THE LUCS-KDD IMPLEMENTATIONS OF THE FOIL, PRM AND CPAR ALGORITHMS



Liverpool University

Frans Coenen

Department of Computer Science

The University of Liverpool

13 February 2004


CONTENTS

1. Introduction.
2. Downloading the software.
2.1. Compiling.
2.2. Documentation.
 
3. Running the software.
4. More detail on application classes.
5. More detail on algorithms used.
6. Some results.
7. Conclusions.



1. INTRODUCTION


FOIL (First Order Inductive Learner) is an inductive learning algorithm for generating Classification Association Rules (CARs) developed by Ross Quinlan and Mike Cameron-Jones (Quinlan and Cameron-Jones 1993). This algorithum was later further developed by Xiaoxin Yin and Jiawei Han to produce the PRM (Predictive Rule Mining) CAR generation algorithm (Yin and Han 2003). PRM was then further developped, by Xiaoxin Yin and Jiawei Han, to produce CPAR (Classification based on Predictive Association Rules).

The set of WWW pages "hanging" from this page describe Java implmentations, developped by the LUCS-KDD research team, of the FOIL, PRM< and CPAR algorithms. The implementations are founded on descriptions foun in Yin and Han (2003), and have been used to compare performance with other CARM algorithms.




2. DOWNLOADING THE SOFTWARE


There are six source files and a six application files available here. The source files are as follows:

  1. AssocRuleMining.java: Set of general CARM (and ARM) utility methods to allow: (i) data input and input error checking, (ii) data preprocessing, (iii) manipulation of records (e.g. operations such as subset, member, union etc.) and (iv) data and parameter output.
  2. CPAR_CARgen.java: The CPAR algorithm.
  3. Classification.java: Set of general methods common to FOIL, PRM and CPAR.
  4. FOIL_CARgen.java: The FOIL algorithm.
  5. PRM_CARgen.java: The PRM algorithm
  6. RuleList.java: Set of methods the allow creation of classifiers and the manipulation (e.g. pruning, ordering, etc.) of classification rule contained in such a classifier. Also contains methods to classify records in a given test set.

The source files are arranged in a class hierarchy of the following form:

                            AssocRuleMining
                                   |
                    +--------------+---------------+
                    |                              |
               Classification                    RuleList
	            |
     +--------------+---------------+
     |              |               |
FOIL_CARgen     PRM_CARgen      CPAR_CARgen

The applications are of two types:

  1. Applications that use Ten Cross Vlaidation (TCV) to arrive at an accuracy projection for the method.
  2. Applications that use a 50:50 split of the input data to produce a training and a test set. The training set is used to build a classifier and the test set to validate it.

The applications are as follows:

  1. ClassCPAR_App.java: CPAR algorithm using a 50:50 training/test split of the input data.
  2. ClassCPAR_App10.java: CPAR algorithm using TCV.
  3. ClassFOIL_App.java: FOIL algorithm using a 50:50 training/test split of the input data.
  4. ClassFOIL_App10.java: FOIL algorithm using TCV.
  5. ClassPRM_App.java: PRM algorithm using a 50:50 training/test split of the input data.
  6. ClassPRM_App10.java: PRM algorithm using TCV.

There is also a "tar ball" foilPrmCpar.tgz that can be downloaded that includes all of the source and application files. It can be unpacked using tar -zxf foilPrmCpar.tgz.


2.1. Compiling

The software has been implemented in Java using the Java2 SDK (Software Development Kit) Version 1.4.0, which should therefore make it highly portable. The code does not require any special packages and thus can be compiled using the standard Java compiler:

javac *.java

2.2. Documentation

The code can be documented using Java Doc. First create a directory Documentation in which to place the resulting HTML pages and then type:

javadoc -d Documentation/ *.java

This will produce a hierrachy of WWW pages contained in the Document directory.




3. RUNNING THE SOFTWARE


When compiled the software can be invoked in the normal manner using the Java interpreter:

java FILE_NAME

If you are planning to process a very large data set it is a good idea to grab some extra memory. For example:

java -Xms600m -Xmx600m FILE_NAME

The input to the software, in all cases is a (space separated) binary valued data set R. The set R comprises a set of N records such that each record (R) comprises a set of attributes and a class. Thus:

R  = {r | r = {{subset A} + one member of C}}

Where A is the set of available attributes, and C is the set of available classes. The value D is then defined as:

D = |A| + |C|

We then say that a particular data set has D columns and N rows. A small example data sets might be as follows:

1 2 3 6
1 4 5 7
1 3 4 6
1 2 6
1 2 3 4 5 7

where, in this case, A = {1, 2, 3, 4, 5}, C = {6, 7}, D = 7 and N = 5. Note that attribute numbers are ordered sequentially commencing with the number 1 (the value 0 has a special meaning). Note also that the class is always the last item listed in each record. The input file is included using a -F flag.

The programs also require information regrding the number of classes. This is input using a -N flag.

Some example invocations, uisng a discretized/ normalised Pima Indians data set (also available from this site), are given below:

java ClassFOIL_App -FpimaIndians.D42.N768.C2.num -N2
java ClassPRM_App10 -FpimaIndians.D42.N768.C2.num -N2
java ClassCPAR_App10 -N2 -FpimaIndians.D42.N768.C2.num

(note that the ordering of flags is not significant). The output from each application is a classifier plus some diagnostic information (run time, number of rules generated etc.).




4. MORE DETAIL ON APPLICATION CLASSES


FOIL, PRM and CPAR Applications classes all have the following basic form:

public class < CLASS_NAME >  {

    /** Main method */

    public static void main(String[] args) throws IOException {

	// Create instance of class CPAR_CARgen, PRM_CARgen or FOIL_CARgen	
	// as desired, using the appropriate constructor.
	< OBJECT > < INSTANCE_NAME > = new < CONSTRUCTOR > ;	
			
	// Read data to be mined from file using the inputDataSet() method
	// found in the AssocRuleMining parent class which is inheritted by
	// the CPAR_CARgen, PRM_CARgen and FOIL_CARgen classes.
	< INSTANCE_NAME > .inputDataSet();
	
	// Create training and test data sets. Two options: (1) if requiring
	// a 50:50 training/test data split use the
	// createTrainingAndTestDataSets() method , (2) if requiring Ten
	// Cross Validation (TCV) use the createTenthsDataSets() method. Both
	// are contained in the Classification class (and are inheritted by
	// the CPAR_CARgen, PRM_CARgen and FOIL_CARgen classes).
	< INSTANCE_NAME > .createTrainingAndTestDataSets();
	//  or
	< INSTANCE_NAME > .newClassification.createTenthsDataSets();
	
	// Mine data and generate CARs using a call to either: (1) the
	// startClassification() if not requiring TCV, (2) commenceTCV()
	// if requiring TCV, or (3) commenceTCVwithOutput() if (for dignostic
	// or monitoring purposes) output is required during the TCV process.
	// Each of these methods returns an accuracy value which van be
	// "captured" and output
	< INSTANCE_NAME > .startClassification();
	< INSTANCE_NAME > .commenceTCV();
	//  or
	< INSTANCE_NAME > .commenceTCVwithOutput();
	
	// Output as deired using the many output methods that are available
	// (see below for further details)
	
	// End
	System.exit(0);
	}

Some output is always generated such as: (1) the input parameters and start settings and (2) "mile stones" during processing. Additional output statements can be included in application classes. The availanle additional output options are as follows:

  1. outputDataArray(): Outputs the input data.
  2. outputDataArraySize(): Outputs the size of the input data (number of records and columns, number of elements and overall density).
  3. outputDuration(double time1, double time2): The elapswed time, in seconds between time1 and time2. Typically used to measure processing time:
    double time1 = (double) System.currentTimeMillis();
    
    // Program statements
    
    < INSTANCE_NAME > .outputDuration(time1,
    				(double) System.currentTimeMillis());
    
  4. outputNumRules(): The number of rules in the resulting classifier
  5. outputRules(): The classifier expressed as a set of rules each with its expected Laplace accuracy value.

Note that the first three of the above output methods are instance methods contained in the AssocRuleMining class which are therefore inheritted by the CPAR_CARgen, PRM_CARgen and FOIL_CARgen classes. The last two of the above are instance methods of the RuleList class and thus must be called using an instance of that class. An instance of the RuleList class is created during processing and may be accessed using the getCurrentRuleListObject() public method found in the Classification class. Thus, for example, the outputRules() method would be invoked as follows:

< INSTANCE_NAME > .getCurrentRuleListObject().outputRules();

When using TCV it makes little sence to use the outputNumRules() and outputRules() methods as this will only serve to output the last 10th of the TCV process. If output is required then the commenceTCVwithOutput() method should be used.




5. MORE DETAIL ON ALGORITHMS USED


More information on the:

  1. LUCS-KDD FOIL implementation,
  2. LUCS-KDD PRM implementation and
  3. LUCS-KDD CPAR implementation.

is available.




6. SOME RESULTS


The following tables give some performance indicators where by the algorithms can be compared. The figures have been obtianed using a number of data sets taken from the UCI Machime learning Repository (Blake and Merz 1998) wehich have been discretised/ normalised using the LUCS KDD DN software system. In each case continuous attributes have been discretised into five sub-ranges. It would have been desirable to use the same datasets as those used by Xiaoxin Yin and Jiawei Han, however it was discovered (January 2003) that many of these data sets were no longer available in the UCI repository. In two cases (auto and heart) the parameters for the data sets are not identical to those reported in Yin and Han (2003) whose paper actually refers back to an earlier paper by Wenmin Li, Jiawei Han and Jiam Pei (Li et al. 2001). A number of further data sets are included below, selected because of their size (The data sets used in Yin and Han (2003) are all quite small). In each case the name of the data sets inlcudes: the number of attributes after dsicretisation (D), the number of records (N) and the bumber of classes (C)

The parameters for the individual algorithms (same as those reported in Yin and Han 2003) are as follows:

  1. FOIL: Maximum of 3 attributes in the antecedent of a rule.
  2. PRM: Minimum gain threshold = 0.7, total weight threshold = 0.05, and decay factor = 2/3.
  3. CPAR: Minimum gain threshold = 0.7, total weight threshold = 0.05, decay factor = 2/3.

Note also that the best 5 rules are used, in each case, when classifying a test record, and that for CPAR a similarity ratio of 1:0.99 was used.

In Table 1 a set of accuracy values are presented. The table includes reported CPAR accuracies (taken from Yin and Han 2003). It is worth commenting that in many cases, the reported CPAR accuracies are better than those produced by the LUCS-KDD implementation. In some cases the difference is negligable, in other it is significant (for example in the cases of the "glass" and and "tic-tac-toe" data sets). In the case of the "ionosphere", "zoo" and "pima indians" data sets the LUCS-KDD implementation of CPAR performs better than the reported version. There are a number of reasons for the differences between the results:

  1. In the case of the "heart" and "auto" data sets it is clear from the iterature that the versions of the data sets used here arew not identical to those used by Xiaoxin Yin and Jiawei Han, thus the differencec is not remarkable.
  2. The LUCS-KDD research team had no infortmation as to the nature of the discretisation used, in particularly the number of ranges used to discretise continuous attributes. It is clear from the avialbale literature the data sets used by Xiaoxin Yin and Jiawei Han were the same as as those used to evaluate CMAR (Li et al. 2001), which in turn were the same as those used to evaluate CBA (Bing et al. 1998). According to Liu Bing et al. the data sets used to evaluate CBA were discretized using software available from the MLC++ machine learning library, however no information is provided detailing this discretisation. The data sets used here have been discretyised using the LUCS-KDD DN software available at:
    http://www.csc.liv.ac.uk/~frans/KDD/Software/LUCS-KDD-DN.

  3. In the case of the data sets used to evaluate CMAR, Wenmin Li et al. reported that they used "C4.5's shuffle utility to shuffle the data sets". However, it is not clear if this was also used by Liu Bing et al. or by Xiaoxin Yin and Jiawei Han. Whatever the case this has not been applied to the data sets used here.
  4. Many of the data sets used to evaluate the LUCS-KDD implementations of FOIL, PRM and CPAR which were also used by Xiaoxin Yin and Jiawei Han are small, i.e. with the exception of "Led-7" less than 1000 records, in some cases ("hepatitus", "iris", "wine" and "zoo") less than 200 records. This means that, when using TCV, the test set may comprise only a few 10s of records and therefore we should expect differences of a few percentage points either way.
DATAFOILPRMCPARReported CPAR
adult.D131.N48842.C2 82.576.776.7_
anneal.D106.N798.C6 96.989.290.298.4
auto.D142.N205.C7a 46.139.948.082.0
breast.D48.N699.C2 94.494.894.896.0
chessKRvK.D66.N28056.C18 42.632.932.8-
connect4.D129.N67557.C3 65.757.854.3-
glass.D52.N214.C7 49.347.048.074.4
heart.D53.N303.C5b 57.456.751.182.6
hepatitus.D58.N155.C2 77.576.576.579.4
horseColic.D94.D368.C2 83.582.082.384.2
ionosphere.D104.N351.C2 89.593.292.992.6
iris.D23.N150.C3 94.094.794.794.7
led7.D24.N3200.C10 62.371.271.273.6
letRecog.D106.N20000.C26 57.560.659.9-
mushroom.D127.N8124.C2 99.598.898.8-
nursery.D32.N12960.C5 91.379.478.5-
pageBlocks.D55.N5473.C5 91.680.776.2-
penDigits.D90.N10992.C10 88.084.183.0-
pimaIndians.D42.N768.C2 73.876.275.673.8
ticTacToe.D29.N958.C2 96.070.072.298.6
waveform.D108.N5000.C3 75.675.775.480.9
wine.D68.N178.C3 86.492.592.595.5
zoo.D43.N101.C7 96.092.096.095.1

Table 1: Comparison of accuracy of FOIL, PRM, CPAR and Reported CPAR produced using Ten Cross Vlaidation (TCV). a CPAR "auto" data set, prior to discrtetisation has 25 attributes, while only 24 were identified for the test set used here. b CPAR "hearty" data sets has 270 records and 2 classes!

In Table 2 a set of execution times are presented. The results support those reported by Xiaoxin Yin and Jiawei Han in that CPAR is significantly faster than FOIL. The table also illustrates that for large datasets all three algorithms require significant amounts of processing time.

DATAFOILPRMCPAR
adult.D131.N48842.C2 10251.0348.6809.0
anneal.D106.N798.C6 3.01.01.8
auto.D142.N205.C7 3.40.61.2
breast.D48.N699.C2 1.50.40.7
chessKRvK.D66.N28056.C18 10122.81005.11736.0
connect4.D129.N67557.C3 35572.57292.024047.1
glass.D52.N214.C7 0.80.30.6
heart.D53.N303.C5 2.30.51.0
hepatitus.D58.N155.C2 0.60.20.3
horseColic.D94.D368.C2 5.00.30.6
ionosphere.D104.N351.C2 5.80.61.1
iris.D23.N150.C3 0.10.10.2
led7.D24.N3200.C10 11.53.15.7
letRecog.D106.N20000.C26 4365.6430.7764.0
mushroom.D127.N8124.C2 38.39.615.4
nursery.D32.N12960.C5 73.127.851.7
pageBlocks.D55.N5473.C5 43.18.615.5
penDigits.D90.N10992.C10 821.155.4101.9
pimaIndians.D42.N768.C2 3.80.51.0
ticTacToe.D29.N958.C2 1.60.30.6
waveform.D108.N5000.C3 295.320.338.1
wine.D68.N178.C3 0.50.20.3
zoo.D43.N101.C7 0.20.10.2

Table 2: Comparison of execution times for FOIL, PRM, and CPAR.

Finally Table 3 lsits the number of rules geneerated in each case by each of the three approaches. From the table ir can be seen that CPAR does not necesserily produce more rules than FOIL (although we might expect it to).

DATAFOILPRMCPAR
adult.D131.N48842.C2 331.8180.5183.1
anneal.D106.N798.C6 41.433.734.0
auto.D142.N205.C7 47.147.576.3
breast.D48.N699.C2 34.616.316.0
chessKRvK.D66.N28056.C18 1116.71442.01504.8
connect4.D129.N67557.C3 285.8813.8816.1
glass.D52.N214.C7 45.035.339.3
heart.D53.N303.C5 59.045.653.2
hepatitus.D58.N155.C2 24.113.214.4
horseColic.D94.D368.C2 67.918.118.4
ionosphere.D104.N351.C2 29.624.326.7
iris.D23.N150.C3 13.610.910.9
led7.D24.N3200.C10 80.631.331.4
letRecog.D106.N20000.C26 560.9590.2643.0
mushroom.D127.N8124.C2 16.223.530.8
nursery.D32.N12960.C5 57.477.883.6
pageBlocks.D55.N5473.C5 123.153.356.2
penDigits.D90.N10992.C10 204.6153.4166.9
pimaIndians.D42.N768.C2 75.924.023.1
ticTacToe.D29.N958.C2 24.010.611.7
waveform.D108.N5000.C3 159.7110.8114.3
wine.D68.N178.C3 18.317.718.2
zoo.D43.N101.C7 10.114.818.9

Table 3: Comparison of number of rules generated of CMAR, FOIL, PRM, CPAR, Reported CPAR, Apriori-TFP with and without Hill Climbing, and Hill Climbing+.




7. CONCLUSSIONS


The LUCS-KDD implementations of FOIL, PRM and CPAR described here have been use successfully by the LUCS-KDD reseach group to contrast and compare various CARM algorithms and techniques. The software is available for free, however the author would appreciate appropriate acknowledgement. The following reference format for referring to the FOIL, PRm and CPAR implementations aresuggested:

  1. Coenen, F. (2004), LUCS-KDD implementations of FOIL (First Order Inductive Learner), http://www.cxc.liv.ac.uk/~frans/KDD/Software/FOIL_PRM_CPAR/foil.html, Department of Computer Science, The University of Liverpool, UK.
  2. Coenen, F. (2004), LUCS-KDD implementations of PRM (Predictive Rule Mining), http://www.cxc.liv.ac.uk/~frans/KDD/Software/FOIL_PRM_CPAR/prm.html, Department of Computer Science, The University of Liverpool, UK.
  3. Coenen, F. (2004), LUCS-KDD implementations of CPR (Classification based on Predictive Association Rules), http://www.cxc.liv.ac.uk/~frans/KDD/Software/FOIL_PRM_CPAR/cpar.html, Department of Computer Science, The University of Liverpool, UK.

Should you wish to refer to this WWW page the following reference format is suggested.

Coenen, F. (2004), LUCS-KDD implementations of the FOIL, PTM and CPAR algorithms, http://www.cxc.liv.ac.uk/~frans/KDD/Software/FOIL_PRM_CPAR/, Department of Computer Science, The University of Liverpool, UK.

Should you discover any "bugs" or other problems within the software (or this documentation), do not hesitate to contact the author.




REFERENCES


  1. Bing, L., Hse, W and Ma, Y. (1998). Integrating Classification and association Rule Mining. Proc. KDD-98.
  2. Blake, C.L. and Merz, C.J. (1998). UCI Repository of machine learning databases http://www.ics.uci.edu/~mlearn/MLRepository.html, Irvine, CA: University of California, Department of Information and Computer Science.
  3. Li W., Han, J. and Pei, J. (2001). CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules. Proc ICDM 2001, pp369-376.
  4. Quinlan, J. R. and Cameron-Jones, R. M. (1993). FOIL: A Midterm Report. Proc. ECML, Vienna, Austria, pp3-20.
  5. Yin, X. and Han, J. (2003). CPAR: Classification based on Predictive Association Rules. Proc. SIAM Int. Conf. on Data Mining (SDM'03), San Fransisco, CA, pp. 331-335.



Created and maintained by Frans Coenen. Last updated 17 March 2004