|
|
1. INTRODUCTION |
FOIL (First Order Inductive Learner) is an inductive learning algorithm for generating Classification Association Rules (CARs) developed by Ross Quinlan and Mike Cameron-Jones (Quinlan and Cameron-Jones 1993). This algorithum was later further developed by Xiaoxin Yin and Jiawei Han to produce the PRM (Predictive Rule Mining) CAR generation algorithm (Yin and Han 2003). PRM was then further developped, by Xiaoxin Yin and Jiawei Han, to produce CPAR (Classification based on Predictive Association Rules).
The set of WWW pages "hanging" from this page describe Java implmentations, developped by the LUCS-KDD research team, of the FOIL, PRM< and CPAR algorithms. The implementations are founded on descriptions foun in Yin and Han (2003), and have been used to compare performance with other CARM algorithms.
2. DOWNLOADING THE SOFTWARE |
There are six source files and a six application files available here. The source files are as follows:
The source files are arranged in a class hierarchy of the following form:
|
The applications are of two types:
The applications are as follows:
There is also a "tar ball" foilPrmCpar.tgz that can be downloaded that includes all of the source and application files. It can be unpacked using tar -zxf foilPrmCpar.tgz.
The software has been implemented in Java using the Java2 SDK (Software Development Kit) Version 1.4.0, which should therefore make it highly portable. The code does not require any special packages and thus can be compiled using the standard Java compiler:
javac *.java
The code can be documented using Java Doc. First create a directory Documentation in which to place the resulting HTML pages and then type:
javadoc -d Documentation/ *.java
This will produce a hierrachy of WWW pages contained in the Document directory.
3. RUNNING THE SOFTWARE |
When compiled the software can be invoked in the normal manner using the Java interpreter:
java FILE_NAME
If you are planning to process a very large data set it is a good idea to grab some extra memory. For example:
java -Xms600m -Xmx600m FILE_NAME
The input to the software, in all cases is a (space separated) binary valued data set R. The set R comprises a set of N records such that each record (R) comprises a set of attributes and a class. Thus:
R = {r | r = {{subset A} + one member of C}}
Where A is the set of available attributes, and C is the set of available classes. The value D is then defined as:
D = |A| + |C|
We then say that a particular data set has D columns and N rows. A small example data sets might be as follows:
1 2 3 6 1 4 5 7 1 3 4 6 1 2 6 1 2 3 4 5 7
where, in this case, A = {1, 2, 3, 4, 5}, C = {6, 7}, D = 7 and N = 5. Note that attribute numbers are ordered sequentially commencing with the number 1 (the value 0 has a special meaning). Note also that the class is always the last item listed in each record. The input file is included using a -F flag.
The programs also require information regrding the number of classes. This is input using a -N flag.
Some example invocations, uisng a discretized/ normalised Pima Indians data set (also available from this site), are given below:
java ClassFOIL_App -FpimaIndians.D42.N768.C2.num -N2 java ClassPRM_App10 -FpimaIndians.D42.N768.C2.num -N2 java ClassCPAR_App10 -N2 -FpimaIndians.D42.N768.C2.num
(note that the ordering of flags is not significant). The output from each application is a classifier plus some diagnostic information (run time, number of rules generated etc.).
4. MORE DETAIL ON APPLICATION CLASSES |
FOIL, PRM and CPAR Applications classes all have the following basic form:
|
Some output is always generated such as: (1) the input parameters and start settings and (2) "mile stones" during processing. Additional output statements can be included in application classes. The availanle additional output options are as follows:
double time1 = (double) System.currentTimeMillis(); // Program statements < INSTANCE_NAME > .outputDuration(time1, (double) System.currentTimeMillis());
Note that the first three of the above output methods are instance methods contained in the AssocRuleMining class which are therefore inheritted by the CPAR_CARgen, PRM_CARgen and FOIL_CARgen classes. The last two of the above are instance methods of the RuleList class and thus must be called using an instance of that class. An instance of the RuleList class is created during processing and may be accessed using the getCurrentRuleListObject() public method found in the Classification class. Thus, for example, the outputRules() method would be invoked as follows:
< INSTANCE_NAME > .getCurrentRuleListObject().outputRules();
When using TCV it makes little sence to use the outputNumRules() and outputRules() methods as this will only serve to output the last 10th of the TCV process. If output is required then the commenceTCVwithOutput() method should be used.
5. MORE DETAIL ON ALGORITHMS USED |
More information on the:
is available.
6. SOME RESULTS |
The following tables give some performance indicators where by the algorithms can be compared. The figures have been obtianed using a number of data sets taken from the UCI Machime learning Repository (Blake and Merz 1998) wehich have been discretised/ normalised using the LUCS KDD DN software system. In each case continuous attributes have been discretised into five sub-ranges. It would have been desirable to use the same datasets as those used by Xiaoxin Yin and Jiawei Han, however it was discovered (January 2003) that many of these data sets were no longer available in the UCI repository. In two cases (auto and heart) the parameters for the data sets are not identical to those reported in Yin and Han (2003) whose paper actually refers back to an earlier paper by Wenmin Li, Jiawei Han and Jiam Pei (Li et al. 2001). A number of further data sets are included below, selected because of their size (The data sets used in Yin and Han (2003) are all quite small). In each case the name of the data sets inlcudes: the number of attributes after dsicretisation (D), the number of records (N) and the bumber of classes (C)
The parameters for the individual algorithms (same as those reported in Yin and Han 2003) are as follows:
Note also that the best 5 rules are used, in each case, when classifying a test record, and that for CPAR a similarity ratio of 1:0.99 was used.
In Table 1 a set of accuracy values are presented. The table includes reported CPAR accuracies (taken from Yin and Han 2003). It is worth commenting that in many cases, the reported CPAR accuracies are better than those produced by the LUCS-KDD implementation. In some cases the difference is negligable, in other it is significant (for example in the cases of the "glass" and and "tic-tac-toe" data sets). In the case of the "ionosphere", "zoo" and "pima indians" data sets the LUCS-KDD implementation of CPAR performs better than the reported version. There are a number of reasons for the differences between the results:
DATA | FOIL | PRM | CPAR | Reported CPAR |
---|---|---|---|---|
adult.D131.N48842.C2 | 82.5 | 76.7 | 76.7 | _ |
anneal.D106.N798.C6 | 96.9 | 89.2 | 90.2 | 98.4 |
auto.D142.N205.C7a | 46.1 | 39.9 | 48.0 | 82.0 |
breast.D48.N699.C2 | 94.4 | 94.8 | 94.8 | 96.0 |
chessKRvK.D66.N28056.C18 | 42.6 | 32.9 | 32.8 | - |
connect4.D129.N67557.C3 | 65.7 | 57.8 | 54.3 | - |
glass.D52.N214.C7 | 49.3 | 47.0 | 48.0 | 74.4 |
heart.D53.N303.C5b | 57.4 | 56.7 | 51.1 | 82.6 |
hepatitus.D58.N155.C2 | 77.5 | 76.5 | 76.5 | 79.4 |
horseColic.D94.D368.C2 | 83.5 | 82.0 | 82.3 | 84.2 |
ionosphere.D104.N351.C2 | 89.5 | 93.2 | 92.9 | 92.6 |
iris.D23.N150.C3 | 94.0 | 94.7 | 94.7 | 94.7 |
led7.D24.N3200.C10 | 62.3 | 71.2 | 71.2 | 73.6 |
letRecog.D106.N20000.C26 | 57.5 | 60.6 | 59.9 | - |
mushroom.D127.N8124.C2 | 99.5 | 98.8 | 98.8 | - |
nursery.D32.N12960.C5 | 91.3 | 79.4 | 78.5 | - |
pageBlocks.D55.N5473.C5 | 91.6 | 80.7 | 76.2 | - |
penDigits.D90.N10992.C10 | 88.0 | 84.1 | 83.0 | - |
pimaIndians.D42.N768.C2 | 73.8 | 76.2 | 75.6 | 73.8 |
ticTacToe.D29.N958.C2 | 96.0 | 70.0 | 72.2 | 98.6 |
waveform.D108.N5000.C3 | 75.6 | 75.7 | 75.4 | 80.9 |
wine.D68.N178.C3 | 86.4 | 92.5 | 92.5 | 95.5 |
zoo.D43.N101.C7 | 96.0 | 92.0 | 96.0 | 95.1 |
Table 1: Comparison of accuracy of FOIL, PRM, CPAR and Reported CPAR produced using Ten Cross Vlaidation (TCV). a CPAR "auto" data set, prior to discrtetisation has 25 attributes, while only 24 were identified for the test set used here. b CPAR "hearty" data sets has 270 records and 2 classes!
In Table 2 a set of execution times are presented. The results support those reported by Xiaoxin Yin and Jiawei Han in that CPAR is significantly faster than FOIL. The table also illustrates that for large datasets all three algorithms require significant amounts of processing time.
DATA | FOIL | PRM | CPAR |
---|---|---|---|
adult.D131.N48842.C2 | 10251.0 | 348.6 | 809.0 |
anneal.D106.N798.C6 | 3.0 | 1.0 | 1.8 |
auto.D142.N205.C7 | 3.4 | 0.6 | 1.2 |
breast.D48.N699.C2 | 1.5 | 0.4 | 0.7 |
chessKRvK.D66.N28056.C18 | 10122.8 | 1005.1 | 1736.0 |
connect4.D129.N67557.C3 | 35572.5 | 7292.0 | 24047.1 |
glass.D52.N214.C7 | 0.8 | 0.3 | 0.6 |
heart.D53.N303.C5 | 2.3 | 0.5 | 1.0 |
hepatitus.D58.N155.C2 | 0.6 | 0.2 | 0.3 |
horseColic.D94.D368.C2 | 5.0 | 0.3 | 0.6 |
ionosphere.D104.N351.C2 | 5.8 | 0.6 | 1.1 |
iris.D23.N150.C3 | 0.1 | 0.1 | 0.2 |
led7.D24.N3200.C10 | 11.5 | 3.1 | 5.7 |
letRecog.D106.N20000.C26 | 4365.6 | 430.7 | 764.0 |
mushroom.D127.N8124.C2 | 38.3 | 9.6 | 15.4 |
nursery.D32.N12960.C5 | 73.1 | 27.8 | 51.7 |
pageBlocks.D55.N5473.C5 | 43.1 | 8.6 | 15.5 |
penDigits.D90.N10992.C10 | 821.1 | 55.4 | 101.9 |
pimaIndians.D42.N768.C2 | 3.8 | 0.5 | 1.0 |
ticTacToe.D29.N958.C2 | 1.6 | 0.3 | 0.6 |
waveform.D108.N5000.C3 | 295.3 | 20.3 | 38.1 |
wine.D68.N178.C3 | 0.5 | 0.2 | 0.3 |
zoo.D43.N101.C7 | 0.2 | 0.1 | 0.2 |
Table 2: Comparison of execution times for FOIL, PRM, and CPAR.
Finally Table 3 lsits the number of rules geneerated in each case by each of the three approaches. From the table ir can be seen that CPAR does not necesserily produce more rules than FOIL (although we might expect it to).
DATA | FOIL | PRM | CPAR |
---|---|---|---|
adult.D131.N48842.C2 | 331.8 | 180.5 | 183.1 |
anneal.D106.N798.C6 | 41.4 | 33.7 | 34.0 |
auto.D142.N205.C7 | 47.1 | 47.5 | 76.3 |
breast.D48.N699.C2 | 34.6 | 16.3 | 16.0 |
chessKRvK.D66.N28056.C18 | 1116.7 | 1442.0 | 1504.8 |
connect4.D129.N67557.C3 | 285.8 | 813.8 | 816.1 |
glass.D52.N214.C7 | 45.0 | 35.3 | 39.3 |
heart.D53.N303.C5 | 59.0 | 45.6 | 53.2 |
hepatitus.D58.N155.C2 | 24.1 | 13.2 | 14.4 |
horseColic.D94.D368.C2 | 67.9 | 18.1 | 18.4 |
ionosphere.D104.N351.C2 | 29.6 | 24.3 | 26.7 |
iris.D23.N150.C3 | 13.6 | 10.9 | 10.9 |
led7.D24.N3200.C10 | 80.6 | 31.3 | 31.4 |
letRecog.D106.N20000.C26 | 560.9 | 590.2 | 643.0 |
mushroom.D127.N8124.C2 | 16.2 | 23.5 | 30.8 |
nursery.D32.N12960.C5 | 57.4 | 77.8 | 83.6 |
pageBlocks.D55.N5473.C5 | 123.1 | 53.3 | 56.2 |
penDigits.D90.N10992.C10 | 204.6 | 153.4 | 166.9 |
pimaIndians.D42.N768.C2 | 75.9 | 24.0 | 23.1 |
ticTacToe.D29.N958.C2 | 24.0 | 10.6 | 11.7 |
waveform.D108.N5000.C3 | 159.7 | 110.8 | 114.3 |
wine.D68.N178.C3 | 18.3 | 17.7 | 18.2 |
zoo.D43.N101.C7 | 10.1 | 14.8 | 18.9 |
Table 3: Comparison of number of rules generated of CMAR, FOIL, PRM, CPAR, Reported CPAR, Apriori-TFP with and without Hill Climbing, and Hill Climbing+.
7. CONCLUSSIONS |
The LUCS-KDD implementations of FOIL, PRM and CPAR described here have been use successfully by the LUCS-KDD reseach group to contrast and compare various CARM algorithms and techniques. The software is available for free, however the author would appreciate appropriate acknowledgement. The following reference format for referring to the FOIL, PRm and CPAR implementations aresuggested:
Should you wish to refer to this WWW page the following reference format is suggested.
Coenen, F. (2004), LUCS-KDD implementations of the FOIL, PTM and CPAR algorithms, http://www.cxc.liv.ac.uk/~frans/KDD/Software/FOIL_PRM_CPAR/, Department of Computer Science, The University of Liverpool, UK.
Should you discover any "bugs" or other problems within the software (or this documentation), do not hesitate to contact the author.
REFERENCES |
Created and maintained by Frans Coenen. Last updated 17 March 2004