|
DISCLAIMER! The following description of the CMAR algorithm (Li et al 2001) is that implemented by the author of this WWW page, i.e. it is not identical to that first produced by Wenmin Li, Jiawei Han and Jian Pei, but certainly displays all the "salient" features of the CMAR algorithm.
|
1. INTRODUCTION |
CMAR (Classification based on Multiple Association Rules) is a Classification Association Rule Mining (CARM) algorithm developed by Wenmin Li, Jiawei Han and Jian Pei (Li et al. 2001). CMAR operates using a two stage approach to generating a classifier:
The two stage approach is a fairly common approach used by many CARM algorithms, for example the CBA (Classification Based on Associations) algorithm (Liu et al 1998).
CMAR makes significant use of the statistical concept of Chi-squared testing --- for readers that are not familiar with this concept it is recommended that they refer to http://www.csc.liv.ac.uk/~frans/Notes/chiTesting.html.
The CMAR algorithm (as described in Li et al 2001) uses an FP-growth algorithm (Han et al. 2000) to produce a set of CARs which are stored in data structure referred to as a CR tree. CARs are inserted into the CR tree if:
Given two CARs, R1 and R2, R1 is said to be a more general rule than R2 if the antecedent for R1 is a subset of the antecedent for R2. In CMAR CARS are prioritised using the following ordering:
(A similar ordering to that used in CBA).
Once a complete CR tree has been produced the generated rules are placed into a list R, ordered according to the above prioritisation, which is then pruned. The pruning algorithm (using the cover principle) is presented in Table 1.
|
Table 1: CMAR rule pruning using the "cover" principle.
R' is then the resulting classifier. In their experiments Li et al. used a MIN_COVER value of 3, i.e. each record had to be satisfied (covered) by at least three rules before it is no longer to be considered in the generation process and removed from the training set.
To test the resulting classifier Li et al. propose the following process. Given a record r in the test set:
The strength of a group is calculate using a Weighted Chi Squared (WCS) measure. This is done by first defining a Maximum Chi Squared (MCS) value for each rule A->c:
MCS = (min(sup(A),sup(c)) - sup(A)sup(c)/N)^2 * N * e
Where:
e = 1/sup(A)sup(c) + 1/sup(A)N-sup(c) + 1/N-sup(A)sup(c) + 1/(N-sup(A))(N-sup(c))
For each group of rules the Weighted Chi Squared value is defined as:
WCS = The sum of (Chi-Squared * Chi-Squared)/(MCS)
The LUCS-KDD implementation of CMAR operates in exactly the same manner as described above except that the CARs are generated using the Apriori-TFP algorithm (Coenen et al. 2001, Coenen et al. 2004 and Coenen and Leng 2004) and placed directly into the rule list R, whereas the original CMAR algorithm used a variation of FP-growth (Han et al. 2000).
2. DOWNLOADING THE SOFTWARE |
The LUCS KDD CMAR software comprises nine source files. These are provided from this WWW page together with three application classes. The source files are as follows:
The PtreeNode, PtreeNodeTop and TtreeNode classes are separate to the remaining classes which are arranged in a class hierarchy of the form illustrated below.
|
The Apriori-TFPC CMAR application classes included here are as follows:
There is also a "tar ball" cmar.tgz that can be downloaded that includes all of the above source and application class files. It can be unpacked using tar -zxf cmar.tgz.
The LUCS-KDD CMAR software has been implemented in Java using the Java2 SDK (Software Development Kit) Version 1.4.0, which should therefore make it highly portable. The code does not require any special packages and thus can be compiled using the standard Java compiler:
javac *.java
The code can be documented using Java Doc. First create a directory Documentation in which to place the resulting HTML pages and then type:
javadoc -d Documentation/ *.java
This will produce a hierarchy of WWW pages contained in the Document directory.
3. RUNNING THE SOFTWARE |
When compiled the software can be invoked in the normal manner using the Java interpreter:
java APPLICATION_CLASS_FILE_NAME -F FILE_NAME -N NUMBER_OF_CLASSES
For example:
java ClassCMAR_App -FpimaIndians.D42.N768.C2.num -N2
The -F flag is used to input the name of the data file to be used, and the -N flag to input the number of classes represented within the input data.
If you planning to process a very large data set it is a good idea to grab some extra memory. Thus:
java -Xms600m -Xmx600m APPLICATION_CLASS_FILE_NAME -F FILE_NAME -N NUMBER_OF_CLASSES
For example:
java -Xms600m -Xmx600m ClassCMAR_App10 -FpimaIndians.D42.N768.C2.num -N2
In the case of ClassCMARfull_App10 the output can be extensive so it is a good idea the direct the output to a file. For example:
java ClassCMAR_App10 -FpimaIndians.D42.N768.C2.num -N2 > myFile
The input to the software, in all cases is a (space separated) binary valued data set T. The set T comprises a set of N records such that each record (t), in turn, comprises a set of attributes. Thus:
T = {t | t = subset A}
Where A is the set of available attributes. The value D is then defined as:
D = |A|
We then say that a particular data set has D columns and N rows. A small example data sets might be as follows:
1 2 3 6 1 4 5 7 1 3 4 6 1 2 6 1 2 3 4 5 7
where, in this case, A = {1, 2, 3, 4, 5, 6, 7} of which 6 and 7 represent the possible classes (the class attributes). Note that attribute numbers are ordered sequentially commencing with the number 1 (the value 0 has a special meaning). This includes the class attributes which should follow on from the last attribute number as in the above example. It is not a good idea to assign some very high value to the class attributes that guarantees that they are numerically after the last attribute number as this introduces inefficiencies into the code. For example, if in the above case, the class numbers are 1001 and 1002 (instead of 6 and 7 respectively) the algorithm will assume that there are 1002 attributes in total in which case there will be 2^1002 - 1 candidate frequent item sets (instead of only 2^7 - 1 candidate sets).
The program assumes support and confidence default threshold values of 20% and 80% respectively. However the user may enter their own thresholds using the -S and -C flags. Thresholds are expressed as percentages.
By default the program reorders the input according to frequency (this makes the software much more efficient), however this can be stopped by commenting out the lines:
newClassification.idInputDataOrdering(); // ClassificationAprioriT newClassification.recastInputData(); // AssocRuleMining
Some example invocations, using a discretized/ normalised version of the Pima Indians data set (also available from this site, the raw data can be obtained from the UCI Machine learning Repository (Blake and Merz 1998)) and the three application classes provided by this WWW site, are given below:
java ClassCMAR_App -FpimaIndians.D42.N768.C2.num -N2 -S1 -C50 java ClassCMAR_App10 -FpimaIndians.D42.N768.C2.num -N2 java ClassOnlyCMAR_App.java -FpimaIndians.D42.N768.C2.num -N2 -S5
(note that the ordering of flags is not significant). The output from each application is a set of CRs (ordered according to the prioritisation given in Sub-section 1.1) plus some diagnostic information (run time, number of rules generated etc.).
4. VERSIPN 1 |
This is Version 2 of the software. Differences between this and Version 1 are: (i) More sophisticated way of conducting TCV, (ii) capability of inpitting seperate training and test data files, (iii) better statistical analysis of end results AUC values, etc.). Versiopn 1 is still available.
5. CONCLUSIONS |
The LUCS-KDD implementation of the CMAR algorithm described here has been used successfully by the LUCS-KDD research group to contrast and compare a variety of classification algorithms and techniques. The software is available for free, however the author would appreciate appropriate acknowledgment. The following reference format for referring to the LUCS-KDD implementation of CMAR, available from this WWW page, is suggested:
Should you discover any "bugs" or other problems within the software (or this documentation), do not hesitate to contact the author.
REFERENCES |
Created and maintained by Frans Coenen. Last updated 5 May 2004