THE LUCS-KDD IMPLEMENTATIONS OF THE CMAR ALGORITHM (VERSION 2)



Liverpool University

Frans Coenen

Department of Computer Science

The University of Liverpool

Tuesday 27 April 2004

Updated: Wednesday 3 December 2008, Monday 1 June 2009, Saturday 3 February 2012.

DISCLAIMER! The following description of the CMAR algorithm (Li et al 2001) is that implemented by the author of this WWW page, i.e. it is not identical to that first produced by Wenmin Li, Jiawei Han and Jian Pei, but certainly displays all the "salient" features of the CMAR algorithm.


CONTENTS

1. Introduction.
1.1. Overview of CMAR.
1.2. Overview of LUCS-KDD implementation of CMAR.
2. Downloading the software.
 
2.1. Compiling.
2.2. Documentation.
3. Running the software.
4. Version 1.
4. Conclusions.



1. INTRODUCTION


CMAR (Classification based on Multiple Association Rules) is a Classification Association Rule Mining (CARM) algorithm developed by Wenmin Li, Jiawei Han and Jian Pei (Li et al. 2001). CMAR operates using a two stage approach to generating a classifier:

  1. Generating the complete set of CARs according to a user supplied:
    1. Support threshold to determine frequent (large) item sets, and
    2. Confidence threshold to confirm CRs.
  2. Prune this set to produce a classifier.

The two stage approach is a fairly common approach used by many CARM algorithms, for example the CBA (Classification Based on Associations) algorithm (Liu et al 1998).

CMAR makes significant use of the statistical concept of Chi-squared testing --- for readers that are not familiar with this concept it is recommended that they refer to http://www.csc.liv.ac.uk/~frans/Notes/chiTesting.html.


1.1. Overview of CMAR

The CMAR algorithm (as described in Li et al 2001) uses an FP-growth algorithm (Han et al. 2000) to produce a set of CARs which are stored in data structure referred to as a CR tree. CARs are inserted into the CR tree if:

  1. The Chi-Squared value is above a user specified critical threshold (Li et al. suggest a 5% significance level; assuming a degree of freedom equivalent to 1, this will equate to a threshold of 3.8415), and
  2. The CR tree does not contain a more general rule with a higher priority.

Given two CARs, R1 and R2, R1 is said to be a more general rule than R2 if the antecedent for R1 is a subset of the antecedent for R2. In CMAR CARS are prioritised using the following ordering:

  1. Confidence: A rule r1 has priority over a rule r2 if confidence(r1) > confidence(r2).
  2. Support: A rule r1 has priority over a rule r2 if confidence(r1)==confidence(r2) && support(r1)>support(r2).
  3. Size of antecedent: A rule r1 has priority over a rule r2 if confidence(r1)==confidence(r2) && support(r1)==support(r2) && |Ar1|<|Ar2|.

(A similar ordering to that used in CBA).

Once a complete CR tree has been produced the generated rules are placed into a list R, ordered according to the above prioritisation, which is then pruned. The pruning algorithm (using the cover principle) is presented in Table 1.

T set of training records
C array of length |T| with value of elements set to 0
R The prioritised rule list
R' empty rule list
For each r in R
    if (T=null) break;
    coverFlag <-- false
    for each ti in T Loop (1 <= i <= [T|)
    	if (r.antecedent subset ti) rule satisfies record
    	    C[i] <-- C[i]+1
    	    coverFlag <-- true
    End loop
    if (coverFlag=true) rule satisfies at least one record
        R' <-- R' union r
    Loop from i<--1 to i<--|c| in steps of 1
        if ci> MIN_COVER remove ti from T
   End loop

Table 1: CMAR rule pruning using the "cover" principle.


R' is then the resulting classifier. In their experiments Li et al. used a MIN_COVER value of 3, i.e. each record had to be satisfied (covered) by at least three rules before it is no longer to be considered in the generation process and removed from the training set.

To test the resulting classifier Li et al. propose the following process. Given a record r in the test set:

  1. Collect all rules that satisfy r, and
    1. If consequents of all rules are all identical, or only one rule, classify record according to the consequents.
    2. Else group rules according to classifier and determine the combined effect of the rules in each group, the classifier associated with the "strongest group" is then selected.

The strength of a group is calculate using a Weighted Chi Squared (WCS) measure. This is done by first defining a Maximum Chi Squared (MCS) value for each rule A->c:

MCS = (min(sup(A),sup(c)) - sup(A)sup(c)/N)^2 * N * e

Where:

  1. sup(P) = support for antecedent.
  2. sup(c) = support for consequent.
  3. N = Number of records in test set.
  4. e is calculated as follows:
e = 1/sup(A)sup(c) + 1/sup(A)N-sup(c) + 1/N-sup(A)sup(c) +
	1/(N-sup(A))(N-sup(c))

For each group of rules the Weighted Chi Squared value is defined as:

WCS = The sum of (Chi-Squared * Chi-Squared)/(MCS)


1.2. Overview of LUCS-KDD implementation of CMAR

The LUCS-KDD implementation of CMAR operates in exactly the same manner as described above except that the CARs are generated using the Apriori-TFP algorithm (Coenen et al. 2001, Coenen et al. 2004 and Coenen and Leng 2004) and placed directly into the rule list R, whereas the original CMAR algorithm used a variation of FP-growth (Han et al. 2000).




2. DOWNLOADING THE SOFTWARE


The LUCS KDD CMAR software comprises nine source files. These are provided from this WWW page together with three application classes. The source files are as follows:

  1. AssocRuleMining.java: Set of general ARM utility methods to allow: (i) data input and input error checking, (ii) data preprocessing, (iii) manipulation of records (e.g. operations such as subset, member, union etc.) and (iv) data and parameter output.
  2. PartialSupportTree.java: Methods to implement the "Apriori-TFP" algorithm using both the "Partial support" and "Total support" tree data structure (P-tree and T-tree).
  3. PtreeNode.java: Methods concerned with the structure of Ptree nodes.
  4. PtreeNodeTop.java: Methods concerned with the structure of the top level of the P-tree which comprises, to allow direct indexing for reasons of efficiency, an array of "top P-tree nodes".
  5. RuleNode.java: Class for storing binary tree of CARs as appropriate. (Used to be defined as an inner class in AssocRuleMining class.)
  6. TotalSupportTree.java: Methods to implement the "Apriori-T" algorithm using the "Total support" tree data structure (T-tree).
  7. TtreeNode.java: Methods concerned with the structure of Ttree nodes.
  8. AprioriTFPclass.java: Parent class for classification rule generator.
  9. AprioriTFP_CMAR.java: Methods to produce classification rules using Wenmin Li, Jiawei Han and Jian Pei's CMAR (Classification based on Multiple associate Rules) algorithm but founded on Apriori-TFP.
  10. AprioriTFP_CARgen: Methods to produce classification rules using a Apriori-TFP appraoch.

The PtreeNode, PtreeNodeTop and TtreeNode classes are separate to the remaining classes which are arranged in a class hierarchy of the form illustrated below.

        	AssocRuleMining
               		|
	+---------------+----------------+
	|                                |
TotalSupportTree                     RuleList
	|
PartialSupportTree
	|
AprioriTFPclass
	|
AprioriTFP_CMAR

The Apriori-TFPC CMAR application classes included here are as follows:

  1. ClassCMAR_2file_App.java: Prooduces a single-class clasifier from a given data set given particular support and confidence thresholds as input using the CMAR algorithm. Takes two input files: (i) training set, (ii) test set.
  2. ClassCMAR_App.java: Fundamental LUCS-KDD CMAR application using a 50:50 training/test set split.
  3. ClassCMAR_App10.java: LUCS-KDD CMAR application using TCV (Ten Cross Validation).
  4. ClassOnlyCMAR_App.java: LUCS-KDD CMAR application that creates a classifier using the entire data set (no measure of accuracy).

There is also a "tar ball" cmar.tgz that can be downloaded that includes all of the above source and application class files. It can be unpacked using tar -zxf cmar.tgz.


2.1. Compiling

The LUCS-KDD CMAR software has been implemented in Java using the Java2 SDK (Software Development Kit) Version 1.4.0, which should therefore make it highly portable. The code does not require any special packages and thus can be compiled using the standard Java compiler:

javac *.java

2.2. Documentation

The code can be documented using Java Doc. First create a directory Documentation in which to place the resulting HTML pages and then type:

javadoc -d Documentation/ *.java

This will produce a hierarchy of WWW pages contained in the Document directory.




3. RUNNING THE SOFTWARE


When compiled the software can be invoked in the normal manner using the Java interpreter:

java APPLICATION_CLASS_FILE_NAME -F FILE_NAME -N NUMBER_OF_CLASSES

For example:

java ClassCMAR_App -FpimaIndians.D42.N768.C2.num -N2

The -F flag is used to input the name of the data file to be used, and the -N flag to input the number of classes represented within the input data.

If you planning to process a very large data set it is a good idea to grab some extra memory. Thus:

java -Xms600m -Xmx600m APPLICATION_CLASS_FILE_NAME -F FILE_NAME
				-N NUMBER_OF_CLASSES

For example:

java -Xms600m -Xmx600m ClassCMAR_App10 -FpimaIndians.D42.N768.C2.num -N2

In the case of ClassCMARfull_App10 the output can be extensive so it is a good idea the direct the output to a file. For example:

java ClassCMAR_App10 -FpimaIndians.D42.N768.C2.num -N2 > myFile

The input to the software, in all cases is a (space separated) binary valued data set T. The set T comprises a set of N records such that each record (t), in turn, comprises a set of attributes. Thus:

T  = {t | t = subset A}

Where A is the set of available attributes. The value D is then defined as:

D = |A|

We then say that a particular data set has D columns and N rows. A small example data sets might be as follows:

1 2 3 6
1 4 5 7
1 3 4 6
1 2 6
1 2 3 4 5 7

where, in this case, A = {1, 2, 3, 4, 5, 6, 7} of which 6 and 7 represent the possible classes (the class attributes). Note that attribute numbers are ordered sequentially commencing with the number 1 (the value 0 has a special meaning). This includes the class attributes which should follow on from the last attribute number as in the above example. It is not a good idea to assign some very high value to the class attributes that guarantees that they are numerically after the last attribute number as this introduces inefficiencies into the code. For example, if in the above case, the class numbers are 1001 and 1002 (instead of 6 and 7 respectively) the algorithm will assume that there are 1002 attributes in total in which case there will be 2^1002 - 1 candidate frequent item sets (instead of only 2^7 - 1 candidate sets).

The program assumes support and confidence default threshold values of 20% and 80% respectively. However the user may enter their own thresholds using the -S and -C flags. Thresholds are expressed as percentages.

By default the program reorders the input according to frequency (this makes the software much more efficient), however this can be stopped by commenting out the lines:

newClassification.idInputDataOrdering();  // ClassificationAprioriT
newClassification.recastInputData();      // AssocRuleMining

Some example invocations, using a discretized/ normalised version of the Pima Indians data set (also available from this site, the raw data can be obtained from the UCI Machine learning Repository (Blake and Merz 1998)) and the three application classes provided by this WWW site, are given below:

java ClassCMAR_App -FpimaIndians.D42.N768.C2.num -N2 -S1 -C50
java ClassCMAR_App10 -FpimaIndians.D42.N768.C2.num -N2
java ClassOnlyCMAR_App.java  -FpimaIndians.D42.N768.C2.num -N2 -S5

(note that the ordering of flags is not significant). The output from each application is a set of CRs (ordered according to the prioritisation given in Sub-section 1.1) plus some diagnostic information (run time, number of rules generated etc.).




4. VERSIPN 1


This is Version 2 of the software. Differences between this and Version 1 are: (i) More sophisticated way of conducting TCV, (ii) capability of inpitting seperate training and test data files, (iii) better statistical analysis of end results AUC values, etc.). Versiopn 1 is still available.




5. CONCLUSIONS


The LUCS-KDD implementation of the CMAR algorithm described here has been used successfully by the LUCS-KDD research group to contrast and compare a variety of classification algorithms and techniques. The software is available for free, however the author would appreciate appropriate acknowledgment. The following reference format for referring to the LUCS-KDD implementation of CMAR, available from this WWW page, is suggested:

  1. Coenen, F. (2004). LUCS KDD implementation of CMAR (Classification based on Multiple Association Rules). http://www.csc.liv.ac.uk/~frans/KDD/Software/CMAR/cmar.html, Department of Computer Science, The University of Liverpool, UK.

Should you discover any "bugs" or other problems within the software (or this documentation), do not hesitate to contact the author.




REFERENCES


  1. Blake, C.L. and Merz, C.J. (1998). UCI Repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html, Irvine, CA: University of California, Department of Information and Computer Science.
  2. Coenen, F., Goulbourne, G. and Leng, P. (2001). Computing Association Rules Using Partial Totals. In de Raedt, L. and Siebes, A. (Eds), Principles of Data Mining and Knowledge Discovery, Proc PKDD 2001, Spring Verlag LNAI 2168, pp 54-66.
  3. Coenen and Leng (2004). Data Structures for Association Rule Mining: T-trees and P-trees To appear in IEEE Transaction in Knowledge and Data Engineering.
  4. Coenen, F., Leng, P., Goulbourne, G. (2004). Tree Structures for Mining Association Rules. Journal of Data Mining and Knowledge Discovery, Vol 15 (7), pp391-398.
  5. Han, J., Pei, J. and Yiwen, Y. (2000). Mining Frequent Patterns Without Candidate Generation. Proceedings ACM-SIGMOD International Conference on Management of Data, ACM Press, pp1-12.
  6. Liu, B. Hsu, W. and Ma, Y (1998). Integrating Classification and Association Rule Mining. Proceedings KDD-98, New York, 27-31 August. AAAI. pp80-86.
  7. Li W., Han, J. and Pei, J. (2001). CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules. Proc ICDM 2001, pp369-376.



Created and maintained by Frans Coenen. Last updated 5 May 2004