THE LUCS-KDD TFPC CLASSIFICATION ASSOCIATION RULE MINING ALGORITHM (VERSION 2)



Liverpool University

Frans Coenen

Department of Computer Science

The University of Liverpool

Thursday 20 January 2011

Updates:


CONTENTS

1. Introduction.
2. Downloading the software.
2.1. Compiling.
2.2. Documentation.
3. Running the software.
3.1. Attribute reordering.
3.2. Output options.
4. The TFPC algorithm in more detail.
 
4.1. Classification using a 50:50 split.
4.2. Classification using Ten Cross Validation (TCV).
4.3 Classification using 50:50 split and hill climbing.
4.4. Classification using Ten Cross Validation (TCV) and hill climbing.
4.5. Classification using hill climbing with 50:50 split to identify optimum support and TCV for accuracy.
4.6. Classifier generation using entire dataset.
5. Concluding remarks.



1. INTRODUCTION


TFPC, Total From Partial Classification (Coenen and leng 2005, Coenen et al. 2005), is a Classification Association Rule Mining (CARM) algorithm founded on the TFP (Total From Partial) Association Rule Mining (ARM) algorithm; which, in turn, is an extension of the Apriori-T (Apriori Total) ARM algorithm. All three algorithms --- Apriori-T, TFP and TFPC --- were developed by the LUCS-KDD research team. To fully understand the TFPC algorithm (as described here) it is recommended that interested readers first study the Apriori-T and TFP algorithms which are both also available from these WWW pages.

Briefly, Apriori-T is an "apriori" style algorithm (such first proposed by Agrawal and Srikant 1994) designed to process a binary valued input data set so as to identify frequent (large) itemsets and store the resulting frequent itemset information in a "reverse" set enumeration tree called a T-tree (Total support tree). This T-tree can then be processed to identify ARs.

TFP (Coenen et al. 2004a, 2004b) proceeds in a similar manner to Apriori-T except that, instead of operating with the raw input data directly, the input data is first preprocessed and placed in a P-tree (Partial support tree). As such TFP has all the functionality of Apriori-T with the additional advantage that it operates in a much more efficient manner. Advantages which are particularly significant when operating with data that features many duplicate records and/or records with duplicate leading sub-strings (in this respect attribute ordering is advantageous).

TFPC is an extension of TFP designed to produce Classification Association Rules (CARs) whereas Apriori-T and TFP are designed to generate ARs. In its simplest form TFPC determines a classifier according to given support and confidence thresholds. The nature of the selected thresholds are therefore the most significant influencing factors on classification accuracy. A more sophisticated version of TFPC uses a hill climbing technique to find a best accuracy given start support and confidence thresholds.

TFPC is different to many other CARM algorithms in that it does not use a "standard" two stage approach: (1) generate all CARs, (2) prune CARs to produce a classifier. Instead TFPC comprises only a single stage where CRs are generated as part of the frequent set identification process. On completion of the generation process the only pruning undertaken is to remove "unreachable rules" and the identification of a default rule.

This is an updated version of the software (Version 1 is still available). The principal distinction between versions 1 and 2 is that the GUI in version 2 allows: (i) seperate training and tests set files to be loaded (instead of using Ten Cross Validation (TCV) or some other training/test set split) and (ii) produces Area Under the operating Curve (AUC) values as well as accuracy values.




2. DOWNLOADING THE SOFTWARE


The TFPC software comprises fourteen source files. These are provided from this WWW page together with seven application classes (see below). The source files are as follows:

  1. AssocRuleMining.java: Set of general ARM utility methods to allow: (i) data input and input error checking, (ii) data preprocessing, (iii) manipulation of records (e.g. operations such as subset, member, union etc.) and (iv) data and parameter output.
  2. PartialSupportTree.java: Methods to implement the "TFP" algorithm using both the "Partial support" and "Total support" tree data structure (P-tree and T-tree).
  3. PtreeNode.java: Methods concerned with the structure of Ptree nodes.
  4. PtreeNodeTop.java: Methods concerned with the structure of the top level of the P-tree which comprises, so as to allow direct indexing for reasons of efficiency, an array of "top P-tree nodes".
  5. RuleList.java: Set of methods that allow the creation and manipulation (e.g. ordering, etc.) of a list of ARs.
  6. TotalSupportTree.java: Methods to produce and manipulate a "Total support" tree data structure (a T-tree).
  7. TtreeNode.java: Methods concerned with the structure of Ttree nodes.
  8. AprioriTFPclass.java: Parent class for classification rule generator.
  9. AprioriTFP_CRgen.java: Fundamental TFPC algorithm
  10. AprioriTFP_CRgen_HC8.java: TFPC algorithm with hill climbing.
  11. AprioriTFPCcontrolControl: class for Apriori TFPC GUI.
  12. AprioriTFPCmodel: Apriori TFPC GUI model. Application combinations made up of algorithm (TFPC, TFPC with hill climbing, TFPC with hill climbing on 1/10th), evaluation strategy (50:50 and TCV) and satisfaction strategy (CSA ordering only.
  13. BatchModeParams: Window to input batch mode parameters parmeters when using GUI application.
  14. Thresholds: Window to input parmeters when using GUI application

The PtreeNode, PtreeNodeTop and TtreeNode classes are separate to the remaining classes which are arranged in a class hierarchy of the form illustrated in Figure 1.

                              AssocRuleMining
                                      |
                       +--------------+---------------+
                       |                              |
               TotalSupportTree                    RuleList
                       |
              PartialSupportTree
		       		   |
		        AprioriTFPclass
                       |
        +--------------+---------------+
        |                              |
 AprioriTFP_CRgen            AprioriTFP_CRgen_HC8

Figure 1: Class Hierarchy for TFPC algorithm components


The TFPC application classes included here are as follows:

  1. AprioriTFP_CR_App.java: Fundamental TFPC application using a 50:50 training/test set split. Using this application accuracy of classification is entirely dependent on the user supplied support and confidence thresholds.
  2. AprioriTFP_CR_App10.java: TFPC application using TCV (Ten Cross Validation). Again accuracy of classification is entirely dependent on the user supplied support and confidence thresholds.
  3. AprioriTFP_CR_AppHC8.java: TFPC application using an "8 POINT" hill-climbing technique to obtain the best "fit" using support and confidence values as parameters. During the hill climbing process TFPC is repeatedly applied to maximise classification accuracy by adjusting support and confidence values as appropriate. The first half of the data set is used as the training set and the last half as the test set.
  4. AprioriTFP_CR_AppHC8_10.java: TFPC application using an "8 POINT" hill climbing and Ten Cross Validation (TCV). As before hill climbing is used by repeatedly applying TFPC to find the most accurate support and confidence values using different 9/10ths of the data set as the training set and the remaining 1/10th as the test set.
  5. AprioriTFP_CR_AppHC_10.java: A classifier that uses TCV but performs hill climbing using a 50:50 split to identify optimum support and confidence thresholds, and then uses the above AprioriTFP_CR_App10 approach to determine accuracy.
  6. AprioriTFPclassOnlyApp.java: TFPC algorithm that uses the entire input data set as the training set (no estimate of accuracy is produced through the use of a test set).
  7. AprioriTFPC_GUI_App.java: "All singing, all dancing" TFPC GUI that incoporates the above application functionality and lots of output options (including, new for version 2, output of AUC values and an option to load seperate test and training sets).

There is also a "tar ball", aprioriTFP.tgz, that can be downloaded that includes all of the above source and application class files. It can be unpacked using tar -zxf aprioriTFPC.tgz.


2.1. Compiling

The TFPC software has been implemented in Java using the Java2 SDK (Software Development Kit) Version 1.4.0, which should therefore make it highly portable. The code does not require any special packages and thus can be compiled using the standard Java compiler:

javac *.java

2.2. Documentation

The code can be documented using Java Doc. First create a directory Documentation in which to place the resulting HTML pages and then type:

javadoc -d Documentation/ *.java

This will produce a hierarchy of WWW pages contained in the Document directory.




3. RUNNING THE SOFTWARE


The software, as noted above, includes five distinct application classes these are discussed in some further detail below. Whatever the case, when compiled the software can be invoked in the normal manner using the Java interpreter:

java APPLICATION_CLASS_FILE_NAME -F FILE_NAME -N NUMBER_OF_CLASSES

The -F flag is used to input the name of the input file and the -N flag the number of classes represented in the input data.

If you are planning to process a very large data set it is a good idea to grab some extra memory:

java -Xms600m -Xmx600m APPLICATION_CLASS_FILE_NAME -N NUMBER_OF_CLASSES

The input to the software, in all cases is a (space separated) binary valued data set R. The set R comprises a set of N rows or records such that each record (r), in turn, comprises a set of attributes. Thus:

R  = {r | r = subset A}

Where A is the set of available attributes. The value D is then defined as:

D = |A|

We then say that a particular data set has D columns and N rows. A small example data sets might be as follows:

1 2 3 6
1 4 5 7
1 3 4 6
1 2 6
1 2 3 4 5 7

where, in this case, A = {1, 2, 3, 4, 5, 6, 7} of which 6 and 7 represent the possible classes. Note that attribute numbers are ordered sequentially commencing with the number 1 (the value 0 has a special meaning). The input file is included using the -F flag.

The program assumes support and confidence default threshold values of 20% and 80% respectively. However the user may enter their own thresholds using the -S and -C flags. Thresholds are expressed as percentages.

Some example invocations, using a discretized/ normalised (using the LUCS-KDD DN software) Pima Indians data set (also available from this site, the raw data can be obtained from the UCI Machine learning Repository (Blake and Merz 1998)) and each of the six command line application classes provided by this WWW site, are given below:

java AprioriTFP_CR_App -FpimaIndians.D42.N768.C2.num -N2
java AprioriTFP_CR_App10 -FpimaIndians.D42.N768.C2.num -N2
java AprioriTFP_CR_AppHC8 -FpimaIndians.D42.N768.C2.num -N2 -S1 -C70
java AprioriTFP_CR_AppHC8_10 -S1 -FpimaIndians.D42.N768.C2.num -C70 -N2
java AprioriTFP_CR_AppHC_10 -FpimaIndians.D42.N768.C2.num -N2 -S1.5 -C70
java AprioriTFPclassOnlyApp -S5 -C70 -FpimaIndians.D42.N768.C2.num -N2

(note that the ordering of flags is not significant). The output from each application is a set of CRs (ordered according to confidence), i.e. a classifier, plus some diagnostic information (run time, number of rules generated etc.).


3.1. Attribute Reordering

To enhance the efficiency of the CBA algorithm the attributers are ordered according to frequency. This means that the original set of input attributes is transformed (for example the original attribute 1 may no longer be attribute 1, while some other attribute N has been redesignated as attrinute 1). To avoid ordering users can comment out the lines:

newClassification.idInputDataOrdering(); 
newClassification.recastInputData();  

in the appropriate application source code files.


3.2. Output Options

The AprioriTFP_CR_App, AprioriTFP_CR_AppHC8 and AprioriTFPclassOnlyApp application (i.e. not the GUI or the applications that provide TCV --- Ten Cross Validation) include a number of output options (listed below) that may be included in the application file (in most cases these are already included and are either active or "commented out"). Note that it in many case it would not make sense to include them in the TCV applications because different classifiers will be generated for each 10th!

Method Summary
public static void outputDataArray()
           Outputs the input data.
public static void outputDataArraySize()
           Outputs size (number of records and number of elements) of stored input data set read from input data file.
public static void outputConversionArrays()
           Outputs conversion array (used to renumber columns for input data in terms of frequency of single attributes --- reordering will enhance performance for some ARM algorithms).
public static void outputSettings()
           Outputs command line values provided by user.
public static void outputRulesWithDefault()
           Outputs contents of rule linked list (if any), with reconversion, such that last rule is the default rule.
public static void outputRules()
           Outputs contents of rule linked list (if any), without identifying a default rule.
public static void outputNumRules()
           Outputs number of generated rules.
public static void outputFrequentSets()
           Commences the process of outputting the frequent sets contained in the T-tree.
public static void outputNumFreqSets()
           Commences the process of counting and outputing number of supported nodes in the T-tree. A supported set is assumed to be a non null node in the T-tree.
public static void outputNumUpdatess()
           Outputs the the number of support count increments (esentially a measure of the amount of work done).
public static void outputStorage()
           Outputs the storage used by the complete T-tree (measure of storage efficiency).
public static void outputTtree()
           Outputs the T-tree (in an encoded form, the GUI application includes an option to output the T-tree in graphical form).

All the applications also include the options the elapsed time, in secondsusing the outputDuration(double time1, double time2) method. This is typically used to measure processing time as follows:

double time1 = (double) System.currentTimeMillis();

// Program statements

< INSTANCE_NAME > .outputDuration(time1,
				(double) System.currentTimeMillis());



4. THE TFPC ALGORITHM IN MORE DETAIL



The TFPC algorithm is founded on a tree data structure, the T-tree (Total support tree), designed to hold information regarding the frequent sets contained in a binary valued input data sets. A frequent set is any sub-set of the available set of attributes that appears in a given percentage (the support threshold) of the available records. Each level in the T-tree is represented as an array of references to T-tree nodes. T-tree nodes have two fields: (1) a support value, (2) a reference to the following level in the T-tree which will be set to null if there is no following sun-branch. A representation of the T-tree, for the data set presented in section 3 and assuming a support threshold of 25%, is given in Figure 1.

Figure 1: Example T-tree

Note, from Figure 1, that itemsets are stored in "reverse", for example the set {w,x,y,z} is stored commencing with attribute Z, this offers two principal advantages:

  1. It facilitates fast access times using direct array indexing using attribute "column numbers".
  2. All sets ending in a particular attribute are contained in a single branch. This means that all frequent sets containing a particular class are contained in a single branch of the T-tree.

Note also, from Figure 1, that the tree increases in size as the tree progresses from left to right. To reduce the number of nodes it helps if the attributes (not the classes) are ordered according the frequency with which they occur in the data set.

A T-tree can be constructed in an apriori manner direct from an input file. However, in common with other apriori style approaches, this has the disadvantage that it requires a number of file accesses. To avoid this the TFPC algorithm first loads the input data in to another tree structure, the P-tree (Partial support tree), which contains all the required information contained in the input data while at the same time performing a partial support count. The P-tree is essentially a set enumeration tree or prefix tree. P-tree nodes have four fields:

  1. label
  2. support
  3. Sibling reference (siblingRef)
  4. Child reference (childRef)

and are organised so that siblings are ordered in geographical order and parent nodes represent leading subsets of the child nodes. The first node in the P-tree is pointed at by a variable startRef. When generating the P-tree, records are added by comparing the current current record (r) with existing nodes (N) in the P-tree commencing with the start nodes (startRef). When comparing r with an existing node N there are six possibilities (the < and > operators should be interpreted as "lexicographically" before/after).

  1. If startRef=null (i.e the P-tree sofar is empty create a node M representing and r pointed at by startRef.
  2. If (r = N.label) then N.support=N.support+1.
  3. If (r < N.label and r subset N.label):
    1. Create node M representing r with M.support<--N.support+1, M.childRef<--N, M.siblingRef<--N.siblingRef.
    2. N.siblingRef <-- null.
    3. Connect M to previous reference to N (startRef, siblingRef or ChildRef as appropriate).
  4. If (r < N.label and r notSubset N.label):
    1. Create node M representing r with M.support <-- 1, M.childRef <-- null, M.siblingRef <-- N.
    2. If r and N.label share a leading substring:
      1. Create a "dummy" node D representing the leading substring with D.support <-- N.support+1, D.childRef <-- M, and D.siblingRef <-- N.siblingRef.
      2. N.siblingRef <-- null.
      3. Connect D so it is pointed at by the previous reference to N.
    3. Otherise connect M to previous reference to N.
  5. If (r > N.label and r superset N.label):
    1. N.support <-- N.support + 1.
    2. If N.childRef=null create node M representing r, with M.support <-- 1, M.childRef <-- null and M.siblingRef <-- null and connect this to N.childRef.
    3. Otherwise N <-- N.childRef and repreat.
  6. If (r > N.label and r notSuperset N.label):
    1. If N.siblingRef=null:
      1. Create node M representing r with M.support <-- 1, M.childRef <-- null, M.siblingRef <-- null.
      2. N.siblingRef = M.
      3. If r and N.label share a leading substring, create a "dummy" node D representing the leading substring with D.support <-- N.support+1, D.childRef <-- N, and D.siblingRef <-- null, and connect D so it is pointed at by the previous reference to N.
    2. Otherwise
      1. If r and N.label share a leading substring create two nodes: M representing r with M.support <-- 1, M.childRef <-- null, M.siblingRef <-- null; and a "dummy" node D representing the leading substring with D.support <-- N.support+1, D.childRef <-- N, and D.siblingRef <-- N.siblingRef. Then connect D so it is pointed at by the previous reference to N, and cause N.siblingRef to point at M.
      2. Otherwise N = N.siblingRef and repeat.

Table 1: P-tree generation rules


To build a T-tree using a P-tree, and during the generation process produce CRs, the following "apriori" style procedure is applied:

Preprocess input data so that:
	i)  Unsupported attributes are removed, and
	ii) Recast records so that remaining attributes (not
	    classes) are ordered according to frequency.

Create a P-tree	

Create top level array of T-tree for columns 1 to N (representing 1-itemsets)
and obtain supports by passing through P-tree.

Generate level 2 arrays (the 2-item candidate sets)

R <-- null
K <-- 2

Loop
	pass through P-tree to obtain support count for level K
	for all unsupported elements in level K arrays set value to null
	for all remaing ellements in level K in branches representing a
		class, generate a CR with associated confidence value:
	    if (confidence>confidence threshold)
	    	R <-- R union CR 	
	    	set corresopinding level K array element to null
	    End if
	K <-- K+1
	Generate level K arrays (the K candidate itemsets) from the
		remaining K-1 array elements using the downward cloure
		property of itemsets
	if no level K arrays generated break
End loop

Table 2: T-tree generation algorithm


In the following sub-sections the various application classes available from this WWW page are discussed in some further detail.

4.1. Classification using a 50:50 split

The AprioriTFP_CR_App application class uses a user supplied support and confidence threshold to produce a classifier. To enhance the efficiency of the TFPC it helps if the attributes are ordered according to frequency. This is achieved by the methods:

newClassification.idInputDataOrdering();
newClassification.recastInputData();

The first identifies the ordering and the second recasts the input data (apart from the class) according to the identified ordering.

There are many output options:

  1. outputTtree(): Lists the T-tree nodes.
  2. outputTtreeStats(): Output parameters for T-tree --- storage, number of nodes and umber of updates.
  3. outputNumFreqSets(): The number of identified frequent sets. .
  4. outputFrequentSets(): Lists the identified frequent sets contained in the T-tree, and their associated support values.
  5. outputNumUpdates(): The number of updates used to build the P-tree (a measure of the amount of "work done").
  6. outputStorage(): The storage requirements, in Bytes, for the T-tree.
  7. outputNumRules(): The number of discovered CRs.
  8. outputRules(): Lists the CRs and their associated confidence values.
  9. outputRulesWithReconversion(): Lists the CRs but using reconversion (only appropriate where attributes have been reordered).

Note that the six of the above output methods are instance methods contained in either the PartialSupportTree class or its parent classes. The remainder of the above are instance methods of the RuleList class and thus must be called using an instance of that class. An instance of the RuleList class is created during processing and may be accessed using the getCurrentRuleListObject() public method found in the TotalSupportTree class. Thus, for example, the outputRules() method would be invoked as follows:

< INSTANCE_NAME > .getCurrentRuleListObject().outputRules();

4.2. Classification using Ten Cross Validation (TCV)

The AprioriTFP_CR_App10 application class is similar to the AprioriTFP_CR_App application class (see above) except that classification accuracy is determined using TCV. The output options are much more limited here. Two options (with and without output):

newClassification.commenceTCVwithOutput();
newClassification.commenceTCV();

The first includes a summary of the TCV process.


4.3. Classification using 50:50 split and hill climbing

The accuracy of the classifier produced by the previous two applications is dependent on the nature of the user specified support and confidence thresholds. The AprioriTFP_CR_AppHC8 application class uses a hill climbing technique to maximise accuracy by repeatedly adjusting the support and confidence threshold values.

Output options include:

  1. outputNumRules(): The number of discovered CRs.
  2. outputSuppAndConf(): The final support and confidence.
  3. outputAccuracy(): The final classification accuracy.
  4. outputRules(): Lists the CRs and their associated confidence values.
  5. outputRulesWithReconversion(): Lists the CRs but using reconversion (only appropriate where attributes have been reordered).

Note that the first three of the above output methods are instance methods contained in either the PartialSupportTree class or its parent classes. The remainder of the above are instance methods of the RuleList class and thus must be called using an instance of that class as illustrated above


4.3.1. Hill Climbing

The hill climbing technique makes use of a 3-D playing area measuring 100x100x100. The axises of the paying area represent: (1) support thresholds, (2) confidence thresholds and (3) accuracies (all defined in terms of percentages). The technique commences with a start support and confidence threshold, describing a currentLoctaion (cl) in the playing area, from which a start accuracy is calculated using TFP. The technique and then attempts to move round the playing area with the aim of maximising the accuracy value. In its "quest" to do this it continuously generates data for a set of eight test locations --- hence the term 8 point hill climbing (the research team also tried a 4 point approach but with disappointing results). The test locations are defined by applying two values, sSupp (change in support threshold) and dConf (change in confidence threshold) to the support and confidence threshold values associated with cl. Consequently the current and test lactations form a 3x3 location grid with cl at the center (Figure 2). The test locations are labeled: n, ne, e, se, s, sw, w, nw and n, with the obvious interpretations.

Figure 1: 8 point location grid within hill climbing playing area on plane of support and confidence axises.

Information regarding the current and test locations in the location grid is contained in a set of structures with the following fields: support, confidence, accuracy, inArea and notCalculated. The last two take boolean values. The inArea field is set to true if the associated location is within the playing area and false false otherwise (it is possible, when working close to the edge of the paying area, to generate a test location which is outside the area). The notCalculated field is set to false if an accuracy has previously been calculated and true otherwise --- this is to prevent the application of TFPC to support and confidence locations for which an accuracy value is already known. The hill climbing process operates as shown in table 3.

1  supp <-- user supplied start support
2  conf <-- user supplied start confidence
3  dSupp <-- start delta support (usually 6.4)
4  dConf <-- start delat confidenece (usually 8.0)
5  locationGrid <-- null
6  Loop
7      if (dSupp=MIN_dSUPP & dConf=MIN_dCONF) break
8      locationGrid <-- generateLocationGrid according to supp and conf values
9      identify bestAccuracy and numBestAccuracyLocations from locationGrid
10     if (best accuracy = cl.accuracy)
10   	   if (dSupp!=MIN_dSUPP) dSupp <-- dSupport/2
12    	   if (dConf!=MIN_dCONF) dConf <-- dConf/2
13     else if (numBestAccuracyLocations = 1) cl <-- bestAccuracyLocation
14     else if (numBestAccuracyLocations > numAvailableLocations*BTAR)
15         cl <-- location with a best accuracy
16         break
17     else cl <-- selectNewLocation(locationGrid)
18 End loop
19 classifier <-- classifier associated with cl

Table 3: T-tree generation algorithm


Note that the procedure makes use of three constants: MIN_dSUPP the minimum adjustment in the support value usually set to 0.1, MIN_dCONF the minimum adjustment in the confidence value usually set to 1.0, BTAR (Best To Available Ratio) the ration of the number of best locations to the number of available locations usually set to 2/3.

The above procedure loops continuously until either: (1) no new location grid can be generated because the dSuppand dConf values have been reduced to a minimum (line 7) in which case the current location is taken as producing the best accuracy, or (2) the majority of locations (as defined using the BTAR constant) display an equal best accuracy (lines 14 and 16) in which case one of these locations is selected as producing the best accuracy. In determining whether a majority of locations display a best accuracy the number of "available" locations is taken into consideration. It may be that (say) if the current location is situated in a corner of the playing area five of the test location will be outside the playing area (i.e. the number of available locations will be equivalent to 4).

Where the above end conditions do not apply there are two options, either: (1) if the best accuracy is featured by the current location (regardless of whether some other locations may also feature this best accuracy) zero in on this location by reducing the "grid size", or (2) move to a location with the best accuracy. In the case of option 2 the discussion to move to a location with the best accuracy is straight forward if there is only one such location. The hill climbing process proceeds by assigning weightings to "potential directions to move in" and then selecting a direction according to these weightings.


4.4. Classification using Ten Cross Validation (TCV) and hill climbing

The AprioriTFP_CR_AppHC8_10 application class is similar to the AprioriTFP_CR_AppHC8 class except that it incorporates TCV with hill climbing instead of using a 50:50 training/test set split.


4.5. Classification using hill climbing with 50:50 split to identify optimum support and TCV for accuracy

The AprioriTFP_CR_AppHC8_10 application class (see above) is computationally expensive because of the amount of "hill-climbing" that it includes. The AprioriTFP_CR_AppHC_10 application class is more computationally effective because it only uses hill climbing (with a 50:50 training/test set split) to obtain optimum support and confidence thresholds. And then make use a straight forward TCV approach to establish accuracy using the identified support and confidence. Output is fixed. Note that AprioriTFP_CR_AppHC_10 is not as accurate as AprioriTFP_CR_AppHC8_10 because the latter generates optimum support and confidence thresholds for each 9/10ths.


4.6. Classifier generation using entire dataset

TFPC algorithm for generating a classifier using the entire data set as the training set (i.e. without using part of the data as a test set).




5. CONCLUDING REMARKS


The TFPC algorithm described here have been use successfully by the LUCS-KDD research group to contrast and compare a variety of classification algorithms and techniques. The software is available for free, however the author would appreciate appropriate acknowledgment. The following reference format for referring to the TFPC implementation available here is suggested:

  1. Coenen, F. (2004). The LUCS-KDD TFPC Classification Association Rule Mining Algorithm. http://www.cSc.liv.ac.uk/~frans/KDD/Software/Apriori_TFPC/Version2/aprioriTFPC.html, Department of Computer Science, The University of Liverpool, UK.

Should you discover any "bugs" or other problems within the software (or this documentation), do not hesitate to contact the author.




REFERENCES


  1. Agrawal, R. and Srikant, R. (1994), Fast algorithms for mining association rules. In proceedings of 20th VLDB Conference, Morgan Kaufman, pp487-499.
  2. Blake, C.L. and Merz, C.J. (1998). UCI Repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html, Irvine, CA: University of California, Department of Information and Computer Science.
  3. Coenen, F., Leng, P. and Ahmed, S. (2004a). Data Structures for association Rule Mining: T-trees and P-trees. IEEE Transactions on Data and Knowledge Engineering, Vol 16, No 6, pp774-778.
  4. Coenen, F.P. Leng, P. and Goulbourne, G. (2004b). Tree Structures for Mining Association Rules. Journal of Data Mining and Knowledge Discovery, Vol 8, No 1, pp25-51.
  5. Coenen, F. and Leng, P. (2005). Obtaining Best Parameter Values for Accurate Classification. Proc. ICDM'2005, IEEE, pp597-600.
  6. Coenen, F., Leng, P. and Zhang, L. (2005). Threshold Tuning for Improved Classification Association Rule Mining. Proceeding PAKDD 2005, LNAI3158, Springer, pp216-225.



Created and maintained by Frans Coenen. Last updated 20 January 2011