THE LUCS-KDD TFP (TOTAL FROM PARTIAL) ASSOCIATION RULE MINING ALGORITHM



Liverpool University

Frans Coenen

Department of Computer Science

The University of Liverpool

5 April 2004

Page last revised May 2006, May 2009


CONTENTS

1. Introduction.
2. Downloading the software.
2.1. Compiling.
2.2. Documentation.
 
3. Running the software.
4. More detail on application classes.
5. The P-tree data structure.
6. The TFP algorithm.
7. Conclusions.



1. INTRODUCTION


TFP (Total From Partial) is an Association Rule Mining (ARM) algorithm developed by the LUCS-KDD research team. The algorithm is an extension of the Apriori_T (Apriori Total) ARM algorithm which is in turn, an "apriori" (Agrawal and Srikant 1994) style algorithm. Apriori-T is designed to process a binary valued input data set so as to identify frequent (large) itemsets and store the resulting frequent itemset information in a "reverse" set enumeration tree called a T-tree (Total support Tree). This T-tree can then be processed to identify ARs.

TFP proceeds in a similar manner to Apriori-T except that instead of operating with the raw input data directly the input data is first preprocessed and placed in a P-tree (Partial support tree). As such TFP has all the functionality of Apriori-T except that it operates in a much more efficient manner as a result of the pre-processing advantages offered by usage of the P-tree data structure. These advantages are particularly significant when operating with data that features many duplicate records and/or records with duplicate leading sub-strings.

The initial ideas incopoprated in the TFP algorith were first described in Golbourne et al. (1999) and reprinted in Golbourne et al. (2000), further devlopments were published in Coenen et al. (2001). The TFP algorithm in its current form together with further details concerning P-trees and T-trees that are not available from this WWW page can be found in Coenen et al. (2004a, 2004b). Some additional refinements are discussed in (Coenen and leng 2002).

Subsequently the TFP algorithm has been further adapted in a number of additional directions: (1) Duistributed/Parallel ARM (Coenen et al (2003), Coenen and Leng (2006), (2) Mining of very large DB that cannot be held in primary storage (Ahmed et al. 2003, 2004, 2006), (3) Classification Association Rule Mining (Coenen and leng 2005, Coenen et al. 2005).




2. DOWNLOADING THE SOFTWARE


The TFP ARM software comprises seven source files. These are provided from this WWW page together with three application classes. The source files are as follows:

  1. AssocRuleMining.java: Set of general ARM utility methods to allow: (i) data input and input error checking, (ii) data preprocessing, (iii) manipulation of records (e.g. operations such as subset, member, union etc.) and (iv) data and parameter output.
  2. PartialSupportTree.java: Methods to implement the "TFP" algorithm using both the "Partial support" and "Total support" tree data structure (P-tree and T-tree).
  3. PtreeNode.java: Methods concerned with the structure of Ptree nodes.
  4. PtreeNodeTop.java: Methods concerned with the structure of the top level of the P-tree which comprises, to allow direct indexing for reasons of efficiency, an array of "top P-tree nodes".
  5. RuleList.java: Set of methods that allow the creation and manipulation (e.g. ordering, etc.) of a list of ARs.
  6. TotalSupportTree.java: Methods to implement the "Apriori-T" algorithm using the "Total support" tree data structure (T-tree).
  7. TtreeNode.java: Methods concerned with the structure of Ttree nodes.

The PtreeNode, PtreeNodeTop and TtreeNode classes are separate to the remaining classes which are arranged in a class hierarchy of the form illustrated below.

                AssocRuleMining
                       |
        +--------------+---------------+
        |                              |
TotalSupportTree                    RuleList
        |
PartialSupportTree

The TFP application classes included here are as follows:

  1. AprioriTFPapp.java: Fundamental TFP application.
  2. AprioriTFPsortedApp.java: TFP application with input data preprocessed so that it is ordered according to frequency of single items --- this serves to reduce the computation time.
  3. AprioriTFPsortedPrunedApp.java: TFP application with data ordered according to frequency of single items and columns representing unsupported 1-itemsets removed --- again this serves to enhance computational efficiency.

The TFP algorithm will also allow use of Apriori-T application classes. Three are provided here as listed below, however for details the interested reader should refer the Apriori-T documentation.

  1. AprioriTapp.java.
  2. AprioriTsortedApp.java
  3. AprioriTsortedPrunedApp.java

There is also a "tar ball" tfpLink.tgz that can be downloaded that includes all of the above source and application class files. It can be unpacked using tar -zxf tfpLink.tgz.


2.1. Compiling

The TFP software has been implemented in Java using the Java2 SDK (Software Development Kit) Version 1.4.0, which should therefore make it highly portable. The code does not require any special packages and thus can be compiled using the standard Java compiler:

javac *.java

2.2. Documentation

The code can be documented using Java Doc. First create a directory Documentation in which to place the resulting HTML pages and then type:

javadoc -d Documentation/ *.java

This will produce a hierarchy of WWW pages contained in the Document directory.




3. RUNNING THE SOFTWARE


When compiled the software can be invoked in the normal manner using the Java interpreter:

java APPLICATION_CLASS_FILE_NAME

If you are planning to process a very large data set it is a good idea to grab some extra memory. For example:

java -Xms600m -Xmx600m APPLICATION_CLASS_FILE_NAME

The input to the software, in all cases is a (space separated) binary valued data set R. The set R comprises a set of N records such that each record (r), in turn, comprises a set of attributes. Thus:

R  = {r | r = subset A}

Where A is the set of available attributes. The value D is then defined as:

D = |A|

We then say that a particular data set has D columns and N rows. A small example data sets might be as follows:

1 2 3 6
1 4 5 7
1 3 4 6
1 2 6
1 2 3 4 5 7

where, in this case, A = {1, 2, 3, 4, 5, 6, 7}. Note that attribute numbers are ordered sequentially commencing with the number 1 (the value 0 has a special meaning). The input file is included using a -F flag.

The program assumes support and confidence default threshold values of 20% and 80% respectively. However the user may enter their own thresholds using the -S and -C flags. Thresholds are expressed as percentages.

Some example invocations, using a discretized/ normalised Pima Indians data set (also available from this site) and each of the three application classes provided by this WWW site, are given below:

java tfpLinkapp -FpimaIndians.D42.N768.C2.num
java tfpLinksortedApp -FpimaIndians.D42.N768.C2.num -S2.5
java tfpLinksortedPrunedApp -FpimaIndians.D42.N768.C2.num -S1 -C90

(note that the ordering of flags is not significant). The output from each application is a set of ARs (ordered according to confidence) plus some diagnostic information (run time, number of rules generated etc.).




4. MORE DETAIL ON APPLICATION CLASSES


TFP Applications classes all have the following basic form:

public class < CLASS_NAME >  {

    /** Main method */

    public static void main(String[] args) throws IOException {

	// Create instance of class PartialSupportTree
	PartialSupportTree newtfpLink = new PartialSupportTree(args);
	
	// Read data to be mined from file
	newtfpLink.inputDataSet();
	
	// If desired either: (1) keep data set as it is (do no
	// preprocesseing), (2) reorder the data sets according to
	// frequency of single attributes:
	newtfpLink.idInputDataOrdering();
	newtfpLink.recastInputData();
	// or (3) reorder and prune the input data
	newtfpLink.idInputDataOrdering();
	newtfpLink.recastInputDataAndPruneUnsupportedAtts();
	newtfpLink.setNumOneItemSets();
	
	// Mine data and produce T-tree	
	double time1 = (double) System.currentTimeMillis();
	newtfpLink.createTotalSupportTree();
	newtfpLink.outputDuration(time1,(double) System.currentTimeMillis());
	
	// Generate ARS
	newtfpLink.generateARs();
	
	// Output as desired using the many output methods that are available
	// (see below for further details)
	
	// End
	System.exit(0);
	}

Some output is always generated such as: (1) the input parameters and start settings and (2) "mile stones" during processing. Additional output statements can be included in application classes. The available additional output options are as follows:

  1. outputDataArray(): Outputs the input data.
  2. outputDataArraySize(): Outputs the size of the input data (number of records and columns, number of elements and overall density).
  3. outputDuration(double time1, double time2): The elapsed time, in seconds between time1 and time2. Typically used to measure processing time:
    double time1 = (double) System.currentTimeMillis();
    
    // Program statements
    
    < INSTANCE_NAME > .outputDuration(time1,
    				(double) System.currentTimeMillis());
    
  4. outputPtreeTable(): Output P-tree table (for computational convenience the P-tree is cast into a table prior to being processed further).
  5. outputPtreeTableStats() : Output parameters for P-tree table --- storage, number of nodes.
  6. outputTtree(): Lists the T-tree nodes.
  7. outputTtreeStats(): Output parameters for T-tree --- storage, number of nodes and umber of updates.
  8. outputNumFreqSets(): The number of identified frequent sets.
  9. outputNumUpdates(): The number of updates used to build the P-tree (a measure of the amount of "work done").
  10. outputFrequentSets(): Lists the identified frequent sets contained in the T-tree, and their associated support values.
  11. outputStorage(): The storage requirements, in Bytes, for the T-tree.
  12. outputNumRules(): The number of discovered ARs.
  13. outputRules(): Lists the ARs and their associated confidence values.
  14. outputRulesWithReconversion(): Lists the ARs but using reconversion (only appropriate where attributes have been reordered).

Note that the first thirteen of the above output methods are instance methods contained in either the PartialSupportTree class or its parent classes. The last three of the above are instance methods of the RuleList class and thus must be called using an instance of that class. An instance of the RuleList class is created during processing and may be accessed using the getCurrentRuleListObject() public method found in the TotalSupportTree class. Thus, for example, the outputRules() method would be invoked as follows:

< INSTANCE_NAME > .getCurrentRuleListObject().outputRules();



5. THE P-TREE DATA STRUCTURE


A P-tree is a set enumeration tree structure in which to store partial counts for item sets. The top, single attribute, level comprises an array of references to structures of the form shown to the right, one for each column.

IdentifierType
Support int (4 Bytes)
Child reference Pointer to child P-tree node (4 Bytes)

Each of these top level structures is then the root of a sub-tree of the overall P-tree. The nodes in each of these sub-tress are represented as structures of the form:

IdentifierType
Node code Array of short integers (2 Bytes each)
Support int (4 Bytes)
Child reference Pointer to child P-tree node (4 Bytes)
Sibling reference Pointer to sibling P-tree node (4 Bytes)

Assuming a sample data set as shown below:

Row NumberColumns
1 1 3 4
2 2 4 5
3 2 4 6

where D=6 (number of columns/attributes) and N=3 (number of records). Prior to commencement of the generation process the P-tree will comprise a 6 element array as shown in Figure 1(a). The first row will be stored in the P-tree as shown in Figure 1(b).

P-TREE

(a): Prior to start

P-TREE

(b): Addition of first row ({1, 3, 4}).

P-TREE

(c): Addition of second row ({1, 4, 5}).

P-TREE

(d): Addition of third row ({2, 4, 6}).

Figure 1: P-tree Generation

Note the support for the parent node is incremented by one. The second row is then included as shown in Figure 1(c), and the third as shown in Figure 20(d). Note that because this last row shares a common leading substring with an existing node in the P-tree a dummy node is created to ensure that the tree does not become "flat".

Inspection of the final P-tree as shown in Figure 2(d) indicates that nothing is lost if we do not store that part of any particular "node code" which is duplicated in the parent code (Figure 2).

 
P-TREE

Figure 2: P-tree in its final form.

The internal representation of the P-tree presented in Figure 2 is then as shown in Figure 3.

P-TREE STRUCTURE

Figure 3: Internal representation of P-tree presented in Figure 2.

On completion of the generation process the P-tree itself is "thrown away" and the contents of the tree stored in a P-tree table --- an array of arrays each element of which is a pointer to a structure of the form:

IdentifierType
Node code short []
(Entire) parent code short []
support int

The node code is the set of column numbers represented by any individual node (not the complete itemset represented by the node). The parent code is the union of all the node codes of the current node's "ancestor nodes" (parent node, grandparent node, etc.). The union of the parent code and the node code for any individual node is the itemset represented by the node. The first index of the array represents the size of the union of the parent and node codes in that array of the array of arrays, i.e. the size (cardinality) of the itemset represented by the node. Thus all singletons are stored at index 1, doubles at index 2 and so on.

The advantages offered by the P-tree table are:

  1. Reduced storage requirements (particularly where the data set contained duplicate rows).
  2. Faster run times because the desired total support counts had already been partially calculated.

Internally the P-tree presented in Figure 3 will be stored in tabular form as shown in Figure 4.

P-TREE TABLE

Figure 4: P-tree table.




6. THE TFP ALGORITHM


The TFP algorithm is used to determine ARs by first creating a T-tree from the P-tree. This process can best be explained by considering the P-tree table given in Figure 4. The T-tree is generated in an Apriori manner. There are a number of features of the P-tree Table that enhance the efficiency of this process:

  1. The first pass of the P-tree will be to calculate supports for singletons and thus the entire P-tree must be traversed. However, on the second pass when calculating the support for "doubles" we can ignore the top level in the P tree, i.e. we can start processing from index 2. Further, at the end of the previous pass we can delete the top level (cardinality = 1) part of the table. Consequently as the T-tree grows in size the P-tree table shrinks.
 
  1. To prevent double counting, on the first pass of the P-tree, we update only those elements in the top level array of the T-tree that correspond to the column numbers in node codes (not parent codes). On the second pass, for each P-tree table record found, we consider only those branches in the T-tree that emanate from a top level element corresponding to a column number represented by the node code (not the parent code). Once the appropriate branch has been located we proceed down to level 2 and update those elements that correspond to the column numbers in the union of the parent and node codes. We then repeat this process for all subsequent levels until there are no more levels in the T-tree to consider.

Thus returning to the P-tree table presented in Figure 4 we proceed as follows:



1. T-tree Level 1.

  1. Commence with an empty top-level T-tree Figure 5(a).

  2. Pass through the P-tree table level by level, starting with level 1, updating the top-level of the T-tree according to the nature of each record in the P-tree table. Thus:
    1. The first record encountered represents the node {1}. and thus we update the support for element 1 of the top-level of the T-tree (Figure 5(b)).
    2. The second record in the P-tree table to be considered is the node {2} and consequently we update the support element 2 in the top-level of the T-tree (Figure 5(c)) --- note that the support for node {2} is 2.
    3. We now pass to the second level in the P-tree table (index 2) which represents doubles (set cardinality = 2). There is only one record at this level with parent code {2} and node code 4 (the record in its entirety represents the P-tree node {2,4}). The contribution of this node to element 2 (the parent code) in the T-tree has already been considered thus, as before, we only consider the node code and increment the support for element 4 in the T-tree (Figure 5(d)).
    4. We now pass down to the third level in the P-tree table. There are three nodes here. We only consider the node codes and thus first increment the supports for elements 3 and 4 in the T-tree (Figure 5(e), then element 5 (Figure 5(f)), and then element 6 (Figure 5(g)).
    We have now completed one pass of the P-tree and we can remove the singles from the P-tree table (Figure 6), prune the T-tree of those nodes that are not adequately supported (Figure 5(h)), and generate the next T-tree level (Figure 5(i)).
P-TREE

(a): Prior to start

P-TREE to T-TREE

(b): P-tree Table, Level 1, Node 1.

P-TREE to T-TREE

(c): P-tree Table, Level 1, Node 2.

P-TREE to T-TREE

(d): P-tree Table, Level 2, Node 1..

 
P-TREE to T-TREE

(e): P-tree Table, Level 3, Node 1.

P-TREE to T-TREE

(f): P-tree Table, Level 3, Node 2.

P-TREE to T-TREE

(g): P-tree Table, Level 3, Node 3.

P-TREE to T-TREE

(h): Level 1 pruning

P-TREE to T-TREE

(i): T-tree prior to second pass of P-tree Table

Figure 5: T-tree Generation using a P-tree table (Level 1).

P-TREE TABLE

Figure 6: P-tree table with cardinality 1 nodes removed.



2. T-tree Level 2.

  1. Continuing on to level two of the T-tree we commence with the cardinality 2 sets in the P-tree table:

    1. There is only one cardinality 2 node in the table, this has parent code {2} and node code {4}. Therefore we search the T-tree branch emanating from element 4 only (the node code), and do not consider the branch emanating from element 1 (the parent code). Thus we pass down branch 4 to the second level and attempt to update elements 4 and 2 (the union of the node and parent codes). However, there is no element 4 in this part of the T-tree so we only update the support associated with element 2 (Figure 7(a)).
    2. We then move to the cardinality 3 nodes in the P-tree Table. The first we encounter has a parent code of {1} and a node code of {3,4} thus we search only the T-tree branch emanating from elements 3 and 4. There is no element 3 so we only consider element 4 and pass down this branch to level 2 in the T-tree and update the elements {1,3,4} (the union of the node and parent P-tree codes) if they are present in this part of the T-tree (Figure 7(b)).
    3. We repeat this process for the other two nodes. The first of these has node code {5} which does not appear in the top level of the T-tree (as pruned so far) and consequently we can ignore this node. The same applies to the last level 3 P-tree Table node.

This completes the second pass of the P-tree table; we set the size 2 reference in the P-tree Table to null (Figure 8), prune the T-tree so far (Figure 7(c)), and generate the the third level on the T-tree (Figure 7(d)).

P-TREE to T-TREE

(a): P-tree Table, Level 2, Node 1.

P-TREE to T-TREE

(b): P-tree Table, Level 3, Node 1.

 
P-TREE to T-TREE

(c): Level 2 pruning

P-TREE to T-TREE

(d): T-tree prior to third pass of P-tree Table

Figure 7: T-tree Generation using a P-tree table (Level 2).

P-TREE TABLE

Figure 8: P-tree table with cardinality 1 and 2 nodes removed.



3. T-tree Level 3.

  1. On subsequent level (level three and beyond) We now repeat the process starting with the cardinality 3 nodes in the P-tree table:
    1. The first node has node code {3,4} thus we search only those T-tree branches emanating from elements 3 and 4 in the T-tree --- there is no element 3 so we only consider element 4. We we get down to the right level (level 3) we update the elements {1,3,4} if present (Figure 9(a)),
    2. The second node has node code 5 which is not in the top level of the T-tree so can be ignored.
    3. The last node has node code 6 and this is also not in the T-tree.

This then completes the third pass of the P-tree table and consequently we prune level 3 of the T-tree (Figure 9(b)), and attempt to generate the next level in the the T-tree, however, in this case there are no more levels to generate and thus we have come to the end of the T-tree generation process (Figure 9(c)). The generation process is also ended because, after removing level 3 in the P-tree Table (Figure 10) there are no more nodes in the table to consider.

P-TREE to T-TREE

(a): P-tree Table, Level 3, Node 1.

 
P-TREE to T-TREE

(b): Level 3 pruning

P-TREE to T-TREE

(c): End of T-tree generation process

Figure 9: T-tree Generation using a P-tree table (Level 2).

Figure 10: P-tree table with cardinality 1, 2 and 3 nodes removed.
P-TREE TABLE



7. CONCLUSIONS


The TFP ARM algorithm described here has been used successfully by the LUCS-KDD research group to contrast and compare a variety of ARM algorithms and techniques. The software is available for free, however the author would appreciate appropriate acknowledgement. The following reference format for refering to the TFP implementation available here is suggested:

  1. Coenen, F. (2004), The LUCS-KDD TFP Association Rule Mining Algorithm, http://www.csc.liv.ac.uk/~frans/KDD/Software/Apriori_TFP/aprioriTFP.html, Department of Computer Science, The University of Liverpool, UK.

Should you discover any "bugs" or other problems within the software (or this documentation), do not hesitate to contact the author.




REFERENCES


  1. Agrawal, R. and Srikant, R. (1994), Fast algorithms for mining association rules, In proceedings of 20th VLDB Conference, Morgan Kaufman, pp487-499.
  2. Ahmed, S., Coenen, F.P. and Leng, P. (2003). Strategies for Partitioning Data in Association Rule Mining. In Coenen, F.P., Preece, A. and Macintosh, A.L. (Eds.), Research and Development in Intelligent Systems XXI, Springer, London, pp127-140.
  3. Ahmed, S., Coenen, F. and Leng, P. (2004). A Tree Partitioning Method for Memory Management in Association Rule Mining. In Kambayashi, Y., Mohania, M. and Woll, W. (eds) Data Warehousing and Knowledge Discovery, (Proc DAWAK 2004 conference, Zaragosa), September 2004: LNCS 3181, Springer, pp331-340.
  4. Ahmed, S., Coenen, F.P. and Leng, P. (2006). Tree-based Partitioning of Data for Association Rule Mining. Knowledge and Information Systems Journal, Vol 10, Num. 3 (October, 2006), pp315-331.
  5. Coenen, F.P., Goulbourne, G. and Leng, P. (2001). Computing Association Rules Using Partial Totals. In de Raedt, L. and Siebes, A. (Eds), Principles of Data Mining and Knowledge Discovery, Proc PKDD 2001, Spring Verlag LNAI 2168, pp 54-66.
  6. Coenen, F.P. and Leng, P. (2002). Finding Association Rules With Some Very Frequent Attributes. In "Principles of Data Mining and Knowledge discovery", eds Elomaa, T., Mannila, H. and Toivonen, H., Proc PKDD 2002 Conference, Helsinki, August 2002: Lecture Notes in AI 2431, Springer-Verlag: 99-111.
  7. Coenen, F.P., Leng, P. and Ahmed, S.(2003) T-Trees, Vertical Partitioning and Distributed Association Rule Mining. Proceedings ICDM-2003, pp513-516.
  8. Coenen, F., Leng, P. and Ahmed, S. (2004a). Data Structures for association Rule Mining: T-trees and P-trees. IEEE Transactions on Data and Knowledge Engineering, Vol 16, No 6, pp774-778.
  9. Coenen, F.P. Leng, P. and Goulbourne, G. (2004b). Tree Structures for Mining Association Rules. Journal of Data Mining and Knowledge Discovery, Vol 8, No 1, pp25-51.
  10. Coenen, F. and Leng, P. (2005). Obtaining Best Parameter Values for Accurate Classification. Proc. ICDM'2005, IEEE, pp597-600.
  11. Coenen, F., Leng, P. and Zhang, L. (2005). Threshold Tuning for Improved Classification Association Rule Mining. Proceeding PAKDD 2005, LNAI3158, Springer, pp216-225.
  12. Coenen, F.P. and Leng, P. (2006). Partitioning Strategies for Association Rule Mining. The Knowledge Engineering Review, Vol. 21, Issue 1 (March 2006), pp 25-47.
  13. Goulbourne, G., Coenen, F.P. and Leng, P. (1999). Algorithms for Computing Association Rules Using A Partial-Support Tree. Proceedings ES99, Springer, London, pp132-147.
  14. Goulbourne, G., Coenen, F.P. and Leng, P. (2000). Algorithms for Computing Association Rules Using a Partial-Support Tree. Journal of Knowledge-Based Systems, Vol (13), pp141-149. (Reprint of ES'99 paper).



Created and maintained by Frans Coenen. Last updated May 2009