THE LUCS-KDD APRIORI-T ASSOCIATION RULE MINING ALGORITHM



Liverpool University

Frans Coenen

Department of Computer Science

The University of Liverpool

13 February 2004


CONTENTS

1. Introduction.
2. Downloading the software.
2.1. Compiling.
2.2. Documentation.
 
3. Running the software.
4. More detail on application classes.
5. The T-tree data structure.
6. The Apriori-T algorithm.
7. Conclusions.



1. INTRODUCTION


Apriori-T (Apriori Total) is an Association Rule Mining (ARM) algorithm, developed by the LUCS-KDD research team which makes use of a "reverse" set enumeration tree where each level of the tree is defined in terms of an array (i.e. the T-tree data structure is a form of Trie). The Apriori-T algorithm was actually developed as part of a more sophisticated ARM algorithm --- Apriori-TFP (Apriori Total From Partial). Both algorithms are described in Coenen and Leng (2004).




2. DOWNLOADING THE SOFTWARE


The Apriori-T ARM software comprises four source files. These are provided from this WWW page together with three application classes. The source files are as follows:

  1. AssocRuleMining.java: Set of general ARM utility methods to allow: (i) data input and input error checking, (ii) data preprocessing, (iii) manipulation of records (e.g. operations such as subset, member, union etc.) and (iv) data and parameter output.
  2. RuleList.java: Set of methods that allow the creation and manipulation (e.g. ordering, etc.) of a list of ARs.
  3. TotalSupportTree.java: Methods to implement the "Apriori-T" algorithm using the "Total support" tree data structure (T-tree).
  4. TtreeNode.java: Methods concerned with the structure of Ttree nodes. Arrays of these structures are used to store nodes at the same level in any sub-branch of the T-tree. Note this is a separate class to the other T-tree classes which are arranged in a class hierarchy as illustrated below.
             AssocRuleMining
                    |
     +--------------+---------------+
     |                              |
TotalSupportTree                    RuleList

The application classes are as follows:

  1. AprioriTapp.java: Fundamental Apriori-T application.
  2. AprioriTsortedApp.java: Apriori-T application with input data preprocessed so that it is ordered according to frequency of single items --- this serves to reduce the computation time.
  3. AprioriTsortedPrunedApp.java: Apriori-T application with data ordered according to frequency of single items and columns representing unsupported 1-itemsets removed --- again this serves to enhance computational efficiency.

There is also a "tar ball" aprioriT.tgz that can be downloaded that includes all of the above source and application class files. It can be unpacked using tar -zxf aprioriT.tgz.


2.1. Compiling

The software has been implemented in Java using the Java2 SDK (Software Development Kit) Version 1.4.0, which should therefore make it highly portable. The code does not require any special packages and thus can be compiled using the standard Java compiler:

javac *.java

2.2. Documentation

The code can be documented using Java Doc. First create a directory Documentation in which to place the resulting HTML pages and then type:

javadoc -d Documentation/ *.java

This will produce a hierarchy of WWW pages contained in the Document directory.




3. RUNNING THE SOFTWARE


When compiled the software can be invoked in the normal manner using the Java interpreter:

java APPLICATION_CLASS_FILE_NAME

If you are planning to process a very large data set it is a good idea to grab some extra memory. For example:

java -Xms600m -Xmx600m APPLICATION_CLASS_FILE_NAME

The input to the software, in all cases is a (space separated) binary valued data set R. The set R comprises a set of N records such that each record (r), in turn, comprises a set of attributes. Thus:

R  = {r | r = subset A}

Where A is the set of available attributes. The value D is then defined as:

D = |A|

We then say that a particular data set has D columns and N rows. A small example data sets might be as follows:

1 2 3 6
1 4 5 7
1 3 4 6
1 2 6
1 2 3 4 5 7

where, in this case, A = {1, 2, 3, 4, 5, 6, 7}. Note that attribute numbers are ordered sequentially commencing with the number 1 (the value 0 has a special meaning). The input file is included using a -F flag.

The program assumes support and confidence default threshold values of 20% and 80% respectively. However the user may enter their own thresholds using the -S and -C flags. Thresholds are expressed as percentages.

Some example invocations, using a discretized/ normalised Pima Indians data set (also available from this site) and each of the three application classes provided by this WWW site, are given below:

java AprioriTapp -FpimaIndians.D42.N768.C2.num
java AprioriTsortedApp -FpimaIndians.D42.N768.C2.num -S2.5
java AprioriTsortedPrunedApp -FpimaIndians.D42.N768.C2.num -S1 -C50

(note that the ordering of flags is not significant). The output from each application is a set of ARs (ordered according to confidence) plus some diagnostic information (run time, number of rules generated etc.).




4. MORE DETAIL ON APPLICATION CLASSES


Apriori-T Applications classes all have the following basic form:

public class < CLASS_NAME >  {

    /** Main method */

    public static void main(String[] args) throws IOException {

	// Create instance of class TotalSupportTree
	TotalSupportTree newAprioriT = new TotalSupportTree(args);
	
	// Read data to be mined from file
	newAprioriT.inputDataSet();	
	
	// If desired either: (1) keep data set as it is (do no
	// preprocessing), (2) reorder the data sets according to
	// frequency of single attributes:
	newAprioriT.idInputDataOrdering();
	newAprioriT.recastInputData();
	// or (3) reorder and prune the input data
	newAprioriT.idInputDataOrdering();
	newAprioriT.recastInputDataAndPruneUnsupportedAtts();
	newAprioriT.setNumOneItemSets();
	
	// Mine data and produce T-tree	
	double time1 = (double) System.currentTimeMillis();
	newAprioriT.createTotalSupportTree();
	newAprioriT.outputDuration(time1,(double) System.currentTimeMillis());
	
	// Generate ARS
	newAprioriT.generateARs();
	
	// Output as desired using the many output methods that are available
	// (see below for further details)
	
	// End
	System.exit(0);
	}

Some output is always generated such as: (1) the input parameters and start settings and (2) "mile stones" during processing. Additional output statements can be included in application classes. The available additional output options are as follows:

  1. outputDataArray(): Outputs the input data.
  2. outputDataArraySize(): Outputs the size of the input data (number of records and columns, number of elements and overall density).
  3. outputDuration(double time1, double time2): The elapsed time, in seconds between time1 and time2. Typically used to measure processing time:
    double time1 = (double) System.currentTimeMillis();
    
    // Program statements
    
    < INSTANCE_NAME > .outputDuration(time1,
    				(double) System.currentTimeMillis());
    
  4. outputTtree(): Lists the T-tree nodes.
  5. outputNumFreqSets(): The number of identified frequent sets.
  6. outputNumUpdates(): The number of updates used to build the T-tree (a measure of the amount of "work done").
  7. outputStorage(): The storage requirements, in Bytes, for the T-tree. .
  8. outputFrequentSets(): Lists the identified frequent sets and their associated support values.
  9. outputNumRules(): The number of discovered ARs.
  10. outputRules(): Lists the ARs and their associated confidence values.
  11. outputRulesWithReconversion(): Lists the ARs but using reconversion (only appropriate where attributes have been reordered).

Note that the first nine of the above output methods are instance methods contained in either the TotalSupportTree class or the AssocRuleMining class the latter are therefore inherited by the TotalSupportTree class). The last three of the above are instance methods of the RuleList class and thus must be called using an instance of that class. An instance of the RuleList class is created during processing and may be accessed using the getCurrentRuleListObject() public method found in the TotalSupportTree class. Thus, for example, the outputRules() method would be invoked as follows:

< INSTANCE_NAME > .getCurrentRuleListObject().outputRules();



5. THE T-TREE DATA STRUCTURE


A T-tree is a set enumeration tree structure in which to store frequent item set information. What distinguishes the T-tree from other set enumeration tree structures is:

  1. Levels in each sub-branch of the tree are defined using arrays. This thus permits "indexing in" at all levels and consequently offers computational advantages.
  2. To aid this indexing the tree is built in "reverse". Each branch is founded on the last element of the frequent sets to be stored. This allows direct indexing with attribute number rather than first applying some offset.

Thus given a data set of the form:

{1 3 4}
{2 4 5}
{2 4 6}

and assuming a support count of less than 1, we can identify the following frequent sets (support counts in parenthesise):

1 (1)
2 (2)
3 (1)
4 (3)
5 (1)
6 (1)
 
1 3 (1)
1 4 (1)
2 4 (2)
2 5 (1)
2 6 (1)
3 4 (1)
 
4 5 (1)
4 6 (1)
1 3 4 (1)
2 4 5 (1)
2 4 6 (1)

These can be presented in a T-tree of the form given in Figure 1 (note the reverse nature of the tree). The internal representation of this "reverse" T-tree founded on arrays of T-tree nodes that can be conceptualised as shown in Figure 2. The storage required for each node (representing a frequent set) in the T-tree is then 12 Bytes:

  1. Reference to T-tree node structure (4 Bytes)
  2. Support count field in T-tree node structure (4 Bytes)
  3. Reference to child array field in T-tree node structure (4 Bytes)

Thus house keeping requirements are still 8 Bytes, however storage gains are obtained because it is not necessary to explicitly store individual attribute labels (i.e. column numbers representing instantiated elements) as these are implied by the indexing. Of course this approach must also require storage for "stubs" (4 Bytes) where nodes are missing (unsupported). Overall the storage advantages for this technique is thus, in part, dependent on the number of missing combinations contained in the data set.


REVERSE T-TREE

Figure 1: Conceptual example of the T-tree data structure

VERSION 3 OF T-TREE

Figure 2: Internal representation of T-tree presented in Figure 1




6. THE APRIOR-T ALGORITHM


The T-tree described above is built in an apriori manner, as first proposed by Agrawal and Srikant (1994), starting with the one item sets and continuing until there are no more candidate N-itemsets. Thus, at a high level, a standard apriori algorithm is used.

K <-- 1
nextlevelFlag=true;

generate candidate K-itemsets
Loop
    count support values for candidate K-itemsets
    prune unsupported K-itemsets
    K <-- 2
    generate candidate K2 itemsets from previous level
    if no K2 itemsets break
end Loop

In more detail the Apriori-T algorithm commences with a method createTotalSupportTree which is presented in Table 1. The method commences by generating the top level of the T-tree (createTtreeTopLevel) and then generating the next level (generateLevel2) from the supported sets in level 1 --- remember that if a 1-itemset is not supported none of its super sets will be supported. Once we have generated level 2 further levels can be generated (createTtreeLevelN).

Method: createTotalSupportTree
Arguments: none
Return: none
Fields: NA
------------------------------
createTtreeTopLevel()
generateLevel2()
createTtreeLevelN()
------------------------------

Table 1: The createTotalSupportTree method

The method to generate the top level of a T-tree is as presented in Table 2. Note that the method includes a call to a general T-tree utility method pruneLevelN described later.

Method: createTtreeTopLevel
Arguments: none
Return: none
Fields: D number of attributes
        startTtreeRef start of T-tree
	dataArray 2D array holding input sets
--------------------------------------
Dimension and initialise top level of T-tree (length = D)

Loop from i = 0 to i = number of records in dataArray
    Loop from j = 0 to j = number of attributes in dataArray[i]
    	startTtreeRef[i][j]++
    End loop
End Loop

pruneLevelN(startTtreeRef,1)
--------------------------------------

Table 2: The createTtreeTopLevel method

The generateLevel2 methods loops through the top level of the T-tree creating new T-tree arrays where appropriate (i.e. where the immediate parent nodes is supported). The method is outlined in Table 3. Note that the method includes a call to a general T-tree utility method generateNextLevel (also described later).

Method: generateLevel2
Arguments: none
Return: none
Fields: startTtreeRef start of T-tree
	nextlevelFlag set to true if next level exists
--------------------------------------
nextlevelFlag <-- false

loop from i = 0 to i = number of nodes in T-tree top level
    if startTtreeRef[i] != null
    		generateNextLevel(startTtreeRef,i,{i})
End Loop
--------------------------------------

Table 3: The createTtreeTopLevel method

Once we have a top level T-tree and a set of candidate second levels (arrays) we can proceed with generating the rest of the T-tree using an iterative process --- the createTtreeLevelN method presented in Table 4. The createTtreeLevelN method calls a number of other methods addSupportToTtreeLevelN, pruneLevelN (also called by the createTtreeTopLevel method) and generateLevelN which are presented in Tables 5, 6 and 7 respectively.

Method: createTtreeLevelN
Arguments: none
Return: none
Fields: startTtreeRef start of T-tree
	nextlevelFlag set to true if next level exists
--------------------------------------
K <-- 2

while (nextlevelFlag)
    addSupportToTtreeLevelN(K)
    pruneLevelN(startTtreeRef,K)
    nextlevelFlag <-- false
    generateLevelN(startTtreeRef,K,{})
    K <-- K+1
End loop
--------------------------------------

Table 4: The createTtreeLevelN method

Method: addSupportToTtreeLevelN
Arguments: K the current level
Return: none
Fields: startTtreeRef start of T-tree
	dataArray 2D array holding input sets
--------------------------------------
Loop from i = 0 to i = number of records in dataArray
    length = number of attributes in dataArray[i]
    addSupportToTtreeFindLevel(startTtreeRef,K,length,
    		dataArray[i]
End loop
--------------------------------------

Method: addSupportToTtreeFindLevel
Arguments: linkref reference to current array in T-tree
	K level marker
	length length of array at current branch in t-tree
	record input data record under consideration
Return: none
Fields: None
--------------------------------------
if (K=1)
    Loop from i = 0 to i = length
    	if (linkref[record[i]] != null)
		increment linkref[record[i]].support by 1
	End if
    End Loop
else
    Loop from i = K-1 to i = length
        if (linkref[record[i]] != null &&
			linkref[record[i]].childRef != null)
	    addSupportToTtreeFindLevel(linkref[record[i]].childRe,
	 		K-1,i,record)
	End if
    End loop
end if else   		   	
--------------------------------------

Table 5: The addSupportToTtreeLevelN method and its related addSupportToTtreeFindLevel method

Method: pruneLevelN
Arguments: linkref reference to current array in T-tree
	K level marker
Return: true if entire array pruned
Fields: minSupport the minimum support threshold
--------------------------------------
if (K=1)
    allUnsupported <-- true
    Loop from i = 1 to i = length of array
    	if (linkref[i] != null)
	    if (linkref[i].support < minSupport)
	    		linkref[i] <-- null
            else allUnsupported <-- false
	    End if else
	End if
    return allUnsupported
    End Loop
else
    Loop from i = K to i = length of array
        if (linkref[i] != null)
	    if (pruneLevelN(linkref[i].childRef,K-1)
	    	linkref[i].childRef <-- null
	    End if
	 End if
    End loop
End if else   		   	
--------------------------------------

Table 6: The pruneLevelN

Method: generateLevelN
Arguments: linkref reference to current array in T-tree
	K level marker
	I the item set represented by the parent node
Return: None
Fields: None
--------------------------------------
if (K=1)
    Loop from i = 2 to i = length of array
	if (linkref[i] != null)
	    generateNextLevel(linkref,i,I union i)
	End if
    End loop
else
    Loop from i = K to i = length of array
	if (linkref[i] != null && linkref[i].childRef != null)
	    generateLevelN(linkref[i].childRef,K-1,I union i
--------------------------------------

Method: generateNextLevel
Arguments: linkref reference to current array in T-tree
	i index to parent node in vurrent array
	I the item set represented by the parent node
Return: None
Fields: nextLevelExists flafg set to true or false
--------------------------------------
linkref[i].childRef <-- array of empty t-tree nodes of length i

Loop from j = 1 to j = i
     if (linkRef[j] != null)
         newI <-- I union j
	 if (all newI true subsets are all supported in the
	 		T-tree sofar)
	     linkRef[i].childRef[j] <-- new T-tree Node
	     nextLevelExists <-- true
	 else linkRef[i].childRef[j] <-- null
	 End if else
     End if
End loop
--------------------------------------

Table 7: The generateLevelN method and its related generateNextLevel method




7. CONCLUSSIONS


The Apriori-T ARM algorithm described here have been use successfully by the LUCS-KDD reseach group to contrast and compare a variety of ARM algorithms and techniques. The software is available for free, however the author would appreciate appropriate acknowledgement. The following reference format for referring to the Apriori-T implementation available here is suggested:

  1. Coenen, F. (2004), The LUCS-KDD Apriori-T Association Rule Mining Algorrithm, http://www.cxc.liv.ac.uk/~frans/KDD/Software/Apriori_T/aprioriT.html, Department of Computer Science, The University of Liverpool, UK.

Should you discover any "bugs" or other problems within the software (or this documentation), do not hesitate to contact the author.




REFERENCES


  1. Agrawal, R. and Srikant, R. (1994), t algorithms for mining association rules, In proceedings of 20th VLDB Conference, Morgan Kaufman, pp487-499.
  2. Coenen and Leng (2004). Data Structures for Association Rule Mining: T-trees and P-trees To appear in IEEE Transaction in Knowledge and Data Engineering.



Created and maintained by Frans Coenen. Last updated 17 March 2004