THE LUCS-KDD TFP (TOTAL FROM PARTIAL) ASSOCIATION RULE MINING ALGORITHM
Frans Coenen
Department of Computer Science
The University of Liverpool
5 April 2004
Page last revised May 2006, May 2009
|
|
CONTENTS
TFP (Total From Partial) is an Association Rule Mining (ARM)
algorithm developed by the
LUCS-KDD research team. The algorithm is an extension of the
Apriori_T (Apriori Total) ARM
algorithm which is in turn, an "apriori" (Agrawal and Srikant 1994)
style algorithm. Apriori-T is designed to process a binary valued input data set
so as to identify frequent (large) itemsets and store the resulting frequent
itemset information in a "reverse" set enumeration tree called a T-tree
(Total support Tree).
This T-tree can then be processed to identify ARs.
TFP proceeds in a similar manner to Apriori-T
except that instead of operating with the raw input data directly
the input data is first preprocessed and placed in a P-tree (Partial support
tree). As such TFP has all the functionality of Apriori-T except that it
operates in a much more efficient manner as a result of the pre-processing
advantages offered by usage of the P-tree data structure. These
advantages are particularly significant when operating with data that
features many duplicate records and/or records with duplicate leading
sub-strings.
The initial ideas incopoprated in the TFP algorith were first described in
Golbourne et al. (1999) and reprinted in Golbourne et al. (2000), further
devlopments were published in Coenen et al. (2001). The TFP algorithm in its
current form together with further details concerning P-trees and T-trees
that are not available from this WWW page can be found in Coenen et al. (2004a, 2004b).
Some additional refinements are discussed in (Coenen and leng 2002).
Subsequently the TFP algorithm has been further adapted in a number of
additional directions: (1) Duistributed/Parallel ARM (Coenen et al (2003),
Coenen and Leng (2006), (2) Mining of very large DB that cannot be held in
primary storage (Ahmed et al. 2003, 2004, 2006), (3) Classification Association
Rule Mining (Coenen and leng 2005, Coenen et al. 2005).
2. DOWNLOADING THE SOFTWARE |
The TFP ARM software comprises seven source files. These are provided
from this WWW page together with three application classes. The source files are
as follows:
- AssocRuleMining.java:
Set of general ARM utility methods to allow: (i) data input and input
error checking, (ii) data preprocessing, (iii) manipulation of records (e.g.
operations such as subset, member, union etc.) and (iv) data and parameter
output.
- PartialSupportTree.java:
Methods to implement the "TFP" algorithm using both the "Partial
support" and "Total support" tree data structure (P-tree and T-tree).
- PtreeNode.java: Methods concerned
with the structure of Ptree nodes.
- PtreeNodeTop.java: Methods
concerned with the structure of the top level of the P-tree which comprises, to
allow direct indexing for reasons
of efficiency, an array of "top P-tree nodes".
- RuleList.java: Set of methods that
allow the creation
and manipulation (e.g. ordering, etc.) of a list of ARs.
- TotalSupportTree.java:
Methods to implement the "Apriori-T"
algorithm using the "Total support" tree data structure (T-tree).
- TtreeNode.java: Methods concerned
with the structure of Ttree nodes.
The PtreeNode, PtreeNodeTop and TtreeNode
classes are separate to the remaining classes which are
arranged in a class hierarchy of the form illustrated below.
AssocRuleMining
|
+--------------+---------------+
| |
TotalSupportTree RuleList
|
PartialSupportTree
|
|
The TFP application classes included here are as follows:
- AprioriTFPapp.java: Fundamental
TFP application.
- AprioriTFPsortedApp.java:
TFP application with input data preprocessed so that it is ordered
according to frequency of single items --- this serves to reduce the computation
time.
-
AprioriTFPsortedPrunedApp.java:
TFP application with data ordered according to frequency of single items
and columns representing unsupported 1-itemsets removed --- again this serves
to enhance computational efficiency.
The TFP algorithm will also allow use of Apriori-T application classes.
Three are provided here as listed below, however for details the interested
reader should refer
the Apriori-T documentation.
- AprioriTapp.java.
- AprioriTsortedApp.java
-
AprioriTsortedPrunedApp.java
There is also a "tar ball" tfpLink.tgz
that can be downloaded that includes all of the above source and application
class files. It can be unpacked using tar -zxf tfpLink.tgz.
2.1. Compiling
The TFP software has been implemented in Java using the Java2 SDK
(Software Development Kit) Version 1.4.0, which should therefore make it highly
portable. The code does not require any special packages and thus can be
compiled using the standard Java compiler:
javac *.java
2.2. Documentation
The code can be documented using Java Doc. First create a directory
Documentation in which to place the resulting HTML pages and then type:
javadoc -d Documentation/ *.java
This will produce a hierarchy of WWW pages contained in the Document
directory.
When compiled the software can be invoked in the normal manner using the
Java interpreter:
java APPLICATION_CLASS_FILE_NAME
If you are planning to process a very large data set it is a good idea to
grab some extra memory. For example:
java -Xms600m -Xmx600m APPLICATION_CLASS_FILE_NAME
The input to the software, in all cases is a (space separated)
binary valued data set R. The set R
comprises a set of N records such that each record (r),
in turn, comprises a set of attributes. Thus:
R = {r | r = subset A}
Where A is the set of available attributes. The value D is
then defined as:
D = |A|
We then say that a particular data set has D columns and
N rows. A small example data sets might be as follows:
1 2 3 6
1 4 5 7
1 3 4 6
1 2 6
1 2 3 4 5 7
where, in this case, A = {1, 2, 3, 4, 5, 6, 7}. Note that
attribute numbers are ordered sequentially commencing with the
number 1 (the value 0 has a special meaning). The input
file is included using a -F flag.
The program assumes support and confidence default threshold values of 20% and
80% respectively. However the user may enter their own thresholds using
the -S and -C flags. Thresholds are expressed as percentages.
Some example invocations, using a discretized/ normalised Pima Indians
data set (also available from this
site) and each of the three application classes provided by this WWW site, are
given below:
java tfpLinkapp -FpimaIndians.D42.N768.C2.num
java tfpLinksortedApp -FpimaIndians.D42.N768.C2.num -S2.5
java tfpLinksortedPrunedApp -FpimaIndians.D42.N768.C2.num -S1 -C90
(note that the ordering of flags is not significant). The output from each
application is a set of ARs (ordered according to confidence) plus some
diagnostic
information (run time, number of rules generated etc.).
4. MORE DETAIL ON APPLICATION CLASSES |
TFP Applications classes all have the following basic form:
public class < CLASS_NAME > {
/** Main method */
public static void main(String[] args) throws IOException {
// Create instance of class PartialSupportTree
PartialSupportTree newtfpLink = new PartialSupportTree(args);
// Read data to be mined from file
newtfpLink.inputDataSet();
// If desired either: (1) keep data set as it is (do no
// preprocesseing), (2) reorder the data sets according to
// frequency of single attributes:
newtfpLink.idInputDataOrdering();
newtfpLink.recastInputData();
// or (3) reorder and prune the input data
newtfpLink.idInputDataOrdering();
newtfpLink.recastInputDataAndPruneUnsupportedAtts();
newtfpLink.setNumOneItemSets();
// Mine data and produce T-tree
double time1 = (double) System.currentTimeMillis();
newtfpLink.createTotalSupportTree();
newtfpLink.outputDuration(time1,(double) System.currentTimeMillis());
// Generate ARS
newtfpLink.generateARs();
// Output as desired using the many output methods that are available
// (see below for further details)
// End
System.exit(0);
}
|
|
Some output is always generated such as: (1) the input parameters and start
settings and (2) "mile stones" during processing.
Additional output statements can be included in application classes.
The available additional output options are as follows:
- outputDataArray(): Outputs the input data.
- outputDataArraySize(): Outputs the size of the input data (number
of records and columns, number of elements and overall density).
- outputDuration(double time1, double time2): The elapsed time,
in seconds between time1 and time2. Typically used to measure processing time:
double time1 = (double) System.currentTimeMillis();
// Program statements
< INSTANCE_NAME > .outputDuration(time1,
(double) System.currentTimeMillis());
- outputPtreeTable(): Output P-tree table (for computational
convenience the P-tree is cast into a table prior to being processed
further).
- outputPtreeTableStats() : Output parameters for P-tree table ---
storage, number of nodes.
- outputTtree(): Lists the T-tree nodes.
- outputTtreeStats(): Output parameters for T-tree --- storage, number
of nodes and umber of updates.
- outputNumFreqSets(): The number of identified frequent sets.
- outputNumUpdates(): The number of updates used to build the
P-tree (a measure of the amount of "work done").
- outputFrequentSets(): Lists the identified frequent sets
contained in the T-tree, and their associated support values.
- outputStorage(): The storage requirements, in Bytes, for the
T-tree.
- outputNumRules(): The number of discovered ARs.
- outputRules(): Lists the ARs and their
associated confidence values.
- outputRulesWithReconversion(): Lists the ARs but using reconversion
(only appropriate where attributes have been reordered).
Note that the first thirteen of the above output methods are instance
methods contained in either the PartialSupportTree class or its
parent classes. The last three of the above are instance
methods of the RuleList class and thus must be called using an
instance of that class. An instance of the RuleList class is created
during processing and may be accessed using the getCurrentRuleListObject()
public method found in the TotalSupportTree class. Thus, for example,
the outputRules() method would be invoked as follows:
< INSTANCE_NAME > .getCurrentRuleListObject().outputRules();
5. THE P-TREE DATA STRUCTURE |
A P-tree is a set enumeration tree structure in which to store
partial counts for item sets. The top,
single attribute, level comprises an array of
references to structures of the form shown to the right, one for each column.
Identifier | Type |
---|
Support |
int (4 Bytes) |
Child reference |
Pointer to child P-tree node (4 Bytes) |
|
Each of these top level structures is then the root of
a sub-tree of the overall P-tree. The nodes in each of these sub-tress are
represented as structures of the form:
Identifier | Type |
---|
Node code |
Array of short integers (2 Bytes each) |
Support |
int (4 Bytes) |
Child reference |
Pointer to child P-tree node (4 Bytes) |
Sibling reference |
Pointer to sibling P-tree node (4 Bytes) |
|
Assuming a sample data set as shown below:
Row Number | Columns |
---|
1 |
1 3 4 |
2 |
2 4 5 |
3 |
2 4 6 |
|
where D=6 (number of columns/attributes) and N=3
(number of records). Prior to commencement of the generation
process the P-tree will comprise a 6 element array as
shown in Figure 1(a). The first row will be stored in the P-tree as shown in
Figure 1(b).
(a): Prior to start
(b): Addition of first row ({1, 3, 4}).
(c): Addition of second row ({1, 4, 5}).
(d): Addition of third row ({2, 4, 6}).
Figure 1: P-tree Generation
Note the support for the parent node is incremented by one. The second
row is then included as shown in
Figure 1(c), and the third as shown in Figure 20(d). Note
that because this last row shares a common leading substring with an existing
node in the P-tree a dummy node is created to ensure that the tree
does not become "flat".
Inspection of the final P-tree as shown in Figure 2(d) indicates that nothing is lost if
we do not store that part of any particular "node code" which is duplicated in the parent
code (Figure 2).
| |
Figure 2: P-tree in its final form.
The internal representation of the P-tree presented in Figure 2
is then as shown in Figure 3.
Figure 3: Internal representation of P-tree presented in Figure 2.
On completion of the generation process the P-tree itself is "thrown away"
and the contents of the tree stored in a P-tree table --- an array of
arrays each element of which is a pointer to a structure of the form:
Identifier | Type |
---|
Node code |
short [] |
(Entire) parent code |
short [] |
support |
int |
|
The node code is the set of column numbers represented by any
individual node (not the complete itemset represented by the node). The
parent code is the union of all the node codes of the current node's
"ancestor nodes" (parent node, grandparent node, etc.). The union of the parent
code and the node code for any individual node is the itemset represented by
the node. The first index of the array
represents the size of the union of the parent and node codes in that array of the array
of arrays, i.e. the size (cardinality) of the itemset represented by the node.
Thus all singletons are stored at index 1, doubles at index 2 and so
on.
The advantages offered by the P-tree table are:
- Reduced storage requirements (particularly where the data set contained duplicate rows).
- Faster run times because the desired total support counts had already been partially
calculated.
Internally the P-tree presented in Figure 3 will be stored in
tabular form as shown in Figure 4.
Figure 4: P-tree table.
|
The TFP algorithm is used to determine ARs by first creating a T-tree
from the P-tree. This process can best be explained by considering the P-tree table
given in Figure 4. The T-tree is generated in an Apriori manner. There are a
number of features of the P-tree Table that enhance the efficiency of this process:
- The first pass of the P-tree will be to calculate supports for singletons
and thus the entire P-tree must be traversed. However, on the second pass when
calculating the support for "doubles" we can ignore the top level in the P
tree, i.e. we can start processing from index 2. Further, at the end of the
previous pass we can delete the top level (cardinality = 1) part of the table.
Consequently as the T-tree grows in size the P-tree table shrinks.
| |
- To prevent double counting, on the first pass of the P-tree, we update only
those elements in the top level array of the T-tree that correspond to the
column numbers in node codes (not parent codes). On the
second pass, for each P-tree table record found, we consider only those
branches in the T-tree that emanate from a top level element corresponding to a
column number represented by the node code (not the parent code). Once the
appropriate branch has
been located we proceed down to level 2 and update those elements that
correspond to the column numbers in the union of the parent and node codes.
We then repeat this process for all subsequent levels until there are no more
levels in the T-tree to consider.
Thus returning to the P-tree table presented in Figure 4 we proceed as
follows:
|
1. T-tree Level 1.
- Commence with an empty top-level T-tree Figure 5(a).
- Pass through the P-tree table
level by level, starting with level 1, updating the top-level of the T-tree according to the nature of
each record in the P-tree table. Thus:
- The first record encountered represents the node {1}.
and thus we update the support for element 1 of the top-level of the T-tree
(Figure 5(b)).
- The second record in the P-tree table to be considered is the node
{2} and consequently we update the support
element 2 in the top-level of the T-tree (Figure 5(c)) --- note that the support for
node {2} is 2.
- We now pass to the second level in the P-tree table (index 2) which
represents doubles (set cardinality = 2). There is only one record at this level
with parent code
{2} and node code 4 (the record in its entirety represents the
P-tree node {2,4}). The contribution of this node to element 2 (the parent
code) in the T-tree
has already been considered thus, as before, we only consider the node code and increment
the support for element 4 in the T-tree (Figure 5(d)).
- We now pass down to the third level in the P-tree table. There are three
nodes here. We only consider the node codes and thus first increment the supports
for elements 3 and 4 in the T-tree (Figure 5(e), then
element 5 (Figure 5(f)), and then element 6 (Figure 5(g)).
We have now completed one pass of the P-tree and we can remove the singles from
the P-tree table (Figure 6), prune the T-tree of those nodes that are not
adequately supported (Figure 5(h)), and generate the next T-tree level
(Figure 5(i)).
(a): Prior to start
(b): P-tree Table, Level 1, Node 1.
(c): P-tree Table, Level 1, Node 2.
(d): P-tree Table, Level 2, Node 1..
| |
(e): P-tree Table, Level 3, Node 1.
(f): P-tree Table, Level 3, Node 2.
(g): P-tree Table, Level 3, Node 3.
(h): Level 1 pruning
(i): T-tree prior to second pass of P-tree Table
Figure 5: T-tree Generation using a P-tree table (Level 1).
Figure 6: P-tree table with cardinality 1 nodes removed.
|
2. T-tree Level 2.
- Continuing on to level two of the T-tree we commence with the cardinality
2 sets in the P-tree table:
- There is only one cardinality 2 node in the table, this has parent code
{2} and node code {4}. Therefore we search the T-tree branch
emanating from
element 4 only (the node code), and do not consider the branch emanating from
element 1 (the parent code). Thus we pass down branch 4 to the second level
and attempt to update elements 4 and 2 (the union of the node and parent codes).
However, there is no element 4 in this part of the T-tree so we only
update the support associated with element 2 (Figure 7(a)).
- We then move to the cardinality 3 nodes in the P-tree Table. The first we
encounter has a parent code of {1} and a
node code of {3,4} thus we search only the T-tree branch emanating from
elements 3 and 4. There is no element 3 so we only consider element 4 and pass
down this branch to level 2 in the T-tree and update the elements
{1,3,4} (the union of the node and parent P-tree codes) if they are
present in this part of the T-tree (Figure 7(b)).
- We repeat this process for the other two nodes. The first of these has node
code {5} which does not appear in the top level of the T-tree (as pruned
so far) and consequently we can ignore this node. The same applies to the last
level 3 P-tree Table node.
This completes the second pass of the P-tree table; we set the size 2 reference
in the P-tree Table to null (Figure 8), prune the T-tree so far (Figure 7(c)),
and generate the the third level on the T-tree (Figure 7(d)).
(a): P-tree Table, Level 2, Node 1.
(b): P-tree Table, Level 3, Node 1.
| |
(c): Level 2 pruning
(d): T-tree prior to third pass of P-tree Table
Figure 7: T-tree Generation using a P-tree table (Level 2).
Figure 8: P-tree table with cardinality 1 and 2 nodes removed.
| |
3. T-tree Level 3.
- On subsequent level (level three and beyond) We now repeat the process starting with the cardinality
3 nodes in the P-tree table:
- The first node has node code {3,4} thus we search only those T-tree
branches emanating from elements 3 and 4 in the T-tree --- there is no element 3
so we only consider element 4. We we get down to the right level (level 3) we
update the elements {1,3,4} if present (Figure 9(a)),
- The second node has node code 5 which is not in the top level of the
T-tree so can be ignored.
- The last node has node code 6 and this is also not in the T-tree.
This then completes the third pass of the P-tree table and consequently we prune
level 3 of the T-tree (Figure 9(b)), and attempt to generate the next level in
the the T-tree, however, in this case there are no more levels to generate and
thus we have come to the end of the T-tree generation process (Figure 9(c)).
The generation process is also ended because, after removing level 3 in the P-tree
Table (Figure 10) there are no more nodes in the table to consider.
(a): P-tree Table, Level 3, Node 1.
| |
(b): Level 3 pruning
(c): End of T-tree generation process
Figure 9: T-tree Generation using a P-tree table (Level 2).
Figure 10: P-tree table with cardinality 1, 2 and 3 nodes removed.
|
|
|
The TFP ARM algorithm described here
has been used successfully by the LUCS-KDD research group to contrast and
compare a variety of ARM algorithms and techniques. The software is available for
free, however the author
would appreciate appropriate acknowledgement. The following reference format
for refering to the TFP implementation available here
is suggested:
- Coenen, F. (2004), The LUCS-KDD
TFP Association Rule Mining Algorithm,
http://www.csc.liv.ac.uk/~frans/KDD/Software/Apriori_TFP/aprioriTFP.html,
Department of Computer Science, The University of Liverpool, UK.
Should you discover any "bugs" or other problems within the software (or this
documentation), do not hesitate to contact
the author.
- Agrawal, R. and Srikant, R. (1994),
Fast algorithms for mining association rules,
In proceedings of 20th VLDB Conference, Morgan Kaufman, pp487-499.
- Ahmed, S., Coenen, F.P. and Leng, P. (2003).
Strategies for
Partitioning Data in Association Rule Mining.
In Coenen, F.P., Preece, A. and Macintosh, A.L. (Eds.), Research and
Development in Intelligent Systems XXI, Springer, London, pp127-140.
- Ahmed, S., Coenen, F. and Leng, P. (2004).
A Tree Partitioning Method for Memory Management
in Association Rule Mining.
In Kambayashi, Y., Mohania, M. and Woll, W. (eds) Data Warehousing and
Knowledge Discovery, (Proc DAWAK 2004 conference, Zaragosa), September 2004:
LNCS 3181, Springer, pp331-340.
- Ahmed, S., Coenen, F.P. and Leng, P. (2006).
Tree-based Partitioning of Data for Association
Rule Mining.
Knowledge and Information Systems Journal, Vol 10, Num. 3 (October, 2006),
pp315-331.
- Coenen, F.P., Goulbourne, G. and Leng, P. (2001).
Computing Association Rules Using
Partial Totals.
In de Raedt, L. and Siebes, A. (Eds), Principles of Data Mining and Knowledge Discovery, Proc PKDD 2001, Spring Verlag LNAI 2168, pp 54-66.
- Coenen, F.P. and Leng, P. (2002).
Finding Association Rules With Some Very
Frequent Attributes.
In "Principles of Data Mining and Knowledge discovery", eds Elomaa, T., Mannila, H. and Toivonen, H., Proc PKDD 2002 Conference, Helsinki, August 2002: Lecture Notes in AI 2431, Springer-Verlag: 99-111.
- Coenen, F.P., Leng, P. and Ahmed, S.(2003)
T-Trees, Vertical Partitioning and
Distributed Association Rule Mining.
Proceedings ICDM-2003, pp513-516.
- Coenen, F., Leng, P. and Ahmed, S. (2004a).
Data Structures for association Rule
Mining: T-trees and P-trees.
IEEE Transactions on Data and Knowledge Engineering, Vol 16, No 6, pp774-778.
- Coenen, F.P. Leng, P. and Goulbourne, G. (2004b).
Tree Structures for Mining Association
Rules.
Journal of Data Mining and Knowledge Discovery, Vol 8, No 1, pp25-51.
- Coenen, F. and Leng, P. (2005).
Obtaining Best Parameter Values for
Accurate Classification.
Proc. ICDM'2005, IEEE, pp597-600.
- Coenen, F., Leng, P. and Zhang, L. (2005).
Threshold Tuning for Improved
Classification Association Rule Mining.
Proceeding PAKDD 2005, LNAI3158, Springer, pp216-225.
- Coenen, F.P. and Leng, P. (2006).
Partitioning
Strategies for Association Rule Mining.
The Knowledge Engineering Review, Vol. 21, Issue 1 (March 2006), pp 25-47.
- Goulbourne, G., Coenen, F.P. and Leng, P. (1999).
Algorithms for Computing Association
Rules Using A Partial-Support Tree.
Proceedings ES99, Springer, London, pp132-147.
- Goulbourne, G., Coenen, F.P. and Leng, P. (2000).
Algorithms for Computing Association
Rules Using a Partial-Support Tree.
Journal of Knowledge-Based Systems, Vol (13), pp141-149. (Reprint of ES'99
paper).
Created and maintained by
Frans Coenen.
Last updated May 2009