THE LUCS-KDD TFP ASSOCIATION RULE MINING ALGORITHM GUI



Liverpool University

Frans Coenen

Department of Computer Science

The University of Liverpool

13 February 2008


CONTENTS

1. Introduction.
2. Downloading the software.
2.1. Compiling.
2.2. Documentation.
 
3. Running the software.
4. GUI Options.
5. Output schemas.



1. INTRODUCTION

TFP (Total From Partial), also sometimes referred to as Apriori-TFP, is an Association Rule Mining (ARM) algorithm, developed by the LUCS-KDD research team. The TFP algorithm in its current form together with further details concerning P-trees and T-trees that are not available from this WWW page can be found in Coenen et al. (2004a, 2004b). The code obtainable from this page is a GUI version that includes (for comparison purposes) the LUCS-KDD implementation of the FPgrowth algorithm developed by Han et al. (2000).




2. DOWNLOADING THE SOFTWARE


The TFP ARM software comprises thirteen source files. These are provided from this WWW page together with an application class. The class files are as follows:

  1. AprioriTFP_GUI_App: Application class.
  2. AprioriTFPcontrol: Class containing the GUI control methods.
  3. AssocRuleMining: Set of general ARM utility methods to allow: (i) data input and input error checking, (ii) data preprocessing, (iii) manipulation of records (e.g. operations such as subset, member, union etc.) and (iv) data and parameter output.
  4. FPtree: An implementation of Han's FP-growth ARM algorithm.
  5. PartialSupportTree: Methods to implement the TFP (Total From Partial) ARM algorithm using both the T-tree (Total support tree) and P-tree (Partial support tree data structures.
  6. PtreeCanvas: Paint canvas in which to display the P-tree.
  7. PtreeNode: Tree structure to store Ptree nodes. Same as top level structure (see below) but with the addition of a sibling branch.
  8. PtreeNodeTop: Top level Ptree node structure. An array of such structures is created in which to store the top level of the Ptree.
  9. PtreeWindow: Window in which to place the P-tree paint canvas.
  10. RuleNode: Set of methods that allow the creation and manipulation (e.g. ordering, etc.) of a list of ARs.
  11. TotalSupportTree: Methods to implement the "Apriori-T" algorithm using the "Total support" tree data structure (T-tree).
  12. TtreeCanvas: Methods to create a canvas onto which the T-tree may be painted if rquired.
  13. TtreeNode: Methods concerned with the structure of Ttree nodes. Arrays of these structures are used to store nodes at the same level in any sub-branch of the T-tree. Note this is a separate class to the other T-tree classes which are arranged in a class hierarchy as illustrated below.
  14. TtreeWindow: Methods to create a GUI window to contain a T-tree canvas (if requested).

The relationship between some of the principal classes is as shown in the hierarchy in Table 1 below.

                   AssocRuleMining
                          |
                   TotalSupportTree
                          |
   +----------------------+----------------------+
   |                                             |
FPtree                                 ParrtialSupportTree

Table 1: Class hierarchy


There is also a "tar ball" aprioriTFP_GUI.tgz that can be downloaded that includes all of the above source and application class files. It can be unpacked using tar -zxf aprioriT.tgz.


2.1. Compiling

The software has been implemented in Java which should therefore make it highly portable. The code does not require any special packages and thus can be compiled using the standard Java compiler:

javac *.java

2.2. Documentation

The code can be documented using Java Doc. First create a directory Documentation in which to place the resulting HTML pages and then type:

javadoc -d Documentation/ *.java

This will produce a hierarchy of WWW pages contained in the Document directory.




3. RUNNING THE SOFTWARE


When compiled the software can be invoked in the normal manner using the Java interpreter:

java AprioriTFP_GUI_App

The resulting GUI window is shown in Figure 1. If you are planning to process a very large data set it is a good idea to grab some extra memory. For example:

java -Xms600m -Xmx600m AprioriTFP_GUI_App
TFP GUI Window

Figure 1: TFP GUI Window

The input to the software, in all cases is a (space separated) binary valued data set D. The set R comprises a set of N records such that each record (r), in turn, comprises a set of attributes (A). We say that a particular data set has M columns and N rows. A small example data sets might be as follows:

1 2 3 6
1 4 5 7
1 3 4 6
1 2 6
1 2 3 4 5 7

where, in this case, A = {1, 2, 3, 4, 5, 6, 7}. Note that attribute numbers are ordered sequentially commencing with the number 1 (the value 0 has a special meaning).

EXAMPLE

  1. Select File -> Load Data and load the pima.D38.N768.C2.num preprocessed data set also available from this WWW page.
  2. Select File -> Load Schema and load the pima.D38.N768.C2.outSchema output schema file available from this WWW page. This is optional but it makes the output more informative. For further details see Section 5 below.
  3. Select Param. Input -> Support and input a support value of 70. This is very high in terms of ARM but for the example we only want to generate a few rules, a much lower value is usually used.
  4. Select Param. Input -> Confidence and input a confidence value of 98. A high confidence value is not unusual in ARM, although a value of about 80% is more common.
  5. Select Gen. Frequent Sets -> TFP (with X check), this will generate a set of frerquent item sets with a support value greater or equal to 70% (the input value).
  6. Selecdt Generator -> Generate ARs (Min Conf), this will generate a set of ARs with a confidence value greater or equal to 98% (the input value).
  7. Select Output -> Association Rules -> ARs Att. Numbers this will produce output of the form given in Table 2, i.e. using attribute numbers.
  8. Select Output -> Association Rules -> ARs Output Schema this will produce output of the form given in Table 3, i.e. more informative output using the schema file loaded earlier.

Some notes on the suppoprt and confidence framework are available at:

http://www.csc.liv.ac.uk/~frans/KDD/Software/KDDnotes/noteOnSuppConfFwork.html.

OUTPUT ASSOCIATION RULES USING ATTRIBUTE NUMBERS:
(N) ANTECEDENT -> CONSEQUENT CONFIDENCE (%)
(1) {23 5} -> {19} 99.28
(2) {9 23 5} -> {19} 99.26
(3) {5} -> {19} 99.13
(4) {9 5} -> {19} 99.1
(5) {19 9 27 14} -> {23} 98.12
(6) {19 9 1 14} -> {23} 98.06
(7) {19 9 14 32} -> {23} 98.0

Table 2: ARs listed usinmg attribute numbers

 
OUTPUT ASSOCIATION RULES USING OUTPUT SCHEMA:
(N) ANTECEDENT -> CONSEQUENT CONFIDENCE (%)
(1) {BodyMassIndex <= 44.733 PlasmaGluConcent <= 142} -> {2-HourSerumIns <= 341} 99.28
(2) {DiastolicBldPress <= 94 BodyMassIndex <= 44.733 PlasmaGluConcent <= 142} -> {2-HourSerumIns <= 341} 99.26
(3) {PlasmaGluConcent <= 142} -> {2-HourSerumIns <= 341} 99.13
(4) {DiastolicBldPress <= 94 PlasmaGluConcent <= 142} -> {2-HourSerumIns <= 341} 99.1
(5) {2-HourSerumIns <= 341 DiastolicBldPress <= 94 DiabPedFunc <= 0.93 TricepsSkinFold <= 41} -> {BodyMassIndex <= 44.733} 98.12
(6) {2-HourSerumIns <= 341 DiastolicBldPress <= 94 NumberPregnacies <= 10 TricepsSkinFold <= 41} -> {BodyMassIndex <= 44.733} 98.06
(7) {2-HourSerumIns <= 341 DiastolicBldPress <= 94 TricepsSkinFold <= 41 Age <= 48} -> {BodyMassIndex <= 44.733} 98.0

Table 3: ARs listed using the output schema




4. GUI OPTIONS


The interface comprises six menus:

  1. File: Four items.
    1. About: Displays some instructions on a screen.
    2. Load Data: Allows input of a data set.
    3. Load Schema: Allows input of a data schema describing the data set (optional). Further details about schemas are given below.
    4. Exit: Shuts down the GUI.
  2. Parameter Input: Two items.
    1. Support: Allows input of a minimum support threshold.
    2. Confidence: Allows input of a minimum confidence threshold.
  3. Data Pre-Processing: Two items.
    1. Sort: Orders input data according to frequency of 1-itemsets (this make tree based ARM algorithms more efficient).
    2. Sort and Prune: Orders input data according to frequency of 1-itemsets and then prunes all occurrences of unsupported 1-itemset (again this make tree based ARM algorithms more efficient)
  4. Generate Frequent Sets: Two items.
    1. TFP (With X-check): Invokes TFP algorithm (version with cross checking).
    2. FP-Growth: Invokes FP Growth algorithm of Han et al.
  5. Generator: Options to generate ARs, two items.
    1. Generate ARs (Min Conf): Causes the generation of ARs using the confidence measure.
    2. Generate ARs (Lift): Causes the generation of ARs using the lift measure.
  6. Output: Many output options.



5. OUTPUT SCHEMAS


A feature of the TFP GUI is that output can be done using an output schema. This is simply a text file of attribute labels, one label per line (of course we need as many labels as attributes). Table 4 gives the first ten lines of an output schema for the UCL Pima indians data set after it has been processed uisng the LUCS-KDD Data Normalisation software. Note that in this case the output schema was also automatically genertaed by the LUCS-KDD Data Normalisation software, but it could equally well have been "hand crafted". Both the processed Pima indians set and the associated output schema are available from this WWW page.

NumberPregnacies <= 10
10 < NumberPregnacies <= 11
11 < NumberPregnacies <= 12
12 < NumberPregnacies
PlasmaGluConcent <= 142
142 < PlasmaGluConcent <= 150
150 < PlasmaGluConcent <= 152
152 < PlasmaGluConcent
DiastolicBldPress <= 94
94 < DiastolicBldPress <= 103
....

Table 4: Class hierarchy




REFERENCES


Coenen, F., Leng, P. and Ahmed, S. (2004a).
Data Structures for association Rule Mining: T-trees and P-trees. IEEE Transactions on Data and Knowledge Engineering, Vol 16, No 6, pp774-778.
Coenen, F.P. Leng, P. and Goulbourne, G. (2004b).
Tree Structures for Mining Association Rules. Journal of Data Mining and Knowledge Discovery, Vol 8, No 1, pp25-51.
Han, J., Pei, J. and Yiwen, Y. (2000).
Mining Frequent Patterns Without Candidate Generation. Proceedings ACM-SIGMOD International Conference on Management of Data, ACM Press, pp1-12.



Created and maintained by Frans Coenen. Last updated 25 August 2008