THE LUCS-KDD FUZZY APRIORI-T SOFTWARE



Liverpool University

Frans Coenen

Department of Computer Science

The University of Liverpool

20 August 2008

Revisions:


CONTENTS

1. Introduction.
2. Downloading the software.
2.1. Compiling.
2.2. Documentation.
 
3. Running the software.
3.1. Documentation.
4. Support calculation.
5. Conclusions.



1. INTRODUCTION


Association Rule Mining (ARM) is concerned with finding "interesting" patterns in binary valued data sets. Standard (classical) ARM requires that all attributes are binary valued (yes-no, true-false, 0-1, etc.). Of course, in real life, not all fields in the data sets we want to apply ARM to are binary valued. Given a field that can take one of a sequence of values, for example an attribute colour with a value set of {Red, Green, Blue}, the practice is to turn this into a sequence of binary valued attributes (three in the example). Given a numeric attribute that can take a range of values, for example the attribute age with a value range of {0..120}, the practice is to "range" this set of values into a sub ranges. In the case of the above age attribute these might be {0..30, 31..60, 61..90, 91..120}, i.e. four binary valued attributes. We can also ascribe labels to these ranges, for example {young, middleAged, old, veryOld}. The LUCS-KDD research team have produced software to perform such "discretisation/normalisation".

The problem with the above approach to dividing ranged values into sub-ranges is that the boindaries between the sub-ranges are crisp boundaries. Thus, in the above example, an age of 30 will be allocated to the range young when it could be argued that this value should also be partly associated with the range middle_aged. The reverse argument might be applied to the age 31. It is suggested that this crisp boundary problem effects the outcome of ARM when applied to ranged data.

Fuzzy Association Rule Mining (FARM) is intended to address the crisp boundary problem encountered in traditional ARM. The principal idea is that ranged values can belong to more than one sub-range, we say that the value has a membership degree that associates it with each available sub-ranges. The total membership degree for a single attribute typically totals to one, and this is the expectancy with respect to The Fuzzy Apriori-T algorithm. There are many different membership functions reported in the literature that are used to calculate the membership degree for a given numeric value. Discussion of the nature of these membership functions is outside the scope of this WWW page. The code described here assumes that membership degree values have been pre calculated ancd that these are presented as attribute number (N) and membership degree (M) pairs of the form:

<N,M>

More specifically the software that may be downloaded from this page is a fuzzy version of the Apriori-T algorithm. Apriori-T, like many ARM algorithms uses the support-confidence framework, a feature of which is that for an item set to be "frequent" its supersets must also be frequent. This downward closure property is maintained in the Fuzzy Apriori-T algorithm.

The support for a 1-itemset is simply the sum of the membership degree values divided by the number of records in the data set. The support for N-item sets, for each record containing the item set, is the sum of the products of the membership degree values in each record. Thus given a data set of the form (we have used letters as attribute identifiers, the code actually expects integers starting from 1:

<c,1.0>
<a,0.5> <b,0.5>
<a,0.5> <c,0.5>
<a,0.5> <b,0.5>

The calculated support values will be:

{a} = 0.375
{c} = 0.375
{a c} = 0.0625
{b} = 0.25
{a b} = 0.125

Another example is given in Section 4 below.




2. DOWNLOADING THE SOFTWARE


The Fuzzy Apriori-T software comprises eleven source files including an application class. These are provided from this WWW page together with an application class. The source files are as follows:

  1. AssocRuleMining.java: Set of general ARM utility methods to allow: (i) data input and input error checking, (ii) data preprocessing, (iii) manipulation of records (e.g. operations such as subset, member, union etc.) and (iv) data and parameter output.
  2. FuzzyAprioriT.java: Class that contains methods to support Fuzzy Association Rule Mining (FARM) based in the Apriori-T algorithm.
  3. FuzzyAprioriTcontrol.java: Control class for Fuzzy Apriori-T, contains methods for manipulating the GUI etc.
  4. FuzzyDataItem.java: Class (record) describing a "fuzzy" data item is per input data.
  5. FuzzyTtreeNode.java: Methods concerned with fuzzy T-tree node structure. Arrays of these structures are used to store nodes at the same level in any sub-branch of the T-tree.
  6. RuleNode.java: Class for storing binary tree of CRs (Classification Rules).
  7. TotalSupportTree.java: Methods concerned with the generation, processing and manipulation of T-tree data storage structures used to hold the total support counts for large itemsets.
  8. TtreeCanvas.java: Modules to paint a representation of the T-tree so that all T-tree nodes are placed in an X-Y plane which is then processed and output.
  9. TtreeNode.java: Methods concerned with T-tree node structure. Arrays of these structures are used to store nodes at the same level in any sub-branch of the T-tree.
  10. TtreeWindow.java: Class to create a window in which the T-tree canvas can be displayed.

Three of the classes are arranged in a class hierarchy of the form illustrated below.

 AssocRuleMining
        |
TotalSupportTree
        |
  FuzzyAprioriT

The fuzzy Apriori-T application class included here is as follows:

  1. FuzzyAprioriT_GUI_app.java: Fuzzy AprioriT GUI application class.

There is also a "tar ball" decTree.tgz that can be downloaded that includes all of the above source and application class files. It can be unpacked using tar -zxf fuzzyAprioriT.tgz.


2.1. Compiling

The decision tree software has been implemented in Java using the Java2 SDK (Software Development Kit) Version 1.5.0, which should therefore make it highly portable. The code does not require any special packages and thus can be compiled using the standard Java compiler:

javac *.java

2.2. Documentation

The code can be documented using Java Doc. First create a directory Documentation in which to place the resulting HTML pages and then type:

javadoc -d Documentation/ *.java

This will produce a hierarchy of WWW pages contained in the Document directory.




3. RUNNING THE SOFTWARE


When compiled the software can be invoked in the normal manner using the Java interpreter:

java APPLICATION_CLASS_FILE_NAME

If you are planning to process a very large data set it is a good idea to grab some extra memory. For example:

java -Xms600m -Xmx600m APPLICATION_CLASS_FILE_NAME

The input to the software, in all cases is a (space separated) tuple data set. Two example data sets test1.num and tesyt2.num are available from this WWW page to act as an examples for users.

To ease output, attributes can be identified by labels instead og integers (which tend to convey little meaning). This is done by loading a output schema file which is simply a list of strings (no quotes required) one per line for each attribute. Two example schema files are also available from this WWW page: test1.schema and tesyt2.schema.


3.1. Documentation

The code can be documented using Java Doc. First create a directory Documentation in which to place the resulting HTML pages and then type:

javadoc -d Documentation/ *.java

This will produce a hierarchy of WWW pages contained in the Document directory.




4. SUPPORT CALCULATION


For single item sets the support is simply the sum of the membership functions divided by the total number of records.

For two item sets and greater the support is the sum of the products of the membership functions.

Thus given the data set (where the attribute set is {a, b, c}:

<a,1.0>
<a,0.5> <b,0.5>
<a,0.5> <c,0.5>
<a,0.5> <b,0.5>

The support calculations will be:

{a} = (0.5+0.5)/4 = 0.25
{b} = (0.5+0.5)/4 = 0.25
{a} = (1.0+0.5+0.5)/4 = 0.5
{a b} = ((0.5*0.5))/4 = 0.06
{a c} = ((0.5*0.5))/4 = 0.06
{b c} = ((0.5*0.5))/4 = 0.06



5. CONCLUSIONS


The fuzzy Apriori-T algorithms described here has been use successfully used by the LUCS-KDD research group in its work on FARM. The software is available for non-commercial usage free of charge, however the author would appreciate appropriate acknowledgement. The following reference format for referring to the fuzzy Apeiori-T implementation available from this WWW site is suggested:

  1. Coenen, F. (2008), The LUCS-KDD Fuzzy Apriori-T Software, http://www.csc.liv.ac.uk/~frans/KDD/Software/Apriori_TFP/aprioriTFP.html, Department of Computer Science, The University of Liverpool, UK.

Should you discover any "bugs" or other problems within the software (or this documentation), do not hesitate to contact the author.




Created and maintained by Frans Coenen. Last updated 20 August 2008