THE LUCS-KDD WEIGHTED FUZZY ASSOCIATION RULE MININER: THE WEIGHTED, FUZZY AND FUZZY WEIGHTED APRIORI-T ALGORITHMS



Liverpool University

Frans Coenen

Department of Computer Science

The University of Liverpool

22 September 2008

Revisions: 3 January 2013


CONTENTS

1. Introduction.
2. Downloading the software.
2.1. Compiling.
2.2. Documentation.
3. Running the software.
3.1. Input Data.
3.2. Weighting Data.
3.3. Output Schemas.
 
4. Support calculation.
4.1. Weighted Support.
4.2. Fuzzy Support.
4.3. Fuzzy Weighted Support.
5. Notes on GUI Interface
6. Acknowledgements.
7. Conclusions.



1. INTRODUCTION


Association Rule Mining (ARM) is concerned with finding "interesting" patterns in binary valued data sets.

In classical ARM the assumption is that all attributes in the dataset are of equal significance; in practice there are many applications where this is not the case. Consequently Weighted ARM (WARM) is a variation of classical ARM where individual attributes are weighted. ARM algorithms typically obtain an efficiency gain by using what is commonly termed the "Downward Closure Property" (the DC property) of items sets, i. e. that for an item set to be frequent all its subsets must be frequent; or, vice-versa, if an itemset is not frwquent that non of its supersets can be frequent. There are many schemes proposed in the literature for implementing WARM, some which maintain the DC property and some which do not. The view here is that for WARM to be effective the application of weightings must be such that the DC property is maintained.

From the above ARM data sets are expected to be binary valued. Of course, in practice, not all fields in the data sets we want to apply ARM to are binary valued. Given a field that can take one of a sequence of values, for example an attribute colour with a value set of {Red, Green, Blue}, the practice is to turn this into a sequence of binary valued attributes (three in the example). Given a numeric attribute that can take a range of values, for example the attribute age with a value range of {0..120} the practice is to "range" this set of values into a sub ranges. In the case of the above age attribute these might be {0..30, 31..60, 61..90, 91..120}, i.e. four binary valued attributes. We can also also ascribe labels to these ranges, for example {young, middle_aged, old, very_old}. The LUCS-KDD research team have produced software to perform such LUCS-KDD Data Normalisation"> discretisation/normalisation.

The problem with the above approach to dividing ranged values into sub-ranges is that the boundaries between the sub-ranges are crisp boundaries. Thus, in the above example, an age of 30 will be allocated to the range young when it could be argued that this value should also be partly associated with the range middle_aged. It is suggested that this crisp boundary problem effects the outcome of ARM when applied to ranged data.

Fuzzy Association Rule Mining (FARM) is intended to address the crisp boundary problem encountered in traditional ARM. The principal idea is that ranged values can belong to more than one sub-range, we say that the value has a membership degree that associates it with one or more sub-ranges. The total membership degree for a value typically totals to one. There are many different membership functions reported in the literature that are used to calculate the membership degree foe a given numeric value.

The software that may be downloaded from this page allows: WARM, FARM and FWARM; the last being a combination of Fuzzy and Weighted ARM. All are variations of the established Apriori-T algorithm. Apriori-T, like many ARM algorithms uses the support-confidence framework, a feature of DC property discussed above.




2. DOWNLOADING THE SOFTWARE


The Fuzzy and/or Weighted Apriori-T software comprises fourteen source files. These are provided from this WWW page together with an application class. The source files are as follows:

  1. AssocRuleMining.java: Set of general ARM utility methods to allow: (i) data input and input error checking, (ii) data preprocessing, (iii) manipulation of records (e.g. operations such as subset, member, union etc.) and (iv) data and parameter output.
  2. FuzzyAprioriT.java: Class that contains methods to support Fuzzy Association Rule Mining (FARM) based on the Apriori-T algorithm.
  3. FuzzyDataItem.java: Class (record) describing a "fuzzy" data item is per input data.
  4. FuzzyTtreeNode.java: Methods concerned with fuzzy Ttree node structure. Arrays of these structures are used to store nodes at the same level in any sub-branch of the T-tree.
  5. FWARMaprioriT.java: Class that contains methods to support Fuzzy Weighted Association Rule Mining (FWARM) based on the Apriori-T algorithm.
  6. FWARMaprioriTcontrol.java: ontrol class for Fuzzy Weighted (FWARM) Apriori-T, contains methods for manipulating the GUI, etc.
  7. RuleNode.java: Class for storing binary tree of CRs (Classification Rules).
  8. TotalSupportTree.java: Methods concerned with the generation, processing and manipulation of T-tree data storage structures used to hold the total support counts for large itemsets.
  9. TtreeCanvas.java: Modules to paint a representation of the T-tree so that all T-tree nodes are placed in an X-Y plane which is then processed and output.
  10. TtreeCanvasSub.java: Modules to paint a representation of the T-tree where the T-tree comprises fuzzy data. Methods place all T-tree nodes are placed in an X-Y plane which is then processed and output.
  11. TtreeNode.java: Methods concerned with Ttree node structure. Arrays of these structures are used to store nodes at the same level in any sub-branch of the T-tree.
  12. TtreeWindow.java: Class to create a window in which the T-tree canvas can be displayed.
  13. WeightedAprioriT.java: Class that contains methods to support Weighted Association Rule Mining (WARM) based in the Apriori-T algorithm.

The decisionTreeNode and RuleNode classes are separate to the remaining classes which are arranged in a class hierarchy of the form illustrated below.

        AssocRuleMining
               |
       TotalSupportTree
               |
         FuzzyAprioriT
               |
      +--------+--------+
      |                 |
FWARMaprioriT   WeightedAprioriT

One fuzzy weighted Apriori-T application class is included here is as follows:

  1. WeightedFuzzyAprioriT_GUI_App.java: Contains methods to implement the Fuzzy Weighted ARM variation of Apriori-T algorithm, and Fuzzy and Weighted only versions.

There is also a "tar ball" weightedFuzzyAprioriT.tgz that can be downloaded that includes all of the above source and application class files. It can be unpacked using tar -zxf weightedFuzzyAprioriT.tgz.


2.1. Compiling

The decision tree software has been implemented in Java using the Java2 SDK (Software Development Kit) Version 1.5.0, which should therefore make it highly portable. The code does not require any special packages and thus can be compiled using the standard Java compiler:

javac *.java

2.2. Documentation

The code can be documented using Java Doc. First create a directory Documentation in which to place the resulting HTML pages and then type:

javadoc -d Documentation/ *.java

This will produce a hierarchy of WWW pages contained in the Document directory.




3. RUNNING THE SOFTWARE


When compiled the software can be invoked in the normal manner using the Java interpreter:

java APPLICATION_CLASS_FILE_NAME

If you are planning to process a very large data set it is a good idea to grab some extra memory. For example:

java -Xms600m -Xmx600m APPLICATION_CLASS_FILE_NAME

The input to the software always comprises at least two files:

  1. A data file, and
  2. A weighting file.

In addition an optional input schema file may also be loaded. Each style of input file is discussed further in the following sub- sections.

3.1. Data Input

There are two forms of input data: (i) binary valued data for Weighted ARM only and (ii) fuzzy valued data for Fuzzy or Weighted Fuzzy ARM.

Binary Valued Data

Binary valued input comprises a (space separated) binary valued data set R. The set R comprises a set of N records such that each record (r), in turn, comprises a set of attributes. Thus:

R  = {r | r = subset A}

Where A is the set of available attributes. The value D is then defined as:

D = |A|

We then say that a particular data set has D columns and N rows. A small example data sets might be as follows:

1 2 3 6
1 4 5 7
1 3 4 6
1 2 6
1 2 3 4 5 7

where, in this case, A = {1, 2, 3, 4, 5, 6, 7}. Note that attribute numbers are ordered sequentially commencing with the number 1 (the value 0 has a special meaning).

Fuzzy Valued Data

In this case the input is a (space separated) numeric data set R with values ranging between 0..1. Two example data sets test1.num and test2.num are available from this WWW page to act as an examples for users. The test1.num data set (the smaller of the two) is as follows:

<c,1.0>
<a,0.25> <b,0.5>
<a,0.5> <c,0.75>
<a,0.5> <b,0.25>

3.2. Weighting Data

Weightings are loaded in a separate text file. The assumption is that the weighting for each attribute is constant. Weightings are expressed in the form of real numbers between 1.0 and 0.0. The format is one weighting per line. Thus there should be as many lines in the weighting file as attributes. Example (the testWeightings1.txt file available from this WWW page):

0.2
1.0
0.3

3.3. Output Schema

A feature of the GUI is that output can be done using an output schema. This is simply a text file of attribute labels, one label per line (of course we need as many labels as attributes). Table 2 gives the output schema for test1 data file test1.schema.

verySmall_A
small_A
medium_A

Table 2: Schema file example




4. SUPPORT CALCULATION


There are three different mechanisms for calculating support according to the nature of the algorithm chosen:

  1. Weighted Support
  2. Fuzzy Support
  3. Weighted Fuzzy Support

4.1. Weighted Support

For single item sets the support is simply the sum of the weightings divided by the total number of records.

For two item sets and greater the support is the sum of the products of the weightings.

Thus given the data set (where the attribute set is {a, b, c}):

c
b c
a c
a b

and a weighting file of the form:

0.25
0.5
0.75

The support calculations will be:

{a} = (0.25+0.25)/4 = 0.125
{b} = (0.5+0.5)/4 = 0.25
{c} = (0.75+0.75+0.75)/4 = 0.563
{a b} = ((0.25*0.5))/4 = 0.031
{a c} = ((0.25*0.75))/4 = 0.047
{b c} = ((0.5*0.75))/4 = 0.094

4.3. Fuzzy Support

For single item sets the support is simply the sum of the membership functions divided by the total number of records.

For two item sets and greater the support is the sum of the products of the membership functions.

Thus given the data set (again where the attribute set is {a, b, c}):

<c,1.0>
<b,0.5><c,0.5>
<a,0.5><c,0.5>
<a,0.5><b,0.5>

The support calculations will be:

{a} = (0.5+0.5)/4 = 0.25
{b} = (0.5+0.5)/4 = 0.25
{a} = (1.0+0.5+0.5)/4 = 0.5
{a b} = ((0.5*0.5))/4 = 0.06
{a c} = ((0.5*0.5))/4 = 0.06
{b c} = ((0.5*0.5))/4 = 0.06

4.3. Fuzzy Weighted Support

For single items sets the support is the sum of the product calculation for each weighting/fuzzy member ship pair (w*f). For 2-itemsets and larger the support is the sum of the products of all the weightings and fuzzy membership calculations. Thus given the data set below (for ease of reading the attribute numbers have been replaced by letters):

<c,1.0>
<a,0.25> <b,0.5>
<a,0.5> <c,0.75>
<a,0.5> <b,0.25>

and the associated weighting file:

0.2
1.0
0.3

The support calculations will be as follows:

{a}   = ((0.25*0.2)+(0.5*0.2)+(0.5*0.2))/4        = *(0.05 + 0.1 + 0.1)/4 = 0.0625
{b}   = ((0.5*1.0)+(0.25*1.0))/4                  = (0.5+0.25)/4          = 0.1875
{c}   = ((1.0*0.3)+(0.75*0.3))/4                  = (0.3+0.225)/5         = 0.13125
{a b} = ((0.25*0.2*0.5*1.0)+(0.5*0.2*0.25*1.0))/4 = (0.025+0.025)/4       = 0.0125
{a c} = ((0.5*0.2*0.75*0.3))/4                    =  0.0225/4             = 0.005625



5. NOTES ON GUI INTERFACE


The interface has six menus. Each is discussed below.

  1. File: Five options:
    1. About: Brief overview of software displayed in text area.
    2. Load Data: Load input data (binary or fuzzy).
    3. Load Weightings: Load associated weightings file (required if doing Weighted ARM).
    4. Load Schema: Load output schema (optional).
    5. Reset: Resets all variables to default state.
    6. Exit: End program.
  2. Param. Input: Two options:
    1. Support: Input support value.
    2. Confidence: Input confidence value.
  3. Data Pre-Proc.: Two options:
    1. Sort: Sort data so that most frequent one itemsets are listed first (this makes T-tree more balanced and therefore aids efficiency).
    2. Sort & Prune: Prunes unsupported one itemsets as well as sorting and thus adds extra efficiency.
  4. Apriori-T: Three variations of the Apriori-T algotith as listed below. Their individual focus is self evident from their titles. The process of X checking refers to search of the T-tree to determine whether all subsets of a candidate set are supported, this adds an extra level of complexity but is usually worth doing.
    1. Weighted Apriori-T (X check).
    2. Fuzzy Apriori-T (X check).
    3. Weighted Fuzzy Apriori-T (X check)".
  5. Generator: Two options for generating ARs:
    1. Generate ARs (Min Conf): Generate ARs using the support-confidence framework.
    2. Generate ARs (Lift): Generate ARs using the support-lift framework.
  6. Output: 6 options, most with sub options:
    1. Output Schema: Outputs the output schema (if loaded).
    2. Data Array: Outputs the data or weighting input as follows:
      1. Data Att. Numbers: Outputs the input data file in "raw" form as attribute numbers).
      2. Data Output Schema: Outputs the input data file using the output schema.
      3. Weightings: Outputs the weightings file.
    3. T-tree: Outputs the T-tree information as follows:
      1. T-tree Statistics: Outputs t-tree statistics.
      2. T-Tree (Att. Numbers): Outputs T-tree in numeric form (to text area)
      3. T-tree (Graph): Outputs T-tree as a graph in a separate window (provided there are only a limited number of nodes)
    4. Fequent Sets: Outputs the identified frequent sets:
      1. FSs Att. Numbers: In the form of attribute numbers.
      2. FSs Output Schema: Using the output schema.
    5. Association Rules: Outputs ARS:
      1. ARs Att. Numbers: In the form of attribute numbers.
      2. ARs Output Schema: Using the output schema.
    6. Diagnostocs: Currently only one option:
      1. Conversion arrays: Outputs the conversion arrays generated when pruning has been invoked.



6. ACKNOWLEDGEMENTS


Many thanks Behzad Eslami Tehrani for spotting an error in the code and supplying the brrast.weightfile.




7. CONCLUSIONS


The Weighted Fuzzy Apriori-T algorithms described here has been use successfully used by the LUCS-KDD research group in its work on FWARM. The software is available for non-commercial usage free charge, however the author would appreciate appropriate acknowledgement. The following reference format for referring to the fuzzy Apeiori-T implementation available from this www site is suggested:

  1. Coenen, F. (2008), The LUCS-KDD Weighted Fuzzy Apriori-T Software, http://www.csc.liv.ac.uk/~frans/KDD/Software/WFapriori_TFP/weightedFuzzyAprioriTFP.html, Department of Computer Science, The University of Liverpool, UK.

Should you discover any "bugs" or other problems within the software (or this documentation), do not hesitate to contact the author.




Created and maintained by Frans Coenen. Last updated 26 September 2008