THE LUCS-KDD DECISION TREE CLASSIFIER SOFTWARE



Liverpool University

Frans Coenen

Department of Computer Science

The University of Liverpool

5 December 2007

Revissions: 31 December 2007, 9 January 2012, 2 August 2017


CONTENTS

1. Introduction.
2. Downloading the software.
2.1. Compiling.
2.2. Documentation.
 
3. Running the software.
4. Output.
5. Some example results.
6. Conclusions.



1. INTRODUCTION


Classification using decision trees was one of the earliest forms of data mining. Ross Quinlan's C4.5 is arguably the most referenced decision tree algorithm (Quinlan, 1993). The code available from this WWW page is the LUCS-KDD (Liverpool University Department of Computer Science - Knowledge Discovery in Data) group's own implementation of a decision tree classifier for binary valued data sets, such as those typically used in Association Rule Mining (ARM).

One of the most significant issues in decision tree generation is deciding on which attribute to split. Various algorithms have been proposed in the literature. Two are used here:

  1. Most frequent.
  2. Information gain.

The first selects the first attribute in a list of attributes order according to frequency within the entire data set. Information gain (Mitchel 1997) is one of the standard measures used in decision tree construction.

The code given here divides the input data into a training set and a test set using a 50:50 split. The training set is used to build the decision tree, which is then evaluated using the test set. Once the decision tree has been generated a set of classification rules is extracted. The rules are ordered according to size of the antecedent with rules with the largest antecedent being listed first (i.e. the most specific riles). The classifier is applied to the test set by selecting the first rule that satisfies the antecedent of each test set example. The last rule in the classifier is the default rule.




2. DOWNLOADING THE SOFTWARE


The decision tree software comprises seven source files. These are provided from this WWW page together with three application classes. The source files are as follows:

  1. AssocRuleMining.java: Set of general ARM utility methods to allow: (i) data input and input error checking, (ii) data preprocessing, (iii) manipulation of records (e.g. operations such as subset, member, union etc.) and (iv) data and parameter output.
  2. DecisionTree.java: Generic Methods to build decision trees and generate a classifier from the result.
  3. DecisionTreeNode.java: Class defining a decision tree node.
  4. DecTreeInfoGain.java: Generates a decision tree using information gain as a the splitting criteria.
  5. DecTreeRandom.java: Generates a decision tree using most frequent attribute selection as a the splitting criteria.
  6. RuleNode.java: Class for storing binary tree of CRs (Classification Rules).

The decisionTreeNode and RuleNode classes are separate to the remaining classes which are arranged in a class hierarchy of the form illustrated below.

                AssocRuleMining
                      |
                 DecisionTree
                      |
       +--------------+---------------+
       |                              |
DecTreeInfoGain                 DecTreeRandom

The decision tree application classes included here are as follows:

  1. DecTreeRandomApp.java: decision tree application class using frequency of attribute occurance as the "splitting criteria" (so not really random!) and a 50:50 training-test set split.
  2. DecTreeInfoGainApp.java: decision tree application class using information gain as the "splitting criteria" and a 50:50 training-test set split.
  3. DecTreeRandomNoTestingApp.java: decision tree application class using frequency of attribute occurance as the "splitting criteria" (so not really random!) and the entire input as the training set.
  4. DecTreeInfoGainNoTestingApp.java: decision tree application class using information gain as the "splitting criteria" and the entire input as the training set.
  5. DecTreeRandom_2file_App.java: decision tree application class using frequency of attribute occurance as the "splitting criteria" and seperate training and test set files.
  6. DecTreeInfoGain_2file_App.java: decision tree application class using information gain as the "splitting criteria" and seperate training and test set files.
  7. DecTreeRandomApp10.java: decision tree application class using frequency of attribute occurance and Ten-fold Cross Validation (TCV).
  8. DecTreeInfoGainApp10.java: decision tree application class using information gain and Ten-fold Cross Validation (TCV).

2.1. Compiling

The decision tree software has been implemented in Java using the Java2 SDK (Software Development Kit) Version 1.5.0, which should therefore make it highly portable. The code does not require any special packages and thus can be compiled using the standard Java compiler:

javac *.java

2.2. Documentation

The code can be documented using Java Doc. First create a directory Documentation in which to place the resulting HTML pages and then type:

javadoc -d Documentation/ *.java

This will produce a hierarchy of WWW pages contained in the Document directory.




3. RUNNING THE SOFTWARE


When compiled the software can be invoked in the normal manner using the Java interpreter:

java APPLICATION_CLASS_FILE_NAME

If you are planning to process a very large data set it is a good idea to grab some extra memory. For example:

java -Xms600m -Xmx600m APPLICATION_CLASS_FILE_NAME

The input to the software, in all cases is a (space separated) binary valued data set R. The set R comprises a set of N records such that each record (r), in turn, comprises a set of attributes. Thus:

R  = {r | r = subset A}

Where A is the set of available attributes. The value D is then defined as:

D = |A|

We then say that a particular data set has D columns and N rows. A small example data sets might be as follows:

1 2 3 6
1 4 5 7
1 3 4 6
1 2 6
1 2 3 4 5 7

where, in this case, A = {1, 2, 3, 4, 5, 6, 7}. Note that attribute numbers are ordered sequentially commencing with the number 1 (the value 0 has a special meaning). The input file is included using a -F flag.

The program assumes that the last attribute in each row is the class for that record (exactly one class per record). The software does not calculate how many classes there are, the user has to inform the software using the -N flag.

If required (the DecTreeRandom_2file_App and DecTreeInfoGain_2file_App applications) the name of a seperate test file can be included using a -T flag.

Some example invocations, using a discretized/ normalised Pima Indians data set (also available from this site) and the two application classes provided by this WWW site, are given below:

java DecTreeRandomApp -FpimaIndians.D42.N768.C2.num -N2
java DecTreeInfoGainApp -N2 -FpimaIndians.D42.N768.C2.num
java DecTreeRandom_2file_App -FpimaIndians.D42.N568.C2.num -N2 -TpimaIndians.D42.N200.C2.num 
java DecTreeRandomApp10 -FpimaIndians.D42.N768.C2.num -N2

(note that the ordering of flags is not significant). The output from each application is a set of CRs plus some diagnostic information (run time, number of rules generated etc.).

Tables 1 and 2 give some sample output. Note that in these examples the "most frequent attribute" (random) splitting criteria gives a better result than information gain.

SETTINGS
--------
Training File name            = pimaIndians.D42.N768.C2.num
Number of classes             = 2

Reading input file: pimaIndians.D42.N768.C2.num
Number of records = 768
Number of columns = 42
Min support       = 153.6 (records)
Accuracy = 82.03125

Generation time = 0.03 seconds (0.0 mins)

Table 1: java DecTreeRabdomApp -FpimaIndians.D42.N768.C2.num -N2

SETTINGS
--------
Training File name            = pimaIndians.D42.N768.C2.num
Number of classes             = 2

Reading input file: pimaIndians.D42.N768.C2.num
Number of records = 768
Number of columns = 42
Min support       = 153.6 (records)
Accuracy = 92.96875

Generation time = 0.02 seconds (0.0 mins)

Table 2: java DecTreeRabdomApp -N2 -FpimaIndians.D42.N768.C2.num




4. OUTPUT


Some output is always generated such as: (i) the input parameters and start settings and (ii) "mile stones" during processing. Additional output statements can be included in application classes. The available additional output options are as follows:

  1. outputDataArray(): Outputs the training set.
  2. outputTestDataArray(): Outputs the test set.
  3. outputDataArraySize(): Outputs the size of the input data (number of records and columns, number of elements and overall density).
  4. outputDuration(double time1, double time2): The elapsed time, in seconds between time1 and time2. Typically used to measure processing time:
    double time1 = (double) System.currentTimeMillis();
    
    // Program statements
    
    < INSTANCE_NAME > .outputDuration(time1,
    				(double) System.currentTimeMillis());
    
  5. outputNumClasses(): Outputs number of classes.
  6. outputNumRules(): The number of discovered ARs.
  7. outputDecTree(): Output a textual representation of the generated decision tree.
  8. outputRulesWithDefault: Lists the generated classifier.



5. SOME EXAMPLE RESULTS


Below are some example test runs using number ten data sets taken from the UCI machine learning repository (Blake and Merz, 1998) and discretised using the LUCS-KDD DN software (Coenen 2003). The datasets may be obtained from:

http://www.csc.liv.ac.uk/~frans/KDD/Software/LUCS-KDD-DN/DataSets/dataSets.html

The naming convention used is as follows: D = total number of attributes (including class attributes), N = number of records and C = number of classes. The results were obtained using a 50:50 training/test set split.

Data setInformation GainRandom
Accuracy (%)Run time (s)Accuracy (%)Run time (s)
mushroom.D90.N8124.C2.num 96.280.71 99.430.09
nursery.D32.N12960.C5.num 22.891.35 22.531.11
pageBlocks.D46.N5473.C5.num 92.540.14 92.510.07
penDigits.D89.N10992.C10.num 71.431.19 75.930.42
pimaIndians.D42.N768.C2.num 72.140.04 71.610.03
soybean-large.D118.N683.C19.num 47.210.09 46.630.02
ticTacToe.D29.N958.C2.num 58.870.04 75.990.03
waveform.D101.N5000.C3.num 73.040.72 96.800.21
wine.D68.N178.C3.num 57.300.04 66.290.03
zoo.D42.N101.C7.num 90.000.04 88.000.01



6. CONCLUSIONS


The decision tree algorithms described here have been use successfully used by the LUCS-KDD research group to contrast and compare a variety of classification algorithms and techniques. The software is available for free, however the author would appreciate appropriate acknowledgement. The following reference format for referring to the decision tree implementation available from this www site is suggested:

  1. Coenen, F. (2007), The LUCS-KDD Decision Tree Classifier Software, http://www.csc.liv.ac.uk/~frans/KDD/Software/DecisionTrees/decisionTree.html, Department of Computer Science, The University of Liverpool, UK.

Should you discover any "bugs" or other problems within the software (or this documentation), do not hesitate to contact the author.




REFERENCES


  1. Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases http://www.ics.uci.edu/~mlearn/MLRepository.html, Irvine, CA: University of California, Department of Information and Computer Science.
  2. Coenen, F. (2003). LUCS-KDD DN Software, http://www.csc.liv.ac.uk/~frans/KDD/Software/LUCS_KDD_DN/, Department of Computer Science, The University of Liverpool, UK.

  3. Mitchell, T.M. (1997). Machine Learning. McGraw-Hill.
  4. Quinlan, J. R. (1998). C4.5: Programs for Machine Learning.. Morgan Kaufmann Publishers.



Created and maintained by Frans Coenen. Last updated 9 January 2012