THE LUCS-KDD DATA DISCRETISATION/NORMALISATION (DN) JAVA SOFTWARE FOR CLASSIFICATION ASSOCIATION RULE MINING (VERSION 2)



Liverpool University

Frans Coenen

Department of Computer Science

The University of Liverpool

Friday 7 January 2005

Updated: Thursday 24 November 2011

IMPORTANT NOTE: This is an upgrade of the September 2003 version. The latter is still available from this site. The distinctions between Version 2, described here, and Version 1 are as follows:

  1. Fewer attribute columns leading to enhanced processing efficiency.
  2. A better discretisation, which takes account of the relationship between numeric attribute values and class values, leading to greater data mining accuracy.

CONTENTS

1. Introduction.
1.1. Missing values.
2. Downloading the software.
3. Before you start.
4. Running the LUCS-KDD DN software.
4.1. Loading a schema file.
4.2. Loading a data file.
4.3. Entering a value for the desired number of divisions.
 
4.4. Discretisation/normalisation.
4.5. Output.
5. How it works.
5.1. Numeric attributes.
5.2. Nominal attributes.
6. Comparison with version 1.
7. Literature review.
8. Conclusions.
9. References.



1. INTRODUCTION

The LUCS-KDD (Liverpool University Computer Science - Knowledge Discovery in Data) DN (Discretisation/ Normalisation) software has been developed to convert data files available in the UCI data repository ([1]) into a binary format suitable for use with Classification Association Rule Mining (CARM) applications. The software can, of course, equally well be used to convert data files obtained from other sources. We define discretisation and normalisation as follows:

Discretisation
The process of converting the range of possible values associated with a continuous data item (e.g. a double precision number) into a number of sub-ranges each identified by a unique integer label; and converting all the values associated with instances of this data item to the corresponding integer labels.
Normalisation
The process of converting values associated with nominal data items so that they correspond to unique integer labels.

CARM requires "binary valued" input data sets where each record represents an itemset which in turn is a subset of the available set of attributes. Thus:

N = The number of available attributes
A = A set of attributes = {a | 0 < a <= N}
D = A data set comprising M records (R)
R = {r | r subset A}

For example if A={1 2 3 4} then we might have a data set of the form:

1 2 3 4
1 2 3
1 2 4
1 3 4
2 3 4
1 2
1 3
1 4
2 3
2 4
3 4
1
2
3
4

which may be used as the input to an ARM software system. The distinction between the data sets used for ARM and those used for CARM is that in the latter case one of the attributes in each record represents a class to which the record is said to "belong". Usually, to facilitate identification, either the last or first attribute in each record in the input data represents the class. Thus we might have a CARM data set of the following form where the last item in each record represents the class (either 3 or 4):

1 2 4
1 2 3
1 4
2 4
1 3
2 3

(Note that records in a CARM input set must have at least one attribute as well as a class).

Generally speaking real data sets do not comprise only binary fields. Typically such data sets comprise a mixture of nominal, continuous and integer fields. Thus real data will require disacretization/ normalisation, i.e. convertion into a binary valued format, before it can be used for CARM. How this is done using the LUCS-KDD DN software will depend on the nature of each field in the data:

Nominal Data (nominal):
A nominal data field has a value selected from an available list of values, for example we may have a nominal data item colour which can have a value taken from the set {blue, green, red, yellow}. Such data items can be normalised by allocating a unique column number to each possible value. For example in the above case we might assign the column numbers {1, 2, 3, 4} to the possible values. This was the approach taken by Version 1 of the LUCS-LDD-DN software; in Version 2 if a two or more nominal values were generally associated (typically in 90% of all records) with the same class the columns were merged.
 
Continuous Data (double):
Continuous data fields take values that are real numbers within some range defined by minimum and maximum limits. In such cases we divided the given range into a number of sub-ranges and allocate a unique column number to each sub-range. For example we might have a data item average which takes a real number within the range 0.0 to 100.0. In Version 1 of thge software this range of values might be divided into three sub-ranges (divisions): 0.0 <= n <= 33.3 , 33.3 < n <= 66.6 and 66.6 < n <= 100.0 (where n is some instance of the field average), and assign each range to a column number (say) {5, 6, 7} respectively. In Version 2, instead of dividing the range into equally sized divisions the range was divided so as the take account of the probability that a particular value was associated with a particular class (see below for further detail).
Integer Data (int):
Integer data fields can be treated as either continuous data items or as nominal data items. For example if we have a field age which can take a value from 1 to 120 we might treat this as a continuous data item and represent it as a sequence of (say) 4 ranges: 0 <= n <= 30 , 31 <= n <= 60 , 61 <= n <= 90 and 91 <= n <= 120 (where n is some instance of the field age), and assign each range to a unique column number (say) {8, 9, 10, 11} respectively. Alternatively we may have a field gender which can have the value 0 or 1 (i.e. a boolean data item), in which case this could be treated as a nominal data item with one column (say 12) allocated to the value 0 and another (say 13) to the value 1. Whether to treat an integer data item as a nominal or a continuous item depends on the number of attributes the user of an CARM programme is prepared to tolerate --- usually the time complexity of these programmes (at least in part) is proportional to the number of columns (attributes) in the input data. It would therefore not be a good idea to discretise the above age field in terms of 120 individual colums!
. In Version 2 of the software integer fields are treated in the same way as doubles.
Unused (unused):
unused data fields generally describe fields such as "record number" or "TID". Data fields of "type" unused are entirely ignored by the LUCS-KDD DN software and thus play no part in the discretization/ normalisation process.

Thus given a (space separated) data set with the schema {colour (nominal), average (double), age (int), class (int)} such as:

red    25.6  56 1
green  33.3   1 1
green   2.5  23 0
blue   67.2 111 1
red    29.0  34 0
yellow 99.5  78 1
yellow 10.2  23 1
yellow  9.9  30 0
blue   67.0  47 0
red    41.8  99 1

This would be discretised/ normalised as follows:

3 5  9 13
2 5  8 13
2 5  8 12
1 7 11 13
3 5  9 12
4 7 10 13
4 5  8 13
4 5  8 12
1 7  9 12
3 6 11 13

The entire data set is now presented by 13 binary valued attributes and can be mined using appropriate CARM software.


1.1 Missing values

It is not unusual for data sets in the UCI repository to include missing values. The convention is to indicate these using a `?' character. This convention has also been adopted for the purpose of the LUCS-KDD DN software described here.

WARNING: Where a missing value has occured in a numeric field this is recorded internally as 100000001.0. In the unfortunate case where input data includes very large numbers (integers or doubles) these will be identified as missing values!




2. DOWNLOADING THE SOFTWARE

The LUCS-KDD DN software has been implemented in Java using the Java2 SDK (Software Development Kit) Version 1.4.0, which should therefore make it highly portable. In the interest of "user freindliness" the user interface has been implemented as a GUI. The source code is available at:

 

http://csc.liv.ac.uk/~frans/KDD/Software/LUCS_KDD_DN/LUCS_KDD_DN.java

The code does not require any special packages and thus can be compiled using the standard Java compiler:

javac LUCS_KDD_DN.java



3. BEFORE YOU START

Before the LUCS-KDD DN software can convert a data file into the desired binary valued format it needs to know the schema for the data to be converted. The schemas for the data files in the UCI repository are described (usually) in free text format within a .names file. Users of the LUCS-KDD DN software will thus have to create their own schema files before any convertion can take place.

LUCS-KDD DN Schema files comprise three lines of text each containing a sequence of N literals separated by white space (not carriage returns) where N is the number of attributes in the data set to be converted:

  1. The first line is used to describe the "type" of each field, permitted options are: unused, nominal, integer and double.
  2. The second line gives the names of the fields, this is not used in the discretisation/ normalisation itself, but may be useful when the time comes to match the column numbers contained in ARs and CARs to the output schema.
 
  1. The third line contains the possible "legal" values for each nominal data item; in the case of "unuseds", "doubles" and "ints" the literal null is used. Nominal values are seperated by a '/' charcater.

The schema file for the above example data set would be:

nominal double int nominal
colour average age class
blue/green/red/yellow null null 0/1

The schema file for the Pima Indians UCI data set might be:

int int int int int double double int nominal
NumberPregnacies PlasmaGluConcent
		DiastolicBldPress TricepsSkinFold
		2-HourSerumIns BodyMassIndex
		DiabPedFunc Age Class
null null null null null null null null 0/1

(Remember that data fields of "type" unused are ignored by the LUCS-KDD DN software and do not appear in the output.)




4. RUNNING THE LUCS-KDD DN SOFTWARE

When compiled the software can be invoked in the normal manner using the Java interpreter:

java LUCS_KDD_DN.java

If you are planning to process a very large data set, such as covertype, it might be an idea to grab some extra memory. For example:

java -Xms600m -Xmx600m LUCS_KDD_DN
 

Once invoked you will see an interface of the form presented in Figure 1. Note that in Figure 1 all but one of the interface buttons are "greyed out"; this is because, at this stage, we have no data to work with. The one functional button is:

  1. Input Schema: Selecting this button will bring up a second GUI to allow the user to load the identified schema file.

Note also, in Figure 1, that some instructions are presented in the main window of the GUI.

Figure 1: LUCS-KDD DN interface on start up.


4.1 Loading a schema file

A schema file is loaded by selecting the "Input Schema" button. Once the schema has been "read" three more buttons become active:

  1. Input Data ('/nbsp/nbsp' separated): Raw data files in the UCI repository are either space or comma separated. This button allows the user to select and load a space separated data file to be discretised/normalised.
  2. Input Data (',' separated): Button to allow the user to select and load a comma separated data file.
  3. List Schema: Button to allow the user to list the contents of the current schema file. In the case of the Pima Indians data set the schema will be presented as follows.
 
    (1) int: NumberPregnacies
    (2) int: PlasmaGluConcent
    (3) int: DiastolicBldPress
    (4) int: TricepsSkinFold
    (5) int: 2-HourSerumIns
    (6) double: BodyMassIndex
    (7) double: DiabPedFunc
    (8) int: Age
    (9) nominal: Class { 0 1 }
    

Figure 2 shows the LUCS-KDD DN interface once a schema file (the example for the Pima Indians data set given above) has been loaded and listed using the "Input Schema" and "List Schema" buttons respectively.

Figure 2: LUCS-KDD DN interface with schema file loaded.


4.2 Loading a data file

A data file can be loaded using either the "Input Data ('/nbsp/nbsp' separated) or "Input Data (',' separated" buttons. Some example raw data (the first few lines from the Pima Indians comma separated data set) is given below:

6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
5,116,74,0,0,25.6,0.201,30,0
3,78,50,32,88,31.0,0.248,26,1
10,115,0,0,0,35.3,0.134,29,0
2,197,70,45,543,30.5,0.158,53,1
8,125,96,0,0,0.0,0.232,54,1
4,110,92,0,0,37.6,0.191,30,0
10,168,74,0,0,38.0,0.537,34,1
10,139,80,0,0,27.1,1.441,57,0
1,189,60,23,846,30.1,0.398,59,1
5,166,72,19,175,25.8,0.587,51,1
......

Figure 3 shows the LUCS-KDD DN interface once a data set has been loaded (the Pima Indians data set in this case). Note that seven more buttons have now become active:

  1. Input Num. Divs: To normalise a continuous attribute the LUCS-KDD discretisation software divides the values for the attribute into a number of sub-ranges. Using the "Input Num. Divs" button the user can specify the desired number of sub-ranges. It is usually a good idea to select a low value; this will have the effect of producing less columns in the output data which in turn will provide computation efficiency benefits. It is usually not a good idea however to enter a value which is less than the number of classes to be represented by the input data, unless the number of classes is relatively high (>5). Values of 3, 4 or 5 have all been successfully used.
  2. List Input Data: This button allows the user to list the input data; note that the data will be listed using the LUCS-KDD DN internal representation where nominal values are represented as index values. Some sample output for the Pima Indians set is given below:
6.0 148.0 72.0 35.0 0.0 33.6 0.627 50.0 1.0
1.0 85.0 66.0 29.0 0.0 26.6 0.351 31.0 0.0
8.0 183.0 64.0 0.0 0.0 23.3 0.672 32.0 1.0
1.0 89.0 66.0 23.0 94.0 28.1 0.167 21.0 0.0
0.0 137.0 40.0 35.0 168.0 43.1 2.288 33.0 1.0
 
5.0 116.0 74.0 0.0 0.0 25.6 0.201 30.0 0.0
3.0 78.0 50.0 32.0 88.0 31.0 0.248 26.0 1.0
10.0 115.0 0.0 0.0 0.0 35.3 0.134 29.0 0.0
2.0 197.0 70.0 45.0 543.0 30.5 0.158 53.0 1.0
8.0 125.0 96.0 0.0 0.0 0.0 0.232 54.0 1.0
4.0 110.0 92.0 0.0 0.0 37.6 0.191 30.0 0.0
10.0 168.0 74.0 0.0 0.0 38.0 0.537 34.0 1.0
10.0 139.0 80.0 0.0 0.0 27.1 1.441 57.0 0.0
1.0 189.0 60.0 23.0 846.0 30.1 0.398 59.0 1.0
5.0 166.0 72.0 19.0 175.0 25.8 0.587 51.0 1.0
....
  1. List Min Max Data: Button to list the minimum and maximum values for the current data set. For The Pima Indians data set this will be as follows:

    Range col. 1: 0.0 to 17.0 (range of 17.0)
    Range col. 2: 0.0 to 199.0 (range of 199.0)
    Range col. 3: 0.0 to 122.0 (range of 122.0)
    Range col. 4: 0.0 to 99.0 (range of 99.0)
    Range col. 5: 0.0 to 846.0 (range of 846.0)
    Range col. 6: 0.0 to 67.1 (range of 67.1)
    Range col. 7: 0.078 to 2.42 (range of 2.342)
    Range col. 8: 21.0 to 81.0 (range of 60.0)
    Range col. 9: 0.0 to 1.0 (range of 1.0)
    
  2. Move N to End: Most CARM algorithms require the class for each record to be the last value listed (the alternative is to require the class to be the first value listed). Not all of the UCI, and other, raw data sets do this (in some cases there are a number of possible fields that may be used as the class). This button allows the user to specify a column to be moved to the end of each record.
  3. Move N to before M: Similarly it is sometimes required to move a column to a position in the data set other than the end. Typically this might be required where a CARM algorithm requires the class to be listed at the start of each record.
  4. Remove Col N: Some of the UCI raw data sets (and other such sets) include redundant columns. For example it is not uncommon for data sets to include an identifying incremental number for each record. This button allows the user to remove a column from the input data.
  5. Randomise Rows: Some data sets are ordered in some way, sometimes on the class variable, sometimes according to a particular attribute or attributes. The effect of this will be that any CARM software applied to the data will produce a "skewed" result. This button causes the input data to be randomised (note that randomisation is not the same as distribution --- see Section 4.5 below).

Figure 3: LUCS-KDD DN interface with schema and data files loaded.


4.3 Entering a value for the desired number of sub-ranges

The desired number of divisions is entered using the "Input Num. Divs" button. With respect to the Pima Indians data set the minimum number of ranges setting for this class could be 2 as there are two classes ---- "NoSignsOfDiabetes" and "SignsOfDiabetes". However, experiments have shown that even with two classes it is a good idea to have at least 3 divisions. An even better classifier will usually be produced using a greater number of divisions, a value of 5 has been found to give a good trade of between accuracy and computational efficiency with respect to the mining algorithm used.

 

Remember that the higher the number of divisions the more output attributes will be generated during the discretisation/normalisation process, which in turn will have an effect on the computational efficiency of any CARM algorithm that may be applied to the data.

Where an integer data field has a range less than the user specified number of sub-ranges the LUCS KDD DN software will assign a number of columns to the integer field equivalent to the size of the range. For example if we have an integer field which can take values in the range of 99 to 102 inclusive (i.e. a range size of 4) and the number of divisions value is 5 then 4 columns would be assigned to the integer field. If, however, the number of sub-ranges value was set to 3, the 99 to 102 range would be divided into three divisions each with its own unique column number assigned to it (how this is done is described blow.


4.4 Discretisation/normalisation

Once a value for the number of divisions has been successfully entered the "Normalisation" button becomes available:

  1. Normalisation: Button to cause discretisation/normalisation to be carried out.

To discretise/normalise the data set the user simply selects the "Normalisation" button. Note that for some data sets this process may take a few seconds. The result in the case of the Pima Indians data set will be of the following form:

 
1 6 9 14 19 23 27 33 38
1 5 9 14 19 23 27 32 37
1 8 9 14 19 23 27 32 38
1 5 9 14 19 23 27 32 37
1 5 9 14 19 23 31 32 38
1 5 9 14 19 23 27 32 37
1 5 9 14 19 23 27 32 38
3 5 9 14 19 23 27 32 37
1 8 9 17 22 23 27 34 38
1 5 10 14 19 23 27 35 38
1 5 9 14 19 23 27 32 37
3 8 9 14 19 23 27 32 38
3 5 9 14 19 23 31 36 37
1 8 9 14 22 23 27 36 38
1 8 9 14 19 23 27 33 38
............

4.5 Output

Once normalisation is complete a further five (output) buttons become active (Figure 4) as follows:

  1. Distribute Classes: So as to improve the accuracy of CARM algorithms it is usually beneficial to ensure that each class is distributed evenly across the entire data set --- this is achieved using the "Distribute Classes" button.
  2. List Parameters: The "List Parameters" button outputs statistics relating to the processed data set. In the case of the Pima Indians set this will be:
    Number of records             = 768
    Number of cols (input data) = 9
    Number of cols (schema)     = 9
    Num. missing values           = 0
    Max num. divisions             = 5
    Number of attributes         = 38
    Density (%)                          = 23.68
    Number of classes              = 2
    Number of records per class:
    	Class 42 = 268 records (34.9%)
    	Class 41 = 500 records (65.1%)
    
  3. List Attribute Labels: This button produces a set of labels, one for each column, for the discretised/normalised data which may be used when explaining the results of CARM exercises. Some example output in the context of the Pima Indians set is given below:
    (1) NumberPregnacies <= 9
    (2) 9 < NumberPregnacies <= 10
    (3) 10 < NumberPregnacies <= 11
    (4) 11 < NumberPregnacies
    (5) PlasmaGluConcent <= 142
    (6) 142 < PlasmaGluConcent <= 150
    (7) 150 < PlasmaGluConcent <= 152
    (8) 152 < PlasmaGluConcent
    (9) DiastolicBldPress <= 94
    (10) 94 < DiastolicBldPress <= 103
    (11) 103 < DiastolicBldPress <= 107
    (12) 107 < DiastolicBldPress <= 117
    (13) 117 < DiastolicBldPress
    (14) TricepsSkinFold <= 40
    (15) 40 < TricepsSkinFold <= 41
    (16) 41 < TricepsSkinFold <= 42
    (17) 42 < TricepsSkinFold <= 48
    (18) 48 < TricepsSkinFold
    (19) 2-HourSerumIns <= 341
    (20) 341 < 2-HourSerumIns <= 376
    (21) 376 < 2-HourSerumIns <= 444
    (22) 444 < 2-HourSerumIns
    (23) BodyMassIndex <= 44.733
    (24) 44.733 < BodyMassIndex <= 45.411
    (25) 45.411 < BodyMassIndex <= 46.767
    (26) 46.767 < BodyMassIndex
    
 
    (27) DiabPedFunc <= 0.93
    (28) 0.93 < DiabPedFunc <= 0.953
    (29) 0.953 < DiabPedFunc <= 1.001
    (30) 1.001 < DiabPedFunc <= 1.379
    (31) 1.379 < DiabPedFunc
    (32) Age <= 47
    (33) 47 < Age <= 53
    (34) 53 < Age <= 54
    (35) 54 < Age <= 55
    (36) 55 < Age
    (37) Class=0
    (38) Class=1
    
  1. List Output Data: The "List Output Data" button is used to output the results of discretisation/ normalisation to the screen thus allowing the user to "eye ball" the data before saving it to file. The first few lines of the discretised, normalised and distributed Pima Indians set are given below. Note that in the case of the Pima Indians sets the ratio of "NoSignsOfDiabetes" (41) and "SignsOfDiabetes" (42) is roughly 2:1 therefore (as a result of distribution) every third record is a "SignsOfDiabetes" record.
    1 6 9 14 19 23 27 33 38
    1 5 9 14 19 23 27 32 37
    1 8 9 14 19 23 27 32 38
    1 5 9 14 19 23 27 32 37
    1 5 9 14 19 23 31 32 38
    1 5 9 14 19 23 27 32 37
    1 5 9 14 19 23 27 32 38
    3 5 9 14 19 23 27 32 37
    1 8 9 17 22 23 27 34 38
    1 5 10 14 19 23 27 35 38
    1 5 9 14 19 23 27 32 37
    3 8 9 14 19 23 27 32 38
    3 5 9 14 19 23 31 36 37
    1 8 9 14 22 23 27 36 38
    1 8 9 14 19 23 27 33 38
    ............
    
  2. Save Normalisation: Finally the "Save Normalisation" button allows the user to specify a file name and directory in which to store the discretised/ normalised (and possibly distributed) data set. It is a good idea to include information concerning: (i) the number of columns (attributes), (ii) the number of records and (iii) the number of classes; in the output file name. For example the Pima Indians output data set might be stored in the file pimaIndians.D38.N768.C2.num (where D is the number of attributes, N the number of records and C the number of classes).

Figure 4: LUCS-KDD DN interface after discretisation/normalisation.




5. HOW IT WORKS

The discretisation/normalisation mechanism operates according to whether the attribute column under consideration is either numeric or nominal.

5.1 Numeric attributes

When processing numeric attributes the procedure is as follows:

  1. Divide the range of the attribute into N discrete sub-ranges. Where the attribute in question is of type int and the range is less than one 100, N is equivalent to the range. In all other cases N is equivalent to 100 (i.e. many more sub-ranges than we would want).
  2. For each sub-range: (a) count the number of records that fall into the division, and (b) the distribution of these records with respect to the available classes.
  3. For each division identify the dominant class. Where there is no dominant class, because no records fall into a particular sub-range or where two or more counts are equivalent, determine the dominant class with reference to the nearest neighbouring class.
  4. Combine sub-ranges with identical dominant classes to form a set if divisions.
  5. If the number of division is less than or equal to the maximum desired number of divisions stop. Otherwise merge divisions until the maximum is reached. A division is selected for merging as follows:
    1. For each possible merge calculate the combined probability for the resulting dominant class
    2. Select the merge with the highest probability

Note that as a result of a merge the resulting dominant class may be the dominant class originally associated with one of the divisions selected for the merge or it may be a different class.

Once a merge is complete it may be that the dominant class associated with the resulting division is identical to that associated with either/or the divisions immediately before and after the merged pair in which case the new divisions can automatically be extended. Consequently the reduction in the number of divisions obtained by a single merge may be 1, 2 or 3.

5.2 Nominal attributes

In the case of nominal attributes the normalisation process is slightly different, as follows:

  1. Divide the range of the attribute into N discrete divisions, were N is equivalent to the number of possible discrete values that the attribute can take.
  2. For each division: (a) count the number of records that fall into the division, and (b) the distribution of these records with respect to the available classes.
  3. For each division identify the dominant class where such a class exists. (Note that where a division has no records associate with it the division removed).
  4. If any pair of divisions share a dominant class with a joint probability in excess of 90% merge the divisions.



6. COMPARISON WITH VERSION 1

Version 2 of the LUCS-KDD-DN software offers the following advantages over versions 1:

  1. In most cases a reduced number of attributes in the output (binary valued) data set leading to more computationally efficient processing.
  2. In most cases a reduction in the overall size of the output data set to further enhance processing efficiency.
  3. A better discretisation leading to greater classification accuracy.

These advatages are evidenced in the following tables. Table 1 compares the size (Bytes) and the number of output attributes produced using both version 2 and version 1 of the software. Table 2 gives a classification accuracy, obtained using TCV (Ten Cross Validation) comparison for both versions of the software.

NAMESize (Bytes) V2Size (Bytes) V1# Attribs. V2# Attribs. V1
adult.D97.N48842.C2.num 20347812140596 97131
anneal.D73.N898.C6.num 36705 34985 73106
auto.D137.N205.C7.num 17150 17277137142
breast.D20.N699.C2.num 18126 19597 20 47
chessKRvK.D58.N28056.C18.num 529596 548454 58 66
connect4.D129.N67557.C3.num 91877529187752129129
cylBands.D124.N540.C2.num 56675 - 124 -
flare.D39.N1389.C9.num 43153 - 39 -
glass.D48.N214.C7.num 5946 5993 48 52
heart.D52.N303.C5.num 11943 12029 52 53
hepatitis.D56.N155.C2.num 8335 8335 56 58
horseColic.D85.D368.C2.num 18303 18515 85 94
ionosphere.D157.N351.C2.num 40300 41469157172
iris.D19.N150.C3.num 2017 1954 19 23
led7.D24.N3200.C10.num 62154 62154 24 24
letRecog.D106.N20000.C26.num 992990 986452106106
mushroom.D90.N8124.C2.num 539424 575924 90127
nursery.D32.N12960.C5.num 320760 320760 32 32
pageBlocks.D46.N5473.C5.num 169663 169814 46 55
penDigits.D89.N10992.C10.num 543038 546304 89 90
pima.D38.N768.C2.num 18462 19302 38 42
soybean-large.D118.N683.C19.num 66107 - 118 -
ticTacToe.D29.N958.C2.num 25866 25866 29 29
waveform.D101.N5000.C3.num 324562 330164101108
wine.D68.N178.C3.num 7146 7126 68178
zoo.D42.N101.C7.num 4670 4670 42 43

Table 1: Comparison of size and number of attributes in the output datasets produced using both versions of the LUCS-KDD-DN software.

NAMEAccuracy V1Accuracy V2 Time V1Time V2
adult.D97.N48842.C2.num 76.180.8 80.0 48.1
anneal.D73.N898.C6.num 84.688.3 12.5 10.9
auto.D137.N205.C7.num 50.870.6 23.4 27.0
breast.D20.N699.C2.num 95.990.0 1.0 1.1
chessKRvK.D58.N28056.C18.num NRG NRG NRG NRG
connect4.D129.N67557.C3.num 65.965.8206.4265.1
cylBands.D124.N540.C2.num - 68.3 - 87.4
flare.D39.N1389.C9.num - 84.3 - 5.2
glass.D48.N214.C7.num 43.764.1 2.5 3.2
heart.D52.N303.C5.num 57.151.4 11.8 10.3
hepatitis.D56.N155.C2.num 77.381.2 15.7 14.5
horseColic.D85.D368.C2.num 78.779.1 6.4 9.0
ionosphere.D157.N351.C2.num 83.886.1 8.4 7.3
iris.D19.N150.C3.num 94.795.3 0.1 1.0
led7.D24.N3200.C10.num 69.068.7 0.7 1.6
letRecog.D106.N20000.C26.num 43.027.6115.6106.3
mushroom.D90.N8124.C2.num 96.199.0 31.6 38.1
nursery.D32.N12960.C5.num 77.977.8 5.4 5.6
pageBlocks.D46.N5473.C5.num 89.890.0 2.4 2.3
penDigits.D89.N10992.C10.num 79.781.7 56.8 91.2
pima.D38.N768.C2.num 74.974.4 1.3 1.1
soybean-large.D118.N683.C19.num - 87.3 - 8.6
ticTacToe.D29.N958.C2.num 63.667.1 3.1 3.1
waveform.D101.N5000.C3.num 71.766.7 89.7 77.7
wine.D68.N178.C3.num 86.372.1 8.7 9.6
zoo.D42.N101.C7.num 93.093.0 6.0 7.1

Table 2: Comparison of classification accuracy the LUCS-KDD TFPC classifier and TCV for datasets genertated using both versions of the LUCS-KDD-DN software. (Support threshold = 1%, Confidence threshold = 50%, NRG = No Rules Generated.)




7. LITERATURE REVIEW

Class dependent discretisation techniques can be categorised in various ways: (i) supervised v. unsupervised, (ii) bottom-up v. top-down, (iii) direct v. in-direct. supervised methods take into consideration the class labels while unsupervised methods do not (it can thus be argued that the latter are not class dependent discretisation techniques). Bottom-up methods allocate each attribute value to a range (intervals) and then merge ranges until no further merging can take place (according to some criteria), top-down methods start with a single range a iteratively split this range according to some class related statistical mechanism until some stopping criteria is arrived at. Direct methods pre specify a maximum number of ranges while in-direct methods (also called incremental methods) do not. The LUCS-KDD-DN software described on this www page is a supervised, bottom-up, direct method. Examples of supervised, bottom-up, incremental methods include (Kerber, 1992) and (Haun and Setiono, 1997). Examples of supervised, top-down, incremental methods include (Kurgan and Cios, 2004 and Tsai et al. 2008). For further work on Class dependent descretisation see (Bryson and Joseph, 2001, Ching, et al., 1995, Das and Vyas, 2010, Dougherty et al., 1995, Hua and Zhao, 2009, Liu et al. 2002, and Liu and Setiono, 1997).




8. CONCLUSION

The LUCS-KDD DN software has been in use successfully within the LUCS-KDD group for some time. The software is available for free, however the author would appreciate appropriate acknowledgement. The following reference format for referring to the software is suggested:

Coenen, F. (2003), LUCS-KDD DN Software (Version 2), http://www.csc.liv.ac.uk/~frans/KDD/Software/LUCS_KDD_DN/, Department of Computer Science, The University of Liverpool, UK.

 

Should you discover any "bugs" or other problems within the software (or this documentation), or have suggestions for additional features, please do not hesitate to contact the author.

Some additional notes on processing a number of example data sets from the UCI repository, together with example schema files are available at:

http://csc.liv.ac.uk/~frans/KDD/Software/LUCS-KDD-DN/exmpleDNnotes.html

A number of example discretised/normalised data sets, taken from the UCI library, are available at:

http://csc.liv.ac.uk/~frans/KDD/Software/LUCS-KDD-DN/DataSets/dataSets.html




9. REFERENCES

  1. Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases http://www.ics.uci.edu/~mlearn/MLRepository.html, Irvine, CA: University of California, Department of Information and Computer Science.
  2. Bryson, N. and Joseph, A. (2001). Optimal Techniques for Class Dependent Attrubute Discretization. The Journal of the Operational Research Society. 52(10), pp1130-1143.
  3. Ching, J.Y., Wong, A.K.C., and Chan, K.C.C. (1995). Class-dependent discretization for inductive learning from continuous and mixed-mode data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(7), pp641-651.
  4. Das, K. and Vyas, O.P. (2010). A Suitability Study of Discretization Methods for Associative Classifiers. International Journal of Computer Applications, 5(10), pp46-51.
  5. Dougherty, J. Kohavi, R. and Sahami, M. (1995) Supervised and Unsupervised Discretization of Continuous Features. Proc. Int. Conf. on Machine Learning (ICML), pp194-202.
  6. Hua, H. and Zhao, H. (2009). A Discretization Algorithm of Continuous Attributes Based on Supervised Clustering. Proc. Chinese Conference on Pattern Recognition (CCPR'09), pp1-5.
  7. Kerber, R. (1992). ChiMerge: Discretization of Numeric Attributes. Proc. Conf. on Artificial Intelligence (AAAI-92) pp123-128.
  8. Kurgan, L.A. and Cips, K.J. (2004). CAIM Discretization Algorithm. IEEE Transactions on Knowledge and Data Engineering, 16(2), pp145-153.
  9. Liu, H., Hussain, F. and Tan, C.L. and Dash, M. (2002). Discretization: An Enabling Technique. Data Mining and Knowledge Discovery, 6(4), pp393-423.
  10. Liu, H. and Setiono, R. (1997). Feature selection via discretization. IEEE Transactions on Knowledge and Data Engineering, 9(4), pp642-645.
  11. Tsai C-J., Lee C-I and Yang W-P (2008). A discretization algorithm based on Class-Attribute Contingency Coe´Čâcient. Information Systems, 178, pp714-731.



Created and maintained by Frans Coenen. Last updated 24 November 2011