THE LUCS-KDD ARM DATA DISCRETISATION/NORMALISATION (DN) JAVA SOFTWARE



Liverpool University

Frans Coenen

Department of Computer Science

The University of Liverpool

Wednesday 17 September 2003

CONTENTS

1. Introduction.
1.1. Missing values.
2. Downloading the software.
3. Before you start.
4. Running the LUCS-KDD DN software.
 
4.1. Loading a schema file.
4.2. Loading a data file.
4.3. Entering a value for the desired number of sub-ranges.
4.4. Discretisation/normalisation.
4.5. Output.
5. Conclusions.
6. References.



1. INTRODUCTION

The LUCS-KDD (Liverpool University Computer Science - Knowledge Discovery in Data) ARM (Association Rule Mining) DN (Discretization/ Normalisation) software has been developed to convert data files available in the UCI data repository ([1]) into a binary format suitable for use with Association Rule Mining (ARM) applications. The software can, of course, equally well be used to convert data files obtained from other sources. We define discretisation and normalisation as follows:

Discretisation
The process of converting the range of possible values associated with a continuous data item (e.g. a double precision number) into a number of sub-ranges each identified by a unique integer label; and converting all the values associated with instances of this data item to the corresponding integer labels.
Normalisation
The process of converting values associated with nominal data items so that they correspond to unique integer labels.

ARM requires input data sets where each record represents an itemset which in turn is a subset of the available set of attributes. Thus:

N = The number of available attributes
A = A set of attributes = {a | 0 < a <= N}
D = A data set comprising M records (R)
R = {r | r subset A}

For example if A={1 2 3 4} then we might have a data set of the form:

1 2 3 4
1 2 3
1 2 4
1 3 4
2 3 4
1 2
1 3
1 4
2 3
2 4
3 4
1
2
3
4

which may be used as the input to an ARM software system.

Generally speaking real data sets do not comprise only binary fields. Typically such data sets comprise a mixture of nominal, continuous and integer fields. Thus real data will require disacretization/ normalisation, i.e. conversion into a binary valued format, before it can be used for ARM. How this is done using the LUCS-KDD DN software will depend on the nature of each field in the data:

Nominal Data (nominal):
A nominal data field has a value selected from an available list of values, for example we may have a nominal data item colour which can have a value taken from the set {blue, green, red, yellow}. Such data items can be normalised by allocating a unique column number to each possible value. For example in the above case we might assign the column numbers {1, 2, 3, 4} to the possible values.
Continuous Data (double):
Continuous data fields take values that are real numbers within some range defined by minimum and maximum limits. In such cases we divided the given range into a number of sub-ranges and allocate a unique column number to each sub-range. For example we might have a data item average which takes a real number within the range 0.0 to 100.0. In which case we might divide this range of values into three sub-ranges: 0.0 <= n <= 33.3 , 33.3 < n <= 66.6 and 66.6 < n <= 100.0 (where n is some instance of the field average), and assign each range to a column number (say) {5, 6, 7} respectively.
 
Integer Data (int):
Integer data fields can be treated as either continuous data items or as nominal data items. For example if we have a field age which can take a value from 1 to 120 we might treat this as a continuous data item and represent it as a sequence of (say) 4 ranges: 0 <= n <= 30 , 31 <= n <= 60 , 61 <= n <= 90 and 91 <= n <= 120 (where n is some instance of the field age), and assign each range to a unique column number (say) {8, 9, 10, 11} respectively. Alternatively we may have a field class which can have the value 0 or 1 (i.e. a Boolean data item), in which case this could be treated as a nominal data item with one column (say 12) allocated to the value 0 and another (say 13) to the value 1. Whether to treat an integer data item as a nominal or a continuous item depends on the number of attributes the user of an ARM programme is prepared to tolerate --- usually the time complexity of these programmes (at least in part) is proportional to the number of columns (attributes) in the input data. It would therefore not be a good idea to discretise the above age field in terms of 120 individual columns!
Unused (unused):
unused data fields generally describe fields such as "record number" or "TID". Data fields of "type" unused are entirely ignored by the LUCS-KDD DN software and thus play no part in the discretization/ normalisation process.

Thus given a (space separated) data set with the schema {colour (nominal), average (double), age (int), class (int)} such as:

red    25.6  56 1
green  33.3   1 1
green   2.5  23 0
blue   67.2 111 1
red    29.0  34 0
yellow 99.5  78 1
yellow 10.2  23 1
yellow  9.9  30 0
blue   67.0  47 0
red    41.8  99 1

This would be discretised/ normalised as follows:

3 5  9 13
2 5  8 13
2 5  8 12
1 7 11 13
3 5  9 12
4 7 10 13
4 5  8 13
4 5  8 12
1 7  9 12
3 6 11 13

The entire data set is now presented by 13 binary valued attributes and can be mined using appropriate ARM software.


1.1 Missing values

It is not unusual for data sets in the UCI repository to include missing values. The convention is to indicate these using a `?' character. This convention has also been adopted for the purpose of the LUCS-KDD DN software described here.

WARNING: Where a missing value has occured in a numeric field this is recorded internally as -100000001.0. In the unfortunate case where input data includes very large negative numbers (integers or doubles) these will be identified as missing values!




2. DOWNLOADING THE SOFTWARE

The LUCS-KDD DN software has been implemented in Java using the Java2 SDK (Software Development Kit) Version 1.4.0, which should therefore make it highly portable. In the interest of "user friendliness" the user interface has been implemented as a GUI. The source code is available at:

http://csc.liv.ac.uk/~frans/KDD/Software/LUCS_KDD_DN_ARM/LUCS_KDD_DN_ARM.java

The code does not require any special packages and thus can be compiled using the standard Java compiler:

 
javac LUCS_KDD_DN_ARM.java

An example data file and schema are also avialble from this WWW page (they are used to illustrate the softeare in Section 4):

The Pima Indians data set is a well known data set often used for "bench marking" purposes within the Knowledge Disciovery in Data (KDD) community. The data set is oner of many available from the UCI Machine Learninf Repository (Blake and Merz, 1998).




3. BEFORE YOU START

Before the LUCS-KDD DN software can convert a data file into the desired binary valued format it needs to know the schema for the data to be converted. The schemas for the data files in the UCI repository are described (usually) in free text format within a .names file. Users of the LUCS-KDD DN software will thus have to create their own schema files before any conversion can take place.

LUCS-KDD DN Schema files comprise three lines of text each containing a sequence of N literals separated by white space (not carriage returns) where N is the number of attributes in the data set to be converted:

  1. The first line is used to describe the "type" of each field, permitted options are: unused, nominal, integer and double.
  2. The second line gives the names of the fields, this is not used in the discretisation/ normalisation itself, but may be useful when the time comes to match the column numbers contained in ARs and CARs to the output schema.
 
  1. The third line contains the possible "legal" values for each nominal data item; in the case of "unuseds", "doubles" and "ints" the literal null is used. Nominal values are separated by a '/' character.

The schema file for the above example data set would be:

nominal double int nominal
colour average age class
blue/green/red/yellow null null 0/1

The schema file for the Pima Indians UCI data set, available from this WWW page, is as follows:

int int int int int double double int nominal
NumberPregnacies PlasmaGluConcent
		DiastolicBldPress TricepsSkinFold
		2-HourSerumIns BodyMassIndex
		DiabPedFunc Age Class
null null null null null null null null 0/1

(Remember that data fields of "type" unused are ignored by the LUCS-KDD DN software and do not appear in the output.)




4. RUNNING THE LUCS-KDD DN SOFTWARE

When compiled the software can be invoked in the normal manner using the Java interpreter:

java LUCS_KDD_DN_ARM.java

If you are planning to process a very large data set, such as covertype, it might be an idea to grab some extra memory. For example:

java -Xms600m -Xmx600m LUCS_KDD_DN_ARM

Once invoked you will see an interface of the form presented in Figure 1. Note that in Figure 1 most of the interface buttons are "greyed out"; this is because, at this stage, we have no data to work with. The two functional buttons are:

 
  1. Input Schema: Selecting this button will bring up a second GUI to allow the user to load the identified schema file.
  2. Input num. ranges: To normalise a continuous attribute the LUCS-KDD discretisation software divides the values for the attribute into a number of sub-ranges. Using the "Input num. ranges" button the user can specify the desired number of sub-ranges. It is usually a good idea to select a low value; this will have the effect of producing less columns in the output data which in turn will provide computation efficiency benefits. Values of 3, 4 or 5 have all been successfully used.

Note also, in Figure 1, that some instructions are presented in the main window of the GUI.


Figure 1: LUCS-KDD DN interface on start up.


4.1 Loading a schema file

A schema file is loaded by selecting the "Input Schema" button. Once the schema has been "read" three more buttons become active:

  1. Input Data ('  ' separated): Raw data files in the UCI repository are either space or comma separated. This button allows the user to select and load a space separated data file to be discretised/normalised.
  2. Input Data (',' separated): Button to allow the user to select and load a comma separated data file.
  3. List Schema: Button to allow the user to list the contents of the current schema file. In the case of the Pima Indians data set the schema will be presented as follows.
 
    (1) int: NumberPregnacies
    (2) int: PlasmaGluConcent
    (3) int: DiastolicBldPress
    (4) int: TricepsSkinFold
    (5) int: 2-HourSerumIns
    (6) double: BodyMassIndex
    (7) double: DiabPedFunc
    (8) int: Age
    (9) nominal: Class { 0 1 }
    

Figure 2 shows the LUCS-KDD DN interface once a schema file (the example for the Pima Indians data set given above) has been loaded and listed using the "Input Schema" and "List Schema" buttons respectively.

Figure 2: LUCS-KDD DN interface with schema file loaded.


4.2 Loading a data file

A data file can be loaded using either the "Input Data ('/nbsp/nbsp' separated) or "Input Data (',' separated" buttons. Some example raw data (the first few lines from the Pima Indians comma separated data set) is given below:

6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
5,116,74,0,0,25.6,0.201,30,0
3,78,50,32,88,31.0,0.248,26,1
10,115,0,0,0,35.3,0.134,29,0
2,197,70,45,543,30.5,0.158,53,1
8,125,96,0,0,0.0,0.232,54,1
4,110,92,0,0,37.6,0.191,30,0
10,168,74,0,0,38.0,0.537,34,1
10,139,80,0,0,27.1,1.441,57,0
1,189,60,23,846,30.1,0.398,59,1
5,166,72,19,175,25.8,0.587,51,1
......

Figure 3 shows the LUCS-KDD DN interface once a data set has been loaded (the Pima Indians data set in this case). Note that six more buttons have now become active:

  1. List Input Data: This button allows the user to list the input data; note that the data will be listed using the LUCS-KDD DN internal representation. Some sample output for the Pima Indians set is given below:
    6.0 148.0 72.0 35.0 0.0 33.6 0.627 50.0 1.0
    1.0 85.0 66.0 29.0 0.0 26.6 0.351 31.0 0.0
    8.0 183.0 64.0 0.0 0.0 23.3 0.672 32.0 1.0
    1.0 89.0 66.0 23.0 94.0 28.1 0.167 21.0 0.0
    0.0 137.0 40.0 35.0 168.0 43.1 2.288 33.0 1.0
    5.0 116.0 74.0 0.0 0.0 25.6 0.201 30.0 0.0
    3.0 78.0 50.0 32.0 88.0 31.0 0.248 26.0 1.0
    10.0 115.0 0.0 0.0 0.0 35.3 0.134 29.0 0.0
    2.0 197.0 70.0 45.0 543.0 30.5 0.158 53.0 1.0
    8.0 125.0 96.0 0.0 0.0 0.0 0.232 54.0 1.0
    4.0 110.0 92.0 0.0 0.0 37.6 0.191 30.0 0.0
    10.0 168.0 74.0 0.0 0.0 38.0 0.537 34.0 1.0
    10.0 139.0 80.0 0.0 0.0 27.1 1.441 57.0 0.0
    1.0 189.0 60.0 23.0 846.0 30.1 0.398 59.0 1.0
    5.0 166.0 72.0 19.0 175.0 25.8 0.587 51.0 1.0
    ....
    
 
  1. List Min Max Data: Button to list the minimum and maximum values for the current data set. For The Pima Indians data set this will be as follows:

    Range col. 1: 0.0 to 17.0 (range of 17.0)
    Range col. 2: 0.0 to 199.0 (range of 199.0)
    Range col. 3: 0.0 to 122.0 (range of 122.0)
    Range col. 4: 0.0 to 99.0 (range of 99.0)
    Range col. 5: 0.0 to 846.0 (range of 846.0)
    Range col. 6: 0.0 to 67.1 (range of 67.1)
    Range col. 7: 0.078 to 2.42 (range of 2.342)
    Range col. 8: 21.0 to 81.0 (range of 60.0)
    Range col. 9: 0.0 to 1.0 (range of 1.0)
    
  2. Move N to End: This button allows the user to specify a column to be moved to the end of each record.
  3. Move N to before M: Similarly it is sometimes required to move a column to a position in the data set other than the end.
  4. Remove Col N: Some of the UCI raw data sets (and other such sets) include redundant columns. For example it is not uncommon for data sets to include an identifying incremental number for each record. This button allows the user to remove a column from the input data.
  5. Randomise Rows: Some data sets are ordered in some way, for example according to a particular attribute or attributes. This button causes the input data to be randomised should this be desired.

Figure 3: LUCS-KDD DN interface with schema and data files loaded.


4.3 Entering a value for the desired number of sub-ranges

The desired number of sub-ranges is entered using the "Input num. ranges" button, we will assume a value of 5.

Remember that the higher the number of sub-ranges value the more output attributes will be generated during the discretisation (normalisation) process, which in turn will have an effect on the computational efficiency of any ARM algorithm that may be applied to the data.

 

Where an integer data field has a range less than the user specified number of sub-ranges the LUCS KDD DN software will assign a number of columns to the integer field equivalent to the size of the range. For example if we have an integer field which can take values in the range of 99 to 102 inclusive (i.e. a range size of 4) and the number of sub-ranges value is 5 then 4 columns would be assigned to the integer field. If, however, the number of sub-ranges value was set to 3, the 99 to 102 range would be divided into three sub-ranges (99 <= n <= 100, 100 < n <= 101 and 101 < n <= 102 ) each with its own unique column number assigned to it.


4.4 Discretisation/normalisation

Once a value for the number of sub-ranges has been successfully entered the "Normalisation" button becomes available:

  1. Normalisation: Button to cause discretisation/normalisation to be carried out.

To discretise/normalise the data set the user simply selects the "Normalisation" button. Note that for some data sets this process may take a few seconds. The result in the case of the Pima Indians data set will be of the following form:

 
2 9 13 17 21 28 32 38 42
1 8 13 17 21 27 31 36 41
3 10 13 16 21 27 32 36 42
1 8 13 17 21 28 31 36 41
1 9 12 17 21 29 35 37 42
2 8 14 16 21 27 31 36 41
1 7 13 17 21 28 31 36 42
3 8 11 16 21 28 31 36 41
1 10 13 18 24 28 31 38 42
3 9 14 16 21 26 31 38 42
2 8 14 16 21 28 31 36 41
3 10 14 16 21 28 31 37 42
3 9 14 16 21 28 33 39 41
1 10 13 17 25 28 31 39 42
2 10 13 16 22 27 32 38 42
............

4.5 Output

Once normalisation is complete a further five (output) buttons become active (Figure 4) as follows:

  1. List Parameters: The "List Parameters" button outputs statistics relating to the processed data set. In the case of the Pima Indians set this will be:
    Number of records             = 768
    Number of cols (input data)   = 9
    Number of cols (schema)       = 9
    Num. missing values           = 0
    Max num. ranges               = 5
    Number of attributes         = 42
    Density (%)                   = 21.43
    Number of classes             = 2
    Number of records per class:
    	Class 42 = 268 records (34.9%)
    	Class 41 = 500 records (65.1%)
    
    
  2. List Ouput Schema: This button produces a set of labels, one for each column, for the discretised/normalised data which may be used when explaining the results of ARM exercises. Some example output in the context of the Pima Indians set is given below:

    (1) NumberPregnacies<3.4 (2) NumberPregnacies<6.8 (3) NumberPregnacies<10.2 (4) NumberPregnacies<13.6 (5) NumberPregnacies<17.0 (6) PlasmaGluConcent<39.8 (7) PlasmaGluConcent<79.6 (8) PlasmaGluConcent<119.4 (9) PlasmaGluConcent<159.2 (10) PlasmaGluConcent<199.0 (11) DiastolicBldPress<24.4 (12) DiastolicBldPress<48.8 (13) DiastolicBldPress<73.2 (14) DiastolicBldPress<97.6 (15) DiastolicBldPress<122.0 (16) TricepsSkinFold<19.8 (17) TricepsSkinFold<39.6 (18) TricepsSkinFold<59.4 (19) TricepsSkinFold<79.2 (20) TricepsSkinFold<99.0 (21) 2-HourSerumIns<169.2 (22) 2-HourSerumIns<338.4 (23) 2-HourSerumIns<507.6
 
    (24) 2-HourSerumIns<676.8
    (25) 2-HourSerumIns<846.0
    (26) BodyMassIndex<13.419999999999998
    (27) BodyMassIndex<26.839999999999996
    (28) BodyMassIndex<40.26
    (29) BodyMassIndex<53.67999999999999
    (30) BodyMassIndex<67.1
    (31) DiabPedFunc<0.5464
    (32) DiabPedFunc<1.0148000000000001
    (33) DiabPedFunc<1.4832
    (34) DiabPedFunc<1.9516000000000002
    (35) DiabPedFunc<2.42
    (36) Age<33.0
    (37) Age<45.0
    (38) Age<57.0
    (39) Age<69.0
    (40) Age<81.0
    (41) Class=0
    (42) Class=1
    
  1. List Output Data: The "List Output Data" button is used to output the results of discretisation/ normalisation to the screen thus allowing the user to "eye ball" the data before saving it to file. The first few lines of the discretised, normalised and distributed Pima Indians set are given below. Note that in the case of the Pima Indians sets the ratio of "NoSignsOfDiabetes" (41) and "SignsOfDiabetes" (42) is roughly 2:1 therefore (as a result of distribution) every third record is a "SignsOfDiabetes" record.
    1 8 13 17 21 27 31 36 41
    2 9 13 17 21 28 32 38 42
    1 8 13 17 21 28 31 36 41
    2 8 14 16 21 27 31 36 41
    3 10 13 16 21 27 32 36 42
    3 8 11 16 21 28 31 36 41
    2 8 14 16 21 28 31 36 41
    1 9 12 17 21 29 35 37 42
    3 9 14 16 21 28 33 39 41
    1 8 12 17 21 29 31 37 41
    1 7 13 17 21 28 31 36 42
    1 9 14 18 22 28 32 36 41
    3 8 14 16 21 28 31 38 41
    1 10 13 18 24 28 31 38 42
    1 8 13 16 21 27 31 36 41
    .........
    
  2. Save Normalisation: Finally the "Save Normalisation" button allows the user to specify a file name and directory in which to store the discretised/ normalised (and possibly distributed) data set. It is a good idea to include information concerning: (i) the number of columns (attributes) and (ii) the number of records; in the output file name. For example the Pima Indians output data set might be stored in the file pimaIndians.D42.N768.num (where D is the number of attributes, and N the number of records).
  3. Save Output Schema: Cause the output schema to be saved to file.

Figure 4: LUCS-KDD DN interface after discretisation/normalisation.




6. CONCLUSION

The LUCS-KDD DN software has been in use successfully within the LUCS-KDD group for some time. The software is available for free for non-commercial use, however the author would appreciate appropriate acknowledgement. The following reference format for referring to the software is suggested:

Coenen, F. (2003), LUCS-KDD ARM DN Software, http://www.csc.liv.ac.uk/~frans/KDD/Software/LUCS_KDD_DN_ARM/, Department of Computer Science, The University of Liverpool, UK.

 

Should you discover any "bugs" or other problems within the software (or this documentation), or have suggestions for additional features, please do not hesitate to contact the author.




REFERENCES

  1. Blake, C.L. and Merz, C.J. (1998). UCI Repository of machine learning databases http://www.ics.uci.edu/~mlearn/MLRepository.html, Irvine, CA: University of California, Department of Information and Computer Science.



Created and maintained by Frans Coenen. Last updated 18 January 2005