|
IMPORTANT NOTE: This is an upgrade of the September 2003 version. The latter is still available from this site. The distinctions between Version 2, described here, and Version 1 are as follows:
|
1. INTRODUCTION |
The LUCS-KDD (Liverpool University Computer Science - Knowledge Discovery in Data) DN (Discretisation/ Normalisation) software has been developed to convert data files available in the UCI data repository ([1]) into a binary format suitable for use with Classification Association Rule Mining (CARM) applications. The software can, of course, equally well be used to convert data files obtained from other sources. We define discretisation and normalisation as follows:
CARM requires "binary valued" input data sets where each record represents an itemset which in turn is a subset of the available set of attributes. Thus: N = The number of available attributes A = A set of attributes = {a | 0 < a <= N} D = A data set comprising M records (R) R = {r | r subset A} For example if A={1 2 3 4} then we might have a data set of the form: 1 2 3 4 1 2 3 1 2 4 1 3 4 2 3 4 1 2 1 3 1 4 2 3 2 4 3 4 1 2 3 4 which may be used as the input to an ARM software system. The distinction between the data sets used for ARM and those used for CARM is that in the latter case one of the attributes in each record represents a class to which the record is said to "belong". Usually, to facilitate identification, either the last or first attribute in each record in the input data represents the class. Thus we might have a CARM data set of the following form where the last item in each record represents the class (either 3 or 4): 1 2 4 1 2 3 1 4 2 4 1 3 2 3 (Note that records in a CARM input set must have at least one attribute as well as a class). Generally speaking real data sets do not comprise only binary fields. Typically such data sets comprise a mixture of nominal, continuous and integer fields. Thus real data will require disacretization/ normalisation, i.e. convertion into a binary valued format, before it can be used for CARM. How this is done using the LUCS-KDD DN software will depend on the nature of each field in the data:
|
Thus given a (space separated) data set with the schema {colour (nominal), average (double), age (int), class (int)} such as: red 25.6 56 1 green 33.3 1 1 green 2.5 23 0 blue 67.2 111 1 red 29.0 34 0 yellow 99.5 78 1 yellow 10.2 23 1 yellow 9.9 30 0 blue 67.0 47 0 red 41.8 99 1 This would be discretised/ normalised as follows: 3 5 9 13 2 5 8 13 2 5 8 12 1 7 11 13 3 5 9 12 4 7 10 13 4 5 8 13 4 5 8 12 1 7 9 12 3 6 11 13 The entire data set is now presented by 13 binary valued attributes and can be mined using appropriate CARM software. 1.1 Missing valuesIt is not unusual for data sets in the UCI repository to include missing values. The convention is to indicate these using a `?' character. This convention has also been adopted for the purpose of the LUCS-KDD DN software described here. WARNING: Where a missing value has occured in a numeric field this is recorded internally as 100000001.0. In the unfortunate case where input data includes very large numbers (integers or doubles) these will be identified as missing values! |
The LUCS-KDD DN software has been implemented in Java using the Java2 SDK (Software Development Kit) Version 1.4.0, which should therefore make it highly portable. In the interest of "user freindliness" the user interface has been implemented as a GUI. The source code is available at: |
http://csc.liv.ac.uk/~frans/KDD/Software/LUCS_KDD_DN/LUCS_KDD_DN.java The code does not require any special packages and thus can be compiled using the standard Java compiler: javac LUCS_KDD_DN.java |
Before the LUCS-KDD DN software can convert a data file into the desired binary valued format it needs to know the schema for the data to be converted. The schemas for the data files in the UCI repository are described (usually) in free text format within a .names file. Users of the LUCS-KDD DN software will thus have to create their own schema files before any convertion can take place. LUCS-KDD DN Schema files comprise three lines of text each containing a sequence of N literals separated by white space (not carriage returns) where N is the number of attributes in the data set to be converted:
|
The schema file for the above example data set would be: nominal double int nominal colour average age class blue/green/red/yellow null null 0/1 The schema file for the Pima Indians UCI data set might be: int int int int int double double int nominal NumberPregnacies PlasmaGluConcent DiastolicBldPress TricepsSkinFold 2-HourSerumIns BodyMassIndex DiabPedFunc Age Class null null null null null null null null 0/1 (Remember that data fields of "type" unused are ignored by the LUCS-KDD DN software and do not appear in the output.) |
When compiled the software can be invoked in the normal manner using the Java interpreter: java LUCS_KDD_DN.java If you are planning to process a very large data set, such as covertype, it might be an idea to grab some extra memory. For example: java -Xms600m -Xmx600m LUCS_KDD_DN |
Once invoked you will see an interface of the form presented in Figure 1. Note that in Figure 1 all but one of the interface buttons are "greyed out"; this is because, at this stage, we have no data to work with. The one functional button is:
Note also, in Figure 1, that some instructions are presented in the main window of the GUI. |
Figure 1: LUCS-KDD DN interface on start up.
A schema file is loaded by selecting the "Input Schema" button. Once the schema has been "read" three more buttons become active:
|
(1) int: NumberPregnacies (2) int: PlasmaGluConcent (3) int: DiastolicBldPress (4) int: TricepsSkinFold (5) int: 2-HourSerumIns (6) double: BodyMassIndex (7) double: DiabPedFunc (8) int: Age (9) nominal: Class { 0 1 } Figure 2 shows the LUCS-KDD DN interface once a schema file (the example for the Pima Indians data set given above) has been loaded and listed using the "Input Schema" and "List Schema" buttons respectively. |
Figure 2: LUCS-KDD DN interface with schema file loaded.
A data file can be loaded using either the "Input Data ('/nbsp/nbsp' separated) or "Input Data (',' separated" buttons. Some example raw data (the first few lines from the Pima Indians comma separated data set) is given below: 6,148,72,35,0,33.6,0.627,50,1 1,85,66,29,0,26.6,0.351,31,0 8,183,64,0,0,23.3,0.672,32,1 1,89,66,23,94,28.1,0.167,21,0 0,137,40,35,168,43.1,2.288,33,1 5,116,74,0,0,25.6,0.201,30,0 3,78,50,32,88,31.0,0.248,26,1 10,115,0,0,0,35.3,0.134,29,0 2,197,70,45,543,30.5,0.158,53,1 8,125,96,0,0,0.0,0.232,54,1 4,110,92,0,0,37.6,0.191,30,0 10,168,74,0,0,38.0,0.537,34,1 10,139,80,0,0,27.1,1.441,57,0 1,189,60,23,846,30.1,0.398,59,1 5,166,72,19,175,25.8,0.587,51,1 ...... Figure 3 shows the LUCS-KDD DN interface once a data set has been loaded (the Pima Indians data set in this case). Note that seven more buttons have now become active:
6.0 148.0 72.0 35.0 0.0 33.6 0.627 50.0 1.0 1.0 85.0 66.0 29.0 0.0 26.6 0.351 31.0 0.0 8.0 183.0 64.0 0.0 0.0 23.3 0.672 32.0 1.0 1.0 89.0 66.0 23.0 94.0 28.1 0.167 21.0 0.0 0.0 137.0 40.0 35.0 168.0 43.1 2.288 33.0 1.0 |
5.0 116.0 74.0 0.0 0.0 25.6 0.201 30.0 0.0 3.0 78.0 50.0 32.0 88.0 31.0 0.248 26.0 1.0 10.0 115.0 0.0 0.0 0.0 35.3 0.134 29.0 0.0 2.0 197.0 70.0 45.0 543.0 30.5 0.158 53.0 1.0 8.0 125.0 96.0 0.0 0.0 0.0 0.232 54.0 1.0 4.0 110.0 92.0 0.0 0.0 37.6 0.191 30.0 0.0 10.0 168.0 74.0 0.0 0.0 38.0 0.537 34.0 1.0 10.0 139.0 80.0 0.0 0.0 27.1 1.441 57.0 0.0 1.0 189.0 60.0 23.0 846.0 30.1 0.398 59.0 1.0 5.0 166.0 72.0 19.0 175.0 25.8 0.587 51.0 1.0 ....
|
Figure 3: LUCS-KDD DN interface with schema and data files loaded.
The desired number of divisions is entered using the "Input Num. Divs" button. With respect to the Pima Indians data set the minimum number of ranges setting for this class could be 2 as there are two classes ---- "NoSignsOfDiabetes" and "SignsOfDiabetes". However, experiments have shown that even with two classes it is a good idea to have at least 3 divisions. An even better classifier will usually be produced using a greater number of divisions, a value of 5 has been found to give a good trade of between accuracy and computational efficiency with respect to the mining algorithm used. |
Remember that the higher the number of divisions the more output attributes will be generated during the discretisation/normalisation process, which in turn will have an effect on the computational efficiency of any CARM algorithm that may be applied to the data. Where an integer data field has a range less than the user specified number of sub-ranges the LUCS KDD DN software will assign a number of columns to the integer field equivalent to the size of the range. For example if we have an integer field which can take values in the range of 99 to 102 inclusive (i.e. a range size of 4) and the number of divisions value is 5 then 4 columns would be assigned to the integer field. If, however, the number of sub-ranges value was set to 3, the 99 to 102 range would be divided into three divisions each with its own unique column number assigned to it (how this is done is described blow. |
Once a value for the number of divisions has been successfully entered the "Normalisation" button becomes available:
To discretise/normalise the data set the user simply selects the "Normalisation" button. Note that for some data sets this process may take a few seconds. The result in the case of the Pima Indians data set will be of the following form: |
1 6 9 14 19 23 27 33 38 1 5 9 14 19 23 27 32 37 1 8 9 14 19 23 27 32 38 1 5 9 14 19 23 27 32 37 1 5 9 14 19 23 31 32 38 1 5 9 14 19 23 27 32 37 1 5 9 14 19 23 27 32 38 3 5 9 14 19 23 27 32 37 1 8 9 17 22 23 27 34 38 1 5 10 14 19 23 27 35 38 1 5 9 14 19 23 27 32 37 3 8 9 14 19 23 27 32 38 3 5 9 14 19 23 31 36 37 1 8 9 14 22 23 27 36 38 1 8 9 14 19 23 27 33 38 ............ |
Once normalisation is complete a further five (output) buttons become active (Figure 4) as follows:
|
(27) DiabPedFunc <= 0.93 (28) 0.93 < DiabPedFunc <= 0.953 (29) 0.953 < DiabPedFunc <= 1.001 (30) 1.001 < DiabPedFunc <= 1.379 (31) 1.379 < DiabPedFunc (32) Age <= 47 (33) 47 < Age <= 53 (34) 53 < Age <= 54 (35) 54 < Age <= 55 (36) 55 < Age (37) Class=0 (38) Class=1 |
Figure 4: LUCS-KDD DN interface after discretisation/normalisation.
5. HOW IT WORKS |
The discretisation/normalisation mechanism operates according to whether the attribute column under consideration is either numeric or nominal.
When processing numeric attributes the procedure is as follows:
Note that as a result of a merge the resulting dominant class may be the dominant class originally associated with one of the divisions selected for the merge or it may be a different class.
Once a merge is complete it may be that the dominant class associated with the resulting division is identical to that associated with either/or the divisions immediately before and after the merged pair in which case the new divisions can automatically be extended. Consequently the reduction in the number of divisions obtained by a single merge may be 1, 2 or 3.
In the case of nominal attributes the normalisation process is slightly different, as follows:
6. COMPARISON WITH VERSION 1 |
Version 2 of the LUCS-KDD-DN software offers the following advantages over versions 1:
These advatages are evidenced in the following tables. Table 1 compares the size (Bytes) and the number of output attributes produced using both version 2 and version 1 of the software. Table 2 gives a classification accuracy, obtained using TCV (Ten Cross Validation) comparison for both versions of the software.
NAME | Size (Bytes) V2 | Size (Bytes) V1 | # Attribs. V2 | # Attribs. V1 |
---|---|---|---|---|
adult.D97.N48842.C2.num | 2034781 | 2140596 | 97 | 131 |
anneal.D73.N898.C6.num | 36705 | 34985 | 73 | 106 |
auto.D137.N205.C7.num | 17150 | 17277 | 137 | 142 |
breast.D20.N699.C2.num | 18126 | 19597 | 20 | 47 |
chessKRvK.D58.N28056.C18.num | 529596 | 548454 | 58 | 66 |
connect4.D129.N67557.C3.num | 9187752 | 9187752 | 129 | 129 |
cylBands.D124.N540.C2.num | 56675 | - | 124 | - |
flare.D39.N1389.C9.num | 43153 | - | 39 | - |
glass.D48.N214.C7.num | 5946 | 5993 | 48 | 52 |
heart.D52.N303.C5.num | 11943 | 12029 | 52 | 53 |
hepatitis.D56.N155.C2.num | 8335 | 8335 | 56 | 58 |
horseColic.D85.D368.C2.num | 18303 | 18515 | 85 | 94 |
ionosphere.D157.N351.C2.num | 40300 | 41469 | 157 | 172 |
iris.D19.N150.C3.num | 2017 | 1954 | 19 | 23 |
led7.D24.N3200.C10.num | 62154 | 62154 | 24 | 24 |
letRecog.D106.N20000.C26.num | 992990 | 986452 | 106 | 106 |
mushroom.D90.N8124.C2.num | 539424 | 575924 | 90 | 127 |
nursery.D32.N12960.C5.num | 320760 | 320760 | 32 | 32 |
pageBlocks.D46.N5473.C5.num | 169663 | 169814 | 46 | 55 |
penDigits.D89.N10992.C10.num | 543038 | 546304 | 89 | 90 |
pima.D38.N768.C2.num | 18462 | 19302 | 38 | 42 |
soybean-large.D118.N683.C19.num | 66107 | - | 118 | - |
ticTacToe.D29.N958.C2.num | 25866 | 25866 | 29 | 29 |
waveform.D101.N5000.C3.num | 324562 | 330164 | 101 | 108 |
wine.D68.N178.C3.num | 7146 | 7126 | 68 | 178 |
zoo.D42.N101.C7.num | 4670 | 4670 | 42 | 43 |
Table 1: Comparison of size and number of attributes in the output datasets produced using both versions of the LUCS-KDD-DN software.
NAME | Accuracy V1 | Accuracy V2 | Time V1 | Time V2 |
---|---|---|---|---|
adult.D97.N48842.C2.num | 76.1 | 80.8 | 80.0 | 48.1 |
anneal.D73.N898.C6.num | 84.6 | 88.3 | 12.5 | 10.9 |
auto.D137.N205.C7.num | 50.8 | 70.6 | 23.4 | 27.0 |
breast.D20.N699.C2.num | 95.9 | 90.0 | 1.0 | 1.1 |
chessKRvK.D58.N28056.C18.num | NRG | NRG | NRG | NRG |
connect4.D129.N67557.C3.num | 65.9 | 65.8 | 206.4 | 265.1 |
cylBands.D124.N540.C2.num | - | 68.3 | - | 87.4 |
flare.D39.N1389.C9.num | - | 84.3 | - | 5.2 |
glass.D48.N214.C7.num | 43.7 | 64.1 | 2.5 | 3.2 |
heart.D52.N303.C5.num | 57.1 | 51.4 | 11.8 | 10.3 |
hepatitis.D56.N155.C2.num | 77.3 | 81.2 | 15.7 | 14.5 |
horseColic.D85.D368.C2.num | 78.7 | 79.1 | 6.4 | 9.0 |
ionosphere.D157.N351.C2.num | 83.8 | 86.1 | 8.4 | 7.3 |
iris.D19.N150.C3.num | 94.7 | 95.3 | 0.1 | 1.0 |
led7.D24.N3200.C10.num | 69.0 | 68.7 | 0.7 | 1.6 |
letRecog.D106.N20000.C26.num | 43.0 | 27.6 | 115.6 | 106.3 |
mushroom.D90.N8124.C2.num | 96.1 | 99.0 | 31.6 | 38.1 |
nursery.D32.N12960.C5.num | 77.9 | 77.8 | 5.4 | 5.6 |
pageBlocks.D46.N5473.C5.num | 89.8 | 90.0 | 2.4 | 2.3 |
penDigits.D89.N10992.C10.num | 79.7 | 81.7 | 56.8 | 91.2 |
pima.D38.N768.C2.num | 74.9 | 74.4 | 1.3 | 1.1 |
soybean-large.D118.N683.C19.num | - | 87.3 | - | 8.6 |
ticTacToe.D29.N958.C2.num | 63.6 | 67.1 | 3.1 | 3.1 |
waveform.D101.N5000.C3.num | 71.7 | 66.7 | 89.7 | 77.7 |
wine.D68.N178.C3.num | 86.3 | 72.1 | 8.7 | 9.6 |
zoo.D42.N101.C7.num | 93.0 | 93.0 | 6.0 | 7.1 |
Table 2: Comparison of classification accuracy the LUCS-KDD TFPC classifier and TCV for datasets genertated using both versions of the LUCS-KDD-DN software. (Support threshold = 1%, Confidence threshold = 50%, NRG = No Rules Generated.)
7. LITERATURE REVIEW |
Class dependent discretisation techniques can be categorised in various ways: (i) supervised v. unsupervised, (ii) bottom-up v. top-down, (iii) direct v. in-direct. supervised methods take into consideration the class labels while unsupervised methods do not (it can thus be argued that the latter are not class dependent discretisation techniques). Bottom-up methods allocate each attribute value to a range (intervals) and then merge ranges until no further merging can take place (according to some criteria), top-down methods start with a single range a iteratively split this range according to some class related statistical mechanism until some stopping criteria is arrived at. Direct methods pre specify a maximum number of ranges while in-direct methods (also called incremental methods) do not. The LUCS-KDD-DN software described on this www page is a supervised, bottom-up, direct method. Examples of supervised, bottom-up, incremental methods include (Kerber, 1992) and (Haun and Setiono, 1997). Examples of supervised, top-down, incremental methods include (Kurgan and Cios, 2004 and Tsai et al. 2008). For further work on Class dependent descretisation see (Bryson and Joseph, 2001, Ching, et al., 1995, Das and Vyas, 2010, Dougherty et al., 1995, Hua and Zhao, 2009, Liu et al. 2002, and Liu and Setiono, 1997).
8. CONCLUSION |
The LUCS-KDD DN software has been in use successfully within the LUCS-KDD group for some time. The software is available for free, however the author would appreciate appropriate acknowledgement. The following reference format for referring to the software is suggested: Coenen, F. (2003), LUCS-KDD DN Software (Version 2), http://www.csc.liv.ac.uk/~frans/KDD/Software/LUCS_KDD_DN/, Department of Computer Science, The University of Liverpool, UK. |
Should you discover any "bugs" or other problems within the software (or this documentation), or have suggestions for additional features, please do not hesitate to contact the author. Some additional notes on processing a number of example data sets from the UCI repository, together with example schema files are available at: http://csc.liv.ac.uk/~frans/KDD/Software/LUCS-KDD-DN/exmpleDNnotes.html A number of example discretised/normalised data sets, taken from the UCI library, are available at: http://csc.liv.ac.uk/~frans/KDD/Software/LUCS-KDD-DN/DataSets/dataSets.html |
Created and maintained by Frans Coenen. Last updated 24 November 2011