LUCS KDD DATASET ENCODING

Frans Coenen

Department of Computer Science, The University of Liverpool

August 2006

In this page we briefly describe the representation in which input data sets are encoded for use with the example LUC-KDD data mining software systems available at . The text is primarily intended for MSc project students but may proove usefull to other readers.

Association Rule Mining (ARM) typically operates with what are known as binary valued data sets. This means that fields in the data sets can have one of two possible values: 1 or 0 (true or false, yes or no, etc.). The data sets are tguypically presented as a single NXM table comprising M/TT> rows and Nfeilds, where the rows reporesent data records and the columns fields. Each field is called an item or attribute (of the data set). The global set of av items (attributes) represented in the datset is usually denoted by the letter I. A set of items (attributes) is known as an itemset. An itemset is thus some subset of the global item set I.

The exempler ARM application is super market basket analysis where we wish to identify purchasing patterns associated with super market customers. In this case the global attribute set (I) is the set of potential retail items that a customer can purchase. Each record then represents some subset of the global set of available retial items purchased in a single transactions. In this context each record is referred to as a transaction and may be identified by a Transaction Identification Number or TID. The dats set is then sometimes referre to as a transactiuon set.

Consider the following example transaction dataset where the presence of a 1 indicates the purchase of a retail item and a 0 the non-purchase of an item. In the table I = {Apples, Beans, Coffee, Damsons}, N = |I| = 4 and M = 5.

TID	Apples	Beans	Coffee	Damsons
1	1	1	1	0
2	1	0	1	1
3	1	0	0	1
4	0	1	1	1
5	0	0	1	1

Usually, for discussion purposes, we abastract away from individual entity names or labels and use sequences of letters inbstead to represent attributes/items. The above data set thus becomes:

A B C
A C D
A D
B C D
C D

Note that the TIDs (record numbers) have been omitted asa these can be implied from the ordering of the records. Although letters are useful for dicussion purpose it is better, for implementaional purposes, to represent attributes as a sequence of integers starting with the number 1. The above data set thus becomes:

This rerpesentation then allows the attribute/item identifeirs to be used as indexes into arrays etc, or to have arithmatic applied to them. The numeric representation is not so good for discussion purposes because it is sometimes difficult to distinguish between the attributes labels 1 and 2 and the attribute label 12.

The LUCS KDD datamining software systems all use the abobe numeric representaition for the input data.