THE LUCS-KDD DISCRETISED/NORMALISED (VERSION 2) ARM AND CARM DATA LIBRARY

A SELECTION OF DISCRETISED DATA SETS



Liverpool University

Frans Coenen

Department of Computer Science

The University of Liverpool

Thursday 27 May 2004

Revisions and additions: 22 June 2005, 14 December 2006

CONTENTS

1. Introduction.
 
2. Data Sets.



1. INTRODUCTION

From this page a number (27) data sets mostly taken from the UCI machine learning repository [1] and discretised using the LUCS-KDD DN software [2] have been made available. Where a dataset has been obtained from elsewhere this is indicated. Discretisation has been carried out assuming a maximum of 5 "divisions". Mote that all ther duscretised files have been "zipped" using gzip. The files are intended for use with Association Rule Mining (ARM) and Classification Association Rule Mining (CARM) software (but may well have further uses) which require binary valued input data.

In each case the file names describe the key characteristics of each data set, in the form which it was discretised. For example, the label adult.D131.N48842.C2 denotes the "adult" data set, which includes 48842 records in 2 classes, with attributes that for the experiments described here have been discretised into 131 binary categories. Details of the discretisation in each case are available.

If you make use of these discretised data sets the author would appreciate appropriate acknowledgement. The following reference format for referring to this page is suggested:

Coenen, F. (2003), The LUCS-KDD Discretised/normalised ARM and CARM Data Library, http://www.csc.liv.ac.uk/~frans/KDD/Software/LUCS_KDD_DN/, Department of Computer Science, The University of Liverpool, UK.




2. DATA SETS

The available datasets are as follows.

NAMESize unzipped (Bytes)
adult.D97.N48842.C2.num2034781
anneal.D73.N898.C6.num36705
auto.D137.N205.C7.num17150
breast.D20.N699.C2.num18126
car.D25.N1728.C4.num32400
chessKRvK.D58.N28056.C18.num529596
congres.D34.N435.C2.num19139
connect4.D129.N67557.C3.num9187752
cylBands.D124.N540.C2.num56675
dematology.D49.N366.C6.num[3]13526
ecoli.D34.N336.C8.num7485
flare.D39.N1389.C9.num43153
glass.D48.N214.C7.num5946
heart.D52.N303.C5.num11943
hepatitis.D56.N155.C2.num8335
 
NAMESize unzipped (Bytes)
horseColic.D85.D368.C2.num18303
ionosphere.D157.N351.C2.num40300
iris.D19.N150.C3.num2017
led7.D24.N3200.C10.num62154
letRecog.D106.N20000.C26.num992990
lymphography.D59.N148.C4.num (Restricted Access)
mushroom.D90.N8124.C2.num539424
nursery.D32.N12960.C5.num320760
pageBlocks.D46.N5473.C5.num169663
penDigits.D89.N10992.C10.num543038
pima.D38.N768.C2.num18462
soybean-large.D118.N683.C19.num66107
ticTacToe.D29.N958.C2.num25866
waveform.D101.N5000.C3.num324562
wine.D68.N178.C3.num7146
zoo.D42.N101.C7.num4670

A "tarball" dataSets.tgz (1.2 MBytes) containing all the above (except the lymphography set) is also available. To unpack the "tarball" use the linux command:

tar -zxf dataSets.tgz

Then use gunzip to unzip undividual files. Please contact me if any problems are encountered.




REFERENCES

  1. Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases http://www.ics.uci.edu/~mlearn/MLRepository.html, Irvine, CA: University of California, Department of Information and Computer Science.
  2. Coenen, F. (2003), LUCS-KDD DN Software, http://www.csc.liv.ac.uk/~frans/KDD/Software/LUCS_KDD_DN/, Department of Computer Science, The University of Liverpool, UK.
  3. NeuNet Pro 2.3 for Windows WWW site. http://www.cormactech.com/neunet/download.html.



Created and maintained by Frans Coenen. Last updated 03 March 2008