THE LUCS-KDD IMPLEMENTATIONS OF FOIL (FIRST ORDER INDUCTIVE LEARNER)


Frans Coenen

Department of Computer Science

The University of Liverpool

13 February 2004


DISCLAMER! The following description of the FOIL algorithm is that implemented by the author of this WWW page, i.e. it may not be identical to that first produced by Ross Quinlan but certainly displays all the "saliant" features.

CONTENTS

1. Overview of FOIL algorithm.
2. Rule pruning.
 
3. Testing.
4. Worked Example.
5. Conclusions.

Note: A C implementation of Foil (by Ross Quinlan and Mike Cameron-Jones) is available by anonymous ftp from ftp.cs.su.oz.au, file pub/foil6.sh.




1. OVERVIEW OF FOIL ALGORITHM


The FOIL algorithm (Quinlan and Cameron-Jones 1993) takes as input a (space separated) binary valued data set R and produces a set of CARs. The resulting classifier, as genertaed by the LUCS-KDD FOIL algorithm described here, comprises a linked-list of rules ordered according to their Laplace accuracy (Clark and Boswell 1991).

The FOIL algorithm commences by, for each class in C (the set of classes), producing two arrays of positive (P) and negative (N) example records. Thus given a data set of the form:

1 2 5
1 5
2 5
3 4 6
3 6
4 6

for the class 5 we would generate the following P and N arrays:

P = {{1,2,5}{1,5},{2,5}}
N = {{3,4,6},{3,6},{4,6}}

Note that R = P union N. The LUCS-KDD implementation of the FOIL algorithm also makes use of a 2-D attribute array (A). The coulumns (first indexes) of A represent attribute numbers (from 1 to D). The first row of A (second index equals 0) is an indicator set to 0 if the attribute has not yet been considered for inclussion in the antecedent of the current rule under consideration, and 1 otherwise. The second row of A (second index equals 1) is used to store gains (see below). Throughout the generation process copies of P, N and A (P', N' and A') are continuously revised.

Rules are built up by repeatedly adding an attribute to the antecedent of a current rule (commencing with an empty antecedent) according to the gain measure.

The LUCS-KDD implementation can broadly be described in Table 1.

method: startFOIL
Parameters: none
Global access to: R, C
----------------------------------------------

generate an empty global attributes array A
For each c in C
    generate global P and N example arrays
    while P != {}
        N'<--N, P'<--P, A'<--A
	if no attributes exist that can produce a gain
				above minimum break
	foilGeneration({},c);
    end loop
end loop

Table 1: The startFOIL method.

The foilGeneration/2 method recursively adds attributes to the antecedent of a proposed rule until either:

  1. The effect of adding another attribute has no positive influence on the classifier (this is measured in terms of gain).
  2. A user specified maximum size (MAX_NUM_ATTS_IN_ANTECEDENT) for the antecedent has been reached.
  3. The negative example set (N') has been reduced to {}.

The method is recursive and has two argument: the antecedent so far and the class. The method is described in Table 2.

method: foilGeneration(ante, cons)
Input: parameters ante (antecedent so far) and
	cons (consequent) for current rule.
Access to: A, A', N, N', P and P'
-------------------------------------------------

for each a in A
    if (a not in antecedent1) calculated gain 
    				and store in A
end loop
i = "available" column in A with best gain
if (A[i][0] <= MIN_BEST_GAIN) 
    add antecedent->c to rule list
    for all records in P reduce weighting by decay factor
     		and adjust PN array accordingly
    return


ante = ante union i
A'[i][1]=1 (i.e. column i is no longer "available")
remove records in P' and N' which DO NOT contain attribute ante
		
if (|ante] >= MAX_NUM_ATTS_IN_ANTECEDENT or N'=={})
    add rule (ante -> cons) to rule list
    remove records from P that contain ante
    return
else  foilGeneration(ante, cons)	

Table 2: The foilGenerastion method.

1. An attribute is not in the antecedent if the appropriate cell in the attributes array has the value 0.

NOTES:

  1. An attribute is not in the antecedent (ante) if the appropriate cell in the attributes array (A) has the value 0.
  2. In experiments MAX_LENGTH has typically been set to 4, and MIN_GAIN to 0.7.
  3. Gain is calculated as follows:
gain(a) = |P'| (log(|P'|/|P'|+|N'|) - log(|P|/|P|+|N|))



2. RULE PRUNING


Unlike many other CARM algorithms FOIL does not include any rule pruning phase. This is because FOIL generates a small number of rules --- each record in the training set may be covered by only one rule in the classifier.




3. TESTING


During testing, for each record in the test sets (as recommended in Yin and Han 2003) the record is classified using the following procedure:

  1. Obtain all rules whose antecedent is a subset of the given record.
  2. Select the best K rules for each class according to their Laplace accuarcy. In experiments K was set to 5).
  3. Determine the average expected accuracy over the selected rules for each class
  4. Select the class with the best average expected accuracy.



4. WORKED EXAMPLE

Given the following data set (first 15 lines from the pima indian set):

2 9 13 17 21 28 32 38 42 
1 8 13 17 21 27 31 36 41 
3 10 13 16 21 27 32 36 42 
1 8 13 17 21 28 31 36 41 
1 9 12 17 21 29 35 37 42 
2 8 14 16 21 27 31 36 41 
1 7 13 17 21 28 31 36 42 
3 8 11 16 21 28 31 36 41 
1 10 13 18 24 28 31 38 42 
3 9 14 16 21 26 31 38 42 
2 8 14 16 21 28 31 36 41 
3 10 14 16 21 28 31 37 42 
3 9 14 16 21 28 33 39 41 
1 10 13 17 25 28 31 39 42 
2 10 13 16 22 27 32 38 42 

We commence with class 41 and create the following P and N sets and the attribute array A. Column numbers in the attributes array indicate attribute labels 1 to 40.

positiveExamples:
 {1 8 13 17 21 27 31 36 41}
 {1 8 13 17 21 28 31 36 41}
 {2 8 14 16 21 27 31 36 41}
 {3 8 11 16 21 28 31 36 41}
 {2 8 14 16 21 28 31 36 41}
 {3 9 14 16 21 28 33 39 41}

negativeExamples:
 {2 9 13 17 21 28 32 38 42}
 {3 10 13 16 21 27 32 36 42}
 {1 9 12 17 21 29 35 37 42}
 {1 7 13 17 21 28 31 36 42}
 {1 10 13 18 24 28 31 38 42}
 {3 9 14 16 21 26 31 38 42}
 {3 10 14 16 21 28 31 37 42}
 {1 10 13 17 25 28 31 39 42}
 {2 10 13 16 22 27 32 38 42}

attributes:
Col 1: 0.0 0.0
Col 2: 0.0 0.0
Col 3: 0.0 0.0
Col 4: 0.0 0.0
Col 5: 0.0 0.0
Col 6: 0.0 0.0
Col 7: 0.0 0.0
Col 8: 0.0 0.0
Col 9: 0.0 0.0
Col 10: 0.0 0.0
Col 11: 0.0 0.0
Col 12: 0.0 0.0
Col 13: 0.0 0.0
Col 14: 0.0 0.0
Col 15: 0.0 0.0
Col 16: 0.0 0.0
Col 17: 0.0 0.0
Col 18: 0.0 0.0
Col 19: 0.0 0.0
Col 20: 0.0 0.0
Col 21: 0.0 0.0
Col 22: 0.0 0.0
Col 23: 0.0 0.0
Col 24: 0.0 0.0
Col 25: 0.0 0.0
Col 26: 0.0 0.0
Col 27: 0.0 0.0
Col 28: 0.0 0.0
Col 29: 0.0 0.0
Col 30: 0.0 0.0
Col 31: 0.0 0.0
Col 32: 0.0 0.0
Col 33: 0.0 0.0
Col 34: 0.0 0.0
Col 35: 0.0 0.0
Col 36: 0.0 0.0
Col 37: 0.0 0.0
Col 38: 0.0 0.0
Col 39: 0.0 0.0
Col 40: 0.0 0.0

The number of positive examples is 6 and the number of negative examples is 9 (note that the latter will be constant for the class in question). We then process the set P until it is empty. First we make copies of P, N and A: P', N' and A'. We then claculate the gain for each attribute were it to be added to the rule antecedent (which currently is empty) and place the resulting gains in the attributes array. The result is as follows:

Col 0: 0.0 0.0 
Col 1: 0.0 -0.3646431135879096 
Col 2: 0.0 0.4462871026284194 
Col 3: 0.0 0.0 
Col 4: 0.0 0.0 
Col 5: 0.0 0.0 
Col 6: 0.0 0.0 
Col 7: 0.0 0.0 
Col 8: 0.0 4.581453659370775 
Col 9: 0.0 -0.4700036292457356 
Col 10: 0.0 0.0 
Col 11: 0.0 0.916290731874155 
Col 12: 0.0 0.0 
Col 13: 0.0 -0.9400072584914712 
Col 14: 0.0 1.216395324324493 
Col 15: 0.0 0.0 
Col 16: 0.0 0.8925742052568388 
Col 17: 0.0 -0.3646431135879096 
Col 18: 0.0 0.0 
Col 19: 0.0 0.0 
Col 20: 0.0 0.0 
Col 21: 0.0 1.3388613078852583 
Col 22: 0.0 0.0 
Col 23: 0.0 0.0 
Col 24: 0.0 0.0 
Col 25: 0.0 0.0 
Col 26: 0.0 0.0 
Col 27: 0.0 0.4462871026284194 
Col 28: 0.0 0.4214420626313049 
Col 29: 0.0 0.0 
Col 30: 0.0 0.0 
Col 31: 0.0 1.1157177565710485 
Col 32: 0.0 0.0 
Col 33: 0.0 0.916290731874155 
Col 34: 0.0 0.0 
Col 35: 0.0 0.0 
Col 36: 0.0 2.899092476264711 
Col 37: 0.0 0.0 
Col 38: 0.0 0.0 
Col 39: 0.0 0.2231435513142097 
Col 40: 0.0 0.0 
 

The first digit is set to 0.0 to indicate that the attribute is available to be included in a rule (i.e. not already in the current rule antecedent) and the second to hold gain values when calculated. When an attribute becomes no longer available the first digit will be set to 1.0. The attribute array is used purely to enhance the computational efficiency of the algorithm.

If there were no gain above the user specified minimum we would add the rule sofar to the ruke list. However there are attributes with a gain above the minimum. The attribute with the best gain is attribute 8 (gain = 4.581453659370775), this attribute is then added to the rule antecedent 8 -> 41 and P' and N' adjusted so that all examples which do not contain attribute 8 are removed:

positiveExamples2:
 {1 8 13 17 21 27 31 36 41} 
 {1 8 13 17 21 28 31 36 41} 
 {2 8 14 16 21 27 31 36 41} 
 {3 8 11 16 21 28 31 36 41} 
 {2 8 14 16 21 28 31 36 41} 

negativeExamples2: Null

As there are no more negative examples thus we add the rule 8 -> 41 to the rule list and remove all examples from P (NOT P') that satisfy the rule. This will leave:

positiveExamples:
 {3 9 14 16 21 28 33 39 41} 

Because P is not empty we loop round again to try and find another rule for the class 41. The number of positive examples is now 1 and the number of negative examples is 9 (remember that the latter will be constant for the class in question). We again make copies of (the revised) P and original N sets of examples to give P' and N', and we again create an empty attributes array A' (by copying A). We, as before, calculate all gains for each attaribute (and store in the attributes array).

Col 1: 0.0 0.0
Col 2: 0.0 0.0
Col 3: 0.0 0.9162907318741549
Col 4: 0.0 0.0
Col 5: 0.0 0.0
Col 6: 0.0 0.0
Col 7: 0.0 0.0
Col 8: 0.0 0.0
Col 9: 0.0 0.9162907318741549
Col 10: 0.0 0.0
Col 11: 0.0 0.0
Col 12: 0.0 0.0
Col 13: 0.0 0.0
Col 14: 0.0 1.2039728043259357
Col 15: 0.0 0.0
Col 16: 0.0 0.6931471805599452
Col 17: 0.0 0.0
Col 18: 0.0 0.0
Col 19: 0.0 0.0
Col 20: 0.0 0.0
Col 21: 0.0 0.356674943938732
Col 22: 0.0 0.0
Col 23: 0.0 0.0
Col 24: 0.0 0.0
Col 25: 0.0 0.0
Col 26: 0.0 0.0
Col 27: 0.0 0.0
Col 28: 0.0 0.5108256237659905
Col 29: 0.0 0.0
Col 30: 0.0 0.0
Col 31: 0.0 0.0
Col 32: 0.0 0.0
Col 33: 0.0 2.3025850929940455
Col 34: 0.0 0.0
Col 35: 0.0 0.0
Col 36: 0.0 0.0
Col 37: 0.0 0.0
Col 38: 0.0 0.0
Col 39: 0.0 1.6094379124341
Col 40: 0.0 0.0

The attribute with the best gain is attribute 33 (gain = 2.3025850929940455) this is then added to the rule antecedent 33 -> 41 and P' and N' adjusted so that all examples which do not contain attribute 8 are removed:

positiveExamples2:
 {3 9 14 16 21 28 33 39 41}

negativeExamples2: Null

There are no more negative examples thus we add the rule 33 -> 41 to the rule list and remove all examples from P (NOT P') that satisfy the rule. This weill leave:

positiveExamples: Null

The example set P is empty so we do not calculate any more rules. We thus go om to the second class 42 and proceed in the same manner.

At the end of the process we will have the following rule list. Note that rules are inserted into the rule list, as they are derrived according to their calculate Laplace accuracy (dervied according to each rules Laplace expected error estimate).

(1)  {8}  ->  {41}  0.8571428571428571%
(2)  {10}  ->  {42}  0.8571428571428571%
(3)  {33}  ->  {41}  0.6666666666666666%
(4)  {9}  ->  {42}  0.6666666666666666%
(5)  {7}  ->  {42}  0.6666666666666666%



5. CONCLUSSIONS


The LUCS-KDD implementation of FOIL described here has been use by the LUCS-KDD reseach group to contrast and compare a variety of CARM algorithms and techniques. The software is available free of charge, however the author would appreciate appropriate acknowledgement. The following reference format for referring to the software is suggested:

Coenen, F. (2004), LUCS-KDD implementations of FOIL (First Order Inductive Learner), http://www.cxc.liv.ac.uk/~frans/KDD/Software/FOIL_PRM_CPAR/foil.html, Department of Computer Science, The University of Liverpool, UK.

Should you discover any "bugs" or other problems within the software (or this documentation), do not hesitate to contact the author.




REFERENCES


  1. Clark, P. and Boswell. R. (1991). Rule Induction With CN2: Some Recent Improvements. Proc. European Working Session on Learning (ESWL'91), Porto, Portugal. pp151-163.
  2. Quinlan, J. R. and Cameron-Jones, R. M. (1993). FOIL: A Midterm Report. Proc. ECML, Vienna, Austria, pp3-20.
  3. Yin, X. and Han, J. (2003). CPAR: Classification based on Predictive Association Rules. Proc. SIAM Int. Conf. on Data Mining (SDM'03), San Fransisco, CA, pp. 331-335.



Created and maintained by Frans Coenen. Last updated 23 June 2005