THE LUCS-KDD IMPLEMENTATIONS OF PRM (PREDICTIVE RULE MINING)



Liverpool University

Frans Coenen

Department of Computer Science

The University of Liverpool

15 February 2004

DISCLAIMER! The following description of the PRM algorithm is that implemented by the author of this WWW page, i.e. it may not be identical to that reported in Yin and Han (2003), but certainly displays all the "salient" features of the PRM algorithm.

CONTENTS

1. Overview of PRM algorithm.
2. Rule pruning.
 
3. Testing.
4. Worked Example.
5. Conclusions.

NOTE: The description given here should be read in the context of the LUCS-KDD implementation of the FOIL (First Order Inductive Learner) algorithm developed by Quinlan and Cameron-Jones.




1. OVERVIEW OF PRM ALGORITHM

The PRM (Predictive Rule Mining) algorithm (Xiaoxin Yin and Jiawei Han 2003) is an extension of FOIL (Quinlan and Cameron-Jones 1993). The enhancement are as follows:

  1. Instead of removing records from the positive examples array P once a rule has been derived, the records in P have a weighting associated with them (number between 0.0 and 1.0) which is reduced by a decay factor. Records are therefore never removed from P, but instead P is processed until the total weight drops below some user specified Total Weight Threshold (TWT). Thus, whereas FOIL generates a small number of rules in that each record in the training set may be covered by only one rule in the classifier, PRM will usually generate more than one rule for each record in the training set.
  2. Gain values are calculated using the total weightings in the positive and negative example sets instead of simply counting up the number of records in the example sets.
  3. To reduce the overhead of repeatedly passing through the example sets to calculate the total weighting (to be used, in turn, to calculate the gains associated with each attribute) PRM uses a "PN array" in which to store the total positive and negative weightings associated with each attribute were that attribute to be included in the current rule.
  4. PRM does not enforce a maximum on the number of attributes that can be included in a rule antecedent (unlike FOIL).

The LUCS-KDD implementation of PRM has the additional features that instead of representing the example sets as 2-D arrays (as in the LUCS-KDD FOIL implementation) the example arrays are represented as arrays of structures (ExamplesStruct) with two fields: (1) the record (i.e. an attribute list), and (2) the associated current weighting for the record.

Like FOIL, PRM takes as input a (space separated) binary valued data set R and produces a set of CARs. The resulting classifier by the LUCS-KDD PRM algorithm comprises a linked-list of rules ordered according to Laplace accuracy (Clark and Boswell 1991) of individual rules.

The PRM algorithm commences by, for each class, producing two global arrays of ExamplesStruct structures representing the positive (P) and negative (N) example sets. Like the LUCS-KDD implementation of FOIL, The PRM implementation also makes use of an Attribute array A.

Again, rules are built up by repeatedly adding an attribute to the antecedent of a current rule (commencing with an empty antecedent) according to the gain measure (also used in FOIL). The LUCS-KDD implementation of PRM can broadly be described in Table 1.

method: startPRM
Parameters: none
Global access to: R, C
----------------------------------------------

generate an empty global attributes array A

For each c in C
    generate global P and N example arrays
    generate global PN array
    determine minimum total Weight threshold

    while (total weight P > minimum total Weight threshold)
	A' <-- A, N' <-- N, P' <-- P, PN' <-- PN
	if no attributes exist with weightings that can
			produce a gain above minimum break
	prmGeneration({},c)
    end loop
end loop

Table 1: The startPRM method.

The prmGeneration/2 procedure repeatedly adds attributes to the antecedent of a proposed rule until either:

  1. There are no more attributes to add.
  2. The effect of adding another attribute has no positive influence on the classifier (this is measured in terms of gain).

The procedure is recursive and has two argument: the antecedent so far and the class. The procedure is as described in Table 2.

method: prmGenerastion/2
Parameters: parameters ante (antecedent so far) and
	cons (consequent) for current rule.
Global access to: A, A', N, N', P, P', PN, PN'
----------------------------------------------

for each a in A
    if (a not in antecedent1) calculated gain using
    	       information in PN array and add to attribute array
end loop
i = "available" column in A with best gain
if (A[i][0] <= MIN_BEST_GAIN) 
    add antecedent->c to rule list
    for all records in P reduce weighting by decay factor
     		and adjust PN array accordingly
    return

ante = ante union i
A'[i][1]=1 (i.e. column i is no longer "available")
remove records in P' and N' which DO NOT contain attribute ante
		and adjust PN array accordingly

if (N' <= {}) {
    add antecedent->c to rule list
    for all records in P reduce weighting by decay factor
     		and adjust PN array accordingly
    return
else prmGeneration(antecedent,c)

Table 2: The prmGenerastion method.

1. An attribute is not in the antecedent if the appropriate cell in the attributes array has the value 0.

In their reported experiments Xiaoxin Yin and Jiawei Han used a MIN_BEST_GAIN of 0.7, and a decay factor of 1/3. Using this decay factor new weightings are calculated thus:

newWeighting = oldWeighting*decayFactor

Gain is then calculated as follows:

WP  = totalWeight(P)
WN  = totalWeight(N)
WP' = totalWeight(P')
WN' = totalWeight(N')
gain(a) = WP' (log(WP'/WP'+PN') - log(WP/WP+PN))

The Minimum Weight Threshold (TWT) for P is calculated by multiplying the start weight of P by the TOTAL_WEIGHT_FACTOR which. Xiaoxin Yin and Jiawei Han used a TOTAL_WEIGHT_FACTOR of 0.05 in their reported experiments.




2. RULE PRUNING


It is possible, using PRM, to generate two or more rules with identical antecedents and consequents. This is because records are not removed from the N and P lists once a rule that satisfies them has been found, but instead these records simply have their weughting reduced. This reduced weighting forms part of the algorithm to calculate gains, however it is still possible for the same attributes to be selected to form the antecedent of a rule because these attributes (despite the reduce weighting) still produce the best gain. Eventually the weighting for the effected records is reduced so far that the attributes do not produce a best gain and are therefore not selected. Where this occurs the rules with the lower accuracy are removed from the rule list. Where this occurs the rules with the lower accuracy are removed from the rule list.




3. TESTING


During testing, for each record in the test sets (as recommended in Yin and Han 2003) the record is classified using the following procedure:

  1. Obtain all rules whose antecedent is a subset of the given record.
  2. Select the best K rules for each class according to their Laplace accuracy. In experiments K was set to 5).
  3. Determine the average expected accuracy over the selected rules for each class
  4. Select the class with the best average expected accuracy.



4. WORKED EXAMPLE

given the following data set (first 15 lines from the Pima Indian set):

2 9 13 17 21 28 32 38 42 
1 8 13 17 21 27 31 36 41 
3 10 13 16 21 27 32 36 42 
1 8 13 17 21 28 31 36 41 
1 9 12 17 21 29 35 37 42 
2 8 14 16 21 27 31 36 41 
1 7 13 17 21 28 31 36 42 
3 8 11 16 21 28 31 36 41 
1 10 13 18 24 28 31 38 42 
3 9 14 16 21 26 31 38 42 
2 8 14 16 21 28 31 36 41 
3 10 14 16 21 28 31 37 42 
3 9 14 16 21 28 33 39 41 
1 10 13 17 25 28 31 39 42 
2 10 13 16 22 27 32 38 42 

We commence with class 41 and create the following P and N example sets and an PN array. The first column numbers for the PN array, presented, below gives the attribute label (1 to 40), the following two columns give the positive and negative total weightings for the P and N example sets respectively. On "start up" each attribute has a weighting of 1 associated with it; for example attribute 1 occurs twice in P and four times in N.

positiveExamples:
 {1 8 13 17 21 27 31 36 41}  (1.0)
 {1 8 13 17 21 28 31 36 41}  (1.0)
 {2 8 14 16 21 27 31 36 41}  (1.0)
 {3 8 11 16 21 28 31 36 41}  (1.0)
 {2 8 14 16 21 28 31 36 41}  (1.0)
 {3 9 14 16 21 28 33 39 41}  (1.0)

negativeExamples:
 {2 9 13 17 21 28 32 38 42}  (1.0)
 {3 10 13 16 21 27 32 36 42}  (1.0)
 {1 9 12 17 21 29 35 37 42}  (1.0)
 {1 7 13 17 21 28 31 36 42}  (1.0)
 {1 10 13 18 24 28 31 38 42}  (1.0)
 {3 9 14 16 21 26 31 38 42}  (1.0)
 {3 10 14 16 21 28 31 37 42}  (1.0)
 {1 10 13 17 25 28 31 39 42}  (1.0)
 {2 10 13 16 22 27 32 38 42}  (1.0)

PN array
Att 1: 	2.0 	4.0
Att 2: 	2.0 	2.0
Att 3: 	2.0 	3.0
Att 4: 	0.0 	0.0
Att 5: 	0.0 	0.0
Att 6: 	0.0 	0.0
Att 7: 	0.0 	1.0
Att 8: 	5.0 	0.0
Att 9: 	1.0 	3.0
Att 10: 0.0 	5.0
Att 11: 1.0 	0.0
Att 12: 0.0 	1.0
Att 13: 2.0 	6.0
Att 14: 3.0 	2.0
Att 15: 0.0 	0.0
Att 16: 4.0 	4.0
Att 17: 2.0 	4.0
Att 18: 0.0 	1.0
Att 19: 0.0 	0.0
Att 20: 0.0 	0.0
Att 21: 6.0 	6.0
Att 22: 0.0 	1.0
Att 23: 0.0 	0.0
Att 24: 0.0 	1.0
Att 25: 0.0 	1.0
Att 26: 0.0 	1.0
Att 27: 2.0 	2.0
Att 28: 4.0 	5.0
Att 29: 0.0 	1.0
Att 30: 0.0 	0.0
Att 31: 5.0 	5.0
Att 32: 0.0 	3.0
Att 33: 1.0 	0.0
Att 34: 0.0 	0.0
Att 35: 0.0 	1.0
Att 36: 5.0 	2.0
Att 37: 0.0 	2.0
Att 38: 0.0 	4.0
Att 39: 1.0 	1.0
Att 40: 0.0 	0.0

The positive set of examples (P) comprises six records. This means that the total weight of P is 6.0 and the Total Weight Threshold (TWT) is:

TWT = totalWeight*TOTAL_WEIGHT_FACTOR
    = 6.0*0.05 = 0.3

The number of negative examples is 9 (note that the latter will be constant for the class in question). We then process the set P until its total weight falls below the threshold. On each iteration we first make copies of P, N, A and PN: P', N', A' and PN'. We then calculate the gain (using threshold values) for each attribute if it were to be added to the rule antecedent (which currently is empty) and place the resulting gains in the attributes array:

Attributes:
Col 1: 0.0 -0.36
Col 2: 0.0 0.45
Col 3: 0.0 0.0 
Col 4: 0.0 0.0 
Col 5: 0.0 0.0 
Col 6: 0.0 0.0 
Col 7: 0.0 0.0 
Col 8: 0.0 4.58 
Col 9: 0.0 -0.47
Col 10: 0.0 0.0 
Col 11: 0.0 0.92
Col 12: 0.0 0.0 
Col 13: 0.0 -0.94
Col 14: 0.0 1.22
Col 15: 0.0 0.0 
Col 16: 0.0 0.89
Col 17: 0.0 -0.36
Col 18: 0.0 0.0 
Col 19: 0.0 0.0 
Col 20: 0.0 0.0 
Col 21: 0.0 1.34
Col 22: 0.0 0.0 
Col 23: 0.0 0.0 
Col 24: 0.0 0.0 
Col 25: 0.0 0.0 
Col 26: 0.0 0.0 
Col 27: 0.0 0.45 
Col 28: 0.0 0.42 
Col 29: 0.0 0.0 
Col 30: 0.0 0.0 
Col 31: 0.0 1.12 
Col 32: 0.0 0.0 
Col 33: 0.0 0.92 
Col 34: 0.0 0.0 
Col 35: 0.0 0.0 
Col 36: 0.0 2.90
Col 37: 0.0 0.0 
Col 38: 0.0 0.0 
Col 39: 0.0 0.22
Col 40: 0.0 0.0 

As in the LUCS-KDD implementation of the FOIL algorithm the first digit in the attributes 2-D array is set to 0.0 to indicate that the attribute is available to be included in a rule (i.e. not already used in the current rule antecedent) and the second to hold gain values when calculated. When an attribute becomes "no longer available" the first digit will be set to 1.0. The attribute array is used purely to enhance the computational efficiency of the algorithm and serves no other purpose.

If there were no gain above the user specified minimum we would add the rule sofar to the rule list. However there are attributes (see above) with a gain above the minimum of 0.7. The attribute with the best gain is attribute 8 (gain = >TT>4.58), this attribute is then added to the rule antecedent 8 -> 41 and P' and N' adjusted so that all examples which do not contain attribute 8 are removed and the PN array adjusted accordingly. For example, considering attribute 1, no records containing this attribute are removed from P' so the total positive weighting remains unchanged. However, four records containing the attribute 1 are removed from N' thus the negative weighting for attribute 1 is reduced by 4.0 to give 0.0. In the LUCS-KDD the removeExDoNotSatRule method in the PRM_CARgen class is used to carry out the necessary adjustment of P' and N' and PN'.

positiveExamples2:
 {1 8 13 17 21 27 31 36 41} 
 {1 8 13 17 21 28 31 36 41} 
 {2 8 14 16 21 27 31 36 41} 
 {3 8 11 16 21 28 31 36 41} 
 {2 8 14 16 21 28 31 36 41} 

negativeExamples2: Null

PN array2:
Att 1: 	2.0 	0.0
Att 2: 	2.0 	0.0
Att 3: 	1.0 	0.0
Att 4: 	0.0 	0.0
Att 5: 	0.0 	0.0
Att 6: 	0.0 	0.0
Att 7: 	0.0 	0.0
Att 8: 	5.0 	0.0
Att 9: 	0.0 	0.0
Att 10: 0.0 	0.0
Att 11: 1.0 	0.0
Att 12: 0.0 	0.0
Att 13: 2.0 	0.0
Att 14: 2.0 	0.0
Att 15: 0.0 	0.0
Att 16: 3.0 	0.0
Att 17: 2.0 	0.0
Att 18: 0.0 	0.0
Att 19: 0.0 	0.0
Att 20: 0.0 	0.0
Att 21: 5.0 	0.0
Att 22: 0.0 	0.0
Att 23: 0.0 	0.0
Att 24: 0.0 	0.0
Att 25: 0.0 	0.0
Att 26: 0.0 	0.0
Att 27: 2.0 	0.0
Att 28: 3.0 	0.0
Att 29: 0.0 	0.0
Att 30: 0.0 	0.0
Att 31: 5.0 	0.0
Att 32: 0.0 	0.0
Att 33: 0.0 	0.0
Att 34: 0.0 	0.0
Att 35: 0.0 	0.0
Att 36: 5.0 	0.0
Att 37: 0.0 	0.0
Att 38: 0.0 	0.0
Att 39: 0.0 	0.0
Att 40: 0.0 	0.0
 

N' is now empty, thus the rule 8 -> 41 is inserted into the rule list. We revise the weightings in P by the decay factor to give a P example set of the form given below. Note that the weightings for each record covered by the new rule has been reduced by 2/3 (the decay factor).

positiveExamples:
 {1 8 13 17 21 27 31 36 41}  (0.33)
 {1 8 13 17 21 28 31 36 41}  (0.33)
 {2 8 14 16 21 27 31 36 41}  (0.33)
 {3 8 11 16 21 28 31 36 41}  (0.33)
 {2 8 14 16 21 28 31 36 41}  (0.33)
 {3 9 14 16 21 28 33 39 41}  (1.0) 

We also produce a new PN' copied from the revised PN array produced when the new rule was discovered:

PN array2:
Att 1: 	0.67 	4.0
Att 2: 	0.67 	2.0
Att 3: 	1.33 	3.0
Att 4: 	0.0 	0.0
Att 5: 	0.0 	0.0
Att 6: 	0.0 	0.0
Att 7: 	0.0 	1.0
Att 8: 	1.67 	0.0
Att 9: 	1.0 	3.0
Att 10: 0.0 	5.0
Att 11: 0.33 	0.0
Att 12: 0.0 	1.0
Att 13: 0.67 	6.0
Att 14: 1.67 	2.0
Att 15: 0.0 	0.0
Att 16: 2.0	4.0
Att 17: 0.67 	4.0
Att 18: 0.0 	1.0
Att 19: 0.0 	0.0
Att 20: 0.0 	0.0
Att 21: 2.67 	6.0
Att 22: 0.0 	1.0
Att 23: 0.0 	0.0
Att 24: 0.0 	1.0
Att 25: 0.0 	1.0
Att 26: 0.0 	1.0
Att 27: 0.67 	2.0
Att 28: 2.0 	5.0
Att 29: 0.0 	1.0
Att 30: 0.0 	0.0
Att 31: 1.67 	5.0
Att 32: 0.0 	3.0
Att 33: 1.0 	0.0
Att 34: 0.0 	0.0
Att 35: 0.0 	1.0
Att 36: 1.67 	2.0
Att 37: 0.0 	2.0
Att 38: 0.0 	4.0
Att 39: 1.0 	1.0
Att 40: 0.0 	0.0

The total weight threshold for P is now 2.67, above the minimum threshold of 0.3, so we repeat the process with the revised example sets P' (copied from the revised version of P) and N' (copied from the orginal N):

positiveExamples2:
 {1 8 13 17 21 27 31 36 41}  (0.33)
 {1 8 13 17 21 28 31 36 41}  (0.33)
 {2 8 14 16 21 27 31 36 41}  (0.33)
 {3 8 11 16 21 28 31 36 41}  (0.33)
 {2 8 14 16 21 28 31 36 41}  (0.33)
 {3 9 14 16 21 28 33 39 41}  (1.0)

negativeExamples2:
 {2 9 13 17 21 28 32 38 42}  (1.0)
 {3 10 13 16 21 27 32 36 42}  (1.0)
 {1 9 12 17 21 29 35 37 42}  (1.0)
 {1 7 13 17 21 28 31 36 42}  (1.0)
 {1 10 13 18 24 28 31 38 42}  (1.0)
 {3 9 14 16 21 26 31 38 42}  (1.0)
 {3 10 14 16 21 28 31 37 42}  (1.0)
 {1 10 13 17 25 28 31 39 42}  (1.0)
 {2 10 13 16 22 27 32 38 42}  (1.0)

we also create an empty attributes array A' (by copying A). As before we then calculate all gains for each attribute (and store in the attributes array):

Attributes:
Col 1: 0.0 0.0 
Col 2: 0.0 0.0 
Col 3: 0.0 -0.35 
Col 4: 0.0 0.0 
Col 5: 0.0 0.0 
Col 6: 0.0 0.0 
Col 7: 0.0 0.0 
Col 8: 0.0 1.53 
Col 9: 0.0 -0.47 
Col 10: 0.0 0.0 
Col 11: 0.0 0.0 
Col 12: 0.0 0.0 
Col 13: 0.0 0.0 
Col 14: 0.0 0.21 
Col 15: 0.0 0.0 
Col 16: 0.0 -0.36 
Col 17: 0.0 0.0 
Col 18: 0.0 0.0 
Col 19: 0.0 0.0 
Col 20: 0.0 0.0 
Col 21: 0.0 -0.7 
Col 22: 0.0 0.0 
Col 23: 0.0 0.0 
Col 24: 0.0 0.0 
Col 25: 0.0 0.0 
Col 26: 0.0 0.0 
Col 27: 0.0 0.0 
Col 28: 0.0 -0.67 
Col 29: 0.0 0.0 
Col 30: 0.0 0.0 
Col 31: 0.0 -0.78 
Col 32: 0.0 0.0 
Col 33: 0.0 0.92 
Col 34: 0.0 0.0 
Col 35: 0.0 0.0 
Col 36: 0.0 0.21 
Col 37: 0.0 0.0 
Col 38: 0.0 0.0 
Col 39: 0.0 0.22 
Col 40: 0.0 0.0 

The attribute with the best gain is attribute 8 (gain = 1.53). Note that this is the same attribute as discovered on the previous iteration --- it is quite possible to generate the same rule more than once using PRM. This is then added to the rule antecedent (8->41) and P' and N' adjusted so that all examples which do not contain attribute 8 are removed and PN' adjusted accordingly:

positiveExamples2:
 {1 8 13 17 21 27 31 36 41}  (0.33)
 {1 8 13 17 21 28 31 36 41}  (0.33)
 {2 8 14 16 21 27 31 36 41}  (0.33)
 {3 8 11 16 21 28 31 36 41}  (0.33)
 {2 8 14 16 21 28 31 36 41}  (0.33)

negativeExamples2: null

PN array2:
Att 1: 	0.67 	0.0
Att 2: 	0.67 	0.0
Att 3: 	0.333 	0.0
Att 4: 	0.0 	0.0
Att 5: 	0.0 	0.0
Att 6: 	0.0 	0.0
Att 7: 	0.0 	0.0
Att 8: 	1.67 	0.0
Att 9: 	0.0 	0.0
Att 10: 0.0 	0.0
Att 11: 0.33 	0.0
Att 12: 0.0 	0.0
Att 13: 0.67 	0.0
Att 14: 0.67 	0.0
Att 15: 0.0 	0.0
Att 16: 1.0 	0.0
Att 17: 0.67 	0.0
Att 18: 0.0 	0.0
Att 19: 0.0 	0.0
Att 20: 0.0 	0.0
Att 21: 1.67 	0.0
Att 22: 0.0 	0.0
Att 23: 0.0 	0.0
Att 24: 0.0 	0.0
Att 25: 0.0 	0.0
Att 26: 0.0 	0.0
Att 27: 0.67 	0.0
Att 28: 1.0 	0.0
Att 29: 0.0 	0.0
Att 30: 0.0 	0.0
Att 31: 1.67 	0.0
Att 32: 0.0 	0.0
Att 33: 0.0 	0.0
Att 34: 0.0 	0.0
Att 35: 0.0 	0.0
Att 36: 1.67 	0.0
Att 37: 0.0 	0.0
Att 38: 0.0 	0.0
Att 39: 0.0 	0.0
Att 40: 0.0 	0.0

N' is now empty so the rule 8->41 is (again) inserted in to the rule list (duplicates are removed in the LUVS-KDD implementaion during pruning) and the weightings in P and PN array adjusted accordingly. The total weight threshold for P is now 1.56, above the minimum threshold of 0.3, so we repeat the process with the revised Example sets P' and N' and an adjusted PN array

On the next iteration the attribute with the best gain is now 33 (gain = 0.92) so we add this to the current rule's antecedent 33->41 and repeat. On the next iteration the best gain is below the threshold so we then add the rule to the rule list.

The total weighting for P has now been reduced to 0.89, still above 0.3, however there are no attributes which can produce a gain in excess of the required minimum thus we stop generating rules for class 41 and repeat the process for class 42.

At the end of the process we will have the following rule list. Note that rules are inserted into the rule list, as they are derived according to their calculate Laplace accuracy (derived according to each rules Laplace expected error estimate).

(1)  {8}   ->  {41}  0.86%
(2)  {10}  ->  {42}  0.86%
(3)  {38}  ->  {42}  0.83%
(4)  {33}  ->  {41}  0.67%



5. CONCLUSSIONS


The LUCS-KDD implementation of PRM described here has been use by the LUCS-KDD reseach group to contrast and compare a variety of CARM algorithms and techniques. The software is available free of charge, however the author would appreciate appropriate acknowledgement. The following reference format for referring to the software is suggested:

Coenen, F. (2004), LUCS-KDD implementations of PRM (Predictive Rule Mining), http://www.cxc.liv.ac.uk/~frans/KDD/Software/FOIL_PRM_CPAR/prm.html, Department of Computer Science, The University of Liverpool, UK.

Should you discover any "bugs" or other problems within the software (or this documentation), do not hesitate to contact the author.




REFERENCES

  1. Clark, P. and Boswell. R. (1991). Rule Induction With CN2: Some Recent Improvements. Proc. European Working Session on Learning (ESWL'91), Porto, Portugal. pp151-163.
  2. Quinlan, J. R. and Cameron-Jones, R. M. (1993). FOIL: A Midterm Report. Proc. ECML, Vienna, Austria, pp3-20.
  3. Yin, X. and Han, J. (2003). CPAR: Classification based on Predictive Association Rules. Proc. SIAM Int. Conf. on Data Mining (SDM'03), San Fransisco, CA, pp. 331-335.



Created and maintained by Frans Coenen. Last updated 23 June 2005