THE SUPPORT-CONFIDENCE FRAMEWORK FOR ASSOCIATION RULE MINING

Frans Coenen

Department of Computer Science, The University of Liverpool

August 2006

UNI. OF L'POOL

In this page we give a brief explanation of the support-confidence framework for Association Rule MIning (ARM). This is the most common framework used to enhance the computational efficiencvy of ARM algorithms. The text is primarily intended for MSc project students but may proove usefull to other readers.

The objective of Association Rule Mining (ARM) is to find interesting patterns within binary valued data tables. The patterns are uysaully expressed as rules of the form:

antecedent => consequnet

Where the antecedent and consequent are two disjoint sub-sets of some global itemset I. Such rules afre called association rules (ARs). Given an AR of the form X => Y this is read as "if X exists then Y is also likely to exist". The most common mechanisum for defining "likelyhood" is in terms of the support-confidence framework. Using this framework we assign a statistical confidence value, expressed as a percentage, to each AR and reject all those rules that are below a user defined confidence threshold. Given a rule X =>Y its confidence vlaue is calculated using the identity:

 support(A union B)
-------------------- * 100
   support(A)

The support of an item set is a count of the number of records in the data set in which it appears.

Consider the following data set:

A B C
A C D
A D
B C D
C D

A number of confidence values for a selection of ARs are given in the table below

RuleSupport for AntecedentSupport for RuleConfidence (%)
A => Bsupport(A) = 3support(AB) = 1confidence(A=>B) = (1/3)*100 = 33.3
A => Dsupport(A) = 3support(ABD) = 2confidence(A=>D) = (2/3)*100 = 66.7
A B => Csupport(AB) = 1support(ABC) = 1confidence(AB=>C) = (1/1)*100 = 100.0
A => C Bsupport(A) = 3support(ABC) = 1confidence(A=>CB) = (1/3)*100 = 33.3

Note that ARs are generated from itemsets, for example the item set {A B C} gives rise to the rules A => B C, B => A C, C => A B, A B => C, A C => B and B C => A (with confidence values of 33.3, 66.7, 33.3, 33.3, 33.3 and 66.7 respectively).

The number of item combinatioins (itemsets) represented in a datbase is given by the identity 2N-1. Thus the number of itemsets, and consequently the number of potential rules, increases exponentially with N. One tactiv often employed to reduce the number of potential ARs is the concept of the frequent itemset. A frequent itemset is an itemset whose support count exceeds some user specifide support threshold. The significance is that potential ARs are only generetaed from frequent itemsets, all other itemsets being rejected.

Using the support-confidence vframework ARM becomes a two stage proces:

  1. Identify all frequent itemsets (those with a suypport count above the given support threshold) represented in the input dataset.
  2. Generate the set of potential ARs and keep those with a confidence value above the given confidence threshold.

The most computationally expensive phase is recognised to be the first frequrnt itemset generation phase.

Some authors refer to frequent itemsets as interesting or large itemsets. In the case of interesting itemsets the associated support count is interpreted as a measure of "interestingness" (ARM can be defined as the process of finding interesting poatterns in data). The term large is used to indicate that the magnitude of the support count is large enough to be greater than the given support threshold. However, students often find the term confusing as they assume (wrongly) that it equates to the size of the itemset (the number of items contained within it); in fact, in terms of the number of items wiythin an itemset, an itemset comprised of relativly few items is more likely to be frequent that one with many items. For tnhis reason the term "karge itemset" is best avoided.