What's Next?: Rule Models Learning Ordered Rule Lists Learning Unordered Rule Sets Descriptive Rule Learning
What's Next?: Rule Models Learning Ordered Rule Lists Learning Unordered Rule Sets Descriptive Rule Learning
What's Next?: Rule Models Learning Ordered Rule Lists Learning Unordered Rule Sets Descriptive Rule Learning
Rule models
What’s next?
5 Rule models
Learning ordered rule lists
Rule lists for ranking and probability estimation
Learning unordered rule sets
Rule sets for ranking and probability estimation
Descriptive rule learning
Rule learning for subgroup discovery
Association rule mining
What’s next?
5 Rule models
Learning ordered rule lists
Rule lists for ranking and probability estimation
Learning unordered rule sets
Rule sets for ranking and probability estimation
Descriptive rule learning
Rule learning for subgroup discovery
Association rule mining
Consider again our small dolphins data set with positive examples
p1: Length = 3 ∧ Gills = no ∧ Beak = yes ∧ Teeth =
many p2: Length = 4 ∧ Gills = no ∧ Beak = yes ∧ Teeth
= many p3: Length = 3 ∧ Gills = no ∧ Beak = yes ∧
Teeth = few p4: Length = 5 ∧ Gills = no ∧ Beak = yes ∧
Teeth = many p5: Length = 5 ∧ Gills = no ∧ Beak = yes
∧ Teeth = few
and negatives
n1: Length = 5 ∧ Gills = yes ∧ Beak = yes ∧ Teeth = many
n2: Length = 4 ∧ Gills = yes ∧ Beak = yes ∧ Teeth =
many n3: Length = 5 ∧ Gills = yes ∧ Beak = no ∧ Teeth
= many n4: Length = 4 ∧ Gills = yes ∧ Beak = no ∧
Teeth = many n5: Length = 4 ∧ Gills = no ∧ Beak = yes
∧ Teeth = few
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 198 / 291
5. Rule models 5.1 Learning ordered rule lists
t The nine possible literals are shown with their coverage counts in Figure 6.2
(left).
t Three of these are pure; in the impurity isometrics plot in Figure 6.2 (right)
they end up on the x-axis and y -axis.
t One of the literals covers two positives and two negatives, and therefore
has the same impurity as the overall data set; this literal ends up on the
ascending diagonal in the coverage plot.
true
[5+, 5-]
Positives
Length=3 Length=4 Length=5 Gills=yes Gills=no Beak=yes
Beak=no Teeth=many Teeth=few [2+, 0-] [1+, 3-]
[2+, 2-] [0+, 4-] [5+, 1-]
[5+, 3-] [0+, 2-] [3+, 4-]
Negatives
[2+, 1-]
(left) All literals with their coverage counts on the data in Example 6.1. The ones in
green (red) are pure for the positive (negative) class. (right) The nine literals plotted as
points in coverage space, with their impurity values indicated by impurity isometrics
(away from the ascending diagonal is better). Impurity values are colour-coded: towards
green if p˙ > 1/2, towards red if p˙ < 1/2, and orange if p˙ = 1/2 (on a 45 degree
isometric). The violet arrow indicates the selected literal, which excludes all five positives
and one negative.
Pos
Positives
0 Neg
Negatives
ROC isometrics for entropy (rescaled to have a maximum value of 1/2), Gini index and
minority class. The grey dotted symmetry line is defined by p˙ = 1/2: each isometric
has two parts, one above the symmetry line (where impurity decreases with increasing
empirical probability p˙) and its mirror image below the symmetry line (where impurity is
proportional to p˙).
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 201 / 291
5. Rule models 5.1 Learning ordered rule lists
true
[5+, 1-]
Positives
Length=3 Length=4 Length=5 Gills=no Beak=yes Teeth=many Teeth=few
[2+, 0-] [1+, 1-] [2+, 0-] [5+, 1-] [5+, 1-] [3+, 0-] [2+, 1-]
Negatives
(left) Revised coverage counts after removing the four negative examples covered by the
first rule found (literals not covering any examples are omitted). (right) We are now
operating in the right-most ‘slice’ of Figure 6.2.
true
[2+, 1-]
Positives
Length=3 Length=4 Length=5 Gills=no Beak=yes Teeth=few
[1+, 0-] [0+, 1-] [1+, 0-] [2+, 1-] [2+, 1-] [2+, 1-]
Negatives
(left) The third rule covers the one remaining negative example, so that the remaining
positives can be swept up by a default rule. (right) This will collapse the coverage space.
F B
A: Gills
E
=yes ≠yes G
Positives
A
=many ≠many C
=4 ≠4
Negatives
Rule lists inherit the property of decision trees that their training set coverage
curve is always convex.
t The first segment of the curve corresponds to all instances which are
covered by B but not by A , which is why we use the set-theoretical
notation B \ A .
t Notice that while this segment corresponds to the second rule in the rule
list, it comes first in the coverage curve because it has the highest
proportion of positives.
t The second coverage segment corresponds to rule A , and the third
coverage segment denoted ‘-’ corresponds to the default rule.
t This segment comes last, not because it represents the last rule, but
because it happens to cover no positives.
A\B, - -
Positives
B
B\A
Negatives
Coverage curves of two rule lists consisting of the rules from Example 6.2, in different
order ( AB in blue and BA in violet). B \ A corresponds to the coverage of rule B
once the coverage of rule A is taken away, and ‘-’ denotes the default rule. The
dotted segment in red connecting the two curves corresponds to the overlap of the
two rules A ∧ B , which is not accessible by either rule list.
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 212 / 291
5. Rule models 5.1 Learning ordered rule lists
Rule lists are similar to decision trees in that the empirical probabilities
associated with each rule yield convex ROC and coverage curves on the training
data.
What’s next?
5 Rule models
Learning ordered rule lists
Rule lists for ranking and probability estimation
Learning unordered rule sets
Rule sets for ranking and probability estimation
Descriptive rule learning
Rule learning for subgroup discovery
Association rule mining
Figure 6.7 shows that the first rule learned for the positive class is
The two examples covered by this rule are removed, and a new rule is learned.
We now encounter a new situation, as none of the candidates is pure (Figure
6.8). We thus start a second-level search, from which the following pure rule
emerges:
·if Gills = no ∧ Length = 5 then Class = ⊕·
To cover the remaining positive, we again need a rule with two conditions (Figure
6.9):
·if Gills = no ∧ Teeth = many then Class = ⊕·
Notice that, even though these rules are overlapping, their overlap only covers
positive examples (since each of them is pure) and so there is no need to
organise them in an if-then-else list.
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 215 / 291
5. Rule models 5.2 Learning unordered rule sets
true
[5+, 5-]
Positives
Length=3 Length=4 Length=5 Gills=yes Gills=no Beak=yes
Beak=no Teeth=many Teeth=few [2+, 0-] [1+, 3-]
[2+, 2-] [0+, 4-] [5+, 1-]
[5+, 3-] [0+, 2-] [3+, 4-]
Negatives
[2+, 1-]
(left) The first rule is learned for the positive class. (right) Precision isometrics look
identical to impurity isometrics (Figure 6.2); however, the difference is that precision is
lowest on the x -axis and highest on the y -axis, while purity is lowest on the
ascending diagonal and highest on both the x-axis and the y -axis.
true
[3+, 5-]
Positives
Length=4 Length=5 Gills=yes Gills=no Beak=yes Teeth=many Teeth=few
[1+, 3-] Beak=no [2+, 2-] [0+, 4-] [3+, 1-] [2+, 4-] [1+, 1-]
[3+, 3-]
[0+, 2-]
Gills=no & Length=4 Gills=no & Length=5 Gills=no & Beak=yes Gills=no & Teeth=many Gills=no & Teeth=few
[1+, 1-] [2+, 0-] [3+, 1-] [2+, 0-] [1+, 1-]
Negatives
(left) The second rule needs two literals: we use maximum precision to select both.
(right) The coverage space is smaller because the two positives covered by the first rule
are removed. The blue box on the left indicates an even smaller coverage space in
which the search for the second literal is carried out, after the condition Gills = no filters
out four negatives. Inside the blue box precision isometrics overlap with those in the
outer box (this is not necessarily the case with search heuristics other than precision).
true
[1+, 5-]
Positives
[1+, 3-] [0+, 2-] [0+, 4-] [1+, 1-] [1+, 3-] [0+, 2-] [1+, 4-] [0+, 1-]
Length=4 & Gills=no Gills=no & Beak=yes Gills=no & Teeth=many Gills=no & Teeth=few [1+,
[1+, 1-] [1+, 1-] 0-] [0+, 1-] Negatives
(left) The third and final rule again needs two literals. (right) The first literal excludes
four negatives, the second excludes the one remaining negative.
One issue with using precision as search heuristic is that it tends to focus a bit
too much on finding pure rules, thereby occasionally missing near-pure rules that
can be specialised into a more general pure rule.
t Consider Figure 6.10 (left): precision favours the rule
· if Length = 3 then Class = ⊕·, even though the near-pure literal Gills =
no leads to the pure rule ·if Gills = no ∧ Teeth = many then Class =
⊕·.
t A convenient way to deal with this ‘myopia’ of precision is the Laplace
correction, which ensures that [5+, 1−] is ‘corrected’ to [6+, 2−] and
thus
considered to be of the same quality as [2+, 0−] aka [3+, 1−] (Figure 6.10
(right)).
true
[5+, 5-]
Positives
Length=3 Length=4 Length=5 [2+, 0-] Gills=yes Gills=no Beak=yes [0+, 4-] Beak=no Teeth=many Teeth=few
[1+, 3-] [5+, 1-] [0+, 2-] [3+, 4-] [2+, 1-]
[2+, 2-] [5+, 3-]
Gills=no & Length=3 Gills=no & Length=4 [2+, Gills=no & Length=5 Gills=no & Beak=yes Gills=no & Teeth=many Gills=no & Teeth=few
0-] [1+, 1-] [2+, 0-] [5+, 1-] [3+, 0-] [2+, 1-]
Negat ives
(left) Using Laplace-corrected precision allows learning a better rule in the first iteration.
(right) Laplace correction adds one positive and one negative pseudo-count, which
means that the isometrics now rotate around (−1, −1) in coverage space, resulting in a
preference for more general rules.
Consider the following rule set (the first two rules were also used in Example
6.2):
t The figures on the right indicate coverage of each rule over the whole
training set. For instances covered by single rules we can use these
coverage counts to calculate probability estimates: e.g., an instance
covered only by rule A would receive probability pˆ(A) = 1/4 = 0.25,
and similarly pˆ(B) = 5/8 = 0.63 and pˆ(C) = 2/4 = 0.50.
t Clearly A and C are mutually exclusive, so the only overlaps we need to
take into account are AB and BC .
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 224 / 291
5. Rule models 5.2 Learning unordered rule sets
t A simple trick that is often applied is to average the coverage of the rules
involved: for example, the coverage of AB is estimated as [3+, 3−]
yielding pˆ(AB) = 3/6 = 0.50. Similarly, pˆ(BC) = 3.5/6 = 0.58.
t The corresponding ranking is thus B – BC – [ AB , C] – A , resulting in the
orange training set coverage curve in Figure 6.11.
Let us now compare this rule set with the following rule list ABC :
The coverage curve of this rule list is indicated in Figure 6.11 as the blue line. We
see that the rule set outperforms the rule list, by virtue of being able to
distinguish between examples covered by B only and those covered by both B
and C.
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 225 / 291
5. Rule models 5.2 Learning unordered rule sets
A
AB, C C\B\A
BC
Positives
B\A
Negatives
Coverage curves of the rule set in Example 6.4 (in orange) and the rule list ABC (in
blue). The rule set partitions the instance space in smaller segments, which in this case
lead to better ranking performance.
What’s next?
5 Rule models
Learning ordered rule lists
Rule lists for ranking and probability estimation
Learning unordered rule sets
Rule sets for ranking and probability estimation
Descriptive rule learning
Rule learning for subgroup discovery
Association rule mining
Subgroup discovery
Positives
Negatives Negatives
Transaction Items
1 nappies
2 beer, crisps
3 apples, nappies
4 beer, crisps, nappies
5 apples
6 apples, beer, crisps, nappies
7 apples, crisps
8 crisps
Each transaction in this table involves a set of items; conversely, for each item we
can list the transactions in which it was involved: transactions 1, 3, 4 and 6 for
nappies, transactions 3, 5, 6 and 7 for apples, and so on. We can also do this for
sets of items: e.g., beer and crisps were bought together in transactions 2, 4 and
6; we say that item set {beer, crisps} covers transaction set {2, 4, 6}.
{}
{Beer, Crisps} {Nappies, Crisps} {Nappies, Beer} {Crisps, Apples} {Beer, Apples} {Nappies, Apples}
{Nappies, Beer, Crisps} {Beer, Crisps, Apples} {Nappies, Crisps, Apples} {Nappies, Beer, Apples}
Item sets in dotted ovals cover a single transaction; in dashed ovals, two transactions; in
triangles, three transactions; and in polygons with n sides, n transactions. The maximal
item sets with support 3 or more are indicated in green.
Association rules I
Frequent item sets can be used to build association rules, which are rules of the
form ·if B then H · where both body B and head H are item sets that
frequently appear in transactions together.
t Pick any edge in Figure 6.17, say the edge between {beer} and
{nappies, beer}. We know that the support of the former is 3 and of the
latter, 2: that is, three transactions involve beer and two of those involve
nappies as well. We say that the confidence of the association rule
·if beer then nappies· is 2/3.
Association rules II
But we only want to construct association rules that involve frequent items.
t The rule ·if beer ∧ apples then crisps· has confidence 1, but there is
only one transaction involving all three and so this rule is not strongly
supported by the data.
t So we first use Algorithm 6.6 to mine for frequent item sets; we then select
bodies B and heads H from each frequent set m , discarding rules whose
confidence is below a given confidence threshold.
t Notice that we are free to discard some of the items in the maximal frequent
sets (i.e., H ∪ B may be a proper subset of m), because any subset of a
frequent item set is frequent as well.
A run of the algorithm with support threshold 3 and confidence threshold 0.6
gives the following association rules:
·if beer then support 3, confidence 3/3
crisps· support 3, confidence
crisps
·if true then
then 3/5 support 5,
beer ·
crisps·
Association rule mining confidence 5/8a post-processing stage in which
often includes
superfluous rules are filtered out, e.g., special cases which don’t have higher
confidence than the general case.
Post-processing
One quantity that is often used in post-processing is lift, defined as
n · Supp(B ∪
H)
Lift(·if B then H ·) =Supp(B ) ·
Supp(H )
where n is the number of transactions.
t For example, for the the first two association rules above we would have lifts
of 8·· = 1.6, as ·if B then H ·)= Lift(·if H then B
3
t Lift(
For ·). Lift( ·if true then crisps·) =8· = 1. This holds
5 the third rule we have
5·
8
any rule with B = Ø, as for 5
n · Supp(Ø ∪ H )
· Supp(H
n · Supp(H
Supp(Ø) ) ) n·
Lift(·if Ø then H ·) =Supp(H ) = =
1 a lift of 1 means that Supp(B ∪ H ) is entirely determined by the
More generally,
marginal frequencies Supp(B ) and Supp(H ) and is not the result of any
meaningful interaction between B and H . Only association rules with lift larger
than 1 are of interest.
cs.bris.ac.uk/ flach/mlbook/ Machine Learning: Making Sense of Data 241 / 291
5. Rule models 5.3 Descriptive rule learning
true
Length=4 & Teeth=many Length= 4 & Beak= yes Length=4 & Gills=no Beak= yes & Teeth=many Length=3 & Teeth=many Gills=no & Teeth=many Length=5 & Teeth=many Length= 3 & Beak=yes Gills=no & Beak=yes Length=3 & Gills=no Length=5 & Beak=yes Length=5 & Gills=no Length=3 & Teeth=few Beak= yes & Teeth=few Gills=no & Teeth=few Length=5 & Teeth=few
Length=4 & Beak= yes & Teeth=many Length=4 & Gills=no & Teeth=many Length= 4 & Gills=no & Beak=yes Length=3 & Beak= yes & Teeth=many Gills=no & Beak=yes & Teeth=many Length=3 & Gills=no & Teeth=many Length=5 & Beak=yes & Teeth=many Length=5 & Gills=no & Teeth=many Length=3 & Gills=no & Beak= yes Length=5 & Gills=no & Beak= yes Length=3 & Beak=yes & Teeth=few Length=3 & Gills=no & Teeth=few Gills=no & Beak= yes & Teeth=few Length=5 & Beak=yes & Teeth=few Length=5 & Gills=no & Teeth=few
Length=4 & Gills=no & Beak= yes & Teeth=many Length=3 & Gills=no & Beak= yes & Teeth=many Length=5 & Gills=no & Beak=yes & Teeth=many Length=3 & Gills=no & Beak= yes & Teeth=few Length=5 & Gills=no & Beak=yes & Teeth=few
The item set lattice corresponding to the positive examples of the dolphin example in
Example 4.4. Each ‘item’ is a literal Feature = Value; each feature can occur at most
once in an item set. The resulting structure is exactly the same as what was called the
hypothesis space in Chapter 4.