Unit 4
Unit 4
Unit 4
Basic Concepts and Algorithms: Problem Definition, Frequent Item Set Generation,
Apriori Principle, Apriori Algorithm, Rule Generation, Compact Representation of
Frequent Item sets, FP-Growth Algorithm.
• Association analysisis useful for discovering interesting relationships
hidden in large data sets.
• The uncovered relationships can be represented in the form of
association rules.
{Milk} {bread}
Problem Definition
• Let I = {i1,i2,...,id} be the set of all items in a market basket data and T = {t 1, t2,...,tN } be the
set of all transactions.
• If an itemset contains k items, it is called a k-itemset. For instance, {Bread, Diapers, Milk}
is an example of a 3-itemset.
Association Rule
• The strength of an association rule can be measured in terms of its support and
confidence.
• An important property of an itemset is its support count, which refers to the number of
transactions that contain a particular itemset.
• Support determines how often a rule is applicable to a given data set, while
confidence determines how frequently items in Y appear in transactions that
contain X.
• Consider the rule {Milk, Diapers} −→ {Bread}. Since the support count for
{Milk, Diapers, Bread} is 2 and the total number of transactions is 5, the rule’s
support is 2/5=0.4.
• The rule’s confidence is obtained by dividing the support count for {Milk,
Diapers, Bread} by the support count for {Milk, Diapers}. Since there are 3
transactions that contain milk and diapers, the confidence for this rule is 2/3=0.67.
Formulation of Association Rule Mining Problem
1. Frequent Itemset Generation, whose objective is to find all the itemsets that satisfy
the minsup threshold. These itemsets are called frequent itemsets.
2. Rule Generation, whose objective is to extract all the high-confidence rules from the
frequent itemsets found in the previous step. These rules are called strong rules.
Frequent Item Set Generation
• A lattice structure can be used to enumerate the list of all possible itemsets.
1. Reduce the number of candidate itemsets (M). The Apriori principle, is an effective way
to eliminate some of the candidate itemsets without counting their support values.
2. Reduce the number of comparisons. Instead of matching each candidate itemset against
every transaction, we can reduce the number of comparisons by using more advanced data
structures, either to store the candidate itemsets or to compress the data set.
The Apriori Principle
must also contain its subsets, {c, d}, {c, e}, {d, e}, {c}, {d}, and {e}.
• As a result, if {c, d, e} is frequent, then all subsets of {c, d, e} (i.e., the shaded itemsets
• which means that if X is a subset of Y , then f(Y ) must not exceed f(X).
Frequent Itemset Generation in the Apriori Algorithm
• Apriori is the first association rule mining algorithm that use of support-based
pruning to systematically control the exponential growth of candidate itemsets.
• After counting their supports, the candidate itemsets {egg} and {Ketchup} are
discarded.
• In the next iteration, candidate 2-itemsets are generated using only the frequent 1-
itemsets.
• Because there are only four frequent 1-itemsets, the number of candidate 2-itemsets
generated by the algorithm is
• The remaining four candidates are frequent, and thus will be used to generate candidate
3-itemsets.
• The only candidate that has this property is {Milk, Bread, Butter}.
• The effectiveness of the Apriori pruning strategy can be shown by counting the
number of candidate itemsets generated.
• The algorithm initially makes a single pass over the data set to determine the support
of each item.
• Upon completion of this step, the set of frequent 1-itemsets, F1, will be known (steps
1 and 2).
• Next, the algorithm will iteratively generate new candidate k-itemsets using the
frequent (k − 1)itemsets found in the previous iteration (step 5).
• To count the support of the candidates, the algorithm needs to make an additional pass
over the data set (steps 6–10).
• The subset function is used to determine all the candidate itemsets in C k that are
contained in each transaction t.
• After counting their supports, the algorithm eliminates all candidate itemsets
whose support counts are less than minsup (step 12).
• The algorithm terminates when there are no new frequent itemsets generated, i.e.,
Fk = ∅ (step 13).
• The frequent itemset generation part of the Apriori algorithm has two important
characteristics.
• First, it is a level-wise algorithm; i.e., it traverses the itemset lattice one level at a time,
from frequent 1-itemsets to the maximum size of frequent itemsets.
• An association rule can be extracted by partitioning the itemset Y into two non-
empty subsets, X and Y −X, such that X → Y −X satisfies the confidence threshold.
• Let X = {1, 2, 3} be a frequent itemset.
• There are six candidate association rules that can be generated from X:
{1, 2} → {3}, {1, 3} → {2}, {2, 3} → {1}, {1} → {2, 3},
• To prove this theorem, consider the following two rules: X 1 → Y −X1 and X → Y −X,
where X1 ⊂ X.
• The confidence of the rules are σ(Y )/σ(X1 ) and σ(Y )/σ(X), respectively.
• Since X1 is a subset of X, σ(X1 ) ≥ σ(X). Therefore, the former rule cannot have a
higher confidence than the latter rule.
Rule Generation in Apriori Algorithm
• The Apriori algorithm uses a level-wise approach for generating association rules, where
each level corresponds to the number of items that belong to the rule consequent.
• Initially, all the high-confidence rules that have only one item in the rule consequent are
extracted.
• All the rules containing item a in its consequent, including {cd} → {ab}, {bd} →
{ac}, {bc} → {ad}, and {d} → {abc} can be discarded.
A pseudocode for the rule generation step is
Compact Representation of Frequent Itemsets
• In practice, the number of frequent itemsets produced from a transaction data set can be
very large.
• It is useful to identify a small representative set of itemsets from which all other
frequent itemsets can be derived.
Maximal Frequent Itemsets
• The itemsets in the lattice are divided into two groups: frequent and are infrequent.
• Every itemset located above the border is frequent, while those located below the
border (the shaded nodes) are infrequent.
• Among the itemsets residing near the border, {a, d}, {a, c, e}, and {b, c, d, e} are
considered to be maximal frequent itemsets because their immediate supersets are
infrequent.
• An itemset such as {a, d} is maximal frequent because all of its immediate supersets, {a,
b, d}, {a, c, d}, and {a, d, e}, are infrequent.
Closed Frequent Itemsets
• (Closed Itemset). An itemset X is closed if none of its immediate supersets has exactly the
same support count as X.
Closed Frequent Itemset. An itemset is a closed frequent itemset if it is closed and
its support is greater than or equal to minsup.
• In the previous example, assuming that the support threshold is 40%, {b,c} is a
frequent itemset because its support is 60%.
• In this diagram, every transaction that contains b also contains c. Consequently, the
support for {b} is identical to {b, c} and {b} should not be considered a closed itemset.
• {b, c} is a closed itemset because it does not have the same support count as any of its
supersets.
• For example, consider the frequent itemset {a, d} .
• The key is to determine which superset (among {a, b, d}, {a, c, d}, or {a, d, e}) has
exactly the same support count as {a, d}.
• the support for {a, d} must be equal to the largest support among its supersets. Since {a,
c, d} has a larger support than both {a, b, d} and {a, d, e},the support for {a, d} must be
identical to the support for {a, c, d}.
FP-Growth Algorithm
• it encodes the data set using a compact data structure called an FP-tree and
extracts frequent itemsets directly from this structure.
FP-Tree Representation
• It is constructed by reading the data set one transaction at a time and mapping each
transaction onto a path in the FP-tree.
• As different transactions have several items in common, their paths may overlap.
• The FP-tree is constructed in the following way:
• The data set is scanned once to determine the support count of each item. Infrequent
items are discarded, while the frequent items are sorted in decreasing support counts.
• The algorithm makes a second pass over the data to construct the FPtree. After reading
the first transaction, {a, b}, the nodes labeled as a and b are created.
• After reading the second transaction, {b,c,d}, a new set of nodes is created for items b, c,
and d. A path is then formed to represent the transaction by connecting the nodes null →
b → c → d.
• This process continues until every transaction has been mapped onto one of the paths
given in the FP-tree.
Frequent Itemset Generation in FP-Growth Algorithm
• Given the example tree, the algorithm looks for frequent itemsets ending in e first,
followed by d, c, b, and finally, a.
• FP-growth finds all the frequent itemsets ending with a particular suffix by
employing a divide-and-conquer strategy.
• For example, find all frequent itemsets ending in e. To do this, we must first check
whether the itemset {e} itself is frequent.