Unit 4

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 72

UNIT - IV: Association Analysis

Basic Concepts and Algorithms: Problem Definition, Frequent Item Set Generation,
Apriori Principle, Apriori Algorithm, Rule Generation, Compact Representation of
Frequent Item sets, FP-Growth Algorithm.
• Association analysisis useful for discovering interesting relationships
hidden in large data sets.
• The uncovered relationships can be represented in the form of
association rules.
{Milk}  {bread}
Problem Definition

• Binary Representation Market basket data can be represented in a binary format ,


where each row corresponds to a transaction and each column corresponds to an
item.
Itemset and Support Count

• Let I = {i1,i2,...,id} be the set of all items in a market basket data and T = {t 1, t2,...,tN } be the
set of all transactions.

• Each transaction ti contains a subset of items chosen from I.

• In association analysis, a collection items is termed an itemset.

• If an itemset contains k items, it is called a k-itemset. For instance, {Bread, Diapers, Milk}
is an example of a 3-itemset.
Association Rule

• An association rule is an expression of the form

X → Y , where X and Y are disjoint itemsets, i.e., X ∩ Y = ∅.

• The strength of an association rule can be measured in terms of its support and
confidence.

• An important property of an itemset is its support count, which refers to the number of
transactions that contain a particular itemset.
• Support determines how often a rule is applicable to a given data set, while
confidence determines how frequently items in Y appear in transactions that
contain X.
• Consider the rule {Milk, Diapers} −→ {Bread}. Since the support count for
{Milk, Diapers, Bread} is 2 and the total number of transactions is 5, the rule’s
support is 2/5=0.4.

• The rule’s confidence is obtained by dividing the support count for {Milk,
Diapers, Bread} by the support count for {Milk, Diapers}. Since there are 3
transactions that contain milk and diapers, the confidence for this rule is 2/3=0.67.
Formulation of Association Rule Mining Problem

• The association rule mining problem can be stated as : (Association Rule


Discovery). Given a set of transactions T, find all the rules having support ≥
minsup and confidence ≥ minconf, where minsup and minconf are the
corresponding support and confidence thresholds.
• common strategy used by association rule mining algorithms is to decompose the
problem into two major subtasks:

1. Frequent Itemset Generation, whose objective is to find all the itemsets that satisfy
the minsup threshold. These itemsets are called frequent itemsets.

2. Rule Generation, whose objective is to extract all the high-confidence rules from the
frequent itemsets found in the previous step. These rules are called strong rules.
Frequent Item Set Generation

• A lattice structure can be used to enumerate the list of all possible itemsets.

• Figure shows an itemset lattice for I = {a, b, c, d, e}.

• In general, a data set that contains k items can potentially generate up to 2 k − 1

frequent itemsets,excluding the null set.


An itemset lattice
• A brute-force approach for finding frequent itemsets is to determine the support
count for every candidate itemset in the lattice structure.

• To do this, we need to compare each candidate against every transaction.

• If the candidate is contained in a transaction,its support count will be incremented


• There are several ways to reduce the computational complexity of frequent itemset
generation.

1. Reduce the number of candidate itemsets (M). The Apriori principle, is an effective way
to eliminate some of the candidate itemsets without counting their support values.

2. Reduce the number of comparisons. Instead of matching each candidate itemset against
every transaction, we can reduce the number of comparisons by using more advanced data
structures, either to store the candidate itemsets or to compress the data set.
The Apriori Principle

• If an itemset is frequent, then all of its subsets must also be frequent.

• consider the itemset lattice


• If {c, d, e} is frequent, then all subsets of this itemset are frequent.
• Suppose {c, d, e} is a frequent itemset. Clearly, any transaction that contains {c, d, e}

must also contain its subsets, {c, d}, {c, e}, {d, e}, {c}, {d}, and {e}.

• As a result, if {c, d, e} is frequent, then all subsets of {c, d, e} (i.e., the shaded itemsets

in this figure) must also be frequent.


• Conversely, if an itemset such as {a, b} is infrequent, then all of its supersets must
be infrequent too.
If {a, b} is infrequent, then all supersets of {a, b} are infrequent.
• Definition (Monotonicity Property). Let I be a set of items, and J = 2I be the
power set of I. A measure f is monotone (or upward closed) if which means that if
X is a subset of Y , then f(X) must not exceed f(Y ).
• On the other hand, f is anti-monotone (or downward closed)

if ∀X, Y ∈ J : (X ⊆ Y ) → f(Y ) ≤ f(X),

• which means that if X is a subset of Y , then f(Y ) must not exceed f(X).
Frequent Itemset Generation in the Apriori Algorithm

• Apriori is the first association rule mining algorithm that use of support-based
pruning to systematically control the exponential growth of candidate itemsets.

• We assume that the support threshold is 33%, which is equivalent to a minimum


support count equal to 4.
Frequent itemset generation using the Apriori algorithm
• Initially, every item is considered as a candidate 1-itemset.

• After counting their supports, the candidate itemsets {egg} and {Ketchup} are
discarded.

• In the next iteration, candidate 2-itemsets are generated using only the frequent 1-
itemsets.
• Because there are only four frequent 1-itemsets, the number of candidate 2-itemsets
generated by the algorithm is

• The remaining four candidates are frequent, and thus will be used to generate candidate
3-itemsets.

• The only candidate that has this property is {Milk, Bread, Butter}.
• The effectiveness of the Apriori pruning strategy can be shown by counting the
number of candidate itemsets generated.

• A brute-force strategy for listing all itemsets (up to size 3) as candidates is


• With the Apriori principle, this number decreases to
• Let Ck denote the set of candidate k-itemsets and Fk denote the set of frequent k-
itemsets:

• The algorithm initially makes a single pass over the data set to determine the support
of each item.

• Upon completion of this step, the set of frequent 1-itemsets, F1, will be known (steps
1 and 2).
• Next, the algorithm will iteratively generate new candidate k-itemsets using the
frequent (k − 1)itemsets found in the previous iteration (step 5).
• To count the support of the candidates, the algorithm needs to make an additional pass
over the data set (steps 6–10).

• The subset function is used to determine all the candidate itemsets in C k that are
contained in each transaction t.
• After counting their supports, the algorithm eliminates all candidate itemsets
whose support counts are less than minsup (step 12).

• The algorithm terminates when there are no new frequent itemsets generated, i.e.,
Fk = ∅ (step 13).
• The frequent itemset generation part of the Apriori algorithm has two important
characteristics.

• First, it is a level-wise algorithm; i.e., it traverses the itemset lattice one level at a time,
from frequent 1-itemsets to the maximum size of frequent itemsets.

• Second, it employs a generate-and-test strategy for finding frequent itemsets.


Rule Generation

• Each frequent k-itemset, Y , can produce up to 2k−2 association rules, ignoring

rules that have empty antecedents or consequents (∅ −→ Y or Y −→ ∅).

• An association rule can be extracted by partitioning the itemset Y into two non-

empty subsets, X and Y −X, such that X → Y −X satisfies the confidence threshold.
• Let X = {1, 2, 3} be a frequent itemset.

• There are six candidate association rules that can be generated from X:
{1, 2} → {3}, {1, 3} → {2}, {2, 3} → {1}, {1} → {2, 3},

{2} → {1, 3}, and {3} → {1, 2}.


• Consider the rule {1, 2} → {3}, which is generated from the frequent
itemset X = {1, 2, 3}.

• The confidence for this rule is σ({1, 2, 3})/σ({1, 2}).


Confidence-Based Pruning

• Like support measure, confidence does not have any monotone


property.

• For example, the confidence for X → Y can be larger, smaller, or equal


to the confidence
Theorem. If a rule X → Y −X does not satisfy the confidence threshold, then any rule
X1 → Y − X1 where X1 is a subset of X, must not satisfy the confidence threshold as
well.

• To prove this theorem, consider the following two rules: X 1 → Y −X1 and X → Y −X,
where X1 ⊂ X.

• The confidence of the rules are σ(Y )/σ(X1 ) and σ(Y )/σ(X), respectively.

• Since X1 is a subset of X, σ(X1 ) ≥ σ(X). Therefore, the former rule cannot have a
higher confidence than the latter rule.
Rule Generation in Apriori Algorithm

• The Apriori algorithm uses a level-wise approach for generating association rules, where
each level corresponds to the number of items that belong to the rule consequent.

• Initially, all the high-confidence rules that have only one item in the rule consequent are
extracted.

• These rules are then used to generate new candidate rules.


• The below diagram shows a lattice structure for the association rules generated
from the frequent itemset {a, b, c, d}.
Pruning of association rules using the confidence measure
• If any node in the lattice has low confidence, then according to Theorem, the entire
subgraph spanned by the node can be pruned immediately.

• Suppose the confidence for {bcd} → {a} is low.

• All the rules containing item a in its consequent, including {cd} → {ab}, {bd} →
{ac}, {bc} → {ad}, and {d} → {abc} can be discarded.
A pseudocode for the rule generation step is
Compact Representation of Frequent Itemsets

• In practice, the number of frequent itemsets produced from a transaction data set can be
very large.

• It is useful to identify a small representative set of itemsets from which all other
frequent itemsets can be derived.
Maximal Frequent Itemsets

• Definition (Maximal Frequent Itemset). A maximal frequent itemset is defined as


a frequent itemset for which none of its immediate supersets are frequent.
Maximal frequent itemset.
• consider the Itemset lattice.

• The itemsets in the lattice are divided into two groups: frequent and are infrequent.

• Every itemset located above the border is frequent, while those located below the
border (the shaded nodes) are infrequent.
• Among the itemsets residing near the border, {a, d}, {a, c, e}, and {b, c, d, e} are
considered to be maximal frequent itemsets because their immediate supersets are
infrequent.

• An itemset such as {a, d} is maximal frequent because all of its immediate supersets, {a,
b, d}, {a, c, d}, and {a, d, e}, are infrequent.
Closed Frequent Itemsets
• (Closed Itemset). An itemset X is closed if none of its immediate supersets has exactly the
same support count as X.
Closed Frequent Itemset. An itemset is a closed frequent itemset if it is closed and
its support is greater than or equal to minsup.

• In the previous example, assuming that the support threshold is 40%, {b,c} is a
frequent itemset because its support is 60%.
• In this diagram, every transaction that contains b also contains c. Consequently, the
support for {b} is identical to {b, c} and {b} should not be considered a closed itemset.

• {b, c} is a closed itemset because it does not have the same support count as any of its
supersets.
• For example, consider the frequent itemset {a, d} .

• The key is to determine which superset (among {a, b, d}, {a, c, d}, or {a, d, e}) has
exactly the same support count as {a, d}.

• the support for {a, d} must be equal to the largest support among its supersets. Since {a,
c, d} has a larger support than both {a, b, d} and {a, d, e},the support for {a, d} must be
identical to the support for {a, c, d}.
FP-Growth Algorithm

• it encodes the data set using a compact data structure called an FP-tree and
extracts frequent itemsets directly from this structure.
FP-Tree Representation

• An FP-tree is a compressed representation of the input data.

• It is constructed by reading the data set one transaction at a time and mapping each
transaction onto a path in the FP-tree.

• As different transactions have several items in common, their paths may overlap.
• The FP-tree is constructed in the following way:

• The data set is scanned once to determine the support count of each item. Infrequent
items are discarded, while the frequent items are sorted in decreasing support counts.

• The algorithm makes a second pass over the data to construct the FPtree. After reading
the first transaction, {a, b}, the nodes labeled as a and b are created.
• After reading the second transaction, {b,c,d}, a new set of nodes is created for items b, c,
and d. A path is then formed to represent the transaction by connecting the nodes null →
b → c → d.

• This process continues until every transaction has been mapped onto one of the paths
given in the FP-tree.
Frequent Itemset Generation in FP-Growth Algorithm

• FP-growth is an algorithm that generates frequent itemsets from an FP-tree by exploring


the tree in a bottom-up fashion.

• Given the example tree, the algorithm looks for frequent itemsets ending in e first,
followed by d, c, b, and finally, a.
• FP-growth finds all the frequent itemsets ending with a particular suffix by
employing a divide-and-conquer strategy.

• For example, find all frequent itemsets ending in e. To do this, we must first check
whether the itemset {e} itself is frequent.

• If it is frequent, we consider the subproblem of finding frequent itemsets ending in


de, followed by ce, be, and ae.
• Consider the task of finding frequent itemsets ending with e.
1. The first step is to gather all the paths containing node e. These initial
paths are called prefix paths
• Frequent itemsets are summarized in
• THANK YOU

You might also like