Aggregation
M A R K E T B A S K E T A N A LY S I S I N P Y T H O N
Isaiah Hull
Economist
Exploring the data
import pandas as pd
# Load novelty gift data.
gifts = pd.read_csv('datasets/novelty_gifts.csv')
# Preview data with head() method.
print(gifts.head())
InvoiceNo Description
0 562583 IVORY STRING CURTAIN WITH POLE
1 562583 PINK AND BLACK STRING CURTAIN
2 562583 PSYCHEDELIC TILE HOOK
3 562583 ENAMEL COLANDER CREAM
4 562583 SMALL FOLDING SCISSOR(POINTED EDGE)
MARKET BASKET ANALYSIS IN PYTHON
Exploring the data
# Print number of transactions.
print(len(gifts['InvoiceNo'].unique()))
9709
# Print number of items.
print(len(gifts['Description'].unique()))
3461
MARKET BASKET ANALYSIS IN PYTHON
Pruning and aggregation
Pruning Aggregation
MARKET BASKET ANALYSIS IN PYTHON
Aggregating the data
# Load one-hot encoded data
onehot = pd.read_csv('datasets/online_retail_onehot.csv')
# Print preview of DataFrame
print(onehot.head(2))
50'S CHRISTMAS GIFT BAG LARGE DOLLY GIRL BEAKER ... ZINC WILLIE WINKIE CANDLE STICK
0 False False False
1 False False True
MARKET BASKET ANALYSIS IN PYTHON
Aggregating the data
# Select the column names for bags and boxes
bag_headers = [i for i in onehot.columns if i.lower().find('bag')>=0]
box_headers = [i for i in onehot.columns if i.lower().find('box')>=0]
# Identify column headers
bags = onehot[bag_headers]
boxes = onehot[box_headers]
print(bags)
50'S CHRISTMAS GIFT BAG LARGE RED SPOT GIFT BAG LARGE
0 False False
1 False False
... ... ...
MARKET BASKET ANALYSIS IN PYTHON
Aggregating the data
# Sum over columns
bags = (bags.sum(axis=1) > 0.0).values
boxes = (boxes.sum(axis=1) > 0.0).values
print(bags)
[False True False ... False True False]
MARKET BASKET ANALYSIS IN PYTHON
Aggregating the data
# Add results to DataFrame
aggregated = pd.DataFrame(np.vstack([bags, boxes]).T, columns = ['bags', 'boxes'])
print(aggregated.head())
bags boxes
0 False False
1 True False
2 False False
3 False False
4 True False
MARKET BASKET ANALYSIS IN PYTHON
Market basket analysis with aggregates
Aggregation process:
Items -> Categories
Compute metrics
Identify rules
# Compute support
print(aggregated.mean())
bags 0.130075
boxes 0.071429
MARKET BASKET ANALYSIS IN PYTHON
Let's practice!
M A R K E T B A S K E T A N A LY S I S I N P Y T H O N
The Apriori
algorithm
M A R K E T B A S K E T A N A LY S I S I N P Y T H O N
Isaiah Hull
Economist
Counting itemsets
( )=
n n!
Item Count Itemset Size Combinations
k (n − k)!k!
3461 0 1
3461 1 3461
3461 2 5,987,530
3461 3 6,903,622,090
3461 4 5,968,181,296,805
MARKET BASKET ANALYSIS IN PYTHON
Counting itemsets
n
n = 3461 → 23461
∑ ( ) = 2n
n
k 23461 >> 1082
k=1
Number of atoms in universe: 1082 .
MARKET BASKET ANALYSIS IN PYTHON
Reducing the number of itemsets
Not possible to consider all itemsets.
Not even possible to enumerate them.
How do we remove an itemset without even evaluating it?
Could set maximum k value.
Apriori algorithm offers alternative.
Doesn't require enumeration of all itemsets.
Sensible rule for pruning.
MARKET BASKET ANALYSIS IN PYTHON
The Apriori principle
Apriori principle. Candles = Infrequent
Subsets of frequent sets are frequent. -> {Candles, Signs} = Infrequent
Retain sets known to be frequent. {Candles, Signs} = Infrequent
-> {Candles, Signs Boxes} = Infrequent
Prune sets not known to be frequent.
{Candles, Signs, Boxes} = Infrequent
-> {Candles, Signs, Boxes, Bags} =
Infrequent
MARKET BASKET ANALYSIS IN PYTHON
Apriori implementation
# Import Apriori algorithm
from mlxtend.frequent_patterns import apriori
# Load one-hot encoded novelty gifts data
onehot = pd.read_csv('datasets/online_retail_onehot.csv')
# Print header.
print(onehot.head())
50'S CHRISTMAS GIFT BAG LARGE ... ZINC WILLIE WINKIE CANDLE STICK \
0 False ... False
1 False ... False
2 False ... False
3 False ... False
4 False ... False
MARKET BASKET ANALYSIS IN PYTHON
Apriori implementation
# Compute frequent itemsets
frequent_itemsets = apriori(onehot, min_support = 0.0005,
max_len = 4, use_colnames = True)
# Print number of itemsets
print(len(frequent_itemsets))
3652
MARKET BASKET ANALYSIS IN PYTHON
Apriori implementation
# Print itemsets
print(frequent_itemsets.head())
support itemsets
0 0.000752 ( 50'S CHRISTMAS GIFT BAG LARGE)
1 0.001504 ( DOLLY GIRL BEAKER)
...
1500 0.000752 (PING MICROWAVE APRON, FOOD CONTAINER SET 3 LO...
1501 0.000752 (WOOD 2 DRAWER CABINET WHITE FINISH, FOOD CONT...
...
MARKET BASKET ANALYSIS IN PYTHON
Let's practice!
M A R K E T B A S K E T A N A LY S I S I N P Y T H O N
Basic Apriori results
pruning
M A R K E T B A S K E T A N A LY S I S I N P Y T H O N
Isaiah Hull
Economist
Apriori and association rules
Apriori prunes itemsets.
Applies minimum support threshold.
Modi ed version can prune by number of items.
Doesn't tell us about association rules.
Association rules.
Many more association rules than itemsets.
{Bags, Boxes}: Bags -> Boxes OR Boxes -> Bags.
MARKET BASKET ANALYSIS IN PYTHON
How to compute association rules
Computing rules from Apriori results. Reducing number of association rules.
Dif cult to enumerate for high n and k. mlxtend module offers means of pruning
association rules.
Could undo itemset pruning by Apriori.
association_rules() takes frequent
items, metric, and threshold.
MARKET BASKET ANALYSIS IN PYTHON
How to compute association rules
# Import Apriori algorithm
from mlxtend.frequent_patterns import apriori, association_rules
# Load one-hot encoded novelty gifts data
onehot = pd.read_csv('datasets/online_retail_onehot.csv')
# Apply Apriori algorithm
frequent_itemsets = apriori(onehot,
use_colnames=True,
min_support=0.0001)
# Compute association rules
rules = association_rules(frequent_itemsets,
metric = "support",
min_threshold = 0.0)
MARKET BASKET ANALYSIS IN PYTHON
The importance of pruning
# Print the rules.
print(rules)
antecedents ... conviction
0 (CARDHOLDER GINGHAM CHRISTMAS TREE) ... inf
...
79505 (SET OF 3 HEART COOKIE CUTTERS) ... 1.998496
# Print the frequent itemsets.
print(frequent_itemsets)
support itemsets
0 0.000752 ( 50'S CHRISTMAS GIFT BAG LARGE)
...
4707 0.000752 (PIZZA PLATE IN BOX, CHRISTMAS ...
MARKET BASKET ANALYSIS IN PYTHON
The importance of pruning
# Compute association rules
rules = association_rules(frequent_itemsets,
metric = "support",
min_threshold = 0.001)
# Print the rules.
print(rules)
antecedents conviction
0 (BIRTHDAY CARD, RETRO SPOT) ... 2.977444
1 (JUMBO BAG RED RETROSPOT) ... 1.247180
MARKET BASKET ANALYSIS IN PYTHON
Exploring the set of rules
print(rules.columns)
Index(['antecedents', 'consequents', 'antecedent support',
'consequent support', 'support', 'confidence', 'lift', 'leverage',
'conviction'],
dtype='object')
print(rules[['antecedents','consequents']])
antecedents consequents
0 (JUMBO BAG RED RETROSPOT) (BIRTHDAY CARD, RETRO SPOT)
1 (BIRTHDAY CARD, RETRO SPOT) (JUMBO BAG RED RETROSPOT)
MARKET BASKET ANALYSIS IN PYTHON
Pruning with other metrics
# Compute association rules
rules = association_rules(frequent_itemsets,
metric = "antecedent support",
min_threshold = 0.002)
# Print the number of rules.
print(len(rules))
3899
MARKET BASKET ANALYSIS IN PYTHON
Let's practice!
M A R K E T B A S K E T A N A LY S I S I N P Y T H O N
Advanced Apriori
results pruning
M A R K E T B A S K E T A N A LY S I S I N P Y T H O N
Isaiah Hull
Economist
Applications
Cross-Promotion Aggregation
MARKET BASKET ANALYSIS IN PYTHON
The Apriori algorithm
List of Lists One-Hot Encoding
Apriori Algorithm
MARKET BASKET ANALYSIS IN PYTHON
The Apriori algorithm
import pandas as pd
import numpy as np
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
itemsets = np.load('itemsets.npy')
print(itemsets)
[['EASTER CRAFT 4 CHICKS'],
['CERAMIC CAKE DESIGN SPOTTED MUG', 'CHARLOTTE BAG APPLES DESIGN'],
['SET 12 COLOUR PENCILS DOLLY GIRL'],
...
['JUMBO BAG RED RETROSPOT', ... 'LIPSTICK PEN FUSCHIA']]
MARKET BASKET ANALYSIS IN PYTHON
The Apriori algorithm
# One-hot encode data
encoder = TransactionEncoder()
onehot = encoder.fit(itemsets).transform(itemsets)
onehot = pd.DataFrame(onehot, columns = encoder.columns_)
# Apply Apriori algorithm and print
frequent_itemsets = apriori(onehot, use_colnames=True, min_support=0.001)
print(frequent_itemsets)
support itemsets
0 0.001504 ( DOLLY GIRL BEAKER)
1 0.002256 ( RED SPOT GIFT BAG LARGE)
...
428 0.001504 (BIRTHDAY CARD, RETRO SPOT, JUMBO BAG RED RETR...
MARKET BASKET ANALYSIS IN PYTHON
Apriori algorithm results
print(len(data.columns))
4201
print(len(frequent_itemsets))
2328
rules = association_rules(frequent_itemsets)
MARKET BASKET ANALYSIS IN PYTHON
Association rules
print(rules['consequents'])
0 (DOTCOM POSTAGE)
...
9 (HERB MARKER THYME)
...
234 (JUMBO BAG RED RETROSPOT)
235 (WOODLAND CHARLOTTE BAG)
236 (RED RETROSPOT CHARLOTTE BAG)
237 (STRAWBERRY CHARLOTTE BAG)
238 (CHARLOTTE BAG SUKI DESIGN)
Name: consequents, Length: 239, dtype: object
MARKET BASKET ANALYSIS IN PYTHON
Filtering with multiple metrics
targeted_rules = rules[rules['consequents'] == {'HERB MARKER THYME'}].copy()
filtered_rules = targeted_rules[(targeted_rules['antecedent support'] > 0.01) &
(targeted_rules['support'] > 0.009) &
(targeted_rules['confidence'] > 0.85) &
(targeted_rules['lift'] > 1.00)]
print(filtered_rules['antecedents'])
9 (HERB MARKER BASIL)
25 (HERB MARKER PARSLEY)
27 (HERB MARKER ROSEMARY)
Name: antecedents, dtype: object
MARKET BASKET ANALYSIS IN PYTHON
Grouping products
MARKET BASKET ANALYSIS IN PYTHON
Aggregation and dissociation
# Load aggregated data
aggregated = pd.read_csv('datasets/online_retail_aggregated.csv')
# Compute frequent itemsets
onehot = encoder.fit(aggregated).transform(aggregated)
data = pd.DataFrame(onehot, columns = encoder.columns_)
frequent_itemsets = apriori(data, use_colnames=True)
# Compute standard metrics
rules = association_rules(frequent_itemsets)
# Compute Zhang's rule
rules['zhang'] = zhangs_rule(rules)
MARKET BASKET ANALYSIS IN PYTHON
Zhang's rule
# Print rules that indicate dissociation
print(rules[rules['zhang'] < 0][['antecedents','consequents']])
antecedents consequents
2 (bag) (candle)
3 (candle) (bag)
4 (sign) (bag)
5 (bag) (sign)
MARKET BASKET ANALYSIS IN PYTHON
Selecting a oorplan
MARKET BASKET ANALYSIS IN PYTHON
Let's practice!
M A R K E T B A S K E T A N A LY S I S I N P Y T H O N