Mining Tutorial Slides
Mining Tutorial Slides
on Data Mining
5 April 2001
Tutorial on
E-commerce and
Clickstream Mining
Jonathan Becher Ronny Kohavi
VP, Product Strategy Director, Data Mining
Accrue Software, Inc. Blue Martini Software
jonbecher@yahoo.com ronnyk@CS.Stanford.edu
http://www.Kohavi.com
2
Agenda
l Introduction (45 min)
l Architecture and Data Flow (45 min)
ä Collecting the data
ä Break (10 min)
ä Building the warehouse
ä Closing the loop
l Mining Web Data (75 min)
ä Transformations
ä Unofficial Break (10 min)
ä Reporting and OLAP
ä Mining
ä Visualization
l Teasers & Summary (15 min)
Jon Becher and Ronny Kohavi
3
l Logistics: bathroom is …
l Questions? Special requests?
† Stats from E-performance:The path to rational exuberance, McKinsey Quarterly, 2001, No 1 Jon Becher and Ronny Kohavi
5
† Stats from E-performance:The path to rational exuberance, McKinsey Quarterly, 2001, No 1 Jon Becher and Ronny Kohavi
6
Value Proposition
l Why mine e-commerce and clickstream data?
ä Improve conversion rate through personalization
ä Optimize marketing campaigns (banners, email, other
media) that bring visitors to your site by measuring
return on investment (ROI).
ä Improve basket size through cross-sells and up-sells
ä Streamline navigation paths through the site
ä Avoid content delivery issues (poorly formatted for
AOL, too rich for low bandwidth users, redundant or
confusing content)
ä Identify customers segments that you can target offline
ä Experiment quickly. The Web is a laboratory.
Understand what works quickly
Jon Becher and Ronny Kohavi
8
Definitions
l Hit – any Web server request that generates a log file entry. A
page has many elements (html, gifs), each generating a hit.
l Page – Web server file that is sent to client user agent, usually a
browser. Typically HTML files, but not all HTML are considered
pages (I.e., frame set). Can be static or dynamic
l Session – all actions (i.e. requests, resets) made in single visit,
from entry until logout or time out (e.g., 20 minutes of no activity).
l Visitor – a user or bot/spider/crawler that makes requests at a
site. Can be new, returning, registered, anonymous
l Buyer – visitor that purchases something
l Customer – a visitor that registers (sometimes defined as buyer)
l Conversion – rate at which visitors transition to desired state
(buyers, customers, registered, started checkout)
l Host – remote machine, identified by IP address, used for visit.
l Referrers – page that provides a link to another page. Can be
internal or external
Clearly there was a page view at Yahoo, but was there also a
page view at Weathernews? How about a hit? A visit?
Jon Becher and Ronny Kohavi
11
Teaser - Conversion
l Product conversions are computed as
rate = “Product quantity sold” / “Number of product views”
l How can conversion rates be above 100%
Heavy Purchasers
l Factors correlating with heavy purchasers:
ä Not an AOL user (defined by browser) - browser window
too small for layout (inappropriate site design)
ä Came to site from print-ad or news, not friends & family
- broadcast ads versus viral marketing
ä Very high and very low income
ä Older customers (Acxiom)
ä High home market value, owners of luxury vehicles
(Acxiom)
ä Geographic: Northeast U.S. states
ä Repeat visitors (four or more times)-loyalty, replenishment
ä Visits to areas of site - personalize differently
Referring Traffic
Referring site traffic changed dramatically over time.
Graph of relative percentages of top 5 sites
Top Referrers
MyCoupons.com
100% 6000
Winnie-
Cooper
5000
80%
and Companies/Apparel/Lingerie
60%
3000
40% FashionMall.com
2000
20%
1000
0% 0
0
2/1 0
3/1 0
0
3/1 0
2/1 0
2/1 0
2/1 0
2/2 0
2/2 0
2/2 0
3/1 0
3/1 0
3/1 0
3/2 0
3/2 0
3/2 0
3/3 0
0
2/2 0
2/1 0
2/2 0
3/2 0
3/2 0
/0
/0
/0
/0
/0
/0
/0
/0
/0
0
0/0
4/0
6/0
0/0
2/0
4/0
8/0
3/0
5/0
7/0
1/0
3/0
5/0
9/0
1/0
8/0
2/0
6/0
9/0
7/0
1/
2/2
2/6
2/8
3/1
3/3
3/7
3/9
2/4
3/5
Session date
Fashion Mall Yahoo ShopNow MyCoupons Winnie-cooper Total from top referrers
Referrers - Ad Policy
l Referrers - establish ad policy based on
conversion rates, not clickthroughs!
ä Overall conversion rate: 0.8% (relatively low)
ä Mycoupons had 8.2% conversion rates, but low
spenders
ä Fashionmall and ShopNow brought 35,000 visitors
Only 23 purchased (0.07% conversion rate!)
ä What about Winnie-Cooper?
Today
7
Total Visits
1
10 20 30 40 50 60 70
Effective Linespeed
0.45
0.4
Reset Frequency
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
10 20 30 40 50 60 70
Effective Linespeed
Agenda
l Introduction (45 min)
l Architecture and Data Flow (45 min)
ä Collecting the data
ä Building the warehouse
ä Closing the loop
l Break (10 min)
l Mining Web Data (75 min)
ä Transformations
ä Reporting and OLAP
ä Mining
ä Visualization
l Summary (20 min)
Architecture
Web Site Warehouse
(operations) (analysis)
all
INTERNET Firew
orate
Corp
Test Site
Local Mirrored
Secured Local
Router
Switch
Mirrored
Router London
INTERNET Mirrored
Tokyo
Load
Balancer
Router
Local
Switch
Switch
Load Balancer
Data Collection
l Visitor activity information
ä Web server log files
ä Web server instrumentation (plug-ins)
ä TCP/IP packet sniffing (network collection)
ä Application server instrumentation
l Disadvantages
ä Multiple file formats (elf)
ä Designed for debugging Web servers, not for analysis
ä Multiple log files for multiple Web servers
ä Distributed sites make sessionizing more difficult
l Disadvantages
ä No incremental data than available from log file
Advertising Analysis
How effective are banner ads?
Agenda
l Introduction (45 min)
l Architecture and Data Flow (45 min)
ä Collecting the data
ä Building the warehouse
ä Closing the loop
l Mining Web Data (75 min)
ä Transformations
ä Unofficial break (10 min)
ä Reporting and Visualization
ä OLAP
ä Mining
l Summary (20 min)
Analytical Operational
Few large transactions Many small transactions
Customer centric Session and product centric
Easy to parallelize (multiple
Hard to parallelize
web/app servers)
Wireless
Data
Mining
Data
Warehouse
Call Center
Visualization
Demographic
loop
Visualization
Data
Mining
Touchpoints: OLTP
Web Store store
Call Center
Campaigns
Other
Data
Data
Warehouse
Sources
Syndicated data
(e.g., Experian/
Acxiom)
Agenda
l Introduction (45 min)
l Architecture and Data Flow (45 min)
ä Collecting the data
ä Building the warehouse
ä Closing the loop
l Mining Web Data (75 min)
ä Transformations
ä Reporting and Visualization
ä OLAP
ä Mining
l Summary (20 min)
Transformations
l Creating a warehouse is not enough; you need to:
ä Make URLs more understandable (dynamic content, page titles)
ä Handle reverse DNS lookup (208.216.181.15 à www.amazon.com)
ä Sessionize (decide which requests belong to same session if you are
not using an application server). Commonly cookie-based
ä Identify crawlers/robots
ä Identify test users
ä Compute session-level attributes (number of pages, time spent,
session milestones)
ä Create customer attributes (repeat visitor, frequent purchaser,
high spender)
ä Use products and content attributes
ä Compute abstractions of existing attributes (e.g., product
hierarchies, referrers, browsers, regions)
ä Calculate date/time attributes
Jon Becher and Ronny Kohavi
44
Dynamic Content
l Must rewrite the URL to increase understanding and facilitate
analysis of served content
http://www.music.com/tape/db=sd-7599/0,1,2,00.html
http://www.music.com/tape/db=sd-7599/0,1,2,00.html
http://www.music.com/tape/classical/bach
http://www.music.com/generic.asp
http://www.music.com/generic.asp
http://www.music.com/tape/classical/bach
Crawler/Robots
l Crawlers are programs that visit your site
ä Search crawlers
Good
ä Shopping bots
ä IE5 offline viewer
ä Performance assessment (e.g., Keynote)
ä E-mail harvesters - Evil
ä Students learning Perl scripts
Test Users
l Every respectable site has a QA department
l Their users hit the site with different patterns
ä Their goal is to break the site, not to purchase
ä They’ll change URLs
ä They’ll surf quickly
ä They’ll click on random links
Session-level Attributes
l Pages
ä Page views per session (deep vs. shallow)
ä Unique pages per session
ä Promotional vs. standard entry
l Time
ä Time spent per session
ä Average time per page
ä Fast vs. slow connection
l Session Milestones
ä Did they go through registration, when?
ä Did they look at the privacy statement?
ä Did they use search?
ä Did they start and/or complete checkout?
Customer Attributes
l Some attributes based on customer history
ä Initial vs. Repeat visitor/purchaser
ä Recent visitor/purchaser Recency
ä Frequent visitor/purchaser Frequency
ä Readers vs. browsers (time per page)
Monetary /
ä Heavy spender Duration
ä Original referrer
Abstract Attributes
l Many attributes have too many values
ä There are over 100 colors for Jeans
ä There are hundreds of area codes and zip codes
ä There are hundreds of referring sites
Date/Time Attributes
l There are many date/time attributes
ä First session time
ä Registration time
ä Delivery time
l Most tools are poor at handling date/time
l Abstract attributes can be created
ä Day of week or month
(people get paid on Fridays or on the 1st and 15th)
ä Hour of day
(behavior is different in the morning than at night)
ä Weekend vs. Weekdays
ä Seasons
l Differences between dates are important for showing
trends
Jon Becher and Ronny Kohavi
53
Tracking Visitors
Within one session
l Referring URLs
ä When traffic is due to a specific reason (search, ad, affiliate)
l Special URLs
ä www.kodak.com/go/freestuff
l Query Strings at the end of URLs
ä www.kodak.com?AdName=freestuff
Across sessions
l Host IP + Browser String
ä Proxies limit accuracy (e.g., AOL, WebTV)
http://webusagemining.com/sys-itmpl/webdataminingworkshop/
l Cookies
ä Stored on visitor’s browser on first visit to site
l Registration
ä Require login for every visit
Jon Becher and Ronny Kohavi
54
Agenda
l Introduction (45 min)
l Architecture and Data Flow (45 min)
ä Collecting the data
ä Building the warehouse
ä Closing the loop
l Break (10 min)
l Mining Web Data (75 min)
ä Transformations
ä Reporting and Visualization
ä OLAP
ä Mining
l Summary (20 min)
Reports
l Traditional representation of data as tables
ä Elements may be changed by user (which columns appear)
ä Format may be change by user (order of columns, color, etc.)
ä Once report has been generated, user typically cannot change it or
ask questions of it, without regenerating the report
l The most important tool for business users
The most unappreciated tool by companies
ä Many companies provide great analytics but miss basic reporting
ä WebTrends has simple log analysis but very clear and nice reports
l Examples: Actuate, AlphaBlox, Brio, Business
Objects, Crystal Decisions (Seagate), Microsoft Excel
Visualizations
l Tabular data can be hard to interpret
ä Provide simple bar charts and scatter plots
l Business users need to quickly see trends
ä Provide time-series graphs
l Avoid creating state-of-the visualizations that
only the creators can understand
Hierarchical Decomposition
Every node shows browser type
on X-axis
Height = number of sessions
Color = average order amount
OLAP
l Relational OLAP (ROLAP)
ä Query data directly from relational structure
ä Typically requires multi-way joins
ä Performance suffers with complexity of questions
ä Verdict: very flexible but doesn’t scale well
ä Examples: Business Objects, Cognos, MicroStrategy
l Multi-dimensional OLAP (MOLAP)
ä Built n-dimensional cubes from source data
ä Data access is n-dimensional lookup
ä Building cubes can be time intensive
ä Verdict: very fast but not very flexible
ä Examples: Hyperion, Microsoft , Oracle Express
MDB RDBMS
Jon Becher and Ronny Kohavi
67
Tree Drill-Down
l Front-ends to MDDB (multi-dimensional
databases) provide easy access to data
Fig Provided
by Knosys
OLAP Visualizations
Fig
Provided
by
Knosys
OLAP Example
Case Study: How does visitor preferences vary by content?
l Why is pages/
visit for politics
relatively low?
l Theory: politics
readers are
high frequency
and low
pages/visit
l Let’s test
theory: drill
down on
politics, show
frequency
OLAP Example
Drilldown on “Politics”
l Answer: time/visit
increases
dramatically at
high frequency
l Politics readers
read instead of
browse!
l From here, we
could continue to
drill down or drill
back up.
Mining – Induction
l Analysis Type
ä Prediction, or business rules created by a person
l Sample Applications
ä Which product or banner should be displayed?
ä Which person is most likely to respond to an outbound email?
ä How likely is a visitor to return to the Web site?
ä Which customers are the heaviest spenders?
l Objections
ä Dynamic nature of Web data is difficult to model
ä Algorithms are not well understood by business users
l Example Companies
ä Accrue, Angoss, Broadbase, Blue Martini, E.piphany,
Microsoft analytical services, SAS
Mining – Segmentation
l Analysis Type
ä Cluster to discover groups of similar behavior or a similar profile
l Sample Applications
ä Find customer segments
ä Generate small number of different web sites or stores
ä Discover communities of visitors with similar interests
ä Identify substitute or cannibal products
l Objections
ä How well do customers fit in a particular group?
ä Hard to understand high-dimensional segments
l Example Companies
ä Accrue, ATG Scenario Server, Blue Martini
Mining – Associations
l Analysis Type
ä Link analysis for associations or time-based sequences
l Sample Applications
ä Shopping cart analysis
ä Up-sell and cross-sell
ä Path analysis
l Objections
ä Shear number of rules makes interpretation difficult
ä With no holdout testing, difficult to know whether results will
stand up over time
l Example Companies
ä Accrue, IBM, SGI, Vignette
Association Example
Recommend potential purchases based on basket contents
Collaborative Filtering
l Analysis Type
ä Recommend small # of products out of 1,000's
l Benefits
ä No need for a training set; algorithm bootstraps itself
ä Can be used directly against operational data store
ä Learning is incremental and should improve over time
l Objections
ä Tie lag to gather data before recommendations valid
ä Black box perception: Why is a recommendation made?
ä Difficult to produce a confidence interval in prediction.
ä In practice, few examples leads to sparse data such that the
recommendations are weak
l Example Companies
ä Like Minds, Net Perceptions
Why?
Hint: Acxiom only conflicted with females,
claiming some females are males.
Never in the other direction
Some images used herein where obtained from IMSI's MasterClips/Master Photo(C) Collection,
Jon Becher and Ronny Kohavi
1895 Francisco Blvd East, San Rafael 94901-5506, USA
80
1800
1400
contained anomalies 1200
800
[Georges and Milley, 600
200
1910
1915
1920
1925
1930
1935
1940
1945
1950
1955
1960
1965
1970
1975
1980
1985
1990
1995
zero (white dots on blue) Year
Summary
Resources (I)
l The Data Webhouse Toolkit: Building the Web-
Enabled Data Warehouse by Ralph Kimball,
Richard Merz. ISBN: 0471376809 (Jan 2000)
l Mastering Data Mining: The Art and Science of
Customer Relationship Management by Michael J. A.
Berry, Gordon Linoff. ISBN: 0471331236
l KDNuggets, Software for Web Mining
http://www.kdnuggets.com/software/web.html
l WEBKDD - Workshops in Web Mining
http://robotics.Stanford.EDU/~ronnyk/WEBKDD2000/index.html
http://robotics.Stanford.EDU/~ronnyk/WEBKDD2001/index.html
Resources (II)
l Web Mining Research: A Survey
http://www.acm.org/sigs/sigkdd/explorations/issue2-1/contents.htm#Kosala
Resources (III)
l An Ideal E-Commerce Architecture for Building Web
Sites Supporting Analysis and Personalization
http://robotics.Stanford.EDU/~ronnyk/ronnyk-bib.html