0% found this document useful (0 votes)
39 views

Mining Tutorial Slides

Ronny Kohavi Director, Data Mining Blue Martini Software Introduction (45 min) Building the warehouse Closing the loop Transformations Unofficial Break (10 min) Reporting and OLAP Mining Visualization (10)

Uploaded by

neelam2111
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Mining Tutorial Slides

Ronny Kohavi Director, Data Mining Blue Martini Software Introduction (45 min) Building the warehouse Closing the loop Transformations Unofficial Break (10 min) Reporting and OLAP Mining Visualization (10)

Uploaded by

neelam2111
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

First SIAM International Conference

on Data Mining
5 April 2001
Tutorial on
E-commerce and
Clickstream Mining
Jonathan Becher Ronny Kohavi
VP, Product Strategy Director, Data Mining
Accrue Software, Inc. Blue Martini Software
jonbecher@yahoo.com ronnyk@CS.Stanford.edu
http://www.Kohavi.com
2

Agenda
l Introduction (45 min)
l Architecture and Data Flow (45 min)
ä Collecting the data
ä Break (10 min)
ä Building the warehouse
ä Closing the loop
l Mining Web Data (75 min)
ä Transformations
ä Unofficial Break (10 min)
ä Reporting and OLAP
ä Mining
ä Visualization
l Teasers & Summary (15 min)
Jon Becher and Ronny Kohavi
3

Introductions - Who are We?


l Ronny Kohavi
l Jon Becher
l Audience
ä How many from Academia vs. Vendor vs. Site?
ä How many analyzed clickstream data?
ä How many analyzed transactional data?
ä How many collect web-based data today?

l Logistics: bathroom is …
l Questions? Special requests?

Jon Becher and Ronny Kohavi


4

Web Mining: Site Categories


l Brochureware - simplest sites
ä Mostly static brochure content
ä About <company>
ä Examples: Exxon Mobil, Philip Morris

l Content Providers - dynamic content


Communities, Portals, Aggregators
ä High conversion rates to members (over 50%) for
repeat visitors †
ä Low ad revenue per visitor (less than $0.50)
ä Subscription revenues are rare
ä Examples: Yahoo!, CNN, Levi’s, Wall Street Journal

† Stats from E-performance:The path to rational exuberance, McKinsey Quarterly, 2001, No 1 Jon Becher and Ronny Kohavi
5

Web Mining: Site Categories II


l Transaction oriented sites
ä Sell items
ä Conversion rates (browsers to shoppers) around 2%
ä Revenue per customer around $150/month
(high average includes travel sites) †
ä Visitor acquisition cost $1-$5 (=$50-$200 / customer)
ä Examples: Amazon, Dell

l Data Mining is most important for transaction


sites and content providers

† Stats from E-performance:The path to rational exuberance, McKinsey Quarterly, 2001, No 1 Jon Becher and Ronny Kohavi
6

What is (not) Covered


l Both Jon and Ronny have more experience
in B2C (Business to Consumer) clients,
although most principles apply to B2B
l We will not cover Information retrieval and
network management.
Rajeev Rastogi and Minos Garofalakis will
cover in tomorrow’s tutorial
l Disclaimer: we will mention books, products,
and URLs that we found useful. This is not a
comprehensive list
l Vendor slides are attached in the beginning
Jon Becher and Ronny Kohavi
7

Value Proposition
l Why mine e-commerce and clickstream data?
ä Improve conversion rate through personalization
ä Optimize marketing campaigns (banners, email, other
media) that bring visitors to your site by measuring
return on investment (ROI).
ä Improve basket size through cross-sells and up-sells
ä Streamline navigation paths through the site
ä Avoid content delivery issues (poorly formatted for
AOL, too rich for low bandwidth users, redundant or
confusing content)
ä Identify customers segments that you can target offline
ä Experiment quickly. The Web is a laboratory.
Understand what works quickly
Jon Becher and Ronny Kohavi
8

Web is DM’s Killer Domain


l Successful data mining benefits from:
ä Large amount of data (many records)
ä Rich data with many attributes
(wide records)
ä Clean data collection (avoid GIGO)
ä Actionable domain
(have real-world impact)
ä Measurable return-on-investment
(did the recipe help?)
l Web mining has all the right
ingredients
Jon Becher and Ronny Kohavi
9

Definitions
l Hit – any Web server request that generates a log file entry. A
page has many elements (html, gifs), each generating a hit.
l Page – Web server file that is sent to client user agent, usually a
browser. Typically HTML files, but not all HTML are considered
pages (I.e., frame set). Can be static or dynamic
l Session – all actions (i.e. requests, resets) made in single visit,
from entry until logout or time out (e.g., 20 minutes of no activity).
l Visitor – a user or bot/spider/crawler that makes requests at a
site. Can be new, returning, registered, anonymous
l Buyer – visitor that purchases something
l Customer – a visitor that registers (sometimes defined as buyer)
l Conversion – rate at which visitors transition to desired state
(buyers, customers, registered, started checkout)
l Host – remote machine, identified by IP address, used for visit.
l Referrers – page that provides a link to another page. Can be
internal or external

Jon Becher and Ronny Kohavi


10

Teaser - Page Definitions


The weather map
image for Chicago is
dynamically loaded
from another site,
when needed.

A user visits Yahoo to www.weathernews.com


find out what the
weather in Chicago will
be next week. weather.yahoo.com

Clearly there was a page view at Yahoo, but was there also a
page view at Weathernews? How about a hit? A visit?
Jon Becher and Ronny Kohavi
11

Teaser - Conversion
l Product conversions are computed as
rate = “Product quantity sold” / “Number of product views”
l How can conversion rates be above 100%

Jon Becher and Ronny Kohavi


12

Case Study: KDD Cup 2000


l Gazelle.com was a legcare and legwear retailer
l Data available for KDD Cup 2000
l Data enhanced with Acxiom
demographics
l See http://www.ecn.purdue.edu/KDDCUP
for details and access to data

Jon Becher and Ronny Kohavi


13

Heavy Purchasers
l Factors correlating with heavy purchasers:
ä Not an AOL user (defined by browser) - browser window
too small for layout (inappropriate site design)
ä Came to site from print-ad or news, not friends & family
- broadcast ads versus viral marketing
ä Very high and very low income
ä Older customers (Acxiom)
ä High home market value, owners of luxury vehicles
(Acxiom)
ä Geographic: Northeast U.S. states
ä Repeat visitors (four or more times)-loyalty, replenishment
ä Visits to areas of site - personalize differently

Jon Becher and Ronny Kohavi


14

Referring Traffic
Referring site traffic changed dramatically over time.
Graph of relative percentages of top 5 sites
Top Referrers
MyCoupons.com
100% 6000
Winnie-
Cooper
5000
80%

Yahoo searches for THONGS ShopNow.com 4000


Percent of top referrers

and Companies/Apparel/Lingerie
60%

3000

40% FashionMall.com
2000

20%
1000

0% 0
0

2/1 0

3/1 0
0

3/1 0
2/1 0

2/1 0

2/1 0

2/2 0

2/2 0

2/2 0

3/1 0

3/1 0

3/1 0

3/2 0

3/2 0

3/2 0

3/3 0
0
2/2 0
2/1 0

2/2 0

3/2 0

3/2 0
/0

/0

/0

/0

/0

/0

/0
/0

/0

0
0/0

4/0

6/0

0/0

2/0

4/0

8/0

3/0

5/0

7/0

1/0

3/0

5/0

9/0

1/0
8/0
2/0

6/0

9/0

7/0
1/
2/2

2/6

2/8

3/1

3/3

3/7

3/9
2/4

3/5

Session date

Fashion Mall Yahoo ShopNow MyCoupons Winnie-cooper Total from top referrers

Jon Becher and Ronny Kohavi


15

Referrers - Ad Policy
l Referrers - establish ad policy based on
conversion rates, not clickthroughs!
ä Overall conversion rate: 0.8% (relatively low)
ä Mycoupons had 8.2% conversion rates, but low
spenders
ä Fashionmall and ShopNow brought 35,000 visitors
Only 23 purchased (0.07% conversion rate!)
ä What about Winnie-Cooper?

Jon Becher and Ronny Kohavi


16

Who is Winnie Cooper?


l Winnie-cooper is a 31 year old
guy who wears pantyhose
l He has a pantyhose site
l 7000 visitors came from his site
l Actions:

ä Make him a celebrity and interview him about how hard it


is for a men to buy pantyhose in stores
ä Personalize for XL sizes

Jon Becher and Ronny Kohavi


17

Case Study: On-line Newspaper


l Regional newspaper focused on editorial content,
classifieds, “yellow pages”, and syndicated content
from third party providers
l Goals:
ä Increase traffic to increase advertising revenue
(acquisition)
ä Increase percentage of registered users (conversion)
ä Increase pages/visits and visits/visitor (stickiness)
ä Deliver more targeted content to registered users

Jon Becher and Ronny Kohavi


18

The War Effect


l When US launched its campaign in
90 Serbia, site put up special section
80 with links to past stories on Kosovo
70 ç Dramatic single day shift in mix of
60 visitor domains to EDU and ORG
50
40 ê Biggest increase in referrers from
30 COM education and teaching sites.
20 ORG l Conclusion: outreach programs to
10 EDU
GOV classrooms based on special events
0
MIL
Week Ago Today

Visitor Sources: Biggest Increases


Yesterday

Today

REFERRER Today Y e s terday V a riance P c t Variance


infoplease (O R G ) 5,013.00 3,580.00 1433.00 4 0 .03
m y s c h o o lonline (O R G ) 21,719.00 2 0 ,933.00 7 8 6 .00 3.75
teachervision (O R G ) 2,066.00 1,765.00 3 0 1 .00 1 7 .05
lycos (O R G ) 2 6 6 .00 2 0 7 .00 5 9 .00 2 8 .50
kidsource (O R G ) 2 6 6 .00 2 1 4 .00 5 2 .00 2 4 .30
fam ilye d u c a tion (O R G ) 6 1 6 .00 5 7 5 .00 4 1 .00 7.13
a w e s o m e library ( O R G ) 1 7 3 .00 1 3 6 .00 3 7 .00 2 7 .21

Jon Becher and Ronny Kohavi


19

The Bandwidth Effect


l Users with high effective line speed are more likely to
be return visitors

Total Visits by Effective Linespeed


9

7
Total Visits

1
10 20 30 40 50 60 70
Effective Linespeed

Bars represent one standard deviation from average


Jon Becher and Ronny Kohavi
20

The Bandwidth Effect II


l Users with low effective linespeed connections are
much more likely to give up on a page before it’s done
l Conclusion: 1) two versions of the site, one with less
rich graphics 2) use HTML instead of PDFs

Reset Frequency by Effective Linespeed


0.5

0.45

0.4
Reset Frequency

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0
10 20 30 40 50 60 70
Effective Linespeed

Jon Becher and Ronny Kohavi


21

The Referrer Effect


l Check on stickiness of the site based on the location
of the referrer reveals visitors from banner ads,
search engines, and portals have shallow visits
l Best results come from affiliates – content partners
that share similar demographics
l Worst: banner advertising – almost no one looks at
any pages beyond the initial redirect
0
1 Page 2 Pag e s 3-5 Pag e s 6-10 Pag e s 11-25 Pages 26+ Pag e s
Pages
double c lic k (ORG) 210,175 6,941 1,422 1,217 1,170 804 479
yahoo (COM) 132,719 12,846 10,159 14,696 15,482 13,139 9,274
familyeducation (ORG) 103,942 116,371 11,357 13,252 16,352 19,749 26,396
mys c hoolonline (ORG) 97,066 10,225 10,575 9,842 14,228 19,839 22,343
lycos (COM) 38,967 3,628 476 289 265 149 77
google (COM) 11,623 3,265 615 387 257 145 114

Jon Becher and Ronny Kohavi


22

Agenda
l Introduction (45 min)
l Architecture and Data Flow (45 min)
ä Collecting the data
ä Building the warehouse
ä Closing the loop
l Break (10 min)
l Mining Web Data (75 min)
ä Transformations
ä Reporting and OLAP
ä Mining
ä Visualization
l Summary (20 min)

Jon Becher and Ronny Kohavi


23

Architecture
Web Site Warehouse
(operations) (analysis)

all
INTERNET Firew
orate
Corp

Test Site

Visitor Merchandisers, Marketers


Jon Becher and Ronny Kohavi
24

Web Site Topology


Need for scalability causes complexity of design

Local Mirrored
Secured Local
Router
Switch
Mirrored
Router London

INTERNET Mirrored
Tokyo
Load
Balancer
Router
Local
Switch
Switch
Load Balancer

San Francisco Secured


Visitor
Dynamic

Jon Becher and Ronny Kohavi


25

Data Collection
l Visitor activity information
ä Web server log files
ä Web server instrumentation (plug-ins)
ä TCP/IP packet sniffing (network collection)
ä Application server instrumentation

l Other sources of data


ä Transactions
ä Marketing programs (banner ads, emails, etc)
ä Demographic (registration, third party overlay)
ä Call center (WISMO)
ä Supply chain (inventory and fulfillment)

Jon Becher and Ronny Kohavi


26

Collection: Server Log Files


l Advantages
ä Everyone has got one
ä Useful for specialized data types (e.g. streaming media)

l Disadvantages
ä Multiple file formats (elf)
ä Designed for debugging Web servers, not for analysis
ä Multiple log files for multiple Web servers
ä Distributed sites make sessionizing more difficult

Jon Becher and Ronny Kohavi


27

Collection: Server Plug-ins


l Advantages
ä Allows for pre-processing of data before storage
ä Can automate scheduling of data to analysis server

l Disadvantages
ä No incremental data than available from log file

Jon Becher and Ronny Kohavi


28

Collection: Packet Sniffing


l Advantages
ä Additional information available
– timing (server response, page download, packet roundtrip)
– browser resets (stop button, move on before load)
ä Any Web server can be supported
ä Data can be captured in real time
ä Multiple Web servers are handled as one
ä Reduces load on Web servers
l Disadvantages
ä Cannot handle encrypted traffic (SSL)
ä Does not capture sub URL information

Jon Becher and Ronny Kohavi


29

Collection: Application Servers


More e-commerce sites now employ application
servers, which control logic and allow logging
l Advantages
ä Can provide information sub page info (product shown,
assortment if multiple products, promotion, ads, prices, etc.)
ä No issues sessionizing (app server controls sessions)
ä Can log events at higher levels than URLs
– completing a scenario (registration, checkout)
– form information, such as search keywords
ä Clickstream and purchase transactions share Ids
ä Robust to changes in URLs
l Disadvantages
ä Must work with an application server and design it properly
ä Does not capture network effects

Jon Becher and Ronny Kohavi


30

Collection: Other Sources


l Advertising networks
ä Which banner ads on which sites cause the best traffic?
ä e.g., Angara, Doubleclick, Engage, Matchlogic, MediaPlex
l Campaign management products
ä Which marketing campaigns are bringing the most qualified visitors to your
site?
ä e.g., Annuncio, Blue Martini, MarketFirst, Prime Respone, Unica, Xchange
l Commerce/transactional engines
ä Which products are most likely to be abandoned on the weekend?
ä e.g., ATG Dynamo Commerce, BEA Weblogic, Broadvision, IBM Websphere
Commerce, OpenMarket Transact
l Overlay data providers
ä How do visitors’ psychographic and demographic information correlate with
their Web site browsing behavior?
ä e.g., Acxiom, Experian, InfoUSA, Nielson

Jon Becher and Ronny Kohavi


31

Advertising Analysis
How effective are banner ads?

Report: Ad by Content Preference

Visitors Visitor Yield New Visitors Cost Per Visitor


Ad Name Content Group
Mustang Financing 1179 44% 5.7% $1.60 Compare the
Auto Ratings 533 20% 4.0% $3.24
Safety Info 433 16% 3.3% $3.99 effectiveness of
Repair History 363 13% 6.7% $4.76
Used Cars 191 7% 21.6% $9.05
ads at driving
Total Ad 2699 100% 5.7% $3.26 traffic to different
Corvette Financing 1009 41% 3.1% $1.71
Auto Ratings 502 21% 1.8% $3.44 areas of the site
Safety Info 441 18% 3.3% $3.91
Repair History 291 12% 4.2% $5.93
Used Cars 191 8% 11.3% $9.03
Total Ad 2434 100% 3.0% $3.54

Report: Impressions to Explorations

Impressions Click-On Rate Visitor Yield Page Depth Time (secs)


Site Name Ad Name
Compare the Portal 1 Mustang 341 6.2% 6.0% 1.2 256
Sebring 346 4.0% 4.0% 1.7 314
effectiveness of ads Corvette 921 3.4% 3.3% 2.9 563
at driving traffic Intrigue
Camaro
643
937
3.3%
1.6%
3.3%
1.5%
3.5
2.1
419
401
from different Portal 2 Mustang 98 3.1% 3.1% 6.4 772
Corvette 106 2.0% 1.8% 4.3 456
external sites Portal 3 Corvette 34 35.0% 22.0% 3.8 421
Camaro 33 6.3% 4.3% 2.9 398
Sebring 59 3.5% 2.9% 4.5 489

Jon Becher and Ronny Kohavi


32

Agenda
l Introduction (45 min)
l Architecture and Data Flow (45 min)
ä Collecting the data
ä Building the warehouse
ä Closing the loop
l Mining Web Data (75 min)
ä Transformations
ä Unofficial break (10 min)
ä Reporting and Visualization
ä OLAP
ä Mining
l Summary (20 min)

Jon Becher and Ronny Kohavi


33
Data Storage
Operational vs. Analytical Storage

l Decision support (data warehouse) has different


needs than a transaction OLTP system

Analytical Operational
Few large transactions Many small transactions
Customer centric Session and product centric
Easy to parallelize (multiple
Hard to parallelize
web/app servers)

l Tuning Oracle to perform well in a warehouse is not


like tuning it for an operational system

Jon Becher and Ronny Kohavi


34

Building the Data Warehouse


Multiple Data Sources Multiple Tools for
Analysis/Mining
Reporting
Internet

Bricks and Mortar OLAP

Wireless
Data
Mining
Data
Warehouse
Call Center
Visualization

Demographic

Jon Becher and Ronny Kohavi


35

Extract Transform Load (ETL)


l Building a data warehouse is a complex
process involving data migration,
consolidation, cleansing, transformations,
and meta-data creation/transfer
l Use ETL tools such as Informatica, Data
Junction, Sane’s NetTracker for weblog data
l Resources:
ä Ralph Kimball’s books
ä http://www.informatica.com
ä http://www.datajunction.com
ä http://www.sane.com

Jon Becher and Ronny Kohavi


36

Alternatives to Data Warehouse


l Simple models can be computed efficiently
at the touchpoint (e.g., webstore)
ä Top items (easy to increment counters)
ä Item pair associations (people who bought this book
also liked that book)
ä Incremental models (e.g., Perceptron, Naïve-Bayes)
ä Some lazy learning techniques (e.g., collaborative
filtering) although these usually do not scale well
without backend work

Jon Becher and Ronny Kohavi


37

Remember reasons for DW


l Without a data warehouse
ä Only simple models can be implemented
ä Can’t integrate external data easily nor go through data
cleansing
ä Hard to use constructed features (e.g., number of
purchases from category X paid by Amex)
ä Lacks human validation and insight to business
ä Many prediction problems show “leaks” exist in data
that may not be discovered in time (e.g., heavy spenders
pay more tax on purchases, so tax predicts purchase
amount)

Jon Becher and Ronny Kohavi


38

Closing the Loop


Consumers and
Analysis
Businesses
Closing the Reporting
OLAP

loop
Visualization
Data
Mining

Touchpoints: OLTP
Web Store store
Call Center
Campaigns

Other
Data
Data
Warehouse
Sources

Syndicated data
(e.g., Experian/
Acxiom)

Jon Becher and Ronny Kohavi


39

Closing the Loop by Humans


l Humans can close the loop
ä Analysis reveals comprehensible patterns
ä Humans generate hypotheses, test and validate
ä Humans take action and change interactions
– Offer new promotions
– Offer new products (e.g., analyze failed searches)
– Offer new cross-sells
– Change advertising strategy based on segments
– Execute e-mail and direct mail campaigns
ä May result in strategic impact on business decisions

Jon Becher and Ronny Kohavi


40

Closing the Loop Automatically


l Automated closing of the loop
ä Optimization of certain processes (e.g, cross-sell offers)
ä Faster cycle (no human involvement required), but
requires tighter software integration of components and
rarely results in interesting strategic insight
ä Can use opaque models (e.g., Neural Networks,
Collaborative Filtering)
ä Legal issues (must not offer cigarettes to minors even
though they correlate with chewing gum)
l Each method of closing the loop has its
advantages/disadvantages
Jon Becher and Ronny Kohavi
41

Agenda
l Introduction (45 min)
l Architecture and Data Flow (45 min)
ä Collecting the data
ä Building the warehouse
ä Closing the loop
l Mining Web Data (75 min)
ä Transformations
ä Reporting and Visualization
ä OLAP
ä Mining
l Summary (20 min)

Jon Becher and Ronny Kohavi


42

l This slide intentionally (not) left blank

Jon Becher and Ronny Kohavi


43

Transformations
l Creating a warehouse is not enough; you need to:
ä Make URLs more understandable (dynamic content, page titles)
ä Handle reverse DNS lookup (208.216.181.15 à www.amazon.com)
ä Sessionize (decide which requests belong to same session if you are
not using an application server). Commonly cookie-based
ä Identify crawlers/robots
ä Identify test users
ä Compute session-level attributes (number of pages, time spent,
session milestones)
ä Create customer attributes (repeat visitor, frequent purchaser,
high spender)
ä Use products and content attributes
ä Compute abstractions of existing attributes (e.g., product
hierarchies, referrers, browsers, regions)
ä Calculate date/time attributes
Jon Becher and Ronny Kohavi
44

Dynamic Content
l Must rewrite the URL to increase understanding and facilitate
analysis of served content

http://www.music.com/tape/db=sd-7599/0,1,2,00.html

http://www.music.com/tape/db=sd-7599/0,1,2,00.html

http://www.music.com/tape/classical/bach

http://www.music.com/generic.asp

http://www.music.com/generic.asp

http://www.music.com/tape/classical/bach

Jon Becher and Ronny Kohavi


45

Crawler/Robots
l Crawlers are programs that visit your site
ä Search crawlers
Good
ä Shopping bots
ä IE5 offline viewer
ä Performance assessment (e.g., Keynote)
ä E-mail harvesters - Evil
ä Students learning Perl scripts

l For understanding your customers, it is very


important to filter out crawlers
l They may account for 50% of
sessions!
Jon Becher and Ronny Kohavi
46

Techniques to Identify Robots


l Browser sends a USERAGENT strings (e.g., keynote, google).
This requires large tables of USERAGENTs to be setup
l Bots commonly turn off images, have empty referrers
l Friendly bots will visit robots.txt file
l Page hit rate is too fast (although some crawl slowly to avoid
hurting the sites)
l Pattern is a depth-first or breadth-first search of site
l Bots never purchase (helps identify USERAGENT strings)
l Eliminate very long paths and unique path sequences
l Setup trap (hidden link) and see who follows it
l Resource: http://bots.internet.com/search

Jon Becher and Ronny Kohavi


47

Test Users
l Every respectable site has a QA department
l Their users hit the site with different patterns
ä Their goal is to break the site, not to purchase
ä They’ll change URLs
ä They’ll surf quickly
ä They’ll click on random links

l Purchases by the QA team are recognized


and ignored by fulfillment center
l Must identify them
ä Requests from specific IP addresses
ä Use of special credit card numbers

Jon Becher and Ronny Kohavi


48

Session-level Attributes
l Pages
ä Page views per session (deep vs. shallow)
ä Unique pages per session
ä Promotional vs. standard entry
l Time
ä Time spent per session
ä Average time per page
ä Fast vs. slow connection
l Session Milestones
ä Did they go through registration, when?
ä Did they look at the privacy statement?
ä Did they use search?
ä Did they start and/or complete checkout?

Jon Becher and Ronny Kohavi


49

Customer Attributes
l Some attributes based on customer history
ä Initial vs. Repeat visitor/purchaser
ä Recent visitor/purchaser Recency
ä Frequent visitor/purchaser Frequency
ä Readers vs. browsers (time per page)
Monetary /
ä Heavy spender Duration
ä Original referrer

l Other attributes are created as hypotheses


ä Heavy purchaser of children’s products
ä Lunchtime visitor

Jon Becher and Ronny Kohavi


50

Product and Content Attributes


l Generalization often has to happen at higher levels
than individual content URLs and product ids
l Products
ä Common attributes are color, size, and weight
ä Specific attributes for category (power consumption for
electrical appliances, inseam size for pants)
l Content
ä Common attributes are topic, version, and author
ä Specific attributes for content types (story and event for news
articles, photographer and length for videos)
l Harder problem: assign attributes to pages
showing collections of products (assortments) or
multiple content sets (portals)

Jon Becher and Ronny Kohavi


51

Abstract Attributes
l Many attributes have too many values
ä There are over 100 colors for Jeans
ä There are hundreds of area codes and zip codes
ä There are hundreds of referring sites

l Higher-level abstractions must be created


l One common abstraction is to use the
hierarchy
ä Organizations naturally organize products in a hierarchy
ä Products: jeans, Men’s Jeans, Levi’s, 505, button fly, …
ä Content: classified, auto classified, SUV auto classified, Pathfinders

Jon Becher and Ronny Kohavi


52

Date/Time Attributes
l There are many date/time attributes
ä First session time
ä Registration time
ä Delivery time
l Most tools are poor at handling date/time
l Abstract attributes can be created
ä Day of week or month
(people get paid on Fridays or on the 1st and 15th)
ä Hour of day
(behavior is different in the morning than at night)
ä Weekend vs. Weekdays
ä Seasons
l Differences between dates are important for showing
trends
Jon Becher and Ronny Kohavi
53

Tracking Visitors
Within one session
l Referring URLs
ä When traffic is due to a specific reason (search, ad, affiliate)
l Special URLs
ä www.kodak.com/go/freestuff
l Query Strings at the end of URLs
ä www.kodak.com?AdName=freestuff

Across sessions
l Host IP + Browser String
ä Proxies limit accuracy (e.g., AOL, WebTV)
http://webusagemining.com/sys-itmpl/webdataminingworkshop/
l Cookies
ä Stored on visitor’s browser on first visit to site
l Registration
ä Require login for every visit
Jon Becher and Ronny Kohavi
54

Agenda
l Introduction (45 min)
l Architecture and Data Flow (45 min)
ä Collecting the data
ä Building the warehouse
ä Closing the loop
l Break (10 min)
l Mining Web Data (75 min)
ä Transformations
ä Reporting and Visualization
ä OLAP
ä Mining
l Summary (20 min)

Jon Becher and Ronny Kohavi


55

Reports
l Traditional representation of data as tables
ä Elements may be changed by user (which columns appear)
ä Format may be change by user (order of columns, color, etc.)
ä Once report has been generated, user typically cannot change it or
ask questions of it, without regenerating the report
l The most important tool for business users
The most unappreciated tool by companies
ä Many companies provide great analytics but miss basic reporting
ä WebTrends has simple log analysis but very clear and nice reports
l Examples: Actuate, AlphaBlox, Brio, Business
Objects, Crystal Decisions (Seagate), Microsoft Excel

Jon Becher and Ronny Kohavi


56

Visualizations
l Tabular data can be hard to interpret
ä Provide simple bar charts and scatter plots
l Business users need to quickly see trends
ä Provide time-series graphs
l Avoid creating state-of-the visualizations that
only the creators can understand

Jon Becher and Ronny Kohavi


57

Simple Bar Charts

Example of real data.


Height = session count
Color = duration
(cold to hot)

Jon Becher and Ronny Kohavi


58

Simple Bar Chart II

Example of real data.


Height = session count
Color = duration
(cold to hot)

Tuesday and Wednesday


are special.
What happened?

Jon Becher and Ronny Kohavi


59

Common Web Reports

Jon Becher and Ronny Kohavi


60

Heat Map Visualization

Example of real data.


Plot of every hour over
several weeks
Color = session count
(cold to hot)

Tue/Wed are not generally


high, but holiday and
promotion made an impact.

Also note white downtime

Jon Becher and Ronny Kohavi


61

Hierarchical Decomposition
Every node shows browser type
on X-axis
Height = number of sessions
Color = average order amount

Jon Becher and Ronny Kohavi


62

On-Line Analytical Processing


l Transforms raw data to reflect dimensionality
"How much did we spend on health benefits, by
month; in our largest three divisions, in each state,
compared with plan?"
l Very fast flexible operations (e.g., sum, average) on
large amounts of data
l Two primary variations
ä Relational OLAP (ROLAP)
ä Multidimensional OLAP (MOLAP)
ä Hybrid OLAP solutions are emerging
l Resources:
www.olapreport.com
www.olapcouncil.org/whtpap.html

Jon Becher and Ronny Kohavi


63

Relational vs. Multi-dimensional


Relational tables have records with fields
Customer Name Customer # Amount Address Region
Jack's Hardware 10456 103.2 40 Main St. West
Value Stores 10114 97.2 18 Elm St. Central
Housewares Inc. 11104 233.22 17 Main St. East
Walter Lock 11230 57.2 6 Charles St. West
A two-dimensional matrix with customer name going down and
a dimension (e.g., region) going across with a measure (e.g.,
amount spent) in the intersection is sparsely populated
Customer Dimension
Dimension
West Central East
Jack's Hardware 103.2
Value Stores 97.2
Housewares Inc. 233.22
Walter Lock 57.2

Jon Becher and Ronny Kohavi


64

Relational vs. Multi-Dimensional II


This relational table has more than
one product per region and more than
one region per product. It lends itself
to a multidimensional representation
with products and regions.

Jon Becher and Ronny Kohavi


65

OLAP
l Relational OLAP (ROLAP)
ä Query data directly from relational structure
ä Typically requires multi-way joins
ä Performance suffers with complexity of questions
ä Verdict: very flexible but doesn’t scale well
ä Examples: Business Objects, Cognos, MicroStrategy
l Multi-dimensional OLAP (MOLAP)
ä Built n-dimensional cubes from source data
ä Data access is n-dimensional lookup
ä Building cubes can be time intensive
ä Verdict: very fast but not very flexible
ä Examples: Hyperion, Microsoft , Oracle Express

Jon Becher and Ronny Kohavi


66

Hybrid OLAP (HOLAP)


User Interface
MD Views
Cross Tabulations
Time Intelligence
Slice & Dice
Analysis Engine Filtering
Sorting
Calculation
Consolidation

MDB RDBMS
Jon Becher and Ronny Kohavi
67

Tree Drill-Down
l Front-ends to MDDB (multi-dimensional
databases) provide easy access to data

Fig Provided
by Knosys

Jon Becher and Ronny Kohavi


68

OLAP Visualizations

l Front ends now provide powerful visualizations


that are very fast and easy to manipulate

Fig
Provided
by
Knosys

Jon Becher and Ronny Kohavi


69

OLAP Example
Case Study: How does visitor preferences vary by content?

l Why is pages/
visit for politics
relatively low?
l Theory: politics
readers are
high frequency
and low
pages/visit
l Let’s test
theory: drill
down on
politics, show
frequency

Jon Becher and Ronny Kohavi


70

OLAP Example
Drilldown on “Politics”

l Answer: time/visit
increases
dramatically at
high frequency
l Politics readers
read instead of
browse!
l From here, we
could continue to
drill down or drill
back up.

Jon Becher and Ronny Kohavi


71

Mining – Induction
l Analysis Type
ä Prediction, or business rules created by a person
l Sample Applications
ä Which product or banner should be displayed?
ä Which person is most likely to respond to an outbound email?
ä How likely is a visitor to return to the Web site?
ä Which customers are the heaviest spenders?
l Objections
ä Dynamic nature of Web data is difficult to model
ä Algorithms are not well understood by business users
l Example Companies
ä Accrue, Angoss, Broadbase, Blue Martini, E.piphany,
Microsoft analytical services, SAS

Jon Becher and Ronny Kohavi


72

Mining – Segmentation
l Analysis Type
ä Cluster to discover groups of similar behavior or a similar profile
l Sample Applications
ä Find customer segments
ä Generate small number of different web sites or stores
ä Discover communities of visitors with similar interests
ä Identify substitute or cannibal products
l Objections
ä How well do customers fit in a particular group?
ä Hard to understand high-dimensional segments
l Example Companies
ä Accrue, ATG Scenario Server, Blue Martini

Jon Becher and Ronny Kohavi


73

Mining – Associations
l Analysis Type
ä Link analysis for associations or time-based sequences
l Sample Applications
ä Shopping cart analysis
ä Up-sell and cross-sell
ä Path analysis
l Objections
ä Shear number of rules makes interpretation difficult
ä With no holdout testing, difficult to know whether results will
stand up over time
l Example Companies
ä Accrue, IBM, SGI, Vignette

Jon Becher and Ronny Kohavi


74

Association Example
Recommend potential purchases based on basket contents

Confidence Lift Support


Driver Item Recommendation
Arugula Dill 57.1% 7.76 4.2%
Basil 44.4% 5.43 3.1%
Basil Parsley 70.0% 7.39 7.4%
Colombian Jamaica 50.0% 6.79 7.8%
Cool Breezer Grape 75.0% 11.88 3.2%
Dill Arugula 57.1% 7.76 4.2%
Basil 67.4% 5.43 3.8%
Pineapple Grape 77.8% 7.39 7.4%
Yellow Pepper Jalapeno 71.4% 4.85 5.3%
Granny Smith 57.1% 5.43 4.2%

Jon Becher and Ronny Kohavi


75

Mining – Path Analysis


l Analysis Type
ä Explore, understand, or predict visitors navigation patterns
through Web site
ä Multiple analytic techniques: statistics, sequences, induction,
clustering, compression
l Sample Applications
ä Designing a more efficient or user friendly site
ä Discovering misleading, duplicative, or overlapping content
ä Understanding the effectiveness of referring links
l Objections
ä Most path analysis provides only simple reporting
l Example Companies
ä Nearly everyone

Jon Becher and Ronny Kohavi


76

Most Frequent Path Report


Top Paths Through Site by Visits
Start Page Paths from Start Visits %
Products 1.Products 837 9.28%
http://www.businesscomputing.com/products/
1.Products 111 1.23%
http://www.businesscomputing.com/products/
2.110 Desktop Computer Specs
http://www.businesscomputing.com/products/pc110/
1.Products 67 0.74%
http://www.businesscomputing.com/products/
2.330 XL Desktop Computer
http://www.businesscomputing.com/products/pc330xl/
1.Products 60 0.66%
http://www.businesscomputing.com/products/
2.Page Has No Title
http://www.businesscomputing.com/shoppingcart.htm
1.Products 47 0.52%
http://www.businesscomputing.com/products/
2.110 Desktop Computer Specs
http://www.businesscomputing.com/products/pc110/
3.110 Desktop Computer
http://www.businesscomputing.com/products/pc110/intro.htm
Jon Becher and Ronny Kohavi
77

Collaborative Filtering
l Analysis Type
ä Recommend small # of products out of 1,000's
l Benefits
ä No need for a training set; algorithm bootstraps itself
ä Can be used directly against operational data store
ä Learning is incremental and should improve over time
l Objections
ä Tie lag to gather data before recommendations valid
ä Black box perception: Why is a recommendation made?
ä Difficult to produce a confidence interval in prediction.
ä In practice, few examples leads to sparse data such that the
recommendations are weak
l Example Companies
ä Like Minds, Net Perceptions

Jon Becher and Ronny Kohavi


78

Teaser - Birth Dates


A bank discovered that almost 5% of their
customers were born on the exact same
date

How can that be explained?

Jon Becher and Ronny Kohavi


79

Teaser - Gender Mystery


l A site has gender on the registration form
l Acxiom, a syndicated data provider, also
provides gender
l A very large discrepancy found between
ä Males according to registration form and
ä Acxiom provided data

Why?
Hint: Acxiom only conflicted with females,
claiming some females are males.
Never in the other direction

Some images used herein where obtained from IMSI's MasterClips/Master Photo(C) Collection,
Jon Becher and Ronny Kohavi
1895 Francisco Blvd East, San Rafael 94901-5506, USA
80

Teaser - Mysterious Birth Years


2000

1800

The KDD CUP 98 data 1600

1400
contained anomalies 1200

for date of birth 1000

800
[Georges and Milley, 600

SIGKDD Explorations 2000] 400

200

l Spikes on years ending in 0


1900
1905

1910
1915
1920
1925
1930
1935
1940
1945
1950
1955
1960
1965
1970
1975
1980
1985
1990
1995
zero (white dots on blue) Year

l Few individuals born prior to 1910

l Many more individuals who were born on even years (blue)


as on odd years (red)
Why?
Jon Becher and Ronny Kohavi
81

Summary

l Significant Return On Investment from


analyzing e-commerce data. Killer domain
l Data collection is important
Design the site with analysis in mind
l Build a data warehouse (ETL, construct
attributes, deal with bots)
l Analyze (reports, OLAP, visualization,
algorithms)
l Close the loop. Experiment and improve.

Jon Becher and Ronny Kohavi


82

Resources (I)
l The Data Webhouse Toolkit: Building the Web-
Enabled Data Warehouse by Ralph Kimball,
Richard Merz. ISBN: 0471376809 (Jan 2000)
l Mastering Data Mining: The Art and Science of
Customer Relationship Management by Michael J. A.
Berry, Gordon Linoff. ISBN: 0471331236
l KDNuggets, Software for Web Mining
http://www.kdnuggets.com/software/web.html
l WEBKDD - Workshops in Web Mining
http://robotics.Stanford.EDU/~ronnyk/WEBKDD2000/index.html
http://robotics.Stanford.EDU/~ronnyk/WEBKDD2001/index.html

Jon Becher and Ronny Kohavi


83

Resources (II)
l Web Mining Research: A Survey
http://www.acm.org/sigs/sigkdd/explorations/issue2-1/contents.htm#Kosala

l Web Data Mining course at DePaul University by


Bamshad Mobasher
http://maya.cs.depaul.edu/~classes/cs589/lecture.html
l Integrating E-commerce and Data Mining:
Architecture and Challenges, WEBKDD'2000
http://robotics.Stanford.EDU/~ronnyk/ronnyk-bib.html
l Drinking from the Firehose: Converting Raw Web
Traffic and E-Commerce Data Streams for Data
Mining and Marketing Analysis by Rob Cooley
http://www.webusagemining.com/sys-tmpl/webdataminingworkshop/

Jon Becher and Ronny Kohavi


84

Resources (III)
l An Ideal E-Commerce Architecture for Building Web
Sites Supporting Analysis and Personalization
http://robotics.Stanford.EDU/~ronnyk/ronnyk-bib.html

l Analyzing Web Site Traffic, Sane Solutions


http://www.sane.com/products/NetTracker/whitepaper.pdf

l Web Mining, Accrue Software


http://www.accrue.com/forms/webmining.html

Jon Becher and Ronny Kohavi

You might also like