Data Mining Slide
Data Mining Slide
Data Mining Slide
Scientific
NASA, EOS project: 50 GB per hour
Environmental datasets
Examples of Data mining
Applications
Set
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
Training
Learn
Model
10
10 No Single 90K Yes Set Classifier
Example of a Decision Tree
cal cal us
i i o
or or nu
teg
teg
nti
ass
ca ca co cl
Tid Home Marital Taxable
Splitting Attributes
Owner Status Income Default
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of consumers
likely to buy a new cell-phone product.
Approach:
Use the data for a similar product introduced before.
We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class
attribute.
Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
Type of business, where they stay, how much they earn, etc.
Use this information as input attributes to learn a classifier
model.
From [Berry & Linoff] Data Mining Techniques, 1997
Classification: Application
2
Fraud Detection
Goal: Predict fraudulent cases in credit card transactions.
Approach:
Use credit card transactions and the information on its account-
holder as attributes.
When does a customer buy, what does he buy, how often he
Intracluster
Intraclusterdistances
distances Intercluster
Interclusterdistances
distances
are
areminimized
minimized are
aremaximized
maximized
Clustering: Application 1
Market Segmentation:
Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be selected
as a market target to be reached with a distinct marketing
mix.
Approach:
Collect different attributes of customers based on their
geographical and lifestyle related information.
Find clusters of similar customers.
Measure the clustering quality by observing buying patterns
of customers in same cluster vs. those from different clusters.
Clustering: Application 2
Document Clustering:
Goal: To find groups of documents that are
similar to each other based on the important
terms appearing in them.
Approach: To identify frequently occurring terms
in each document. Form a similarity measure
based on the frequencies of different terms. Use
it to cluster.
Gain: Information Retrieval can utilize the
clusters to relate a new document or search term
to clustered documents.
Illustrating Document
Clustering
Clustering Points: 3204 Articles of Los Angeles Times.
Similarity Measure: How many words are common in
these documents (after some word filtering).
National 273 36
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
Rules
RulesDiscovered:
Discovered:
3 Beer, Coke, Diaper, Milk
{Milk}
{Milk}-->
-->{Coke}
{Coke}
4 Beer, Bread, Diaper, Milk
{Diaper,
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
5 Coke, Diaper, Milk
Association Rule Discovery:
Application 1
os sy
l
Original Data
Approximated
Numerosity Reduction:
Reduce the volume of data
Parametric methods
Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the
data (except possible outliers)
Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling
Clustering
W O R
SRS le random
i m p ho ut
( s e wi t
l
samp ment)
p l a ce
re
SRSW
R
Raw Data
Sampling
Raw Data Cluster/Stratified Sample