Data Mining
Tutorial
Gregory Piatetsky-Shapiro
KDnuggets
© 2006 KDnuggets
Outline
Introduction
Data Mining Tasks
Classification & Evaluation
Clustering
Application Examples
2
© 2006 KDnuggets
Trends leading to Data Flood
More data is generated:
Web, text, images …
Business transactions, calls,
...
Scientific data: astronomy,
biology, etc
More data is captured:
Storage technology faster
and cheaper
DBMS can handle bigger DB
3
© 2006 KDnuggets
Largest Databases in 2005
Winter Corp. 2005 Commercial
Database Survey:
1. Max Planck Inst. for
Meteorology , 222 TB
2. Yahoo ~ 100 TB (Largest Data
Warehouse)
3. AT&T ~ 94 TB
www.wintercorp.com/VLDB/2005_TopTen_Survey/TopTenWinners_2005.asp
4
© 2006 KDnuggets
Data Growth
In 2 years (2003 to 2005),
the size of the largest database TRIPLED!
5
© 2006 KDnuggets
Data Growth Rate
Twice as much information was created in 2002
as in 1999 (~30% growth rate)
Other growth rate estimates even higher
Very little data will ever be looked at by a human
Knowledge Discovery is NEEDED to make sense
and use of data.
6
© 2006 KDnuggets
Knowledge Discovery Definition
Knowledge Discovery in Data is the
non-trivial process of identifying
valid
novel
potentially useful
and ultimately understandable patterns in data.
from Advances in Knowledge Discovery and Data
Mining, Fayyad, Piatetsky-Shapiro, Smyth, and
Uthurusamy, (Chapter 1), AAAI/MIT Press 1996
7
© 2006 KDnuggets
Related Fields
Machine Visualization
Learning
Data Mining and
Knowledge Discovery
Statistics Databases
8
© 2006 KDnuggets
Statistics, Machine Learning and
Data Mining
Statistics:
more theory-based
more focused on testing hypotheses
Machine learning
more heuristic
focused on improving performance of a learning agent
also looks at real-time learning and robotics – areas not part of data
mining
Data Mining and Knowledge Discovery
integrates theory and heuristics
focus on the entire process of knowledge discovery, including data
cleaning, learning, and integration and visualization of results
Distinctions are fuzzy
9
© 2006 KDnuggets
Knowledge Discovery Process
flow, according to CRISP-DM
see
Monitoring www.crisp-dm.org
for more
information
Continuous
monitoring and
improvement is
an addition to CRISP
10
© 2006 KDnuggets
Historical Note:
Many Names of Data Mining
Data Fishing, Data Dredging: 1960-
used by statisticians (as bad name)
Data Mining :1990 --
used in DB community, business
Knowledge Discovery in Databases (1989-)
used by AI, Machine Learning Community
also Data Archaeology, Information Harvesting,
Information Discovery, Knowledge Extraction, ...
Currently: Data Mining and Knowledge Discovery
are used interchangeably
11
© 2006 KDnuggets
Data Mining Tasks
© 2006 KDnuggets
Some Definitions
Instance (also Item or Record):
an example, described by a number of attributes,
e.g. a day can be described by temperature, humidity
and cloud status
Attribute or Field
measuring aspects of the Instance, e.g. temperature
Class (Label)
grouping of instances, e.g. days good for playing
13
© 2006 KDnuggets
Major Data Mining Tasks
Classification: predicting an item class
Clustering: finding clusters in data
Associations: e.g. A & B & C occur frequently
Visualization: to facilitate human discovery
Summarization: describing a group
Deviation Detection: finding changes
Estimation: predicting a continuous value
Link Analysis: finding relationships
…
© 2006 KDnuggets 14
Classification
Learn a method for predicting the instance class from
pre-labeled (classified) instances
Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...
15
© 2006 KDnuggets
Clustering
Find “natural” grouping of
instances given un-labeled data
16
© 2006 KDnuggets
Association Rules &
Frequent Itemsets
Transactions
TID Produce Frequent Itemsets:
1 MILK, BREAD, EGGS
2 BREAD, SUGAR Milk, Bread (4)
3 BREAD, CEREAL
Bread, Cereal (3)
4 MILK, BREAD, SUGAR
5 MILK, CEREAL Milk, Bread, Cereal (2)
6 BREAD, CEREAL …
7 MILK, CEREAL
8 MILK, BREAD, CEREAL, EGGS
9 MILK, BREAD, CEREAL
Rules:
Milk => Bread (66%)
17
© 2006 KDnuggets
Visualization & Data Mining
Visualizing the data to
facilitate human
discovery
Presenting the
discovered results in a
visually "nice" way
18
© 2006 KDnuggets
Summarization
Describe features of the
selected group
Use natural language
and graphics
Usually in Combination
with Deviation detection
or other methods
Average length of stay in this study area rose 45.7 percent,
from 4.3 days to 6.2 days, because ...
19
© 2006 KDnuggets
Data Mining Central Quest
Find true patterns
and avoid overfitting
(finding seemingly signifcant
but really random patterns due
to searching too many possibilites)
20
© 2006 KDnuggets