Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
Concepts and
Techniques
Jiawei Han and Micheline Kamber
July 1, 2015
Chapter 1. Introduction
July 1, 2015
July 1, 2015
Evolution of Sciences
Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
The Internet and computing Grid that makes all these archives universally
accessible
Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online
Science, Comm. ACM, 45(11): 50-54, Nov. 2002
July 1, 2015
Evolution of Database
Technology
1960s:
1970s:
1980s:
1990s:
2000s
July 1, 2015
Alternative names
July 1, 2015
Data miningcore of
knowledge discovery
process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
July 1, 2015
Decisio
n
Making
Data Presentation
Visualization Techniques
End User
Business
Analyst
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
July 1, 2015
DBA
Machine
Learning
Pattern
Recognition
July 1, 2015
Statistics
Data Mining
Algorithm
Data Mining: Concepts and
Techniques
Visualization
Other
Disciplines
High-dimensionality of data
July 1, 2015
10
Data to be mined
Knowledge to be mined
Techniques utilized
Relational, data warehouse, transactional, stream, objectoriented/relational, active, spatial, time-series, text, multimedia, heterogeneous, legacy, WWW
Applications adapted
July 1, 2015
11
General functionality
July 1, 2015
12
Object-relational databases
Multimedia database
Text databases
July 1, 2015
13
July 1, 2015
14
Cluster analysis
Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns
Maximizing intra-class similarity & minimizing interclass similarity
Outlier analysis
Outlier: Data object that does not comply with the general behavior
of the data
Noise or exception? Useful in fraud detection, rare events analysis
Trend and evolution analysis
Trend and deviation: e.g., regression analysis
Sequential pattern mining: e.g., digital camera large SD memory
Periodicity analysis
Similarity-based analysis
Other pattern-directed or statistical analyses
July 1, 2015
15
Classification
#4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So
Stupid After All? Internat. Statist. Rev. 69, 385-398.
Statistical Learning
#8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent
patterns without candidate generation. In SIGMOD '00.
July 1, 2015
16
Link Mining
#9. PageRank: Brin, S. and Page, L. 1998. The anatomy of a
large-scale hypertextual Web search engine. In WWW-7, 1998.
#10. HITS: Kleinberg, J. M. 1998. Authoritative sources in a
hyperlinked environment. SODA, 1998.
Clustering
#11. K-Means: MacQueen, J. B., Some methods for
classification and analysis of multivariate observations, in Proc.
5th Berkeley Symp. Mathematical Statistics and Probability,
1967.
#12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny, M. 1996.
BIRCH: an efficient data clustering method for very large
databases. In SIGMOD '96.
Bagging and Boosting
#13. AdaBoost: Freund, Y. and Schapire, R. E. 1997. A decisiontheoretic generalization of on-line learning and an application
to boosting. J. Comput. Syst. Sci. 55, 1 (Aug. 1997), 119-139.
July 1, 2015
17
Sequential Patterns
#16. CBA: Liu, B., Hsu, W. and Ma, Y. M. Integrating classification and
association rule mining. KDD-98.
Rough Sets
July 1, 2015
18
July 1, 2015
19
Mining methodology
Mining different kinds of knowledge from diverse data types, e.g., bio,
stream, Web
User interaction
July 1, 2015
20
July 1, 2015
21
KDD Conferences
ACM SIGKDD Int. Conf. on
Knowledge Discovery in
Databases and Data Mining
(KDD)
SIAM Data Mining Conf.
(SDM)
(IEEE) Int. Conf. on Data
Mining (ICDM)
Conf. on Principles and
practices of Knowledge
Discovery and Data Mining
(PKDD)
Pacific-Asia Conf. on
Knowledge Discovery and
Data Mining (PAKDD)
July 1, 2015
Other related
conferences
ACM SIGMOD
VLDB
(IEEE) ICDE
WWW, SIGIR
Journals
22
Statistics
Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS,
etc.
Journals: Machine Learning, Artificial Intelligence, Knowledge and Information
Systems, IEEE-PAMI, etc.
Web and IR
Visualization
July 1, 2015
23
Recommended Reference
Books
S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data.
Morgan Kaufmann, 2002
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge
Discovery, Morgan Kaufmann, 2001
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2 nd ed., 2006
D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, Springer-Verlag, 2001
P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with
Java Implementations, Morgan Kaufmann, 2 nd ed. 2005
July 1, 2015
24
Summary
July 1, 2015
25
Supplementary Lecture
Slides
July 1, 2015
26
Other Applications
July 1, 2015
27
Where does the data come from?Credit card transactions, loyalty cards,
discount coupons, customer complaint calls, plus (public) lifestyle studies
Target marketing
July 1, 2015
28
Resource planning
Competition
July 1, 2015
29
Medical insurance
Retail industry
July 1, 2015
Anti-terrorism
Data Mining: Concepts and
Techniques
30
July 1, 2015
31
Interestingness measures
July 1, 2015
32
Approaches
July 1, 2015
First general all the patterns and then filter out the
uninteresting ones
Generate only the interesting patternsmining query
optimization
Data Mining: Concepts and
Techniques
33
July 1, 2015
Techniques
34
A Few Announcements
(Sept. 1)
July 1, 2015
35
July 1, 2015
36
Task-relevant data
Background knowledge
July 1, 2015
37
Schema hierarchy
Set-grouping hierarchy
Operation-derived hierarchy
Rule-based hierarchy
low_profit_margin (X) <= price(X, P1) and cost (X, P2) and
(P1 - P2) < $50
July 1, 2015
38
Simplicity
e.g., (association) rule length, (decision) tree size
Certainty
e.g., confidence, P(A|B) = #(A and B)/ #(B), classification
reliability or accuracy, certainty factor, rule strength, rule
quality, discriminating weight, etc.
Utility
potential usefulness, e.g., support (association), noise
threshold (description)
Novelty
not previously known, surprising (used to remove
redundant rules, e.g., Illinois vs. Champaign rule
implication support ratio)
July 1, 2015
39
July 1, 2015
40
Motivation
Design
July 1, 2015
41
July 1, 2015
42
July 1, 2015
43
July 1, 2015
44
Loose coupling
July 1, 2015
45
Know
ledge
-Base
Database or Data
Warehouse Server
data cleaning, integration, and selection
Database
July 1, 2015
Data
World-Wide Other Info
Repositories
Warehouse
Web
Data Mining: Concepts and
Techniques
46