Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
Definition:
Data mining is the process of discovering
interesting patterns and knowledge from
large amounts of data.
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
patterns or knowledge from huge amount of
data.
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”?
Simple search and query processing
Inductive reasoning
(Deductive reasoning) expert systems
Inductive reasoning moves from specific instances into a
generalized conclusion, while deductive reasoning moves
from generalized principles that are known to be true to a
true and specific conclusion.
Task-relevant Data
Data Cleaning
Data Integration
Databases
April 6, 2019 Data Mining: Concepts and Techniques 11
Knowledge Discovery (KDD) Process
Data Mining: On What Kinds of Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
Database
Technology Statistics
Machine Visualization
Learning Data Mining
Pattern
Recognition Other
Algorithm Disciplines
Outlier analysis
Outlier: Data object that does not comply with the general behavior
of the data
Noise or exception? Useful in fraud detection, rare events analysis
Periodicity analysis
Similarity-based analysis
Mining methodology
Mining various and new kinds of knowledge
Multidimensional space: searching for interesting patterns among
combinations of dimensions (attributes).
Data Mining an interdisciplinary effort
Boosting the power of discovery in a networked environment:
Handling uncertainty, noise, or incompleteness of data:
Pattern evaluation and pattern- or constraint-guided mining:
User interaction
Interactive mining of knowledge at multiple levels of abstraction
Incorporation of background knowledge:
Data mining query languages and ad-hoc mining
Presentation and visualization of data mining results
society?
How can we guard against its misuse?
Task-relevant data
Database or data warehouse name
Database tables or data warehouse cubes
Condition for data selection
Relevant attributes or dimensions
Data grouping criteria
Type of knowledge to be mined
Characterization, discrimination, association, classification,
prediction, clustering, outlier analysis, other data mining tasks
Background knowledge
Pattern interestingness measurements
Visualization/presentation of discovered patterns
April 6, 2019 Data Mining: Concepts and Techniques 49
Primitive 3: Background Knowledge
Simplicity
e.g., (association) rule length, (decision) tree size
Certainty
e.g., confidence, P(A|B) = #(A and B)/ #(B), classification
reliability or accuracy, certainty factor, rule strength, rule quality,
discriminating weight, etc.
Utility
potential usefulness, e.g., support (association), noise threshold
(description)
Novelty
not previously known, surprising (used to remove redundant
rules, e.g., Illinois vs. Champaign rule implication support ratio)
Motivation
A DMQL can provide the ability to support ad-hoc and
interactive data mining
By providing a standardized language like SQL
Hope to achieve a similar effect like that SQL has on
relational database
Foundation for system development and evolution
Facilitate information exchange, technology transfer,
commercialization and wide acceptance
Design
DMQL is designed with the primitives described earlier
Pattern Evaluation
Knowl
Data Mining Engine edge-
Base
Database or Data
Warehouse Server