Introduction To Data Mining
Introduction To Data Mining
Introduction To Data Mining
Mining
1
Reference book
• “Data Mining: Concepts and Techniques” by
Jiawei Han and Micheline Kamber
2
Data Mining Outline
– Introduction
– Related Concepts
– Data Mining Techniques
3
Topics discussed
Goal: Provide an overview of data mining.
4
Why Data Mining
• Credit ratings/targeted marketing:
– Given a database of 100,000 names, which persons are the least likely to
default on their credit cards?
– Identify likely responders to sales promotions
• Fraud detection
– Which types of transactions are likely to be fraudulent, given the transactional
history of a particular customer?
• Customer relationship management:
– Which of my customers are likely to be the most loyal?
6
What Is Data Mining?
• Data mining (knowledge discovery from data)
– Extraction of interesting ( implicit, previously unknown and potentially
useful) patterns or knowledge from huge amount of data
• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, information harvesting, business
intelligence, etc.
7
Data Mining Algorithm
• Objective: Fit Data to a Model
– Descriptive
– Predictive
• Preference – Technique to choose the best
model
• Search – Technique to search the data
– “Query”
8
Database Processing vs. Data Mining
Processing
• Query • Query
– Well defined – Poorly defined
– SQL – No precise query language
Data Data
– Operational data – Not operational data
Output Output
– Precise – Fuzzy
9
Query Examples
• Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more
than $10,000 in the last month.
– Find all customers who have purchased milk
• Data Mining
– Find all credit applicants who are poor credit
risks. (classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased
with milk. (association rules)
10
Data Mining Models and Tasks
11
Basic Data Mining Tasks
• Classification maps data into predefined groups or classes
– Supervised learning
– Pattern recognition
– Prediction
12
Basic Data Mining Tasks (cont’d)
Link Analysis uncovers relationships among
data.
– Affinity (similarity) Analysis
– Association Rules
– Sequential Analysis determines sequential
patterns.
13
Data Mining vs. KDD
• Knowledge Discovery in Databases (KDD):
process of finding useful information and
patterns in data.
• Data Mining: Use of algorithms to extract the
information and patterns derived by the KDD
process.
14
KDD Process
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
•Bayes Theorem
•Regression Analysis
•EM Algorithm
•K-Means Clustering
•Time Series Analysis
•Algorithm Design Techniques
•Algorithm Analysis •Neural Networks
•Data Structures
•Decision Tree Algorithms
17
Why Not Traditional Data Analysis?
• Tremendous amount of data
– Algorithms must be highly scalable to handle such as tera-bytes of data
• High-dimensionality of data
• High complexity of data
– Data streams and sensor data
– Time-series data, temporal data, sequence data
– Structure data, graphs, social networks and multi-linked data
– Heterogeneous databases and legacy databases
– Spatial, multimedia, text and Web data
– Software programs, scientific simulations
• New and sophisticated applications
18
Data Mining: On What Kinds of Data?
• Database-oriented data sets and applications
– Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
– Data streams and sensor data
– Time-series data, temporal data, sequence data (incl. bio-sequences)
– Structure data, graphs, social networks and multi-linked data
– Object-relational databases
– Heterogeneous databases and legacy databases
– Spatial data
– Multimedia database
– Text databases
– The World-Wide Web
19
KDD Issues
• Human Interaction
• Overfitting
• Outliers
• Interpretation
• Visualization
• Large Datasets
• High Dimensionality
20
KDD issues (cont’d)
• Multimedia Data
• Missing Data
• Irrelevant Data
• Noisy Data
• Changing Data
• Integration
• Application
21
Database Perspective on Data
Mining
• Scalability
• Real World Data
• Updates
• Ease of Use
22
Goal: Examine some areas which are related to
data mining.
Related Concepts Outline
• Database/OLTP Systems
• Fuzzy Sets and Logic
• Information Retrieval(Web Search Engines)
• Dimensional Modeling
• Data Warehousing
• OLAP/DSS
• Statistics
• Machine Learning
• Pattern Matching
23
DB & OLTP Systems
• Schema
– (ID,Name,Address,Salary,JobNo)
• Data Model
– ER
– Relational
• Transaction
• Query:
SELECT Name
FROM T
WHERE Salary > 100000
24
Fuzzy Sets and Logic
• Fuzzy Set: Set membership function is a real valued function with
output in the range [0,1].
• f(x): Probability x is in F.
• 1-f(x): Probability x is not in F.
• EX:
– T = {x | x is a person and x is tall}
– Let f(x) be the probability that x is tall
– Here f is the membership function
25
Fuzzy Sets
26
27
Information Retrieval
• Information Retrieval (IR): retrieving desired information from textual
data.
• Digital Libraries
• Web Search Engines
• Traditionally keyword based
• Sample query:
Find all documents about “data mining”.
28
Dimensional Modeling
• View data in a hierarchical manner more as business
executives might
• Useful in decision support systems and mining
• Dimension: collection of logically related attributes; axis for
modeling data.
• Facts: data stored
• Ex: Dimensions – products, locations, date
Facts – quantity, unit price
29
Relational View of Data
ProdID LocID Date Quantity UnitPrice
123 Dallas 022900 5 25
123 Houston 020100 10 20
150 Dallas 031500 1 100
150 Dallas 031500 5 95
150 Fort 021000 5 80
Worth
150 Chicago 012000 20 75
200 Seattle 030100 5 50
300 Rochester 021500 200 5
500 Bradenton 022000 15 20
500 Chicago 012000 10 25
1
30
Dimensional Modeling Queries
31
Cube view of Data
32
Aggregation Hierarchies
33
Data Warehouse
• Defined in many different ways, but not rigorously.
– A decision support database that is maintained separately from the
organization’s operational database
– Support information processing by providing a solid platform of consolidated,
historical data for analysis.
34
Data Warehouse—Subject-Oriented
35
Data Warehouse—Integrated
• Constructed by integrating multiple, heterogeneous data
sources
– relational databases, flat files, on-line transaction records
36
Data Warehouse—Time Variant
• The time horizon for the data warehouse is significantly
longer than that of operational systems
– Operational database: current value data
– Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
• Every key structure in the data warehouse
– Contains an element of time, explicitly or implicitly
– But the key of operational data may or may not contain “time
element”
37
Data Warehouse—Nonvolatile
• A physically separate store of data transformed from the
operational environment
• Operational update of data does not occur in the data
warehouse environment
– Does not require transaction processing, recovery, and concurrency
control mechanisms
– Requires only two operations in data accessing:
38
Data Warehouse vs. Operational DBMS
39
Operational vs. Informational
Operational Data Data Warehouse
40
Statistics
42
Pattern Matching (Recognition)
• Pattern Matching: finds occurrences of a
predefined pattern in the data.
• Applications include speech recognition,
information retrieval, time series analysis.
43
DM vs. Related Topics
Area Query Data Results Output
44
Data Mining Techniques Outline
45
Some success stories:
46