Introduction To Data Mining

Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

Introduction to Data

Mining

1
Reference book
• “Data Mining: Concepts and Techniques” by
Jiawei Han and Micheline Kamber

2
Data Mining Outline
– Introduction
– Related Concepts
– Data Mining Techniques

3
Topics discussed
Goal: Provide an overview of data mining.

• Define data mining


• Data mining vs. databases
• Basic data mining tasks
• Data mining development
• Data mining issues

4
Why Data Mining
• Credit ratings/targeted marketing:
– Given a database of 100,000 names, which persons are the least likely to
default on their credit cards?
– Identify likely responders to sales promotions
• Fraud detection
– Which types of transactions are likely to be fraudulent, given the transactional
history of a particular customer?
• Customer relationship management:
– Which of my customers are likely to be the most loyal?

Data Mining helps extract such


information 5
Introduction
• Data is growing at a phenomenal rate
• Users expect more sophisticated (refined)
information
• How?

UNCOVER HIDDEN INFORMATION


DATA MINING

6
What Is Data Mining?
• Data mining (knowledge discovery from data)
– Extraction of interesting ( implicit, previously unknown and potentially
useful) patterns or knowledge from huge amount of data
• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, information harvesting, business
intelligence, etc.

• Finding hidden information in a database


• Fit data to a model

7
Data Mining Algorithm
• Objective: Fit Data to a Model
– Descriptive
– Predictive
• Preference – Technique to choose the best
model
• Search – Technique to search the data
– “Query”

8
Database Processing vs. Data Mining
Processing

• Query • Query
– Well defined – Poorly defined
– SQL – No precise query language

 Data  Data
– Operational data – Not operational data

 Output  Output
– Precise – Fuzzy

9
Query Examples
• Database
– Find all credit applicants with last name of Smith.
– Identify customers who have purchased more
than $10,000 in the last month.
– Find all customers who have purchased milk

• Data Mining
– Find all credit applicants who are poor credit
risks. (classification)
– Identify customers with similar buying habits.
(Clustering)
– Find all items which are frequently purchased
with milk. (association rules)
10
Data Mining Models and Tasks

11
Basic Data Mining Tasks
• Classification maps data into predefined groups or classes
– Supervised learning
– Pattern recognition
– Prediction

• Regression is used a model to predict continuous value for a given


input.

• Clustering groups similar data together into clusters.


– Unsupervised learning
– Segmentation
– Partitioning

12
Basic Data Mining Tasks (cont’d)
Link Analysis uncovers relationships among
data.
– Affinity (similarity) Analysis
– Association Rules
– Sequential Analysis determines sequential
patterns.

13
Data Mining vs. KDD
• Knowledge Discovery in Databases (KDD):
process of finding useful information and
patterns in data.
• Data Mining: Use of algorithms to extract the
information and patterns derived by the KDD
process.

14
KDD Process

• Selection: Obtain data from various sources.


• Preprocessing: Cleanse data.
• Transformation: Convert to common format. Transform
to new format.
• Data Mining: Obtain desired results.
• Interpretation/Evaluation: Present results to user in
meaningful manner.
15
Data Mining and Business Intelligence

Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
16
Data Mining Development: Multiple Disciplines
•Similarity Measures
•Hierarchical Clustering
•Relational Data Model •IR Systems
•SQL •Imprecise Queries
•Association Rule Algorithms •Textual Data
•Data Warehousing
•Scalability Techniques •Web Search Engines

•Bayes Theorem
•Regression Analysis
•EM Algorithm
•K-Means Clustering
•Time Series Analysis
•Algorithm Design Techniques
•Algorithm Analysis •Neural Networks
•Data Structures
•Decision Tree Algorithms

17
Why Not Traditional Data Analysis?
• Tremendous amount of data
– Algorithms must be highly scalable to handle such as tera-bytes of data
• High-dimensionality of data
• High complexity of data
– Data streams and sensor data
– Time-series data, temporal data, sequence data
– Structure data, graphs, social networks and multi-linked data
– Heterogeneous databases and legacy databases
– Spatial, multimedia, text and Web data
– Software programs, scientific simulations
• New and sophisticated applications

18
Data Mining: On What Kinds of Data?
• Database-oriented data sets and applications
– Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
– Data streams and sensor data
– Time-series data, temporal data, sequence data (incl. bio-sequences)
– Structure data, graphs, social networks and multi-linked data
– Object-relational databases
– Heterogeneous databases and legacy databases
– Spatial data
– Multimedia database
– Text databases
– The World-Wide Web

19
KDD Issues
• Human Interaction
• Overfitting
• Outliers
• Interpretation
• Visualization
• Large Datasets
• High Dimensionality

20
KDD issues (cont’d)
• Multimedia Data
• Missing Data
• Irrelevant Data
• Noisy Data
• Changing Data
• Integration
• Application

21
Database Perspective on Data
Mining

• Scalability
• Real World Data
• Updates
• Ease of Use

22
Goal: Examine some areas which are related to
data mining.
Related Concepts Outline
• Database/OLTP Systems
• Fuzzy Sets and Logic
• Information Retrieval(Web Search Engines)
• Dimensional Modeling
• Data Warehousing
• OLAP/DSS
• Statistics
• Machine Learning
• Pattern Matching

23
DB & OLTP Systems
• Schema
– (ID,Name,Address,Salary,JobNo)
• Data Model
– ER
– Relational
• Transaction
• Query:
SELECT Name
FROM T
WHERE Salary > 100000

DM: Only imprecise queries

24
Fuzzy Sets and Logic
• Fuzzy Set: Set membership function is a real valued function with
output in the range [0,1].
• f(x): Probability x is in F.
• 1-f(x): Probability x is not in F.
• EX:
– T = {x | x is a person and x is tall}
– Let f(x) be the probability that x is tall
– Here f is the membership function

DM: Prediction and classification are fuzzy.

25
Fuzzy Sets

26
27
Information Retrieval
• Information Retrieval (IR): retrieving desired information from textual
data.

• Digital Libraries
• Web Search Engines
• Traditionally keyword based
• Sample query:
Find all documents about “data mining”.

DM: Similarity measures;


Mine text/Web data.

28
Dimensional Modeling
• View data in a hierarchical manner more as business
executives might
• Useful in decision support systems and mining
• Dimension: collection of logically related attributes; axis for
modeling data.
• Facts: data stored
• Ex: Dimensions – products, locations, date
Facts – quantity, unit price

DM: May view data as dimensional.

29
Relational View of Data
ProdID LocID Date Quantity UnitPrice
123 Dallas 022900 5 25
123 Houston 020100 10 20
150 Dallas 031500 1 100
150 Dallas 031500 5 95
150 Fort 021000 5 80
Worth
150 Chicago 012000 20 75
200 Seattle 030100 5 50
300 Rochester 021500 200 5
500 Bradenton 022000 15 20
500 Chicago 012000 10 25
1
30
Dimensional Modeling Queries

• Roll Up: more general dimension


• Drill Down: more specific dimension
• Dimension (Aggregation) Hierarchy
• SQL uses aggregation
• Decision Support Systems (DSS): Computer
systems and tools to assist managers in
making decisions and solving problems.

31
Cube view of Data

32
Aggregation Hierarchies

33
Data Warehouse
• Defined in many different ways, but not rigorously.
– A decision support database that is maintained separately from the
organization’s operational database
– Support information processing by providing a solid platform of consolidated,
historical data for analysis.

• “A data warehouse is a subject-oriented, integrated, time-variant, and


nonvolatile collection of data in support of management’s decision-making
process.”—W. H. Inmon
• Data warehousing:
– The process of constructing and using data warehouses

34
Data Warehouse—Subject-Oriented

• Organized around major subjects, such as customer, product,


sales
• Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing
• Provide a simple and concise view around particular subject
issues by excluding data that are not useful in the decision
support process

35
Data Warehouse—Integrated
• Constructed by integrating multiple, heterogeneous data
sources
– relational databases, flat files, on-line transaction records

• Data cleaning and data integration techniques are applied.


– Ensure consistency in naming conventions, encoding structures,
attribute measures, etc. among different data sources
– When data is moved to the warehouse, it is converted.

36
Data Warehouse—Time Variant
• The time horizon for the data warehouse is significantly
longer than that of operational systems
– Operational database: current value data
– Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
• Every key structure in the data warehouse
– Contains an element of time, explicitly or implicitly
– But the key of operational data may or may not contain “time
element”

37
Data Warehouse—Nonvolatile
• A physically separate store of data transformed from the
operational environment
• Operational update of data does not occur in the data
warehouse environment
– Does not require transaction processing, recovery, and concurrency
control mechanisms
– Requires only two operations in data accessing:

• initial loading of data and access of data

38
Data Warehouse vs. Operational DBMS

• OLTP (on-line transaction processing)


– Major task of traditional relational DBMS
– Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration,
accounting, etc.
• OLAP (on-line analytical processing)
– Major task of data warehouse system
– Data analysis and decision making
• Distinct features (OLTP vs. OLAP):
– User and system orientation: customer vs. market
– Data contents: current, detailed vs. historical, consolidated
– Database design: ER + application vs. star + subject
– View: current, local vs. evolutionary, integrated
– Access patterns: update vs. read-only but complex queries

39
Operational vs. Informational
Operational Data Data Warehouse

Application OLTP OLAP

Use Precise Queries Ad Hoc

Modification Dynamic Static

Orientation Application Business

Data Operational Values Integrated

Size Gigabits Terabits

Level Detailed Summarized

Access Often Less Often

Response Few Seconds Minutes

40
Statistics

• Simple descriptive models


• Statistical inference: generalizing a model
created from a sample of the data to the
entire dataset.
• Data mining targeted to business user

DM: Many data mining methods come from


statistical techniques.
41
Machine Learning
• Machine Learning: area of AI that examines how to write programs that
can learn.
• Often used in classification and prediction
• Supervised Learning: learns by example.
• Unsupervised Learning: learns without knowledge of correct answers.
• Machine learning often deals with small static datasets.

DM: Uses many machine learning techniques.

42
Pattern Matching (Recognition)
• Pattern Matching: finds occurrences of a
predefined pattern in the data.
• Applications include speech recognition,
information retrieval, time series analysis.

DM: Type of classification.

43
DM vs. Related Topics
Area Query Data Results Output

DB/OLTP Precise Database Precise DB Objects or


Aggregation

IR Precise Documents Vague Documents

OLAP Analysis Multidimensional Precise DB Objects or


Aggregation

DM Vague Preprocessed Vague KDD Objects

44
Data Mining Techniques Outline

Goal: Provide an overview of basic data


mining techniques
• Statistical
– Point Estimation
– Bayes Theorem
– Hypothesis Testing
– Regression and Correlation
• Similarity Measures
• Decision Trees
• Neural Networks
– Activation Functions
• Genetic Algorithms

45
Some success stories:

Data mining applications


Text Mining
Video Mining ----------- Multimedia Mining

Privacy Preserving in Association Rule Mining

Intrusion Detection- Database Intrusion Detection


- Network Intrusion Detection

46

You might also like