Chapter 6 Data Mining

This document provides an introduction to data mining. It defines data mining as the process of discovering patterns and relationships in large datasets. The document outlines several data mining techniques including prediction, associations, and clustering. Prediction techniques include classification and regression. Association rule learning is used to discover relationships between variables. Clustering assigns objects to groups based on similarities. Examples of data mining applications in various industries are also provided.

Uploaded by

Jiawei Tan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

232 views39 pages

Chapter 6 Data Mining

Uploaded by

Jiawei Tan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Chapter 6

INTRODUCTION TO DATA MINING

Learning objectives:

 After this lesson, you are able to learn as the

following:
 What is Data Mining?
 Describe the various techniques in Data mining process
 Understand the KDD Process model
 Describe the various phases of CRISP-DM
 Applications of Data Mining
Definition of Data mining
 Data mining is the process of discovering interesting knowledge such as
unknown patterns, association or significant structures from large amount of
data stored in databases, data warehouses or other information repositories in
order to discover useful patterns.
 Another definition of data mining : Data mining is an iterative process of
creating predictive and descriptive models, by uncovering previously unknown
trends and patterns in vast amount of data in order to support decision making.
 Data mining is a subset of Business Analytics
 There is a need to turn data into useful information and knowledge for broad
applications including
 Market analysis
 Business management
 Decision support
 Customer segmentation and behavior
 Etc.
How data mining works?

 Data mining builds models to discover patterns

among attributes presented in the data set.
 Models are:
 Mathematical representations (simple linear
relationships and highly non-linear
relationship) that identify patterns among
attributes of the things such as customers with
products
 Some of these patterns are explanatory and
others are predictive (foretelling future values
of certain attributes)
Why Mine Data? Commercial Viewpoint
 Lots of data is being collected
and warehoused
 Web data, e-commerce
 purchases at department/
grocery stores
 Bank/Credit Card
transactions

 Computers have become cheaper and more powerful

 Competitive Pressure is Strong
 Providebetter, customized services for an edge (e.g. in
Customer Relationship Management)
What is (not) Data Mining?

lWhat is not Data l What is Data Mining?

Mining?
– Look up phone – Certain names are more
number in phone prevalent in certain US
directory locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Query a Web – Group together similar
search engine for documents returned by
information about search engine according to
“Amazon” their context (e.g. Amazon
rainforest, Amazon.com,)
Examples of data mining
applications

 Regarding temporal data, for instance, banking data can be mined

for changing trends, which may aid in the scheduling of bank tellers
according to the volume of customer traffic.
 Stock exchange data can be mined so that trends that could help to
plan investment strategies can be uncovered
 Computer network data streams can be mined to detect intrusions
based on the anomaly of message flows, which may be discovered
by clustering, dynamic construction of stream models or by
comparing the current frequent patterns with those at a previous
time.
 With spatial data, look for patterns that describe changes in
metropolitan poverty rates based on city distances from major
highways. By examining the relationships among a set of spatial
objects, which subsets of objects are spatially auto correlated or
associated can be discovered.
Industry examples of DM
applications
 Sales/ Marketing
 Identify buying patterns from customers
 Find the association among customer demographic characteristics
 Banking
 Credit card fraudulent detection
 Identify ‘loyal’ customers
 Insurance and Health Care
 Claims analysis i.e., which medical procedures are claimed together
 Predict the customers who will buy new policies
 Transportation
 Determine the distribution schedules for the outlets
 Analyze loading patterns
 Medicine
 Characterize patient behavior in order to predict office visits
 Identify successful medical therapies for different diseases / illnesses
Take a break….
Watch a video

 Source of data mining

 https://www.youtube.com/watch?v=Y_JlkzzhAgw
Data Mining
Tasks,
methods
and
algorithms
Prediction
 Prediction is refer to the act of telling about the
future by taking into account the experiences,
opinions and other relevant information in
conducting the task of foretelling.
 Depending on the nature of what is being
predicted, prediction can be specifically as :
 Classification (predicted thing is such as
tomorrow’s forecast, is a class label such as
“rainy” or “sunny”)
 Regression (predicted thing is tomorrow’s
temperature, is a real number such as 65 F)
 Time-series, the data consists of values of the
same variable that is captured and stored over
tine in regular intervals, such as stock price
Prediction techniques
 Classification : assign a new data record to one of several
predefined categories or classes. Also called supervised
learning.
 Classification approaches normally use a training set where
all objects are already associated with known class labels.
 The classification algorithm learns from the training set
and builds a model. The model is used to classify new
objects.
 This method has been used in customer segmentation,
business modeling, and credit analysis.
 For example, after starting a credit policy, the
OurVideoStore managers could analyze the customers’
behaviours via their credit, and label accordingly the
customers who received credits with three possible labels
“safe”, “risky” and “very risky”. The classification analysis
would generate a model that could be used to either
accept or reject credit requests in the future
Associations
 Or association rule learning in data mining is a
popular and well-researched technique for
discovering interesting relationships among
variables in large databases.
 With the help of bar-code scanners, the use of
associations rules for discovering regularities
among products is able to capture by the
system.
 Types of associations:
 Link analysis : the linkage among many objects
of interest is discovered automatically, such as
the link between web pages and referential
relationships among groups of academic
publication authors
Associations techniques
 Market-basket: detect sets of attributes/items that
frequently has association relationship or correlations
among them, e.g. 90% of the people who buy cookies,
also buy milk (60% of all grocery shoppers buy both)
 In data mining, association rules are useful for
analyzing and predicting customer behavior. They
play an important part in shopping basket data
analysis, product clustering, catalog design and store
layout.
 Sequence mining (categorical): discover sequences of
events that commonly occur together, .e.g. In a set of
DNA sequences ACGTC is followed by GTCA after a gap
of 9, with 30% probability
 Something come after the other, for example: when
happen outbreak flu, the glove will be in shortage
Association rules
Clustering
 Clustering: method of assigning a set of objects into groups
or segments based on similarities automatically.
 Unlike classification, in clustering the class labels are
unknown.
 As the selected algorithm goes through the data set,
identifying the common of things based on their
characteristics, the clusters are established.
 Clustering techniques include optimization.
 Goal of clustering is to create groups so that the members
within each group have maximum similarity and the
members across groups have minimum similarity.
Clustering techniques
 Cluster analysis is a means of identifying
classes of items so that items in a cluster have
more in common with each other than with
items in other clusters.
 Example: create customer segmentation based
on income, age, race, location, etc.
Data Mining Techniques
 Outlier Analysis: find the record(s) that is (are)
the most different from the other records, i.e.,
find all outliers. Outliers are data elements that
cannot be grouped in a given class or cluster.
Example of using Data Mining
Data Mining versus Statistics
Data Mining Statistics

Starts with loosely defined Starts with a well-defined

discovery statement by using proposition and by collecting
all existing data (i.e. sample data (i.e. primary data)
observational and secondary to test the hypothesis
data) to discover novel
patterns and relationships

Data sets in data mining are as Statistics looks for the right
“big” as possible size of data (if the size of data
required for statistical
analysis, usually sample of data
is used)
Data
Visualization
Take a break…
watch a video
 How Facebook Data Mining, And Your Info, Is Influencing
The 2016 Election | TODAY
https://www.youtube.com/watch?v=i-rIYadXoms
Knowledge Discovery in Database
(KDD)
 Knowledge Discovery from Data (KDD), refers to the broad
process of finding knowledge in data that emphasizes the
"high-level" application of particular data mining methods.
 The unifying goal of KDD process - extract knowledge from
data in the context of large databases - done by using data
mining methods
 KDD refers to the entire process of discovering useful
knowledge from data.
 This process involves making decision of what qualifies as
knowledge by evaluating and possibly interpreting the
patterns. It also includes the choice of encoding schemes,
preprocessing, sampling, and projections of the data prior
to the data mining step.
KDD: A Definition

 KDD is the automatic extraction of non-obvious,

hidden knowledge from large volumes of data.

Then run Data

Mining algorithms

106-1012 bytes:
we never see the What is the knowledge?
whole data set, so will How to represent
put it in the memory of and use it?
computers
Knowledge Discovery Process
Steps in KDD process
Knowledge Discovery Process
 The Knowledge Discovery in Databases process comprises of a few steps
leading from raw data collections to some form of new knowledge.
 The iterative process consists of the following steps:
 Data cleaning: also known as data cleansing, it is a phase in which noise data and
irrelevant data are removed from the collection or maybe missing data.
 Data integration: at this stage, multiple data sources, often heterogeneous, may be
combined in a common source.
 Data selection: at this step, the data relevant to the analysis is decided on and
retrieved from the data collection.
 Data transformation: also known as data consolidation, it is a phase in which the
selected data is transformed into forms appropriate for the mining procedure.
 Data mining: it is the crucial step in which clever techniques are applied to extract
patterns potentially useful. Searching for patterns of interest in a particular
representational form or a set of such representations, including classification rules
or trees, regression, and clustering
 Pattern evaluation: in this step, strictly interesting patterns representing
knowledge are identified based on given measures.
 Knowledge representation: is the final phase in which the discovered knowledge is
visually represented to the user. This essential step uses visualization techniques to
help users understand and interpret the data mining results.
3 methodologies of KDD
model
 Fayyad et al. (Computer science)
 E.g., WEKA
 SEMMA (SAS) (Statistics)
 SAS Enterprise Miner
 CRISP-DM (SPSS, OHRA) (Business)
 SPSS
Methodology of KDD –
CRISP-DM
 CRISP-DM
 Stands for Cross Industry Standard Process for
Data Mining
 A non-proprietary, documented, and freely
available data mining model.
 It was developed by industry leaders with input
from more than 200 data mining users and data
mining tool and service providers.
 It is an industry-, tool- and application-neutral
model.
 This model encourages best practices and offers
organizations the structure needed to realize
better, faster results from data mining.
Six phases in CRISP-DM
CRISP –DM (Elaborate view)
Six phases of CRISP-DM
1. Business Understanding
 This initial phase focuses on understanding the project objectives and
requirements from a business perspective, and then converting this
knowledge into a data mining problem definition, and a preliminary
plan designed to achieve the objectives.
 Such as “What are the common characteristics of the customers we
have lost to our competitors recently?”
2. Data Understanding
 The data understanding phase starts with an initial data collection. It
proceeds with activities
 ▪ To get familiar with the data,
 ▪ To identify data quality problems,
 ▪ To discover first insights into the data, or to
 ▪ Detect interesting subsets to form hypotheses for hidden information.
Six phases of CRISP-DM
3. Data Preparation
 The data preparation phase covers all activities to
construct the final dataset (data that will be fed into the
modeling tool(s)) from the initial raw data.
 Data preparation tasks are likely to be performed multiple
times, and not in any prescribed order. Tasks include table,
record, and attribute selection as well as transformation
and cleaning of data for modeling tools.
4. Modeling
 In this phase, many modeling techniques are chosen and
applied, and calibrate their parameters to optimal values.
Typically, to the same data mining problem type, several
techniques can be applied.
Six phases of CRISP-DM
5. Evaluate Results
 The accuracy and generality of the model were dealt with
the previous evaluation steps. The degree to which the
model meets the business objectives is assessed in this step.
 Also this step seeks to determine if there is some valid
business reason why the model is deficient. If time and
budget permits, the model(s) can be tested on test
applications in the real application which is another option
of evaluation.
6. Deployment
 The end of the project is not just the creation of the model.
Though the purpose of the model is to increase knowledge
of the data, the knowledge gained needs to be organized
and presented in such a way that the client can use.
KDD vs. DM
 DM is a component of the KDD process that is
mainly concerned with means by which patterns
and models are extracted and enumerated from
the data
 DM is quite technical
 Knowledge discovery involves evaluation and
interpretation of the patterns and models to
make the decision of what constitutes
knowledge and what does not
 KDD requires a lot of domain understanding
 The DM and KDD are often used interchangeably
 Perhaps DM is a more common term in business
world, and KDD in academic world
The end.

Video: Data Mining and Business Intelligent

https://www.youtube.com/watch?v=peSNJ5bfjX0

How data mining works?

https://www.youtube.com/watch?v=W44q6qszdqY

Arbeitbuch - A2-1 Beste Freunde PDF 2
0% (1)
Arbeitbuch - A2-1 Beste Freunde PDF 2
9 pages
PSPF Infosec 11 Robust Ict Systems
No ratings yet
PSPF Infosec 11 Robust Ict Systems
6 pages
CRM Ch-07 Data Mining
No ratings yet
CRM Ch-07 Data Mining
32 pages
Assignment 1:: Intro To Machine Learning
No ratings yet
Assignment 1:: Intro To Machine Learning
6 pages
1 Explain Apriori Algorithm With Example or Finding Frequent Item Sets Using With Candidate Generation
No ratings yet
1 Explain Apriori Algorithm With Example or Finding Frequent Item Sets Using With Candidate Generation
21 pages
Module 1 Capsule 2 ITIL Core Concepts V1.3
No ratings yet
Module 1 Capsule 2 ITIL Core Concepts V1.3
5 pages
Data Mining
No ratings yet
Data Mining
8 pages
Enterprise Resource and Planning - ERP
No ratings yet
Enterprise Resource and Planning - ERP
20 pages
Data Mining
No ratings yet
Data Mining
15 pages
Big Data Syllabus
No ratings yet
Big Data Syllabus
2 pages
Chapter I - The Database Environment and Development Process
50% (2)
Chapter I - The Database Environment and Development Process
14 pages
Data Warehousing
No ratings yet
Data Warehousing
24 pages
Sharda Dss10 PPT 08 ST
No ratings yet
Sharda Dss10 PPT 08 ST
14 pages
Introduction To Data Management - Week 1 - 2024
No ratings yet
Introduction To Data Management - Week 1 - 2024
17 pages
MIS Important Questions
No ratings yet
MIS Important Questions
8 pages
Chapter - 4 - Association Rule Mining
No ratings yet
Chapter - 4 - Association Rule Mining
86 pages
Chapter 9 Management Information Systems
No ratings yet
Chapter 9 Management Information Systems
11 pages
Business Intelligence and Analytics Tools
No ratings yet
Business Intelligence and Analytics Tools
4 pages
Module 1 Data Warehousing Fundamentals
No ratings yet
Module 1 Data Warehousing Fundamentals
17 pages
Data Warehousing & Data Mining Syllabus Subject Code:56055 L:4 T/P/D:0 Credits:4 Int. Marks:25 Ext. Marks:75 Total Marks:100
No ratings yet
Data Warehousing & Data Mining Syllabus Subject Code:56055 L:4 T/P/D:0 Credits:4 Int. Marks:25 Ext. Marks:75 Total Marks:100
52 pages
Data Warehousing and Data Mining Syllabus
No ratings yet
Data Warehousing and Data Mining Syllabus
1 page
Security and Ethical Challenges
No ratings yet
Security and Ethical Challenges
76 pages
Chapter 2 - Processes in Conducting Research
No ratings yet
Chapter 2 - Processes in Conducting Research
38 pages
Davao Del Sur State College
No ratings yet
Davao Del Sur State College
7 pages
Executive Information System
100% (2)
Executive Information System
28 pages
Big Data Summery
No ratings yet
Big Data Summery
9 pages
Fundamentals of Information Systems
100% (1)
Fundamentals of Information Systems
14 pages
MIS Module 2-1
No ratings yet
MIS Module 2-1
61 pages
Sad Lec19, 20 & 21 - System Implementation & Maintenance
No ratings yet
Sad Lec19, 20 & 21 - System Implementation & Maintenance
65 pages
The Rise and Fall of Dell Answer
No ratings yet
The Rise and Fall of Dell Answer
6 pages
Chapter 6
No ratings yet
Chapter 6
59 pages
Chapter 2 - The Organizational Context
No ratings yet
Chapter 2 - The Organizational Context
26 pages
Topic 1 - An Introduction To Integrated Enterprise Information Systems
No ratings yet
Topic 1 - An Introduction To Integrated Enterprise Information Systems
40 pages
HCI1 - Cognitive Aspects
No ratings yet
HCI1 - Cognitive Aspects
44 pages
CSBS 4 Notes
No ratings yet
CSBS 4 Notes
16 pages
Fundamentals of Information Systems, Seventh Edition
No ratings yet
Fundamentals of Information Systems, Seventh Edition
56 pages
Edureka CAS Brochure PDF
No ratings yet
Edureka CAS Brochure PDF
15 pages
Exercise Chapter 2
No ratings yet
Exercise Chapter 2
9 pages
Lesson 6 Data Life Cycle Part 2
No ratings yet
Lesson 6 Data Life Cycle Part 2
30 pages
A Survey On Data Mining
No ratings yet
A Survey On Data Mining
4 pages
Information Data and Processing
No ratings yet
Information Data and Processing
10 pages
Chapter 5 Data Resource Management
No ratings yet
Chapter 5 Data Resource Management
24 pages
All Basic Principles and Concept of Databases
No ratings yet
All Basic Principles and Concept of Databases
83 pages
Lesson 1 Introduction To Information Security
No ratings yet
Lesson 1 Introduction To Information Security
42 pages
D7.2 Data Managment Plan v1.04
No ratings yet
D7.2 Data Managment Plan v1.04
14 pages
Notes DATA MINING MBA III
No ratings yet
Notes DATA MINING MBA III
8 pages
CH4-Ethical and Social Isues in Information System
No ratings yet
CH4-Ethical and Social Isues in Information System
16 pages
Principles of Information: Systems, Ninth Edition
No ratings yet
Principles of Information: Systems, Ninth Edition
59 pages
File System Vs DBMS
No ratings yet
File System Vs DBMS
6 pages
Knowledge Management in Bank
No ratings yet
Knowledge Management in Bank
65 pages
BA7205 Information Management
No ratings yet
BA7205 Information Management
10 pages
IS - Chapter 1, Overview of Information System
No ratings yet
IS - Chapter 1, Overview of Information System
34 pages
Lecture 1-Fundamentals of Information Systems
No ratings yet
Lecture 1-Fundamentals of Information Systems
31 pages
Data Warehousing
No ratings yet
Data Warehousing
47 pages
Revised E-Commerce Assignment 1
No ratings yet
Revised E-Commerce Assignment 1
10 pages
Chapter 1 - The Dynamic New Workplace
100% (1)
Chapter 1 - The Dynamic New Workplace
42 pages
Sbe13ch17a PP
No ratings yet
Sbe13ch17a PP
48 pages
ST - Module 3: Integration, System and Acceptance Testing Integration Testing
No ratings yet
ST - Module 3: Integration, System and Acceptance Testing Integration Testing
27 pages
Cse2012 Dbms Lab Manual
No ratings yet
Cse2012 Dbms Lab Manual
69 pages
MIS Assignment 02
No ratings yet
MIS Assignment 02
2 pages
Activity in Project Management
No ratings yet
Activity in Project Management
15 pages
Introduction Research Methods l1 2023
No ratings yet
Introduction Research Methods l1 2023
92 pages
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
From Everand
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
Joseph O. Esin
No ratings yet
Lab Android Part 7 SQLite Database
No ratings yet
Lab Android Part 7 SQLite Database
61 pages
Knowledge Representation and Expert System
No ratings yet
Knowledge Representation and Expert System
36 pages
Cisco Assignment
No ratings yet
Cisco Assignment
22 pages
Informed Search v1
No ratings yet
Informed Search v1
31 pages
ITM Assignment Full
No ratings yet
ITM Assignment Full
29 pages
INT4204 Searching
No ratings yet
INT4204 Searching
21 pages
Chapter3 Project Scope Management (Saras Ref)
No ratings yet
Chapter3 Project Scope Management (Saras Ref)
41 pages
Chapter 9 Transactions Management and Concurrency Control
No ratings yet
Chapter 9 Transactions Management and Concurrency Control
36 pages
Chapter 8 SQL Complex Queries
No ratings yet
Chapter 8 SQL Complex Queries
51 pages
Chapter 7 Data Warehouse & OLAP
No ratings yet
Chapter 7 Data Warehouse & OLAP
42 pages
Chapter 7 Normalization
No ratings yet
Chapter 7 Normalization
24 pages
Mobile Computing
No ratings yet
Mobile Computing
16 pages
Page 1 of 4 Training Agreement: Mobile New Hire Training!
No ratings yet
Page 1 of 4 Training Agreement: Mobile New Hire Training!
4 pages
Cics Refresher
100% (1)
Cics Refresher
32 pages
1550 ENM 20 Operations For 5G RAN (ELecture)
No ratings yet
1550 ENM 20 Operations For 5G RAN (ELecture)
3 pages
UkgRanrLR Ki2e1zVCKYDA Reference-Guide-SQL
No ratings yet
UkgRanrLR Ki2e1zVCKYDA Reference-Guide-SQL
8 pages
Question Bank Data Structures - 24 - 25 - Odd
0% (1)
Question Bank Data Structures - 24 - 25 - Odd
12 pages
OS Unit - 4 Notes
No ratings yet
OS Unit - 4 Notes
35 pages
Indexing and Hashing: Solutions To Practice Exercises
No ratings yet
Indexing and Hashing: Solutions To Practice Exercises
11 pages
Mega Ed Pro Manual
No ratings yet
Mega Ed Pro Manual
15 pages
Separately Excited DC Motor For Electric Vehicle Controller Design
No ratings yet
Separately Excited DC Motor For Electric Vehicle Controller Design
5 pages
10 Maths N Ch. 2
No ratings yet
10 Maths N Ch. 2
5 pages
Chapter 6 Wireless Security
No ratings yet
Chapter 6 Wireless Security
8 pages
Online Food Order Report
No ratings yet
Online Food Order Report
46 pages
Chapter - 8: Paths, Path Products and Regular Expressions
No ratings yet
Chapter - 8: Paths, Path Products and Regular Expressions
61 pages
Nuscan - 2700R - User Guide
No ratings yet
Nuscan - 2700R - User Guide
73 pages
2023 Voluson Swift bt23 Gesstures Guide jb26749xx - 1
No ratings yet
2023 Voluson Swift bt23 Gesstures Guide jb26749xx - 1
12 pages
Mitigating IPR Conflicts Under Sports Law: An Analysis of The Esports Industry - Vaishnavi Venkatesan
No ratings yet
Mitigating IPR Conflicts Under Sports Law: An Analysis of The Esports Industry - Vaishnavi Venkatesan
9 pages
Chapter No: - 1: Yoga Classes Registration System
No ratings yet
Chapter No: - 1: Yoga Classes Registration System
27 pages
ARRIS Vip1113
No ratings yet
ARRIS Vip1113
2 pages
Kivymd Readthedocs Io en 1.0.2
No ratings yet
Kivymd Readthedocs Io en 1.0.2
553 pages
BU Thematic Area: Scientific, Technological Innovations and Techno-Entrepreneurship in
No ratings yet
BU Thematic Area: Scientific, Technological Innovations and Techno-Entrepreneurship in
3 pages
KLEE: Unassisted and Automatic Generation of High-Coverage Tests For Complex Systems Programs
No ratings yet
KLEE: Unassisted and Automatic Generation of High-Coverage Tests For Complex Systems Programs
16 pages
Tle-6-Ictentrep-Module 5
No ratings yet
Tle-6-Ictentrep-Module 5
41 pages
Detailed Lesson Plan DepEd Application Final
No ratings yet
Detailed Lesson Plan DepEd Application Final
7 pages
Home Automation Using ARDUINO: Submitted By:-Rupshanker Mishra Roll No. - 214212 CSE 3rd Year 6th Semester
No ratings yet
Home Automation Using ARDUINO: Submitted By:-Rupshanker Mishra Roll No. - 214212 CSE 3rd Year 6th Semester
21 pages
Information Retrieval: IR Models: Boolean Model
No ratings yet
Information Retrieval: IR Models: Boolean Model
37 pages
CCNP Security CH1
No ratings yet
CCNP Security CH1
35 pages