0% found this document useful (0 votes)
232 views39 pages

Chapter 6 Data Mining

This document provides an introduction to data mining. It defines data mining as the process of discovering patterns and relationships in large datasets. The document outlines several data mining techniques including prediction, associations, and clustering. Prediction techniques include classification and regression. Association rule learning is used to discover relationships between variables. Clustering assigns objects to groups based on similarities. Examples of data mining applications in various industries are also provided.

Uploaded by

Jiawei Tan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
232 views39 pages

Chapter 6 Data Mining

This document provides an introduction to data mining. It defines data mining as the process of discovering patterns and relationships in large datasets. The document outlines several data mining techniques including prediction, associations, and clustering. Prediction techniques include classification and regression. Association rule learning is used to discover relationships between variables. Clustering assigns objects to groups based on similarities. Examples of data mining applications in various industries are also provided.

Uploaded by

Jiawei Tan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Chapter 6

INTRODUCTION TO DATA MINING


Learning objectives:

 After this lesson, you are able to learn as the


following:
 What is Data Mining?
 Describe the various techniques in Data mining process
 Understand the KDD Process model
 Describe the various phases of CRISP-DM
 Applications of Data Mining
Definition of Data mining
 Data mining is the process of discovering interesting knowledge such as
unknown patterns, association or significant structures from large amount of
data stored in databases, data warehouses or other information repositories in
order to discover useful patterns.
 Another definition of data mining : Data mining is an iterative process of
creating predictive and descriptive models, by uncovering previously unknown
trends and patterns in vast amount of data in order to support decision making.
 Data mining is a subset of Business Analytics
 There is a need to turn data into useful information and knowledge for broad
applications including
 Market analysis
 Business management
 Decision support
 Customer segmentation and behavior
 Etc.
How data mining works?

 Data mining builds models to discover patterns


among attributes presented in the data set.
 Models are:
 Mathematical representations (simple linear
relationships and highly non-linear
relationship) that identify patterns among
attributes of the things such as customers with
products
 Some of these patterns are explanatory and
others are predictive (foretelling future values
of certain attributes)
Why Mine Data? Commercial Viewpoint
 Lots of data is being collected
and warehoused
 Web data, e-commerce
 purchases at department/
grocery stores
 Bank/Credit Card
transactions

 Computers have become cheaper and more powerful


 Competitive Pressure is Strong
 Providebetter, customized services for an edge (e.g. in
Customer Relationship Management)
What is (not) Data Mining?

lWhat is not Data l What is Data Mining?


Mining?
– Look up phone – Certain names are more
number in phone prevalent in certain US
directory locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Query a Web – Group together similar
search engine for documents returned by
information about search engine according to
“Amazon” their context (e.g. Amazon
rainforest, Amazon.com,)
Examples of data mining
applications

 Regarding temporal data, for instance, banking data can be mined


for changing trends, which may aid in the scheduling of bank tellers
according to the volume of customer traffic.
 Stock exchange data can be mined so that trends that could help to
plan investment strategies can be uncovered
 Computer network data streams can be mined to detect intrusions
based on the anomaly of message flows, which may be discovered
by clustering, dynamic construction of stream models or by
comparing the current frequent patterns with those at a previous
time.
 With spatial data, look for patterns that describe changes in
metropolitan poverty rates based on city distances from major
highways. By examining the relationships among a set of spatial
objects, which subsets of objects are spatially auto correlated or
associated can be discovered.
Industry examples of DM
applications
 Sales/ Marketing
 Identify buying patterns from customers
 Find the association among customer demographic characteristics
 Banking
 Credit card fraudulent detection
 Identify ‘loyal’ customers
 Insurance and Health Care
 Claims analysis i.e., which medical procedures are claimed together
 Predict the customers who will buy new policies
 Transportation
 Determine the distribution schedules for the outlets
 Analyze loading patterns
 Medicine
 Characterize patient behavior in order to predict office visits
 Identify successful medical therapies for different diseases / illnesses
Take a break….
Watch a video

 Source of data mining


 https://www.youtube.com/watch?v=Y_JlkzzhAgw
Data Mining
Tasks,
methods
and
algorithms
Prediction
 Prediction is refer to the act of telling about the
future by taking into account the experiences,
opinions and other relevant information in
conducting the task of foretelling.
 Depending on the nature of what is being
predicted, prediction can be specifically as :
 Classification (predicted thing is such as
tomorrow’s forecast, is a class label such as
“rainy” or “sunny”)
 Regression (predicted thing is tomorrow’s
temperature, is a real number such as 65 F)
 Time-series, the data consists of values of the
same variable that is captured and stored over
tine in regular intervals, such as stock price
Prediction techniques
 Classification : assign a new data record to one of several
predefined categories or classes. Also called supervised
learning.
 Classification approaches normally use a training set where
all objects are already associated with known class labels.
 The classification algorithm learns from the training set
and builds a model. The model is used to classify new
objects.
 This method has been used in customer segmentation,
business modeling, and credit analysis.
 For example, after starting a credit policy, the
OurVideoStore managers could analyze the customers’
behaviours via their credit, and label accordingly the
customers who received credits with three possible labels
“safe”, “risky” and “very risky”. The classification analysis
would generate a model that could be used to either
accept or reject credit requests in the future
Associations
 Or association rule learning in data mining is a
popular and well-researched technique for
discovering interesting relationships among
variables in large databases.
 With the help of bar-code scanners, the use of
associations rules for discovering regularities
among products is able to capture by the
system.
 Types of associations:
 Link analysis : the linkage among many objects
of interest is discovered automatically, such as
the link between web pages and referential
relationships among groups of academic
publication authors
Associations techniques
 Market-basket: detect sets of attributes/items that
frequently has association relationship or correlations
among them, e.g. 90% of the people who buy cookies,
also buy milk (60% of all grocery shoppers buy both)
 In data mining, association rules are useful for
analyzing and predicting customer behavior. They
play an important part in shopping basket data
analysis, product clustering, catalog design and store
layout.
 Sequence mining (categorical): discover sequences of
events that commonly occur together, .e.g. In a set of
DNA sequences ACGTC is followed by GTCA after a gap
of 9, with 30% probability
 Something come after the other, for example: when
happen outbreak flu, the glove will be in shortage
Association rules
Clustering
 Clustering: method of assigning a set of objects into groups
or segments based on similarities automatically.
 Unlike classification, in clustering the class labels are
unknown.
 As the selected algorithm goes through the data set,
identifying the common of things based on their
characteristics, the clusters are established.
 Clustering techniques include optimization.
 Goal of clustering is to create groups so that the members
within each group have maximum similarity and the
members across groups have minimum similarity.
Clustering techniques
 Cluster analysis is a means of identifying
classes of items so that items in a cluster have
more in common with each other than with
items in other clusters.
 Example: create customer segmentation based
on income, age, race, location, etc.
Data Mining Techniques
 Outlier Analysis: find the record(s) that is (are)
the most different from the other records, i.e.,
find all outliers. Outliers are data elements that
cannot be grouped in a given class or cluster.
Example of using Data Mining
Data Mining versus Statistics
Data Mining Statistics

Starts with loosely defined Starts with a well-defined


discovery statement by using proposition and by collecting
all existing data (i.e. sample data (i.e. primary data)
observational and secondary to test the hypothesis
data) to discover novel
patterns and relationships

Data sets in data mining are as Statistics looks for the right
“big” as possible size of data (if the size of data
required for statistical
analysis, usually sample of data
is used)
Data
Visualization
Take a break…
watch a video
 How Facebook Data Mining, And Your Info, Is Influencing
The 2016 Election | TODAY
https://www.youtube.com/watch?v=i-rIYadXoms
Knowledge Discovery in Database
(KDD)
 Knowledge Discovery from Data (KDD), refers to the broad
process of finding knowledge in data that emphasizes the
"high-level" application of particular data mining methods.
 The unifying goal of KDD process - extract knowledge from
data in the context of large databases - done by using data
mining methods
 KDD refers to the entire process of discovering useful
knowledge from data.
 This process involves making decision of what qualifies as
knowledge by evaluating and possibly interpreting the
patterns. It also includes the choice of encoding schemes,
preprocessing, sampling, and projections of the data prior
to the data mining step.
KDD: A Definition

 KDD is the automatic extraction of non-obvious,


hidden knowledge from large volumes of data.

Then run Data


Mining algorithms

106-1012 bytes:
we never see the What is the knowledge?
whole data set, so will How to represent
put it in the memory of and use it?
computers
Knowledge Discovery Process
Steps in KDD process
Knowledge Discovery Process
 The Knowledge Discovery in Databases process comprises of a few steps
leading from raw data collections to some form of new knowledge.
 The iterative process consists of the following steps:
 Data cleaning: also known as data cleansing, it is a phase in which noise data and
irrelevant data are removed from the collection or maybe missing data.
 Data integration: at this stage, multiple data sources, often heterogeneous, may be
combined in a common source.
 Data selection: at this step, the data relevant to the analysis is decided on and
retrieved from the data collection.
 Data transformation: also known as data consolidation, it is a phase in which the
selected data is transformed into forms appropriate for the mining procedure.
 Data mining: it is the crucial step in which clever techniques are applied to extract
patterns potentially useful. Searching for patterns of interest in a particular
representational form or a set of such representations, including classification rules
or trees, regression, and clustering
 Pattern evaluation: in this step, strictly interesting patterns representing
knowledge are identified based on given measures.
 Knowledge representation: is the final phase in which the discovered knowledge is
visually represented to the user. This essential step uses visualization techniques to
help users understand and interpret the data mining results.
3 methodologies of KDD
model
 Fayyad et al. (Computer science)
 E.g., WEKA
 SEMMA (SAS) (Statistics)
 SAS Enterprise Miner
 CRISP-DM (SPSS, OHRA) (Business)
 SPSS
Methodology of KDD –
CRISP-DM
 CRISP-DM
 Stands for Cross Industry Standard Process for
Data Mining
 A non-proprietary, documented, and freely
available data mining model.
 It was developed by industry leaders with input
from more than 200 data mining users and data
mining tool and service providers.
 It is an industry-, tool- and application-neutral
model.
 This model encourages best practices and offers
organizations the structure needed to realize
better, faster results from data mining.
Six phases in CRISP-DM
CRISP –DM (Elaborate view)
Six phases of CRISP-DM
1. Business Understanding
 This initial phase focuses on understanding the project objectives and
requirements from a business perspective, and then converting this
knowledge into a data mining problem definition, and a preliminary
plan designed to achieve the objectives.
 Such as “What are the common characteristics of the customers we
have lost to our competitors recently?”
2. Data Understanding
 The data understanding phase starts with an initial data collection. It
proceeds with activities
 ▪ To get familiar with the data,
 ▪ To identify data quality problems,
 ▪ To discover first insights into the data, or to
 ▪ Detect interesting subsets to form hypotheses for hidden information.
Six phases of CRISP-DM
3. Data Preparation
 The data preparation phase covers all activities to
construct the final dataset (data that will be fed into the
modeling tool(s)) from the initial raw data.
 Data preparation tasks are likely to be performed multiple
times, and not in any prescribed order. Tasks include table,
record, and attribute selection as well as transformation
and cleaning of data for modeling tools.
4. Modeling
 In this phase, many modeling techniques are chosen and
applied, and calibrate their parameters to optimal values.
Typically, to the same data mining problem type, several
techniques can be applied.
Six phases of CRISP-DM
5. Evaluate Results
 The accuracy and generality of the model were dealt with
the previous evaluation steps. The degree to which the
model meets the business objectives is assessed in this step.
 Also this step seeks to determine if there is some valid
business reason why the model is deficient. If time and
budget permits, the model(s) can be tested on test
applications in the real application which is another option
of evaluation.
6. Deployment
 The end of the project is not just the creation of the model.
Though the purpose of the model is to increase knowledge
of the data, the knowledge gained needs to be organized
and presented in such a way that the client can use.
KDD vs. DM
 DM is a component of the KDD process that is
mainly concerned with means by which patterns
and models are extracted and enumerated from
the data
 DM is quite technical
 Knowledge discovery involves evaluation and
interpretation of the patterns and models to
make the decision of what constitutes
knowledge and what does not
 KDD requires a lot of domain understanding
 The DM and KDD are often used interchangeably
 Perhaps DM is a more common term in business
world, and KDD in academic world
The end.

Video: Data Mining and Business Intelligent


https://www.youtube.com/watch?v=peSNJ5bfjX0

How data mining works?


https://www.youtube.com/watch?v=W44q6qszdqY

You might also like