Chapter 6 Data Mining
Chapter 6 Data Mining
Data sets in data mining are as Statistics looks for the right
“big” as possible size of data (if the size of data
required for statistical
analysis, usually sample of data
is used)
Data
Visualization
Take a break…
watch a video
How Facebook Data Mining, And Your Info, Is Influencing
The 2016 Election | TODAY
https://www.youtube.com/watch?v=i-rIYadXoms
Knowledge Discovery in Database
(KDD)
Knowledge Discovery from Data (KDD), refers to the broad
process of finding knowledge in data that emphasizes the
"high-level" application of particular data mining methods.
The unifying goal of KDD process - extract knowledge from
data in the context of large databases - done by using data
mining methods
KDD refers to the entire process of discovering useful
knowledge from data.
This process involves making decision of what qualifies as
knowledge by evaluating and possibly interpreting the
patterns. It also includes the choice of encoding schemes,
preprocessing, sampling, and projections of the data prior
to the data mining step.
KDD: A Definition
106-1012 bytes:
we never see the What is the knowledge?
whole data set, so will How to represent
put it in the memory of and use it?
computers
Knowledge Discovery Process
Steps in KDD process
Knowledge Discovery Process
The Knowledge Discovery in Databases process comprises of a few steps
leading from raw data collections to some form of new knowledge.
The iterative process consists of the following steps:
Data cleaning: also known as data cleansing, it is a phase in which noise data and
irrelevant data are removed from the collection or maybe missing data.
Data integration: at this stage, multiple data sources, often heterogeneous, may be
combined in a common source.
Data selection: at this step, the data relevant to the analysis is decided on and
retrieved from the data collection.
Data transformation: also known as data consolidation, it is a phase in which the
selected data is transformed into forms appropriate for the mining procedure.
Data mining: it is the crucial step in which clever techniques are applied to extract
patterns potentially useful. Searching for patterns of interest in a particular
representational form or a set of such representations, including classification rules
or trees, regression, and clustering
Pattern evaluation: in this step, strictly interesting patterns representing
knowledge are identified based on given measures.
Knowledge representation: is the final phase in which the discovered knowledge is
visually represented to the user. This essential step uses visualization techniques to
help users understand and interpret the data mining results.
3 methodologies of KDD
model
Fayyad et al. (Computer science)
E.g., WEKA
SEMMA (SAS) (Statistics)
SAS Enterprise Miner
CRISP-DM (SPSS, OHRA) (Business)
SPSS
Methodology of KDD –
CRISP-DM
CRISP-DM
Stands for Cross Industry Standard Process for
Data Mining
A non-proprietary, documented, and freely
available data mining model.
It was developed by industry leaders with input
from more than 200 data mining users and data
mining tool and service providers.
It is an industry-, tool- and application-neutral
model.
This model encourages best practices and offers
organizations the structure needed to realize
better, faster results from data mining.
Six phases in CRISP-DM
CRISP –DM (Elaborate view)
Six phases of CRISP-DM
1. Business Understanding
This initial phase focuses on understanding the project objectives and
requirements from a business perspective, and then converting this
knowledge into a data mining problem definition, and a preliminary
plan designed to achieve the objectives.
Such as “What are the common characteristics of the customers we
have lost to our competitors recently?”
2. Data Understanding
The data understanding phase starts with an initial data collection. It
proceeds with activities
▪ To get familiar with the data,
▪ To identify data quality problems,
▪ To discover first insights into the data, or to
▪ Detect interesting subsets to form hypotheses for hidden information.
Six phases of CRISP-DM
3. Data Preparation
The data preparation phase covers all activities to
construct the final dataset (data that will be fed into the
modeling tool(s)) from the initial raw data.
Data preparation tasks are likely to be performed multiple
times, and not in any prescribed order. Tasks include table,
record, and attribute selection as well as transformation
and cleaning of data for modeling tools.
4. Modeling
In this phase, many modeling techniques are chosen and
applied, and calibrate their parameters to optimal values.
Typically, to the same data mining problem type, several
techniques can be applied.
Six phases of CRISP-DM
5. Evaluate Results
The accuracy and generality of the model were dealt with
the previous evaluation steps. The degree to which the
model meets the business objectives is assessed in this step.
Also this step seeks to determine if there is some valid
business reason why the model is deficient. If time and
budget permits, the model(s) can be tested on test
applications in the real application which is another option
of evaluation.
6. Deployment
The end of the project is not just the creation of the model.
Though the purpose of the model is to increase knowledge
of the data, the knowledge gained needs to be organized
and presented in such a way that the client can use.
KDD vs. DM
DM is a component of the KDD process that is
mainly concerned with means by which patterns
and models are extracted and enumerated from
the data
DM is quite technical
Knowledge discovery involves evaluation and
interpretation of the patterns and models to
make the decision of what constitutes
knowledge and what does not
KDD requires a lot of domain understanding
The DM and KDD are often used interchangeably
Perhaps DM is a more common term in business
world, and KDD in academic world
The end.