Section 1

Section 1: Introduction to data
mining
1
Main objective of this session
Aim:
•Introduce the data mining and multivariate analysis process.
•Discuss the SEMMA and CRISP-DM process in general.
•Study the Sampling, Exploring and Modifying stages of the SEMMA process.
Learning outcomes
1.Understand the concept of data mining.
2.Identify the different types of data.
3.Understand the SEMMA process.
4.Identify the main tasks associated to the Data Mining process.
4.Outline the main components of Data Mining Algorithms.
5.Understand the stages of Sampling, Exploring and Modifying in the SEMMA process
6.Identify the methods to code categorical variables.

2
7.Understand the concept of information value and how it can be calculated.
1.1 Introduction
– In this competitive world the decision makers have to be
more assertive in their decision. Therefore systems that
improve and support the decision process are fully
required.
– Main applications of this techniques:

•Finance: Determine if a customer is likely to default in its payments.
•Services (telecom, energy): How likely is a customer to switch operators (churn).
•Marketing – Customer Relationship Management (CRM): Segments, potential
clients. Ex: how likely is a group of customer to buy a a product or services.
•Health: Determine the risk of suffer a particular disease. Ex: Smoking women are
5 time more likely to suffer breast cancer.
•Pharmaceutical Industry: If a particular drug is an effective treatment for a
particular disease.
3
Data Mining and Multivariate Statistics
– Data mining is the analysis of observational data sets to find relationships
and to summarise the data in ways that are both understandable and useful
to the owner in its decision process.
– Multivariate statistics is a set of models and tests of how well the models fit
the data in the case when the data involves several dimensions or
variables.
– Data is facts and measurements collected together about one topic
– Data sets are represented by a n x p data matrix.
• Each row (n) represents an object on which the measurements are taken.
• Each column (p) represents one of the characteristics, variables, fields collected
4
on each object
1.2 Type of data:
• Qualitative: • Quantitative (cont.):
– Data can be observed but not – Nominal: outcomes are different

measured: Colours, textures, categories and no relation
smells, tastes, appearance, between categories. Ex:
beauty, etc. Residential status; Own home,
rent, live with parents
– Qualitative → Quality
– Ordinal: still divided into
categories but categories can be
ranked. Ex: Sweater size: small,
• Quantitative: medium, large.
– Data which can be measured: – Cardinal: data which can be
Length, height, area, volume, directly and precisely measured.
weight, speed, time, etc.
• Discrete – {0,1,2,3..} – number
– Quantitative → Quantity of children
• Continuous – {0,200}kg –
weight
Source:
5
http://www.regentsprep.org/Regents/math/ALGEBRA/AD1/qualquant.htm
1.2 Components of Data
Mining Algorithms
Data mining application consists of five components:
1.Task: what is the data mining application trying to do?
– Visualization, density estimation, classification; clustering,
2.Structure: what is the model or pattern we are trying to fit to the data;
– Linear regression model; classification tree; hierarchical clustering
3.Measurement function: how to judge the quality of our model or pattern

– Can be goodness of fit,: how well does it describe existing data (like R2) or
– Prediction : how does it perform on data not yet collected/not used in model development
4.Optimization method: finding best structure/model and parameters in that model.

– This is usually what one thinks of as data mining algorithm
5.Data management technique; how to store/index/retrieve data needed

– For small data sets not important but for massive data sets it is essential.
– Where the data is stored
– how often it needs to be accessed as part of the optimization method
– is critical to the application being feasible
6
1.3 Data Mining Tasks
• Exploratory Data Analysis (EDA)

– Explore data with no clear idea of what is being looked for
– Visualisation and interactive
• p low: use histograms, graphs, pie charts, coxcomb
plots (Nightingale)
• p high: use projection onto lower dimensions
Principal component analysis gives the “most
interesting” projections
• Descriptive modelling
– Describe all the data the process that generates it 7
1.3 Data Mining Tasks (cont.)
• Predictive modelling
– Predict value of one variable if one knows the other variables
– Classification: if predictive variable is categorical
– Regression: if predictive variable is continues
• Discovering Patterns and Rules

– Detect connections between variables in some parts of the data.
– Association rules : what variables occur together ( basket analysis)
– Predictive rules: predict the values of an unknown variable
8
CRISP-DM AND SEMMA FOR
DATA MINING
9
Cross-Industry Standard Process for Data
Mining: CRISP-DM
• A complete view of the business process necessary to
create an analytics model.
– Born from the European Strategic Program on
Research in Information Technology (ESPRIT).
– Five companies led the project development. SPSS

(now IBM), Teradata, Daimler AG, NCR Corporation and
OHRA.
• As of 2014, it is the industry standard for model

development (KDNuggets, 2014).
• The modelling stages of the process are a parallel to the

SEMMA
10 process.
11
Business Understanding
• The first step is to understand the dynamics of the
organization!
– What are the objectives of the model?
– What are the goals of the organization?
– What are the technical requirements and human
resources necessary? Do we have them?
• The output of this phase is a preliminary plan to perform

the task.
12
Data
•
Understanding
This phase includes initial variable selection and data cleansing.
• We are aiming to:

– Collect available data from all sources (variable selection).
– Data cleansing:
• Filter outliers.
• Filter irrelevant variables.
• Delete/replace null values.
– Perform the first attribute filter.
• Goal: To gain the first insights from the data, and the final data tableau (the input data for the
models).
• The next step is the modelling stage, this can be better understood by the SEMMA process.
13
1.4 SEMMA Process
SAMPLE EXPLORE MODIFY MODEL ASSESS
•Acquire an For each •Data •Statistical •Determine

unbiased variable: adjustment. techniques how well the
sample of the •Get a feel of and models model fits
data which typical values. •Outliers on data to the data.
describes the treatments undertake
situation •Outliers the required •What
detection •Adjust/take data mining confidence
•Define the functions of task. you should
“targets” •Inter data to put it have in the
variables relationship in most results
which capture between useful form. obtained
the respond different (Measureme
of the variables. nt function).
situation.
14
1.5 Sample Stage Crédito Numero Variable 1 Variable 2 Variable 3 Variable 4 Variable 5 Variable 6
• Multiples Primary data sources:

data sources
•Collected for the purpose
of the research
•Usually coming form
internal sources such as
datamart, datawarehouse.
Final goal: Select and
integrate all the data
Secondary sources:
sources to get the
•Exploit data already available for maximum possible of
other purposes. structured data.
•Complementary data usually
coming from external data.
• Factors to consider when making such a choice: Cost, 15

timing, appropriate, representative.
1.5 Sample Stage
Types of primary data
– General population
• Questionnaires
• Panel sessions: get group together to discuss topic
• Longitudinal panels: some group meet over time to

see changes in preferences and habits
– Own customers
• Customer transactions: loyalty cards, customer
16
invoices
1.5 Sample Stage
Types of secondary data
– Geo-demographic data
• Census of population (every 10 years in UK)
» Leads to classification by small regions (

postcode defines down to around 12-20
houses)
» Commercial companies mine this to produce

classifications
» MOSAIC, ACORN, FINPIN

17
– Government Surveys
1.5 Sample Stage
Sampling and Data Collection
– Random Sample
• Every member of population has same chance of
being chosen and chance of chosen is independent
of who else chosen
– Often based on random number source, i.e. random numbers
picked between 0 and 9 (RND in excel), say 45719.
» Pick with gaps of 4,5,7 etc or pick one with ID 45719 or pick if
number above 7 not pick if 7 or below (20%).
– Systematic sample
• Chose data at regular intervals, e.g. every 10th
– Usually pseudo-random if not really random 18
1.5 Sample Stage
Sampling and Data Collection
– Stratified Sample
• Split population into strata and % wanted from each
strata agreed ( not necessarily % in population).
Random sample taken from each strata ( this is
difference with quota.
• Allows for over-representation of some groups ( rich

for spending habits; poor for default analysis)
– Multi-stage sample
• Wants to get sample concentrated in some 19
geographic region so easy to get at.
1.4 SEMMA Process
• Acquire an For each • Data • Statistical • Determine

data which typical values. • Outliers on data to the data.
situation •Outliers the required • What
detection • Adjust/take data mining confidence
• Define the functions of task. you should
situation.
20
1.6 Explore Stage
Motivation
– Dirty, noisy data
• For example, Age= -2003
– Inconsistent data
• Value “0” means actual zero or missing value
– Incomplete data
• Income=?
– Data integration and data merging problems

• Amounts in Euro versus Amounts in dollar 21
1.6 Explore Stage
Descriptive statistics - measures of location
•Measure of location
– Given set of weekly sales numbers 1,3,3,3,3,4,6,6,7,14
how would you use one number to describe their
location, i.e. are they selling in the 10s or the 100s
•Mean
– Take average , i.e. (x1 + x2 + …xn )/n =
• (1+3+3+3+3+4+6+6+7+14)/10=5
– Is a measure of the expected value of the corresponding
distribution 22
– Affected strongly by outliers
1.6 Explore Stage
Descriptive statistics - measures of location
•Median
– Order data in increasing size and the median is the

middle value (if odd number of data points) or mid-way
between middle two values ( if even number of data
points)
– 1,3,3,3,3, 4,6,6,7,14 so half way between 4 and 3 , i.e.
3.5
– More typical result than mean but not so easy to use
•Mode
– Mode is the value which occurs most frequently 23
– So the four 3s mean 3 is the mode
1.6 Explore Stage
Descriptive statistics - measures of spread
Given set of weekly sales numbers 1,3,3,3,3,4,6,6,7,14.
how would you use one number to describe their spread,
i.e. are they large variations in what is being sold.
–Range
• Range = Maximum value – minimum value = 14-
1=13
• Easy for ungrouped data hard to do for grouped data
• Affected by one or two extreme values
–Quartile deviation
24
• Q1 is value with 25% of observations below it;
1.6 Explore Stage
i =n
∑ (µ − x )
i =1
i
2
– Variance n
• If µ is mean of data then variance is

i =n
• Units are squares of original units ∑ (x − x )

i =1
i
2
n −1
• Agrees with definition of variance for probability
distributions
= [(5-1)2 + 4(5-3)2 + (5-4)2 +2(5-6)2 + (5-7)2 +(5-14)2

] /10= 12
– Sample Variance. If we do not know mean but have

estimated it from sample then we need to adjust 25
1.6 Explore Stage
– Coefficient of variation
• Coefficient of variation = ( standard deviation)/(mean)
• Is a measure of how dispersed the data is in relation

to mean.
– Coefficient of skewness = 3(mean- median)/(standard

deviation)
– Measure of how symmetrical data is.

26
1.6 Explore Stage
Descriptive statistics - measures of relationship
•Covariance: measure of the linear relationship between two variables:
Positive sign (+) says both go in same direction
Negative sign (-) says go in opposite direction.
n
∑
( xi − x )( yi − y )
Cov (X,Y) = E((X - E(X)) * (Y - E(Y))) = E(XY) - E(X)E(Y)
Cov( x, y ) = i =1
•Sample covariance is n −1
•Example ( X=1, Y=2), (X=2,Y=4), (X=3,Y=6)

Sample cov (X, Y)= 1.33
27
1.6 Explore Stage If Corr(X,Y) = 1 then Y=aX+b ( a>0)
Cov ( X , Y )
• Corr
Correlation
( X , Y ) = is unit invariant: If Corr (X,Y)=-1 then Y=aX+b ( a<0)
Var ( X ) Var (Y )
− 1 ≤ Corr ( X , Y ) ≤ 1 If Corr(X,Y) = 0 then X,Y independent
• Sample correlation: ∑ ( x − x )( y − y )
i i
Corr ( x, y ) = i =1
 n 2 
n
2
∑ i ( x − x )  ∑ i( y − y ) 
= i 1=  i1 
28
1.6 Explore Stage
Graphics and charts
Food
– Pie Charts- good for nominal data
pizza chicken curry salad fish&chips kebab 29

1.6 Explore Stage
Graphics and charts
Spending
– Bar Charts – good when data can be split in different

ways
35
30
25
20
15
10
0
Jan Feb Mar Apr May Jun
Food Petrol Alcohol

30
1.6 Explore Stage
Graphics and charts
12
– Histograms
10
– Frequency tables can be
represented graphically as 8
histograms
6
0
00- 49 50-99 100-149 150-199
Sales
31
1.6 Explore Stage
Graphics and charts
•Box plot
– A box plot is a visual representation of five numbers:

• Median M P(X≤M)=0.50
1,5*IQR
• First Quartile Q1 P(X≤Q1)=0.25
• Third Quartile Q3 P(X≤Q3)=0.75

Min
• Minimum
Q1 M Q3
Outliers
32
1.7 Modifying variables
• Acquire an For each •Data • Statistical • Determine

data which typical values. •Outliers on data to the data.
situation •Outliers the required • What
detection •Adjust/take data mining confidence
• Define the functions of task. you should
situation.
33
The idea is to pre-process the data in order to:
– Treat or eliminate outliers.
– Missing values.
– Possible inconsistencies in the data.
– Capture nonlinearities between the target variables and

the dependent variables.
34
Missing Values
– Keep or Delete or Replace
– Keep
• The fact that a variable is missing can be important
information!
• Encode variable in a special way (e.g. separate

category during coarse classification)
– Delete
35
• When too many missing values, removing the
Imputation Procedures for Missing Values
– For continuous attributes

• Replace with median/mean (median more robust to outliers)
• Replace with median/mean of all instances of the same class
– For ordinal/nominal attributes

• Replace with modal value (= most frequent category)
• Replace with modal value of all instances of the same class
36
– Advanced schemes
Outliers
– E.g., due to recording or data entry errors or noise
– Types of outliers
• Valid observation: e.g. salary of boss, ratio variables
• Invalid observation: age =-2003
– Outliers can be hidden in one dimensional views of the

data (multidimensional nature of data)
– Uni-variate outliers versus multivariate outliers 37

Uni-variate Outlier Detection Methods
– Visually
• histogram basedxi − µ
zi =
σ
• box plots
– z-score
– Measures how many standard deviations an observation

is away from the mean for a specific variable
38
Treatment of Outliers
– For invalid outliers:
• E.g. age=300 years
• Treat as missing value (keep, delete, replace)
– For valid outliers:

• Truncation based on z-scores:
– Replace all variable values having z-scores of

> 3 by the mean + 3 times the standard deviation
– Replace all variable values having z-scores of

< -3 by the mean -3 times the standard deviation
• Truncation based on IQR (more robust than z-scores) 39

1.7transformation
Data Modifying variables 7000
6000
Close
5000
•Functions of variables 4000
3000
Close
2000
– They are used to treat nonlinearities in 1000
the relationship between the target 0
variable and the independents variables.
01/01/1985
01/12/1985
01/11/1986
01/10/1987
01/09/1988
01/08/1989
01/07/1990
01/06/1991
01/05/1992
01/04/1993
01/03/1994
01/02/1995
01/01/1996
01/12/1996
01/11/1997
01/10/1998
01/09/1999
01/08/2000
01/07/2001
01/06/2002
01/05/2003
01/04/2004
01/03/2005
01/02/2006
01/01/2007
01/12/2007
it is used when there is a clear evidence
of such relationship (at least
graphically). logClose
4.5
– Polynomials – so related to X2 etc. 4
3.5
– Log of variable: continuous time version 3
of rate of growth: 2.5
2
logClose
Share price is S(t) 1.5
Return = (S(t+1)-S(t))/S(t) 0.5
S(t) = ert S(0) ⇔ log S(t) = rt + log(S(0))

0
01/01/1985
01/01/1986
01/01/1987
01/01/1988
01/01/1989
01/01/1990
01/01/1991
01/01/1992
01/01/1993
01/01/1994
01/01/1995
01/01/1996
01/01/1997
01/01/1998
01/01/1999
01/01/2000
01/01/2001
01/01/2002
01/01/2003
01/01/2004
01/01/2005
01/01/2006
01/01/2007
01/01/2008
40
Data transformation
•Coarse classification
Transformation is used to treat nonlinearities in the relationship between the

target variable and the independents variables and there is no clear evidence
that this relationship could be proxy by a well-known analytical function
Motivation
• Group values of categorical variables for robust
analysis
• non-linear effects for continuous variables 41

Data transformation
% Default
Age
Lyn Thomas, A survey of credit and behavioural scoring: forecasting financial risk of
lending to consumers, International Journal of Forecasting, 16, 149-172, 2000 42
Data transformation
G1 G2 G3 G4 G5
% Default
–We replace AGE with categorical variables G1, …,G5. 43

• Coarse classification
– The Chi-squared Method helps to decide which is the best way to group
• Consider the following example (taken from the book, Credit Scoring and Its
Applications, by Thomas, Edelman and Crook, 2002)
Attribute Owner
• Goods are the Rent Rentdefault in the
people who never Withlast year and
Other No otherwise.
the Bads Total
Unfurnished Furnished parents answer
Goods 6000 1600 350 950 90 10 9000
Bads 300 400 140 100 50 10 1000
• Suppose we want three categories. Should we take

44
– Option 1: owners, renters, and others. continued...
1.7
The Modifying
Chi-squared variables
Method
– Compare the observed frequencies with the theoretical frequencies
assuming equal odds using a chi-squared test statistic.
• Example: The theoretical distribution for GOOD OWNERS is
(6000+300) x 9000/10000=5670.
• Likewise, the number of bad renters given that the odds are the same as in the
whole
χ =
2 (6000 − 5670) are
population
2
+
(300(1600+400+350+140)
− 630)² (1950 − 2241)² (540 −x249
+ +
)² + (1050 − 1089)² + (160 − 121)² = 583
1000/10000=249.
5670 630 2241 249 1089 121
– Option 1
χ2 =
(6000 − 5670)² + (300 − 630)² + (950 − 945)² + (100 − 105)² + (2050 − 2385)² + (600 − 265)² = 662
5670 630 945 105 2385 265
– Option 2
– The higher the test statistic, the better the split (formally, compare with chi-
square distribution with k-1degrees of freedom for k classes of the 45
characteristic).
Attribute Owner
Rent Rent
Unfurnished Furnished
With
parents
Other
No
answer
Total
Goods 6000 1600 350 950 90 10 9000

Bads 300 400 140 100 50 10 1000
6300 2000 490 1050 140 20 10000
Opcion 1 Opcion 1 Theoretical distribution
Attribute Owner Renters Others Total Attribute Owner Renters Others Total
Goods 6000 1950 1050 9000 Goods 5670 2241 1089 9000
Bads 300 540 160 1000 Bads 630 249 121 1000
6300 2490 1210 10000 6300 2490 1210 10000
Cuadratic difference 19.2 37.8 1.4 58.4

172.9 340.1 12.6 525.5
Chi-square 583.9
Degree of F 2
p-value 1.61E-127
Opcion 2 Opcion 2 Theoretical distribution
With
Attribute Owner Others Total Attribute Owner Renters Others Total
parents
Goods 6000 950 2050 9000 Goods 5670 945 2385 9000
Bads 300 100 600 1000 Bads 630 105 265 1000
6300 1050 2650 10000 6300 1050 2650 10000
Cuadratic difference 19.2 0.0 47.1 66.3

172.9 0.2 423.5 596.6
Chi-square 662.9
Degree of F 2
p-value 1.15E-144
46
1.8 Coding categorical variables
Dummy encoding also called reference cell
Input 1 Input 2 Input 3
Purpose=car 1 0 0
Purpose=real estate 0 1 0
Purpose=other 0 0 1
Deviance encoding
Input 1 Input 2 Input 3

Purpose=car 1 0 0
Purpose=real estate 0 1 0
Purpose=other -1 -1 -1
Warning: Different coding produce different model’s coefficients

47
1.9 Weight of evidence
• Consider the following analysis about the Age and the
relationship with the “Default” event.
– The weight of evidence (woe) measures how each

group of age is related to the default
 %Goods  event.
WOEk = ln k

 % Bads k 
– 18-22
The lower
Age Num the
% weight
250 12.50%
of evidence
Goods (No Default)
194
%
10.74%
(in
Bads favour
(Defaults) of% being WOE
56 28.87% -0.98851
defaulter),
23-26 the higher
300 15.00% are the
246 chances
13.62% of default
54 27.84%in this
-0.71466
category450 22.50%
27-29 405 22.43% 45 23.20% -0.03379
30-35 500 25.00% 475 26.30% 25 12.89% 0.713427
35-44 350 17.50% 339 18.77% 11 5.67% 1.197093
44+ 150 7.50% 147 8.14% 3 1.55% 1.660809
Total 2000 100.00% 1806 100.00% 194 100.00%
48
1.10 Information Value
• Information Value (IV) is a measure of predictive power
used to:
– IV
assess thek appropriateness
= ∑ (%Goods − % Badsk ) *WOEk of the classing
– select
k predictive variablesAge %Goods %Bads WOE Sum
18-22 10.74% 28.87% -0.98851 0.17915675
23-26 13.62% 27.84% -0.71466 0.10158085
27-29 22.43% 23.20% -0.03379 0.00026037
30-35 26.30% 12.89% 0.71343 0.09570358
35-44 18.77% 5.67% 1.19709 0.15682713
– Rule of thumb: 44+ 8.14% 1.55% 1.66081 0.1094995
Total 100.00% 100.00% IV = 0.64302817
• < 0.02 : unpredictive
• 0.02 – 0.1 : weak
• 0.1 – 0.3 : medium

49
• 0.3 + : strong

Section 1

Uploaded by

Copyright:

Available Formats

Section 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Section 1

Uploaded by

Copyright:

Available Formats

Section 1: Introduction to data

2.Identify the different types of data.

3.Understand the SEMMA process.

4.Identify the main tasks associated to the Data Mining process.

4.Outline the main components of Data Mining Algorithms.

6.Identify the methods to code categorical variables.

– Main applications of this techniques:

– Data is facts and measurements collected together about one topic

– Data sets are represented by a n x p data matrix.

– Data can be observed but not – Nominal: outcomes are different

3.Measurement function: how to judge the quality of our model or pattern

4.Optimization method: finding best structure/model and parameters in that model.

5.Data management technique; how to store/index/retrieve data needed

• Exploratory Data Analysis (EDA)

• Discovering Patterns and Rules

– Five companies led the project development. SPSS

• As of 2014, it is the industry standard for model

• The modelling stages of the process are a parallel to the

• The output of this phase is a preliminary plan to perform

• We are aiming to:

SAMPLE EXPLORE MODIFY MODEL ASSESS

•Acquire an For each •Data •Statistical •Determine

• Multiples Primary data sources:

• Factors to consider when making such a choice: Cost, 15

• Panel sessions: get group together to discuss topic

• Longitudinal panels: some group meet over time to

» Leads to classification by small regions (

» Commercial companies mine this to produce

» MOSAIC, ACORN, FINPIN

• Allows for over-representation of some groups ( rich

SAMPLE EXPLORE MODIFY MODEL ASSESS

• Acquire an For each • Data • Statistical • Determine

– Data integration and data merging problems

– Order data in increasing size and the median is the

• Easy for ungrouped data hard to do for grouped data

• Affected by one or two extreme values

• If µ is mean of data then variance is

• Units are squares of original units ∑ (x − x )

= [(5-1)2 + 4(5-3)2 + (5-4)2 +2(5-6)2 + (5-7)2 +(5-14)2

– Sample Variance. If we do not know mean but have

• Is a measure of how dispersed the data is in relation

– Coefficient of skewness = 3(mean- median)/(standard

– Measure of how symmetrical data is.

•Example ( X=1, Y=2), (X=2,Y=4), (X=3,Y=6)

pizza chicken curry salad fish&chips kebab 29

– Bar Charts – good when data can be split in different

Food Petrol Alcohol

– A box plot is a visual representation of five numbers:

• Third Quartile Q3 P(X≤Q3)=0.75

SAMPLE EXPLORE MODIFY MODEL ASSESS

• Acquire an For each •Data • Statistical • Determine

– Treat or eliminate outliers.

– Possible inconsistencies in the data.

– Capture nonlinearities between the target variables and

– Keep or Delete or Replace

• Encode variable in a special way (e.g. separate

– For continuous attributes

• Replace with median/mean of all instances of the same class

– For ordinal/nominal attributes