Section 1
Section 1
Section 1
mining
1
Main objective of this session
Aim:
•Introduce the data mining and multivariate analysis process.
•Discuss the SEMMA and CRISP-DM process in general.
•Study the Sampling, Exploring and Modifying stages of the SEMMA process.
Learning outcomes
1.Understand the concept of data mining.
5.Understand the stages of Sampling, Exploring and Modifying in the SEMMA process
– Multivariate statistics is a set of models and tests of how well the models fit
the data in the case when the data involves several dimensions or
variables.
• Each row (n) represents an object on which the measurements are taken.
• Each column (p) represents one of the characteristics, variables, fields collected
4
on each object
1.2 Type of data:
• Qualitative: • Quantitative (cont.):
• Continuous – {0,200}kg –
weight
Source:
5
http://www.regentsprep.org/Regents/math/ALGEBRA/AD1/qualquant.htm
1.2 Components of Data
Mining Algorithms
Data mining application consists of five components:
1.Task: what is the data mining application trying to do?
– Visualization, density estimation, classification; clustering,
2.Structure: what is the model or pattern we are trying to fit to the data;
– Linear regression model; classification tree; hierarchical clustering
6
1.3 Data Mining Tasks
• Descriptive modelling
– Describe all the data the process that generates it 7
1.3 Data Mining Tasks (cont.)
• Predictive modelling
– Predict value of one variable if one knows the other variables
– Classification: if predictive variable is categorical
– Regression: if predictive variable is continues
8
CRISP-DM AND SEMMA FOR
DATA MINING
9
Cross-Industry Standard Process for Data
Mining: CRISP-DM
• A complete view of the business process necessary to
create an analytics model.
– Born from the European Strategic Program on
Research in Information Technology (ESPRIT).
12
Data
•
Understanding
This phase includes initial variable selection and data cleansing.
• Goal: To gain the first insights from the data, and the final data tableau (the input data for the
models).
• The next step is the modelling stage, this can be better understood by the SEMMA process.
13
1.4 SEMMA Process
14
1.5 Sample Stage Crédito Numero Variable 1 Variable 2 Variable 3 Variable 4 Variable 5 Variable 6
– General population
• Questionnaires
– Own customers
• Customer transactions: loyalty cards, customer
16
invoices
1.5 Sample Stage
Types of secondary data
– Geo-demographic data
• Census of population (every 10 years in UK)
» Pick with gaps of 4,5,7 etc or pick one with ID 45719 or pick if
number above 7 not pick if 7 or below (20%).
– Systematic sample
• Chose data at regular intervals, e.g. every 10th
– Usually pseudo-random if not really random 18
1.5 Sample Stage
Sampling and Data Collection
– Stratified Sample
• Split population into strata and % wanted from each
strata agreed ( not necessarily % in population).
Random sample taken from each strata ( this is
difference with quota.
– Multi-stage sample
• Wants to get sample concentrated in some 19
geographic region so easy to get at.
1.4 SEMMA Process
20
1.6 Explore Stage
Motivation
– Dirty, noisy data
• For example, Age= -2003
– Inconsistent data
• Value “0” means actual zero or missing value
– Incomplete data
• Income=?
•Measure of location
– Given set of weekly sales numbers 1,3,3,3,3,4,6,6,7,14
how would you use one number to describe their
location, i.e. are they selling in the 10s or the 100s
•Mean
– Take average , i.e. (x1 + x2 + …xn )/n =
• (1+3+3+3+3+4+6+6+7+14)/10=5
– Is a measure of the expected value of the corresponding
distribution 22
– Affected strongly by outliers
1.6 Explore Stage
Descriptive statistics - measures of location
•Median
•Mode
– Mode is the value which occurs most frequently 23
– So the four 3s mean 3 is the mode
1.6 Explore Stage
Descriptive statistics - measures of spread
Given set of weekly sales numbers 1,3,3,3,3,4,6,6,7,14.
how would you use one number to describe their spread,
i.e. are they large variations in what is being sold.
–Range
• Range = Maximum value – minimum value = 14-
1=13
–Quartile deviation
24
• Q1 is value with 25% of observations below it;
1.6 Explore Stage
i =n
∑ (µ − x )
Descriptive statistics - measures of spread
i =1
i
2
– Variance n
n −1
• Agrees with definition of variance for probability
distributions
– Coefficient of variation
• Coefficient of variation = ( standard deviation)/(mean)
∑
( xi − x )( yi − y )
Cov (X,Y) = E((X - E(X)) * (Y - E(Y))) = E(XY) - E(X)E(Y)
Cov( x, y ) = i =1
•Sample covariance is n −1
27
1.6 Explore Stage If Corr(X,Y) = 1 then Y=aX+b ( a>0)
Cov ( X , Y )
• Corr
Correlation
( X , Y ) = is unit invariant: If Corr (X,Y)=-1 then Y=aX+b ( a<0)
Var ( X ) Var (Y )
− 1 ≤ Corr ( X , Y ) ≤ 1 If Corr(X,Y) = 0 then X,Y independent
• Sample correlation: ∑ ( x − x )( y − y )
i i
Corr ( x, y ) = i =1
n 2
n
2
∑ i ( x − x ) ∑ i( y − y )
= i 1= i1
28
1.6 Explore Stage
Graphics and charts
Food
– Pie Charts- good for nominal data
30
25
20
15
10
0
Jan Feb Mar Apr May Jun
0
00- 49 50-99 100-149 150-199
Sales
31
1.6 Explore Stage
Graphics and charts
•Box plot
33
1.7 Modifying variables
The idea is to pre-process the data in order to:
– Missing values.
34
1.7 Modifying variables
Missing Values
– Keep
• The fact that a variable is missing can be important
information!
– Delete
35
• When too many missing values, removing the
1.7 Modifying variables
Imputation Procedures for Missing Values
36
– Advanced schemes
1.7 Modifying variables
Outliers
– Types of outliers
• Valid observation: e.g. salary of boss, ratio variables
– Visually
• histogram basedxi − µ
zi =
σ
• box plots
– z-score
6000
Close
5000
3000
Close
2000
01/01/1985
01/12/1985
01/11/1986
01/10/1987
01/09/1988
01/08/1989
01/07/1990
01/06/1991
01/05/1992
01/04/1993
01/03/1994
01/02/1995
01/01/1996
01/12/1996
01/11/1997
01/10/1998
01/09/1999
01/08/2000
01/07/2001
01/06/2002
01/05/2003
01/04/2004
01/03/2005
01/02/2006
01/01/2007
01/12/2007
it is used when there is a clear evidence
of such relationship (at least
graphically). logClose
4.5
3.5
2
logClose
Share price is S(t) 1.5
•Coarse classification
Motivation
• Group values of categorical variables for robust
analysis
•Coarse classification
% Default
Age
Lyn Thomas, A survey of credit and behavioural scoring: forecasting financial risk of
lending to consumers, International Journal of Forecasting, 16, 149-172, 2000 42
1.7 Modifying variables
Data transformation
G1 G2 G3 G4 G5
•Coarse classification
% Default
– The Chi-squared Method helps to decide which is the best way to group
• Consider the following example (taken from the book, Credit Scoring and Its
Applications, by Thomas, Edelman and Crook, 2002)
Attribute Owner
• Goods are the Rent Rentdefault in the
people who never Withlast year and
Other No otherwise.
the Bads Total
Unfurnished Furnished parents answer
Goods 6000 1600 350 950 90 10 9000
Bads 300 400 140 100 50 10 1000
(6000+300) x 9000/10000=5670.
• Likewise, the number of bad renters given that the odds are the same as in the
whole
χ =
2 (6000 − 5670) are
population
2
+
(300(1600+400+350+140)
− 630)² (1950 − 2241)² (540 −x249
+ +
)² + (1050 − 1089)² + (160 − 121)² = 583
1000/10000=249.
5670 630 2241 249 1089 121
– Option 1
χ2 =
(6000 − 5670)² + (300 − 630)² + (950 − 945)² + (100 − 105)² + (2050 − 2385)² + (600 − 265)² = 662
5670 630 945 105 2385 265
– Option 2
– The higher the test statistic, the better the split (formally, compare with chi-
square distribution with k-1degrees of freedom for k classes of the 45
characteristic).
1.7 Modifying variables
Attribute Owner
Rent Rent
Unfurnished Furnished
With
parents
Other
No
answer
Total
Attribute Owner Renters Others Total Attribute Owner Renters Others Total
Goods 6000 1950 1050 9000 Goods 5670 2241 1089 9000
Bads 300 540 160 1000 Bads 630 249 121 1000
6300 2490 1210 10000 6300 2490 1210 10000
With
Attribute Owner Others Total Attribute Owner Renters Others Total
parents
Goods 6000 950 2050 9000 Goods 5670 945 2385 9000
Bads 300 100 600 1000 Bads 630 105 265 1000
6300 1050 2650 10000 6300 1050 2650 10000
Deviance encoding
– 18-22
The lower
Age Num the
% weight
250 12.50%
of evidence
Goods (No Default)
194
%
10.74%
(in
Bads favour
(Defaults) of% being WOE
56 28.87% -0.98851
defaulter),
23-26 the higher
300 15.00% are the
246 chances
13.62% of default
54 27.84%in this
-0.71466
category450 22.50%
27-29 405 22.43% 45 23.20% -0.03379
30-35 500 25.00% 475 26.30% 25 12.89% 0.713427
35-44 350 17.50% 339 18.77% 11 5.67% 1.197093
44+ 150 7.50% 147 8.14% 3 1.55% 1.660809
Total 2000 100.00% 1806 100.00% 194 100.00%
48
1.10 Information Value
• Information Value (IV) is a measure of predictive power
used to:
– IV
assess thek appropriateness
= ∑ (%Goods − % Badsk ) *WOEk of the classing
– select
k predictive variablesAge %Goods %Bads WOE Sum
18-22 10.74% 28.87% -0.98851 0.17915675
23-26 13.62% 27.84% -0.71466 0.10158085
27-29 22.43% 23.20% -0.03379 0.00026037
30-35 26.30% 12.89% 0.71343 0.09570358
35-44 18.77% 5.67% 1.19709 0.15682713
– Rule of thumb: 44+ 8.14% 1.55% 1.66081 0.1094995
Total 100.00% 100.00% IV = 0.64302817
• < 0.02 : unpredictive