Section 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 49

Section 1: Introduction to data

mining

1
Main objective of this session
Aim:
•Introduce the data mining and multivariate analysis process.
•Discuss the SEMMA and CRISP-DM process in general.
•Study the Sampling, Exploring and Modifying stages of the SEMMA process.

Learning outcomes
1.Understand the concept of data mining.

2.Identify the different types of data.

3.Understand the SEMMA process.

4.Identify the main tasks associated to the Data Mining process.

4.Outline the main components of Data Mining Algorithms.

5.Understand the stages of Sampling, Exploring and Modifying in the SEMMA process

6.Identify the methods to code categorical variables.


2
7.Understand the concept of information value and how it can be calculated.
1.1 Introduction
– In this competitive world the decision makers have to be
more assertive in their decision. Therefore systems that
improve and support the decision process are fully
required.

– Main applications of this techniques:


•Finance: Determine if a customer is likely to default in its payments.
•Services (telecom, energy): How likely is a customer to switch operators (churn).
•Marketing – Customer Relationship Management (CRM): Segments, potential
clients. Ex: how likely is a group of customer to buy a a product or services.
•Health: Determine the risk of suffer a particular disease. Ex: Smoking women are
5 time more likely to suffer breast cancer.
•Pharmaceutical Industry: If a particular drug is an effective treatment for a
particular disease.
3
Data Mining and Multivariate Statistics
– Data mining is the analysis of observational data sets to find relationships
and to summarise the data in ways that are both understandable and useful
to the owner in its decision process.

– Multivariate statistics is a set of models and tests of how well the models fit
the data in the case when the data involves several dimensions or
variables.

– Data is facts and measurements collected together about one topic

– Data sets are represented by a n x p data matrix.

• Each row (n) represents an object on which the measurements are taken.

• Each column (p) represents one of the characteristics, variables, fields collected
4
on each object
1.2 Type of data:
• Qualitative: • Quantitative (cont.):

– Data can be observed but not – Nominal: outcomes are different


measured: Colours, textures, categories and no relation
smells, tastes, appearance, between categories. Ex:
beauty, etc. Residential status; Own home,
rent, live with parents
– Qualitative → Quality
– Ordinal: still divided into
categories but categories can be
ranked. Ex: Sweater size: small,
• Quantitative: medium, large.
– Data which can be measured: – Cardinal: data which can be
Length, height, area, volume, directly and precisely measured.
weight, speed, time, etc.
• Discrete – {0,1,2,3..} – number
– Quantitative → Quantity of children

• Continuous – {0,200}kg –
weight

Source:
5
http://www.regentsprep.org/Regents/math/ALGEBRA/AD1/qualquant.htm
1.2 Components of Data
Mining Algorithms
Data mining application consists of five components:
1.Task: what is the data mining application trying to do?
– Visualization, density estimation, classification; clustering,

2.Structure: what is the model or pattern we are trying to fit to the data;
– Linear regression model; classification tree; hierarchical clustering

3.Measurement function: how to judge the quality of our model or pattern


– Can be goodness of fit,: how well does it describe existing data (like R2) or
– Prediction : how does it perform on data not yet collected/not used in model development

4.Optimization method: finding best structure/model and parameters in that model.


– This is usually what one thinks of as data mining algorithm

5.Data management technique; how to store/index/retrieve data needed


– For small data sets not important but for massive data sets it is essential.
– Where the data is stored
– how often it needs to be accessed as part of the optimization method
– is critical to the application being feasible

6
1.3 Data Mining Tasks

• Exploratory Data Analysis (EDA)


– Explore data with no clear idea of what is being looked for
– Visualisation and interactive
• p low: use histograms, graphs, pie charts, coxcomb
plots (Nightingale)
• p high: use projection onto lower dimensions
Principal component analysis gives the “most
interesting” projections

• Descriptive modelling
– Describe all the data the process that generates it 7
1.3 Data Mining Tasks (cont.)
• Predictive modelling
– Predict value of one variable if one knows the other variables
– Classification: if predictive variable is categorical
– Regression: if predictive variable is continues

• Discovering Patterns and Rules


– Detect connections between variables in some parts of the data.
– Association rules : what variables occur together ( basket analysis)
– Predictive rules: predict the values of an unknown variable

8
CRISP-DM AND SEMMA FOR
DATA MINING
9
Cross-Industry Standard Process for Data
Mining: CRISP-DM
• A complete view of the business process necessary to
create an analytics model.
– Born from the European Strategic Program on
Research in Information Technology (ESPRIT).

– Five companies led the project development. SPSS


(now IBM), Teradata, Daimler AG, NCR Corporation and
OHRA.

• As of 2014, it is the industry standard for model


development (KDNuggets, 2014).

• The modelling stages of the process are a parallel to the


SEMMA
10 process.
11
Business Understanding
• The first step is to understand the dynamics of the
organization!
– What are the objectives of the model?
– What are the goals of the organization?
– What are the technical requirements and human
resources necessary? Do we have them?

• The output of this phase is a preliminary plan to perform


the task.

12
Data

Understanding
This phase includes initial variable selection and data cleansing.

• We are aiming to:


– Collect available data from all sources (variable selection).
– Data cleansing:
• Filter outliers.
• Filter irrelevant variables.
• Delete/replace null values.
– Perform the first attribute filter.

• Goal: To gain the first insights from the data, and the final data tableau (the input data for the
models).

• The next step is the modelling stage, this can be better understood by the SEMMA process.

13
1.4 SEMMA Process

SAMPLE EXPLORE MODIFY MODEL ASSESS

•Acquire an For each •Data •Statistical •Determine


unbiased variable: adjustment. techniques how well the
sample of the •Get a feel of and models model fits
data which typical values. •Outliers on data to the data.
describes the treatments undertake
situation •Outliers the required •What
detection •Adjust/take data mining confidence
•Define the functions of task. you should
“targets” •Inter data to put it have in the
variables relationship in most results
which capture between useful form. obtained
the respond different (Measureme
of the variables. nt function).
situation.

14
1.5 Sample Stage Crédito Numero Variable 1 Variable 2 Variable 3 Variable 4 Variable 5 Variable 6

• Multiples Primary data sources:


data sources
•Collected for the purpose
of the research
•Usually coming form
internal sources such as
datamart, datawarehouse.
Final goal: Select and
integrate all the data
Secondary sources:
sources to get the
•Exploit data already available for maximum possible of
other purposes. structured data.
•Complementary data usually
coming from external data.

• Factors to consider when making such a choice: Cost, 15


timing, appropriate, representative.
1.5 Sample Stage
Types of primary data

– General population
• Questionnaires

• Panel sessions: get group together to discuss topic

• Longitudinal panels: some group meet over time to


see changes in preferences and habits

– Own customers
• Customer transactions: loyalty cards, customer
16
invoices
1.5 Sample Stage
Types of secondary data
– Geo-demographic data
• Census of population (every 10 years in UK)

» Leads to classification by small regions (


postcode defines down to around 12-20
houses)

» Commercial companies mine this to produce


classifications

» MOSAIC, ACORN, FINPIN


17
– Government Surveys
1.5 Sample Stage
Sampling and Data Collection
– Random Sample
• Every member of population has same chance of
being chosen and chance of chosen is independent
of who else chosen
– Often based on random number source, i.e. random numbers
picked between 0 and 9 (RND in excel), say 45719.

» Pick with gaps of 4,5,7 etc or pick one with ID 45719 or pick if
number above 7 not pick if 7 or below (20%).

– Systematic sample
• Chose data at regular intervals, e.g. every 10th
– Usually pseudo-random if not really random 18
1.5 Sample Stage
Sampling and Data Collection
– Stratified Sample
• Split population into strata and % wanted from each
strata agreed ( not necessarily % in population).
Random sample taken from each strata ( this is
difference with quota.

• Allows for over-representation of some groups ( rich


for spending habits; poor for default analysis)

– Multi-stage sample
• Wants to get sample concentrated in some 19
geographic region so easy to get at.
1.4 SEMMA Process

SAMPLE EXPLORE MODIFY MODEL ASSESS

• Acquire an For each • Data • Statistical • Determine


unbiased variable: adjustment. techniques how well the
sample of the •Get a feel of and models model fits
data which typical values. • Outliers on data to the data.
describes the treatments undertake
situation •Outliers the required • What
detection • Adjust/take data mining confidence
• Define the functions of task. you should
“targets” •Inter data to put it have in the
variables relationship in most results
which capture between useful form. obtained
the respond different (Measureme
of the variables. nt function).
situation.

20
1.6 Explore Stage
Motivation
– Dirty, noisy data
• For example, Age= -2003

– Inconsistent data
• Value “0” means actual zero or missing value

– Incomplete data
• Income=?

– Data integration and data merging problems


• Amounts in Euro versus Amounts in dollar 21
1.6 Explore Stage
Descriptive statistics - measures of location

•Measure of location
– Given set of weekly sales numbers 1,3,3,3,3,4,6,6,7,14
how would you use one number to describe their
location, i.e. are they selling in the 10s or the 100s

•Mean
– Take average , i.e. (x1 + x2 + …xn )/n =
• (1+3+3+3+3+4+6+6+7+14)/10=5
– Is a measure of the expected value of the corresponding
distribution 22
– Affected strongly by outliers
1.6 Explore Stage
Descriptive statistics - measures of location

•Median

– Order data in increasing size and the median is the


middle value (if odd number of data points) or mid-way
between middle two values ( if even number of data
points)
– 1,3,3,3,3, 4,6,6,7,14 so half way between 4 and 3 , i.e.
3.5
– More typical result than mean but not so easy to use

•Mode
– Mode is the value which occurs most frequently 23
– So the four 3s mean 3 is the mode
1.6 Explore Stage
Descriptive statistics - measures of spread
Given set of weekly sales numbers 1,3,3,3,3,4,6,6,7,14.
how would you use one number to describe their spread,
i.e. are they large variations in what is being sold.
–Range
• Range = Maximum value – minimum value = 14-
1=13

• Easy for ungrouped data hard to do for grouped data

• Affected by one or two extreme values

–Quartile deviation
24
• Q1 is value with 25% of observations below it;
1.6 Explore Stage
i =n

∑ (µ − x )
Descriptive statistics - measures of spread
i =1
i
2

– Variance n

• If µ is mean of data then variance is


i =n

• Units are squares of original units ∑ (x − x )


i =1
i
2

n −1
• Agrees with definition of variance for probability
distributions

= [(5-1)2 + 4(5-3)2 + (5-4)2 +2(5-6)2 + (5-7)2 +(5-14)2


] /10= 12

– Sample Variance. If we do not know mean but have


estimated it from sample then we need to adjust 25
1.6 Explore Stage
Descriptive statistics - measures of spread

– Coefficient of variation
• Coefficient of variation = ( standard deviation)/(mean)

• Is a measure of how dispersed the data is in relation


to mean.

– Coefficient of skewness = 3(mean- median)/(standard


deviation)

– Measure of how symmetrical data is.


26
1.6 Explore Stage
Descriptive statistics - measures of relationship
•Covariance: measure of the linear relationship between two variables:
Positive sign (+) says both go in same direction
Negative sign (-) says go in opposite direction.
n


( xi − x )( yi − y )
Cov (X,Y) = E((X - E(X)) * (Y - E(Y))) = E(XY) - E(X)E(Y)
Cov( x, y ) = i =1
•Sample covariance is n −1

•Example ( X=1, Y=2), (X=2,Y=4), (X=3,Y=6)


Sample cov (X, Y)= 1.33

27
1.6 Explore Stage If Corr(X,Y) = 1 then Y=aX+b ( a>0)
Cov ( X , Y )
• Corr
Correlation
( X , Y ) = is unit invariant: If Corr (X,Y)=-1 then Y=aX+b ( a<0)
Var ( X ) Var (Y )
− 1 ≤ Corr ( X , Y ) ≤ 1 If Corr(X,Y) = 0 then X,Y independent

• Sample correlation: ∑ ( x − x )( y − y )
i i
Corr ( x, y ) = i =1

 n 2 
n
2
∑ i ( x − x )  ∑ i( y − y ) 
= i 1=  i1 

28
1.6 Explore Stage
Graphics and charts
Food
– Pie Charts- good for nominal data

pizza chicken curry salad fish&chips kebab 29


1.6 Explore Stage
Graphics and charts
Spending

– Bar Charts – good when data can be split in different


ways
35

30

25

20

15

10

0
Jan Feb Mar Apr May Jun

Food Petrol Alcohol


30
1.6 Explore Stage
Graphics and charts
12
– Histograms
10
– Frequency tables can be
represented graphically as 8
histograms
6

0
00- 49 50-99 100-149 150-199

Sales

31
1.6 Explore Stage
Graphics and charts

•Box plot

– A box plot is a visual representation of five numbers:


• Median M P(X≤M)=0.50
1,5*IQR
• First Quartile Q1 P(X≤Q1)=0.25

• Third Quartile Q3 P(X≤Q3)=0.75


Min
• Minimum
Q1 M Q3
Outliers
32
1.7 Modifying variables

SAMPLE EXPLORE MODIFY MODEL ASSESS

• Acquire an For each •Data • Statistical • Determine


unbiased variable: adjustment. techniques how well the
sample of the •Get a feel of and models model fits
data which typical values. •Outliers on data to the data.
describes the treatments undertake
situation •Outliers the required • What
detection •Adjust/take data mining confidence
• Define the functions of task. you should
“targets” •Inter data to put it have in the
variables relationship in most results
which capture between useful form. obtained
the respond different (Measureme
of the variables. nt function).
situation.

33
1.7 Modifying variables
The idea is to pre-process the data in order to:

– Treat or eliminate outliers.

– Missing values.

– Possible inconsistencies in the data.

– Capture nonlinearities between the target variables and


the dependent variables.

34
1.7 Modifying variables
Missing Values

– Keep or Delete or Replace

– Keep
• The fact that a variable is missing can be important
information!

• Encode variable in a special way (e.g. separate


category during coarse classification)

– Delete
35
• When too many missing values, removing the
1.7 Modifying variables
Imputation Procedures for Missing Values

– For continuous attributes


• Replace with median/mean (median more robust to outliers)

• Replace with median/mean of all instances of the same class

– For ordinal/nominal attributes


• Replace with modal value (= most frequent category)

• Replace with modal value of all instances of the same class

36
– Advanced schemes
1.7 Modifying variables
Outliers

– E.g., due to recording or data entry errors or noise

– Types of outliers
• Valid observation: e.g. salary of boss, ratio variables

• Invalid observation: age =-2003

– Outliers can be hidden in one dimensional views of the


data (multidimensional nature of data)

– Uni-variate outliers versus multivariate outliers 37


1.7 Modifying variables
Uni-variate Outlier Detection Methods

– Visually
• histogram basedxi − µ
zi =
σ
• box plots

– z-score

– Measures how many standard deviations an observation


is away from the mean for a specific variable
38
1.7 Modifying variables
Treatment of Outliers
– For invalid outliers:
• E.g. age=300 years

• Treat as missing value (keep, delete, replace)

– For valid outliers:


• Truncation based on z-scores:

– Replace all variable values having z-scores of


> 3 by the mean + 3 times the standard deviation

– Replace all variable values having z-scores of


< -3 by the mean -3 times the standard deviation

• Truncation based on IQR (more robust than z-scores) 39


1.7transformation
Data Modifying variables 7000

6000
Close

5000

•Functions of variables 4000

3000
Close
2000

– They are used to treat nonlinearities in 1000

the relationship between the target 0

variable and the independents variables.

01/01/1985
01/12/1985
01/11/1986
01/10/1987
01/09/1988
01/08/1989
01/07/1990
01/06/1991
01/05/1992
01/04/1993
01/03/1994
01/02/1995
01/01/1996
01/12/1996
01/11/1997
01/10/1998
01/09/1999
01/08/2000
01/07/2001
01/06/2002
01/05/2003
01/04/2004
01/03/2005
01/02/2006
01/01/2007
01/12/2007
it is used when there is a clear evidence
of such relationship (at least
graphically). logClose
4.5

– Polynomials – so related to X2 etc. 4

3.5

– Log of variable: continuous time version 3

of rate of growth: 2.5

2
logClose
Share price is S(t) 1.5

Return = (S(t+1)-S(t))/S(t) 0.5

S(t) = ert S(0) ⇔ log S(t) = rt + log(S(0))


0
01/01/1985
01/01/1986
01/01/1987
01/01/1988
01/01/1989
01/01/1990
01/01/1991
01/01/1992
01/01/1993
01/01/1994
01/01/1995
01/01/1996
01/01/1997
01/01/1998
01/01/1999
01/01/2000
01/01/2001
01/01/2002
01/01/2003
01/01/2004
01/01/2005
01/01/2006
01/01/2007
01/01/2008
40
1.7 Modifying variables
Data transformation

•Coarse classification

Transformation is used to treat nonlinearities in the relationship between the


target variable and the independents variables and there is no clear evidence
that this relationship could be proxy by a well-known analytical function

Motivation
• Group values of categorical variables for robust
analysis

• non-linear effects for continuous variables 41


1.7 Modifying variables
Data transformation

•Coarse classification
% Default

Age

Lyn Thomas, A survey of credit and behavioural scoring: forecasting financial risk of
lending to consumers, International Journal of Forecasting, 16, 149-172, 2000 42
1.7 Modifying variables
Data transformation
G1 G2 G3 G4 G5

•Coarse classification
% Default

–We replace AGE with categorical variables G1, …,G5. 43


1.7 Modifying variables
• Coarse classification

– The Chi-squared Method helps to decide which is the best way to group

• Consider the following example (taken from the book, Credit Scoring and Its
Applications, by Thomas, Edelman and Crook, 2002)

Attribute Owner
• Goods are the Rent Rentdefault in the
people who never Withlast year and
Other No otherwise.
the Bads Total
Unfurnished Furnished parents answer
Goods 6000 1600 350 950 90 10 9000
Bads 300 400 140 100 50 10 1000

• Suppose we want three categories. Should we take


44
– Option 1: owners, renters, and others. continued...
1.7
The Modifying
Chi-squared variables
Method
– Compare the observed frequencies with the theoretical frequencies
assuming equal odds using a chi-squared test statistic.

• Example: The theoretical distribution for GOOD OWNERS is

(6000+300) x 9000/10000=5670.

• Likewise, the number of bad renters given that the odds are the same as in the
whole
χ =
2 (6000 − 5670) are
population
2
+
(300(1600+400+350+140)
− 630)² (1950 − 2241)² (540 −x249
+ +
)² + (1050 − 1089)² + (160 − 121)² = 583
1000/10000=249.
5670 630 2241 249 1089 121

– Option 1
χ2 =
(6000 − 5670)² + (300 − 630)² + (950 − 945)² + (100 − 105)² + (2050 − 2385)² + (600 − 265)² = 662
5670 630 945 105 2385 265
– Option 2

– The higher the test statistic, the better the split (formally, compare with chi-
square distribution with k-1degrees of freedom for k classes of the 45
characteristic).
1.7 Modifying variables
Attribute Owner
Rent Rent
Unfurnished Furnished
With
parents
Other
No
answer
Total

Goods 6000 1600 350 950 90 10 9000


Bads 300 400 140 100 50 10 1000
6300 2000 490 1050 140 20 10000

Opcion 1 Opcion 1 Theoretical distribution

Attribute Owner Renters Others Total Attribute Owner Renters Others Total
Goods 6000 1950 1050 9000 Goods 5670 2241 1089 9000
Bads 300 540 160 1000 Bads 630 249 121 1000
6300 2490 1210 10000 6300 2490 1210 10000

Cuadratic difference 19.2 37.8 1.4 58.4


172.9 340.1 12.6 525.5
Chi-square 583.9
Degree of F 2
p-value 1.61E-127

Opcion 2 Opcion 2 Theoretical distribution

With
Attribute Owner Others Total Attribute Owner Renters Others Total
parents
Goods 6000 950 2050 9000 Goods 5670 945 2385 9000
Bads 300 100 600 1000 Bads 630 105 265 1000
6300 1050 2650 10000 6300 1050 2650 10000

Cuadratic difference 19.2 0.0 47.1 66.3


172.9 0.2 423.5 596.6
Chi-square 662.9
Degree of F 2
p-value 1.15E-144
46
1.8 Coding categorical variables
Dummy encoding also called reference cell
Input 1 Input 2 Input 3
Purpose=car 1 0 0
Purpose=real estate 0 1 0
Purpose=other 0 0 1

Deviance encoding

Input 1 Input 2 Input 3


Purpose=car 1 0 0
Purpose=real estate 0 1 0
Purpose=other -1 -1 -1

Warning: Different coding produce different model’s coefficients


47
1.9 Weight of evidence
• Consider the following analysis about the Age and the
relationship with the “Default” event.

– The weight of evidence (woe) measures how each


group of age is related to the default
 %Goods  event.
WOEk = ln k

 % Bads k 

– 18-22
The lower
Age Num the
% weight
250 12.50%
of evidence
Goods (No Default)
194
%
10.74%
(in
Bads favour
(Defaults) of% being WOE
56 28.87% -0.98851
defaulter),
23-26 the higher
300 15.00% are the
246 chances
13.62% of default
54 27.84%in this
-0.71466
category450 22.50%
27-29 405 22.43% 45 23.20% -0.03379
30-35 500 25.00% 475 26.30% 25 12.89% 0.713427
35-44 350 17.50% 339 18.77% 11 5.67% 1.197093
44+ 150 7.50% 147 8.14% 3 1.55% 1.660809
Total 2000 100.00% 1806 100.00% 194 100.00%

48
1.10 Information Value
• Information Value (IV) is a measure of predictive power
used to:
– IV
assess thek appropriateness
= ∑ (%Goods − % Badsk ) *WOEk of the classing
– select
k predictive variablesAge %Goods %Bads WOE Sum
18-22 10.74% 28.87% -0.98851 0.17915675
23-26 13.62% 27.84% -0.71466 0.10158085
27-29 22.43% 23.20% -0.03379 0.00026037
30-35 26.30% 12.89% 0.71343 0.09570358
35-44 18.77% 5.67% 1.19709 0.15682713
– Rule of thumb: 44+ 8.14% 1.55% 1.66081 0.1094995
Total 100.00% 100.00% IV = 0.64302817
• < 0.02 : unpredictive

• 0.02 – 0.1 : weak

• 0.1 – 0.3 : medium


49
• 0.3 + : strong

You might also like