Summer Course: Data Mining: Regression Analysis

This document summarizes regression analysis techniques presented in a summer data mining course. It covers classical linear regression, LASSO and ridge regression, nonparametric regression using k-nearest neighbors and decision trees, support vector regression, and variable selection methods like AIC, BIC, and adjusted R-squared. Examples of linear regression are shown and the techniques of ordinary least squares estimation, cross-validation, and evaluating model complexity are discussed for different regression algorithms.

Uploaded by

叶子纯

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views34 pages

Summer Course: Data Mining: Regression Analysis

Uploaded by

叶子纯

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 34

Summer

Course
Data
Mining
Regression Analysis
Presenter: Georgi Nalbantov
Summer Course: Data Mining
August 2009
Summer
Course
Data
Mining
2/34
Structure
Regression analysis: definition and examples

Classical Linear Regression

LASSO and Ridge Regression (linear and nonlinear)

Nonparametric (local) regression estimation:
kNN for regression, Decision trees, Smoothers

Support Vector Regression (linear and nonlinear)

Variable/feature selection (AIC, BIC, R^2-adjusted)
Summer
Course
Data
Mining
3/34
Feature Selection, Dimensionality Reduction, and
Clustering in the KDD Process
U.M.Fayyad, G.Patetsky-
Shapiro and P.Smyth (1995)
Summer
Course
Data
Mining
4/34
Common Data Mining tasks
Clustering Classification Regression

k-th Nearest Neighbour
Parzen Window
Unfolding, Conjoint
Analysis, Cat-PCA

Linear Discriminant Analysis, QDA
Logistic Regression (Logit)
Decision Trees, LSSVM, NN, VS

Classical Linear Regression
Ridge Regression
NN, CART
+
+
+
+
+ +
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
-
-
-
-
-
-
+
+
+
+
+
+
+
+
+
X
1
X
1
X
1
X
2
X
2
Summer
Course
Data
Mining
5/34
Linear regression analysis: examples
Summer
Course
Data
Mining
6/34
Linear regression analysis: examples
Summer
Course
Data
Mining
7/34
The Regression task

Given data on m explanatory variables and 1 explained variable, where the explained
variable can take real values in 9
1
, find a function that gives the best fit:

Given: ( x
1
, y
1
), , ( x
m
, y
m
) e 9
n
X 9
1

Find: ] : 9
n
9
1

best function = the expected error on unseen data ( x
m+1
, y
m+1
), , ( x
m+k
, y
m+k
)
is minimal

Summer
Course
Data
Mining
8/54
Classical Linear Regression (OLS)
Explanatory and Response Variables are Numeric
Relationship between the mean of the response variable and the level
of the explanatory variable assumed to be approximately linear
(straight line)
Model:
) , 0 ( ~
1 0
o c c | | N x Y + + =
|
1
> 0 Positive Association
|
1
< 0 Negative Association
|
1
= 0 No Association
Summer
Course
Data
Mining
9/54
Classical Linear Regression (OLS)
Task:

Minimize the sum of
squared errors:
2
1
1
^
0
^
1
2
^
1
^
0
^ ^

= =
|
.
|

\
|
|
.
|

\
|
+ =
|
.
|

\
|
= + =
n
i
i i
n
i
i i
x y y y SSE x y | | | |
x y
1
^
0
^ ^
| | + =
|
0
Mean response when x=0 (y-intercept)

|
1
Change in mean response when x
increases by 1 unit (slope)

|
0
,

|
1
are unknown parameters (like )

|
0
+|
1
x Mean response when explanatory
variable takes on the value x
Summer
Course
Data
Mining
10/54
Classical Linear Regression (OLS)
Coefficients
a
89.124 7.048 12.646 .000
-9.009 1.503 -.937 -5.994 .002
(Constant)
LSD_CONC
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variable: SCORE
a.
x y
1
^
0
^ ^
| | + =
y
x
1
Parameter: Slope in the population model (|
1
)
Estimator: Least squares estimate:
Estimated standard error:

Methods of making inference regarding
population:
Hypothesis tests (2-sided or 1-sided)
Confidence Intervals

xx
S s /
^
1
^
= | o
2 2
2
^
2

|
.
|

\
|

=

n
SSE
n
y y
s
( )

=
2
x x S
xx
1
^
|
Summer
Course
Data
Mining
11/54
Classical Linear Regression (OLS)
Summer
Course
Data
Mining
12/54
Classical Linear Regression (OLS)
Summer
Course
Data
Mining
13/54
Classical Linear Regression (OLS)
Coefficient of determination (r
2
) : proportion of
variation in y explained by the regression on x.

1 0
2 2
s s

= r
S
SSE S
r
yy
yy
( )

=
2
y y S
yy
where

|
.
|

\
|
=
2
^
y y SSE
Summer
Course
Data
Mining
14/54
Classical Linear Regression (OLS):
Multiple regression
Numeric Response variable (y)
p Numeric predictor variables
Model:

Y = |
0
+ |
1
x
1
+ + |
p
x
p
+ c

Partial Regression Coefficients: |
i
effect (on the mean response) of
increasing the i
th
predictor variable by 1 unit, holding all other
predictors constant

Summer
Course
Data
Mining
15/54
Classical Linear Regression (OLS):
Ordinary Least Squares estimation
Population Model for mean response:
p p p
x x x x Y E | | | + + + =
1 1 0 1
) , | (
Least Squares Fitted (predicted) equation, minimizing SSE:

|
.
|

\
|
= + + + =
2
^ ^
1 1
^
0
^ ^
Y Y SSE x x Y
p p
| | |
Summer
Course
Data
Mining
16/54
Classical Linear Regression (OLS):
Ordinary Least Squares estimation
Model:
p p
x x Y
^
1 1
^
0
^ ^
| | | + + + =
Ridge regression estimation:
OLS estimation:
LASSO estimation:

|
.
|

\
|
=
2
^
min Y Y SSE

= =
+
|
.
|

\
|
=
p
j
j
n
i
Y Y SSE
1 1
2
^
min |

= =
+
|
.
|

\
|
=
p
j
j
n
i
Y Y SSE
1
2
1
2
^
min |
Summer
Course
Data
Mining
17/59
LASSO and Ridge estimation of model coefficients
sum(|beta|) sum(|beta|)
Summer
Course
Data
Mining
18/59
Nonparametric (local) regression estimation:
k-NN, Decision trees, smoothers
Summer
Course
Data
Mining
19/59
Nonparametric (local) regression estimation:
k-NN, Decision trees, smoothers
Summer
Course
Data
Mining
20/59
Nonparametric (local) regression estimation:
k-NN, Decision trees, smoothers
Summer
Course
Data
Mining
21/59
Nonparametric (local) regression estimation:
k-NN, Decision trees, smoothers
How to Choose k or h?
When k or h is small, single instances matter; bias is small, variance is
large (undersmoothing): High complexity
As k or h increases, we average over more instances and variance
decreases but bias increases (oversmoothing): Low complexity
Cross-validation is used to finetune k or h.
Summer
Course
Data
Mining
22/59
Linear Support Vector Regression
Suspiciously smart case
(overfitting)
Compromise case, SVR
(good generalisation)
Lazy case
(underfitting)
E
x
p
e
n
d
i
t
u
r
e
s

Age

E
x
p
e
n
d
i
t
u
r
e
s

Age

The thinner the tube, the more complex the model
biggest area
small area
E
x
p
e
n
d
i
t
u
r
e
s

Age

middle-sized area
Support vectors
Summer
Course
Data
Mining
23/59
Nonlinear Support Vector Regression
E
x
p
e
n
d
i
t
u
r
e
s

Age

Map the data into a higher-dimensional space:
Summer
Course
Data
Mining
24/59
Nonlinear Support Vector Regression
E
x
p
e
n
d
i
t
u
r
e
s

Age

Map the data into a higher-dimensional space:
Summer
Course
Data
Mining
25/59
Nonlinear Support Vector Regression: Technicalities
The SVR function:

Subject to:
To find the unknown parameters of the SVR function, solve:

How to choose , ,

= RBF kernel:

Find , , , and from a cross-validation procedure
Summer
Course
Data
Mining
26/59
SVR Technicalities: Model Selection
Do 5-fold cross-validation to find and for several fixed values of .

0 5 10 15
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
C
g
a
m
m
a
CV_MSE, epsilon = 0.15
0.0588
0.0588
0.0588
0.0588
0.0592
0.0592
0.0592
0.0592
0.0592
0.0598
0.0598
0.0598
0.0598
0.061
0
5
10
15 0
0.01
0.02
0.058
0.059
0.06
0.061
0.062
0.063
0.064
gamma
CV_MSE, epsilon = 0.15
C
C
V
M
S
E
Summer
Course
Data
Mining
SVR Study :
Model Training, Selection and Prediction
CVMSE (IR*, HR*, CR*)
CVMSE (IR*, HR*, CR*)
True returns (red) and raw predictions (blue)
27/59
Summer
Course
Data
Mining
28/59
-1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4
-3.05
-3
-2.95
-2.9
-2.85
-2.8
Effect of credit spread on SP500
credit spread
S
P
5
0
0
-40 -30 -20 -10 0 10 20 30 40 50 60
-5.5
-5
-4.5
-4
-3.5
-3
-2.5
Effect of vix on SP500
vix
S
P
5
0
0
-10 -5 0 5 10 15 20 25
-4.5
-4
-3.5
-3
-2.5
-2
Effect of vix FUT on SP500
vix FUT
S
P
5
0
0
-70 -60 -50 -40 -30 -20 -10 0 10 20 30
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
Effect of 3m treasure bill on SP500
3m treasure bill
S
P
5
0
0
SVR: Individual Effects
Summer
Course
Data
Mining
29/34
SVR Technicalities: SVR vs. OLS
Performance on the test set
0 5 10 15 20 25 30 35 40
2
2.5
3
3.5
4
Observation
E
x
p
e
n
d
i
t
u
r
e
s
Holiday Data, test set, epsilon = 0.15
MSE= 0.04
0 5 10 15 20 25 30 35 40
2
2.5
3
3.5
4
Obserlation
E
x
p
e
n
d
i
t
u
r
e
Holiday Data, test set, OLS solution
MSE= 0.23
SVR
OLS
Performance on the test set
Summer
Course
Data
Mining
30/34
Technical Note:
Number of Training Errors vs. Model Complexity
test errors
complexity training errors
Model complexity
Min. number of
training errors,
Functions ordered in
increasing complexity
Best trade-off
MATLAB video here
Summer
Course
Data
Mining
31/34
Variable selection for regression
Akaike Information Criterion (AIC). Final prediction error:

Summer
Course
Data
Mining
32/34
Variable selection for regression
Bayesian Information Criterion (BIC), also known as Schwarz criterion. Final
prediction error:

BIC tends to choose simpler models than AIC.

Summer
Course
Data
Mining
33/34
Variable selection for regression
R^2-adjusted:

Summer
Course
Data
Mining
34/34
Conclusion / Summary / References
Alpaydin, 2004,
http://www-stat.stanford.edu/~tibs/lasso.html ,
Bishop, 2006
(any introductory statistical/econometric book)
Smola and Schoelkopf, 2003
Hastie et. el., 2001
Classical Linear Regression

LASSO and Ridge Regression (linear and
nonlinear)

Nonparametric (local) regression estimation:
kNN for regression, Decision trees, Smoothers

Support Vector Regression (linear and
nonlinear)

Variable/feature selection (AIC, BIC, R^2-
adjusted)
Hastie et. el., 2001,
(any statistical/econometric book)

Chapter 7-One Sample Inference - 250623 - 080358
No ratings yet
Chapter 7-One Sample Inference - 250623 - 080358
94 pages
Applied Econometrics 2nd Edition Dimitrios Asteriou Instant Download
No ratings yet
Applied Econometrics 2nd Edition Dimitrios Asteriou Instant Download
48 pages
USDINR Forcasting
No ratings yet
USDINR Forcasting
55 pages
Unit-3 DA
No ratings yet
Unit-3 DA
50 pages
Simple Linear Regression
100% (2)
Simple Linear Regression
55 pages
Unit 2
No ratings yet
Unit 2
133 pages
Unit 2 ML
No ratings yet
Unit 2 ML
201 pages
Multivariate Regression, Slides
No ratings yet
Multivariate Regression, Slides
61 pages
Uttam Linear Regression 17march24
No ratings yet
Uttam Linear Regression 17march24
82 pages
Lecture 3 - Linear Regression Imran 20022025 092939am
No ratings yet
Lecture 3 - Linear Regression Imran 20022025 092939am
46 pages
Ch3 Slides Ed4 2024 20
No ratings yet
Ch3 Slides Ed4 2024 20
72 pages
Da Sem Unit 3-1
No ratings yet
Da Sem Unit 3-1
13 pages
Chapter 8
No ratings yet
Chapter 8
60 pages
Unit - 3 PDA
No ratings yet
Unit - 3 PDA
20 pages
STAT630Slide Adv Data Analysis
No ratings yet
STAT630Slide Adv Data Analysis
238 pages
Chapter 4 - Student
No ratings yet
Chapter 4 - Student
53 pages
Ch3 Slides Ed4 2024
No ratings yet
Ch3 Slides Ed4 2024
72 pages
Statistics Week3
No ratings yet
Statistics Week3
19 pages
Linear RegressionSV
No ratings yet
Linear RegressionSV
66 pages
ISLP - Website 135 200
No ratings yet
ISLP - Website 135 200
66 pages
1-Chap II Econometrics ABC DR Mitiku
No ratings yet
1-Chap II Econometrics ABC DR Mitiku
80 pages
ISLP - Website-135-200 (1) - 1-60
No ratings yet
ISLP - Website-135-200 (1) - 1-60
60 pages
Model Development
No ratings yet
Model Development
80 pages
Book Dynamic Io
No ratings yet
Book Dynamic Io
368 pages
Fda Unit 5
No ratings yet
Fda Unit 5
20 pages
1.1 Regression Analysis
No ratings yet
1.1 Regression Analysis
33 pages
Daunit 3
No ratings yet
Daunit 3
32 pages
Chapter Three - Regression Feb 26 2024
No ratings yet
Chapter Three - Regression Feb 26 2024
17 pages
Machine Learning Unit2
No ratings yet
Machine Learning Unit2
31 pages
Statics Thinking-Regression
No ratings yet
Statics Thinking-Regression
51 pages
Supervised Machine Learning - Regression
No ratings yet
Supervised Machine Learning - Regression
34 pages
4 5774074240839978028
No ratings yet
4 5774074240839978028
57 pages
MATH6183 Introduction+Regression
No ratings yet
MATH6183 Introduction+Regression
70 pages
Lecture 5 6 Forecasting
100% (1)
Lecture 5 6 Forecasting
45 pages
Unit - II - DA
No ratings yet
Unit - II - DA
22 pages
APPLIED REGRESSION ANALYSIS AND GENERALIZED LINEAR MODELS Fox 2008
0% (1)
APPLIED REGRESSION ANALYSIS AND GENERALIZED LINEAR MODELS Fox 2008
103 pages
Linear Regression Models
No ratings yet
Linear Regression Models
41 pages
1.linear Regression PSP
No ratings yet
1.linear Regression PSP
92 pages
125.785 Module 2.1
No ratings yet
125.785 Module 2.1
94 pages
Linear Regression
No ratings yet
Linear Regression
16 pages
3 Regression Diagnostics
100% (1)
3 Regression Diagnostics
53 pages
Lecture6 Regression
No ratings yet
Lecture6 Regression
42 pages
(Unit-04) Part-01 - ML Algo
No ratings yet
(Unit-04) Part-01 - ML Algo
49 pages
Deck2 BusinessIntelligence M1 ACSA
No ratings yet
Deck2 BusinessIntelligence M1 ACSA
15 pages
ECM Class 1 2 3
No ratings yet
ECM Class 1 2 3
65 pages
SL 3
No ratings yet
SL 3
11 pages
LinearStatisticalModels and Regression Analysis
No ratings yet
LinearStatisticalModels and Regression Analysis
27 pages
Linear Regression - 2
No ratings yet
Linear Regression - 2
10 pages
Linear Regression Models
No ratings yet
Linear Regression Models
42 pages
Unit Iii
No ratings yet
Unit Iii
27 pages
SRM Formula Sheet
No ratings yet
SRM Formula Sheet
16 pages
FML Unit2
No ratings yet
FML Unit2
13 pages
Cost Estimation Using Regression Analysis
No ratings yet
Cost Estimation Using Regression Analysis
9 pages
Unit III
No ratings yet
Unit III
18 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
27 pages
Lecture 2
No ratings yet
Lecture 2
23 pages
Lecture 6-Revisions Chapter 1-5
No ratings yet
Lecture 6-Revisions Chapter 1-5
62 pages
Scott-Cunningham-Causal-Inference-2020 PRIMERA PARTE
No ratings yet
Scott-Cunningham-Causal-Inference-2020 PRIMERA PARTE
25 pages
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
From Everand
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
Gérard Blanchet
3/5 (1)
1-7 Least-Square Regression
100% (1)
1-7 Least-Square Regression
23 pages
Summer Course: Data Mining: Regression Analysis
No ratings yet
Summer Course: Data Mining: Regression Analysis
34 pages
Lecture06 MultReg
No ratings yet
Lecture06 MultReg
38 pages
Week 5 Sol
No ratings yet
Week 5 Sol
5 pages
SPReg
No ratings yet
SPReg
46 pages
ID Team Learning Ditinjau Dari Team Diversi
No ratings yet
ID Team Learning Ditinjau Dari Team Diversi
13 pages
Econometrics Cheat Sheet
No ratings yet
Econometrics Cheat Sheet
4 pages
Unit - Iii
No ratings yet
Unit - Iii
9 pages
Project 1 Macroeconometrics Assiyg1 Kedir.m PDF
100% (1)
Project 1 Macroeconometrics Assiyg1 Kedir.m PDF
19 pages
New File Spss
No ratings yet
New File Spss
4 pages
EDS - Jupyter Notebook
No ratings yet
EDS - Jupyter Notebook
9 pages
Regression Analysis in Machine Learning
No ratings yet
Regression Analysis in Machine Learning
13 pages
Generalized Linear Models
No ratings yet
Generalized Linear Models
7 pages
Classical Machine Learning: Linear Regression: Ramesh S
No ratings yet
Classical Machine Learning: Linear Regression: Ramesh S
28 pages
Dependent Variable
No ratings yet
Dependent Variable
3 pages
2021-A Complete Guide To Stepwise Regression in R
No ratings yet
2021-A Complete Guide To Stepwise Regression in R
4 pages
Data Analytics Unit 3 Notes
100% (3)
Data Analytics Unit 3 Notes
28 pages
Da Unit-3
No ratings yet
Da Unit-3
27 pages
Statistical Testing and Prediction Using Linear Regression: Abstract
No ratings yet
Statistical Testing and Prediction Using Linear Regression: Abstract
10 pages
Pengaruh Kompetensi, Kompleksitas Tugas, Skeptisme Profesional Terhadap Kualitas Audit Pada BPKP Provinsi Sumatera Utara
No ratings yet
Pengaruh Kompetensi, Kompleksitas Tugas, Skeptisme Profesional Terhadap Kualitas Audit Pada BPKP Provinsi Sumatera Utara
11 pages
Regression Statistics
No ratings yet
Regression Statistics
7 pages
Ordinary Least Squares Linear Regression Review: Week 4
No ratings yet
Ordinary Least Squares Linear Regression Review: Week 4
10 pages
GR7 Hw3mansci
No ratings yet
GR7 Hw3mansci
4 pages
ME781 Midsem 2016
No ratings yet
ME781 Midsem 2016
2 pages
Combinepdf
No ratings yet
Combinepdf
8 pages
Complete Linear Regression Algorithm
No ratings yet
Complete Linear Regression Algorithm
4 pages
MATRÍCULA G17081504 NOMBRE Luis Humberto Caballero Mejía
No ratings yet
MATRÍCULA G17081504 NOMBRE Luis Humberto Caballero Mejía
4 pages
CV Vijay
No ratings yet
CV Vijay
3 pages
Numerical Analysis II Essentials
From Everand
Numerical Analysis II Essentials
The Editors of REA
No ratings yet
Fitting Line To Set of Points
No ratings yet
Fitting Line To Set of Points
5 pages
Business Forecasting
No ratings yet
Business Forecasting
3 pages

Summer Course: Data Mining: Regression Analysis

Uploaded by

Summer Course: Data Mining: Regression Analysis

Uploaded by

Summer

You might also like