Summer Course: Data Mining: Regression Analysis
Summer Course: Data Mining: Regression Analysis
Course
Data
Mining
Regression Analysis
Presenter: Georgi Nalbantov
Summer Course: Data Mining
August 2009
Summer
Course
Data
Mining
2/34
Structure
Regression analysis: definition and examples
Classical Linear Regression
LASSO and Ridge Regression (linear and nonlinear)
Nonparametric (local) regression estimation:
kNN for regression, Decision trees, Smoothers
Support Vector Regression (linear and nonlinear)
Variable/feature selection (AIC, BIC, R^2-adjusted)
Summer
Course
Data
Mining
3/34
Feature Selection, Dimensionality Reduction, and
Clustering in the KDD Process
U.M.Fayyad, G.Patetsky-
Shapiro and P.Smyth (1995)
Summer
Course
Data
Mining
4/34
Common Data Mining tasks
Clustering Classification Regression
k-th Nearest Neighbour
Parzen Window
Unfolding, Conjoint
Analysis, Cat-PCA
Linear Discriminant Analysis, QDA
Logistic Regression (Logit)
Decision Trees, LSSVM, NN, VS
Classical Linear Regression
Ridge Regression
NN, CART
+
+
+
+
+ +
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
-
-
-
-
-
-
+
+
+
+
+
+
+
+
+
X
1
X
1
X
1
X
2
X
2
Summer
Course
Data
Mining
5/34
Linear regression analysis: examples
Summer
Course
Data
Mining
6/34
Linear regression analysis: examples
Summer
Course
Data
Mining
7/34
The Regression task
Given data on m explanatory variables and 1 explained variable, where the explained
variable can take real values in 9
1
, find a function that gives the best fit:
Given: ( x
1
, y
1
), , ( x
m
, y
m
) e 9
n
X 9
1
Find: ] : 9
n
9
1
best function = the expected error on unseen data ( x
m+1
, y
m+1
), , ( x
m+k
, y
m+k
)
is minimal
Summer
Course
Data
Mining
8/54
Classical Linear Regression (OLS)
Explanatory and Response Variables are Numeric
Relationship between the mean of the response variable and the level
of the explanatory variable assumed to be approximately linear
(straight line)
Model:
) , 0 ( ~
1 0
o c c | | N x Y + + =
|
1
> 0 Positive Association
|
1
< 0 Negative Association
|
1
= 0 No Association
Summer
Course
Data
Mining
9/54
Classical Linear Regression (OLS)
Task:
Minimize the sum of
squared errors:
2
1
1
^
0
^
1
2
^
1
^
0
^ ^
= =
|
.
|
\
|
|
.
|
\
|
+ =
|
.
|
\
|
= + =
n
i
i i
n
i
i i
x y y y SSE x y | | | |
x y
1
^
0
^ ^
| | + =
|
0
Mean response when x=0 (y-intercept)
|
1
Change in mean response when x
increases by 1 unit (slope)
|
0
,
|
1
are unknown parameters (like )
|
0
+|
1
x Mean response when explanatory
variable takes on the value x
Summer
Course
Data
Mining
10/54
Classical Linear Regression (OLS)
Coefficients
a
89.124 7.048 12.646 .000
-9.009 1.503 -.937 -5.994 .002
(Constant)
LSD_CONC
Model
1
B Std. Error
Unstandardized
Coefficients
Beta
Standardized
Coefficients
t Sig.
Dependent Variable: SCORE
a.
x y
1
^
0
^ ^
| | + =
y
x
1
Parameter: Slope in the population model (|
1
)
Estimator: Least squares estimate:
Estimated standard error:
Methods of making inference regarding
population:
Hypothesis tests (2-sided or 1-sided)
Confidence Intervals
xx
S s /
^
1
^
= | o
2 2
2
^
2
|
.
|
\
|
=
n
SSE
n
y y
s
( )
=
2
x x S
xx
1
^
|
Summer
Course
Data
Mining
11/54
Classical Linear Regression (OLS)
Summer
Course
Data
Mining
12/54
Classical Linear Regression (OLS)
Summer
Course
Data
Mining
13/54
Classical Linear Regression (OLS)
Coefficient of determination (r
2
) : proportion of
variation in y explained by the regression on x.
1 0
2 2
s s
= r
S
SSE S
r
yy
yy
( )
=
2
y y S
yy
where
|
.
|
\
|
=
2
^
y y SSE
Summer
Course
Data
Mining
14/54
Classical Linear Regression (OLS):
Multiple regression
Numeric Response variable (y)
p Numeric predictor variables
Model:
Y = |
0
+ |
1
x
1
+ + |
p
x
p
+ c
Partial Regression Coefficients: |
i
effect (on the mean response) of
increasing the i
th
predictor variable by 1 unit, holding all other
predictors constant
Summer
Course
Data
Mining
15/54
Classical Linear Regression (OLS):
Ordinary Least Squares estimation
Population Model for mean response:
p p p
x x x x Y E | | | + + + =
1 1 0 1
) , | (
Least Squares Fitted (predicted) equation, minimizing SSE:
|
.
|
\
|
= + + + =
2
^ ^
1 1
^
0
^ ^
Y Y SSE x x Y
p p
| | |
Summer
Course
Data
Mining
16/54
Classical Linear Regression (OLS):
Ordinary Least Squares estimation
Model:
p p
x x Y
^
1 1
^
0
^ ^
| | | + + + =
Ridge regression estimation:
OLS estimation:
LASSO estimation:
|
.
|
\
|
=
2
^
min Y Y SSE
= =
+
|
.
|
\
|
=
p
j
j
n
i
Y Y SSE
1 1
2
^
min |
= =
+
|
.
|
\
|
=
p
j
j
n
i
Y Y SSE
1
2
1
2
^
min |
Summer
Course
Data
Mining
17/59
LASSO and Ridge estimation of model coefficients
sum(|beta|) sum(|beta|)
Summer
Course
Data
Mining
18/59
Nonparametric (local) regression estimation:
k-NN, Decision trees, smoothers
Summer
Course
Data
Mining
19/59
Nonparametric (local) regression estimation:
k-NN, Decision trees, smoothers
Summer
Course
Data
Mining
20/59
Nonparametric (local) regression estimation:
k-NN, Decision trees, smoothers
Summer
Course
Data
Mining
21/59
Nonparametric (local) regression estimation:
k-NN, Decision trees, smoothers
How to Choose k or h?
When k or h is small, single instances matter; bias is small, variance is
large (undersmoothing): High complexity
As k or h increases, we average over more instances and variance
decreases but bias increases (oversmoothing): Low complexity
Cross-validation is used to finetune k or h.
Summer
Course
Data
Mining
22/59
Linear Support Vector Regression
Suspiciously smart case
(overfitting)
Compromise case, SVR
(good generalisation)
Lazy case
(underfitting)
E
x
p
e
n
d
i
t
u
r
e
s
Age
E
x
p
e
n
d
i
t
u
r
e
s
Age
The thinner the tube, the more complex the model
biggest area
small area
E
x
p
e
n
d
i
t
u
r
e
s
Age
middle-sized area
Support vectors
Summer
Course
Data
Mining
23/59
Nonlinear Support Vector Regression
E
x
p
e
n
d
i
t
u
r
e
s
Age
Map the data into a higher-dimensional space:
Summer
Course
Data
Mining
24/59
Nonlinear Support Vector Regression
E
x
p
e
n
d
i
t
u
r
e
s
Age
Map the data into a higher-dimensional space:
Summer
Course
Data
Mining
25/59
Nonlinear Support Vector Regression: Technicalities
The SVR function:
Subject to:
To find the unknown parameters of the SVR function, solve:
How to choose , ,
= RBF kernel:
Find , , , and from a cross-validation procedure
Summer
Course
Data
Mining
26/59
SVR Technicalities: Model Selection
Do 5-fold cross-validation to find and for several fixed values of .
0 5 10 15
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
C
g
a
m
m
a
CV_MSE, epsilon = 0.15
0.0588
0.0588
0.0588
0.0588
0.0592
0.0592
0.0592
0.0592
0.0592
0.0598
0.0598
0.0598
0.0598
0.061
0
5
10
15 0
0.01
0.02
0.058
0.059
0.06
0.061
0.062
0.063
0.064
gamma
CV_MSE, epsilon = 0.15
C
C
V
M
S
E
Summer
Course
Data
Mining
SVR Study :
Model Training, Selection and Prediction
CVMSE (IR*, HR*, CR*)
CVMSE (IR*, HR*, CR*)
True returns (red) and raw predictions (blue)
27/59
Summer
Course
Data
Mining
28/59
-1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4
-3.05
-3
-2.95
-2.9
-2.85
-2.8
Effect of credit spread on SP500
credit spread
S
P
5
0
0
-40 -30 -20 -10 0 10 20 30 40 50 60
-5.5
-5
-4.5
-4
-3.5
-3
-2.5
Effect of vix on SP500
vix
S
P
5
0
0
-10 -5 0 5 10 15 20 25
-4.5
-4
-3.5
-3
-2.5
-2
Effect of vix FUT on SP500
vix FUT
S
P
5
0
0
-70 -60 -50 -40 -30 -20 -10 0 10 20 30
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
Effect of 3m treasure bill on SP500
3m treasure bill
S
P
5
0
0
SVR: Individual Effects
Summer
Course
Data
Mining
29/34
SVR Technicalities: SVR vs. OLS
Performance on the test set
0 5 10 15 20 25 30 35 40
2
2.5
3
3.5
4
Observation
E
x
p
e
n
d
i
t
u
r
e
s
Holiday Data, test set, epsilon = 0.15
MSE= 0.04
0 5 10 15 20 25 30 35 40
2
2.5
3
3.5
4
Obserlation
E
x
p
e
n
d
i
t
u
r
e
Holiday Data, test set, OLS solution
MSE= 0.23
SVR
OLS
Performance on the test set
Summer
Course
Data
Mining
30/34
Technical Note:
Number of Training Errors vs. Model Complexity
test errors
complexity training errors
Model complexity
Min. number of
training errors,
Functions ordered in
increasing complexity
Best trade-off
MATLAB video here
Summer
Course
Data
Mining
31/34
Variable selection for regression
Akaike Information Criterion (AIC). Final prediction error:
Summer
Course
Data
Mining
32/34
Variable selection for regression
Bayesian Information Criterion (BIC), also known as Schwarz criterion. Final
prediction error:
BIC tends to choose simpler models than AIC.
Summer
Course
Data
Mining
33/34
Variable selection for regression
R^2-adjusted:
Summer
Course
Data
Mining
34/34
Conclusion / Summary / References
Alpaydin, 2004,
http://www-stat.stanford.edu/~tibs/lasso.html ,
Bishop, 2006
(any introductory statistical/econometric book)
Smola and Schoelkopf, 2003
Hastie et. el., 2001
Classical Linear Regression
LASSO and Ridge Regression (linear and
nonlinear)
Nonparametric (local) regression estimation:
kNN for regression, Decision trees, Smoothers
Support Vector Regression (linear and
nonlinear)
Variable/feature selection (AIC, BIC, R^2-
adjusted)
Hastie et. el., 2001,
(any statistical/econometric book)