Lecture 3
Lecture 3
Lecture 3
Tarun Ramadorai
I This is one of the most important and useful methods in the machine
learning toolkit, as we will see.
1. For k = 0, ..., K 1
1.1 Consider all models with k + 1 regressors, i.e, those that augment the current
model with one additional predictive variable.
1.2 Pick the best of these models using your preferred criterion (say R2).
1.3 Either keep going until you hit a pre-specified number of regressors, or pick the
model with the k + 1 that minimizes cross-validated prediction error, or adjusted
R2 .
Rti = ci + ∑ k =1 λki Fk , t + uti for the eight HFR indexes during the full sample period from January 1990 to June 2000
K
This table shows the results of the regression
period. The table shows the intercept (C), statistically significant (at five percent level) slope coefficients on the various buy-and-hold and option-based risk factors
and adjusted R2 (Adj-R2). The buy-and-hold risk factors are Russell 3000 index (RUS), lagged Russell 3000 index (LRUS)), MSCI excluding the US index (MXUS),
MSCI Emerging Markets index (MEM), Fama-French Size and Book-to-Market factors (SMB & HML), Momentum factor (MOM), Salomon Brothers Government and
Corporate Bond index (SBG), Salomon Brothers World Government Bond index (SBW), Lehman High Yield Composite index (LHY), Federal Reserve Bank
Competitiveness-Weighted Dollar index (FRBI), Goldman Sachs Commodity index (GSCI) and the change in the default spread in basis points (DEFSPR). The option-
based risk factors include the at-the-money and out-of-money call and put options on the S&P 500 Composite index (SPCa/o and SPPa/o). For the two call and put
option-based strategies, subscripts a and o refer to at-the-money and out-of-the-money respectively.
Event Arbitrage Restructuring Event Driven Relative Value Convertible Equity Hedge Equity Non- Short Selling
Arbitrage Arbitrage Hedge
Factors λ Factors λ Factors λ Factors λ Factors λ Factors λ Factors λ Factors λ
C 0.04 C 0.43 C 0.20 C 0.38 C 0.24 C 0.99 C 0.56 C -0.07
SPPo -0.92 SPPo -0.63 SPPo -0.94 SPPo -0.64 SPPa -0.27 RUS 0.41 RUS 0.75 SPCo -1.38
SMB 0.15 SMB 0.24 SMB 0.31 MOM -0.08 LRUS 0.10 SMB 0.33 SMB 0.58 RUS -0.69
HML 0.08 HML 0.12 HML 0.12 SMB 0.17 SMB 0.05 HML -0.08 MEM 0.05 SMB -0.77
LRUS 0.06 RUS 0.17 HML 0.08 MEM 0.03 GSCI 0.08 HML 0.40
LHY 0.13 MEM 0.06 MXUS 0.04 SBG 0.16
FRBI 0.27
MEM 0.09
Adj-R2 44.04 Adj-R2 65.57 Adj-R2 73.38 Adj-R2 52.17 Adj-R2 40.51 Adj-R2 72.53 Adj-R2 91.63 Adj-R2 82.02
1. For k = K, K 1, ..., 1
1.1 Consider all models with k 1 regressors, i.e, those that delete one regressor at a
time from the full model.
1.2 Pick the best of these models using your preferred criterion (say R2).
1.3 Either keep going until you hit a pre-specified number of regressors, or pick the
model with the k that minimizes cross-validated prediction error, or adjusted R2.
Adjusted-R2 Distribution
OLS OLS + LASSO
0.10
Pr[ Adj. R2 = x ]
10.05%
8.17%
0.05
0.00
0% 5% 10% 15% 20% 0% 5% 10% 15% 20%
I Note what the objective function is saying: there is a penalty (l2 norm)
for a large number of large coefficients.
I This has the result of shrinking all the βk estimates towards zero.
I However, because of the penalty function, this is not the case for the
Ridge Regression. So it is best to first standardize the predictors
before estimating (turn them into mean zero, variance one by dividing
by in-sample standard deviation).
I Note also that the Ridge Regression will smoothly shrink the
parameters to zero, i.e., if regressors are orthogonal to one another and
normalized as above, all βk coefficients will be shrunk by a factor of
1
1+ λ .
I The LASSO does not have this drawback. The estimator is the
solution to the optimization problem:
!2
N K K
∑ Yi β0 ∑ Xik βk +λ ∑ j βk j
i =1 k =1 k =1
K
= RSS + λ ∑ j βk j .
k =1
I Useful intuition for the LASSO, when the RHS variables in the LASSO
regression are uncorrelated and have unit variance:
ols LASSO
I If β̂k is the OLS estimator, and β̂k the corresponding LASSO estimator,
then:
LASSO ols ols
β̂k = sign( β̂k )[max(0, β̂k λ)]
LASSO
I If the OLS coefficient is estimated small relative to λ, then β̂k = 0.
ϑ̂n0 ,`
λ θ̂n0 ,`
Figure 2. Note that the LASSO could select stock n’s lagged returns when trying to
forecast stock n’s future returns. In principle the LASSO could identify precisely the
Imperialsame
Collegepredictors
Business School as we use in the some of the OLS regressions
Imperialfrom Subsection
means Intelligent Business2.1. 17
Why Does LASSO Deliver Zero Coefficients?
Source: Intro to Stat. Learning
I Ellipses are iso-RSS curves; shaded are constraint regions for LASSO (left) and Ridge (right).
I But we will also consider how the LASSO helps with causal inference
and not just prediction, later in the course.
I More generally, you can find the optimal decision rule (in the training
dataset) as the one that minimizes the risk, i.e., the expected loss
function (or in most applications, the empirical risk, which is the
average of the loss function on the training dataset.
I One approach is to shrink them towards the base rate in the training
dataset.
I Note: the good news is that python has isotonic regression inbuilt:
sklearn.isotonic IsotonicRegression!
Fact
“Our experiments show that boosting full decision trees usually yields better
models than boosting weaker stumps. Unfortunately, our results also show that
boosting to directly optimize log-loss, or applying Logistic Correction to models
boosted with exponential loss, is only effective when boosting weak models such
as stumps. Neither of these methods is effective when boosting full decision trees.
Significantly better performance is obtained by boosting full decision trees with
exponential loss, and then calibrating their predictions using either Platt Scaling
or Isotonic Regression. Calibration with Platt Scaling or Isotonic Regression is
so effective that after calibration boosted decision trees predict better probabilities
than any other learning method we have compared them to, including neural
nets, bagged trees, random forests, and calibrated SVMs.
0.8
Fraction of positives
0.6
0.4
1500000
1000000
500000
0
0.0 0.2 0.4 0.6 0.8 1.0
Mean predicted value