Lecture 3

Big Data in Finance
Regularization and Calibration
Tarun Ramadorai
Imperial College Business School Imperial means Intelligent Business 1

Regularization
I One important tool that we didn’t cover in great detail in the previous
lectures is Regularization.
I Regularization refers to the process of introducing additional

constraints into the optimization problem in order to get a more
sensible solution, and in particular to avoid the problem of overfitting.
I This is one of the most important and useful methods in the machine
learning toolkit, as we will see.
I Helps to solve a common problem: How can we make progress if

there are many possible predictors of an outcome variable?

Linear Regression, Prediction, and Regularization
I Consider the standard OLS regression:
K
0
Yi = β0 + ∑ Xik βk + εi = Xi β + εi.
k =1
I We usually estimate the parameters of this regression as:

!2
N K
arg min ∑
β0,β1,.., βK i =1
Yi β0 ∑ Xik βk .
k =1
I Several regularization approaches can be used to modify this objective

function. Why should we do so?
I Using the linear regression is useful as an intuitive explanatory
device, but less good for prediction.
I For example, when K 3, OLS is not admissible. (See Efron article in

Scientific American in the reading list for an intuitive explanation of
Stein’s (1955) famous result).
I In practice, this means that there are better predictors than OLS when K 3.
I Another issue is what we might do when K is very large – for

example, in the credit assignment, there are over 100 possible
variables but this can be much higher if we consider nonlinear
transformations of these variables.
I In medical science, often millions.

I One possible problem when you have many RHS variables is that you
increase the chance that they are highly correlated with one another –
meaning that their magnitudes may be off because of collinearity.
I Another is that it can make testing restrictions or hypotheses very

difficult especially if K is high relative to N, the sample size.
I Finally there’s the obvious issue of interpretation being very hard

when there are numerous variables on the RHS, making it difficult to
figure out which are truly important.

Regularization - Best Subsets
I Clearly as we introduce additional regressors into OLS, we increase
the goodness of fit – but there is the usual bias-variance tradeoff.
I There are several ways to introduce regularization constraints.
I One way is to limit the set of regressors explictly (say k K ). This is

called "best subset selection"
I Computationally hard! The brute force way to do this is to simply fit

all models which have k regressors. There will be (Kk ) of these models.
Then select amongst them using R2 or some other metric
(cross-validated prediction error, perhaps).
Regularization - Stepwise Regressions
I A simpler approach is offered by stepwise regression. The procedure
is:
1. For k = 0, ..., K 1
1.1 Consider all models with k + 1 regressors, i.e, those that augment the current
model with one additional predictive variable.
1.2 Pick the best of these models using your preferred criterion (say R2).
1.3 Either keep going until you hit a pre-specified number of regressors, or pick the
model with the k + 1 that minimizes cross-validated prediction error, or adjusted
R2 .

Hedge Fund Factors with Stepwise Regression
Agarwal and Naik (2004)
Table 4: Results with HFR Equally-Weighted Indexes
Rti = ci + ∑ k =1 λki Fk , t + uti for the eight HFR indexes during the full sample period from January 1990 to June 2000
K
This table shows the results of the regression
period. The table shows the intercept (C), statistically significant (at five percent level) slope coefficients on the various buy-and-hold and option-based risk factors
and adjusted R2 (Adj-R2). The buy-and-hold risk factors are Russell 3000 index (RUS), lagged Russell 3000 index (LRUS)), MSCI excluding the US index (MXUS),
MSCI Emerging Markets index (MEM), Fama-French Size and Book-to-Market factors (SMB & HML), Momentum factor (MOM), Salomon Brothers Government and
Corporate Bond index (SBG), Salomon Brothers World Government Bond index (SBW), Lehman High Yield Composite index (LHY), Federal Reserve Bank
Competitiveness-Weighted Dollar index (FRBI), Goldman Sachs Commodity index (GSCI) and the change in the default spread in basis points (DEFSPR). The option-
based risk factors include the at-the-money and out-of-money call and put options on the S&P 500 Composite index (SPCa/o and SPPa/o). For the two call and put
option-based strategies, subscripts a and o refer to at-the-money and out-of-the-money respectively.
Event Arbitrage Restructuring Event Driven Relative Value Convertible Equity Hedge Equity Non- Short Selling
Arbitrage Arbitrage Hedge
Factors λ Factors λ Factors λ Factors λ Factors λ Factors λ Factors λ Factors λ
C 0.04 C 0.43 C 0.20 C 0.38 C 0.24 C 0.99 C 0.56 C -0.07
SPPo -0.92 SPPo -0.63 SPPo -0.94 SPPo -0.64 SPPa -0.27 RUS 0.41 RUS 0.75 SPCo -1.38
SMB 0.15 SMB 0.24 SMB 0.31 MOM -0.08 LRUS 0.10 SMB 0.33 SMB 0.58 RUS -0.69
HML 0.08 HML 0.12 HML 0.12 SMB 0.17 SMB 0.05 HML -0.08 MEM 0.05 SMB -0.77
LRUS 0.06 RUS 0.17 HML 0.08 MEM 0.03 GSCI 0.08 HML 0.40
LHY 0.13 MEM 0.06 MXUS 0.04 SBG 0.16
FRBI 0.27
MEM 0.09
Adj-R2 44.04 Adj-R2 65.57 Adj-R2 73.38 Adj-R2 52.17 Adj-R2 40.51 Adj-R2 72.53 Adj-R2 91.63 Adj-R2 82.02

I Can also approach this using backward stepwise regression. The
procedure begins by estimating the full model with K regressors.
Then:
1. For k = K, K 1, ..., 1
1.1 Consider all models with k 1 regressors, i.e, those that delete one regressor at a
time from the full model.
1.2 Pick the best of these models using your preferred criterion (say R2).
1.3 Either keep going until you hit a pre-specified number of regressors, or pick the
model with the k that minimizes cross-validated prediction error, or adjusted R2.

I Note that both forward and backward stepwise regression are
substantially less computationally intensive than best subset selection
overall.
I Also worth noting that backward stepwise regression cannot be used

if K > N, the sample size.
I Forward stepwise is the only solution in this case.
I Now let’s move across to understanding two other important

techniques for regularization that are commonly applied in machine
learning. These are Ridge Regression and the LASSO.
LASSO in Returns Forecasting
Chinco, Clark-Joseph, Ye (2016)
Adjusted-R2 Distribution
OLS OLS + LASSO
0.10
Pr[ Adj. R2 = x ]
10.05%
8.17%
0.05
0.00
0% 5% 10% 15% 20% 0% 5% 10% 15% 20%
Figure 4: Distribution of adjusted R2 s from the forecasting regressions in Equations (4)

and (8). Black bars: Probability that the adjusted R2 from a single out-of-sample forecasting
regression falls within a 1%-point interval. Red vertical line: Average adjusted R2 from these
regressions corresponding to the point estimates in the bottom row of Table 1. Left panel:
Out-of-sample prediction made using OLS as in Equation (4). Right panel: Out-of-sample
predictions made using both OLS and the LASSO as in Equation (8). Reads: “Including
the LASSO’s return forecast increases out-of-sample predictive power by 10.05/8.17 − 1 = 23%
relative to the benchmark OLS model.”
Ridge Regression
I The Ridge Regression estimates parameters by minimizing the
following objective function:
!2
N K K
∑ Yi β0 ∑ Xik βk +λ ∑ 2
βk
i =1 k =1 k =1
K
= RSS + λ ∑ β2k .
k =1
β̂ Ridge = ( X 0 X + λIK ) 1
( X 0Y )

Ridge Regression
I Here, λ 0 is a tuning parameter to be estimated separately.
I Can usually be selected using cross-validation in the training sample.
I Note what the objective function is saying: there is a penalty (l2 norm)
for a large number of large coefficients.
I This has the result of shrinking all the βk estimates towards zero.

Ridge Regression
I In standard OLS, if we multiply one of the predictor variables by a
constant c, the coefficient estimate simply scales by 1c . This means that
for any scaled predictor, Xk β̂k remains the same.
I However, because of the penalty function, this is not the case for the
Ridge Regression. So it is best to first standardize the predictors
before estimating (turn them into mean zero, variance one by dividing
by in-sample standard deviation).
I Note also that the Ridge Regression will smoothly shrink the
parameters to zero, i.e., if regressors are orthogonal to one another and
normalized as above, all βk coefficients will be shrunk by a factor of
1
1+ λ .

The LASSO
I As we saw earlier, est subset selection is problematic because it is
computationally infeasible. The Ridge Regression, while
computationally feasible, has the drawback that all predictors are
generally selected (though shrunk).
I The LASSO does not have this drawback. The estimator is the
solution to the optimization problem:
!2
N K K
∑ Yi β0 ∑ Xik βk +λ ∑ j βk j
i =1 k =1 k =1
K
= RSS + λ ∑ j βk j .
k =1
I The estimator yields sparse models, which only involve a subset of

variables.
Intuition for the LASSO
I Authors use LASSO to predict one-minute stock returns with the

lagged returns of all the other stocks listed on the NYSE.
I Useful intuition for the LASSO, when the RHS variables in the LASSO
regression are uncorrelated and have unit variance:
ols LASSO
I If β̂k is the OLS estimator, and β̂k the corresponding LASSO estimator,
then:
LASSO ols ols
β̂k = sign( β̂k )[max(0, β̂k λ)]
I If OLS coefficient is estimated large (assume positive), then LASSO delivers

LASSO ols ols
similar estimate, since β̂k = β̂k λ β̂k .
LASSO
I If the OLS coefficient is estimated small relative to λ, then β̂k = 0.

Intuition for the LASSO
Relationship Between LASSO and OLS Estimates
ϑ̂n0 ,`
λ θ̂n0 ,`
Figure 2: x-axis: OLS-regression coefficient in an infinite sample. y-axis: Penalized-

regression coefficient from the LASSO. Dotted: x = y line. Reads: “If an OLS regression
would have estimated a small coefficient value given enough data, |θ̂n0 ,` | < λ, then the LASSO
will set ϑ̂n0 ,` = 0.”
Figure 2. Note that the LASSO could select stock n’s lagged returns when trying to
forecast stock n’s future returns. In principle the LASSO could identify precisely the
Imperialsame
Collegepredictors
Business School as we use in the some of the OLS regressions
Imperialfrom Subsection
means Intelligent Business2.1. 17
Why Does LASSO Deliver Zero Coefficients?
Source: Intro to Stat. Learning
I Ellipses are iso-RSS curves; shaded are constraint regions for LASSO (left) and Ridge (right).

LASSO
I As you will see, the LASSO is an incredibly useful tool in a wide
variety of contexts. You will use it in at least two cases in this course:
I Credit scoring (including your assignment!).
I Returns forecasting.
I But we will also consider how the LASSO helps with causal inference
and not just prediction, later in the course.
I Now, a brief introduction to calibrating classification algorithms.

"Hard" and Probabilistic Classification
I Classification algorithms are useful, but for many applications, we
need a probability that an observation belongs to a particular class
label, rather than the predicted class label alone.
I Many algorithms in their basic form aren’t set up to give you

probabilities. They are "hard" classifiers rather than probabilistic
classifiers:
I Tree-based methods.
I Support vector machines.
I However, others are naturally probabilistic:

I Logistic regressions.
I Naive Bayes.
I More generally, in econometrics, such probabilistic classifiers are broadly studied
in the area of discrete choice.

Probabilistic to Categories
I It’s easy to transform a probabilistic classifier into a "hard" classifier.
I The simplest way (for a binary "hard" classification) is to set a cutoff

level, and simply classify observations above and below the cutoff
into the two classes.
I Generally speaking, since probabilistic classifiers optimize conditional

probabilities on the training set, this cutoff is optimally 50%.
I More generally, you can find the optimal decision rule (in the training
dataset) as the one that minimizes the risk, i.e., the expected loss
function (or in most applications, the empirical risk, which is the
average of the loss function on the training dataset.

Categories to Probabilistic
I It’s less easy to transform a "hard" classification into probabilities.
I Take a specific example. For classification trees, the standard

approach is to do so by assigning the raw training frequency. That is:
I Take a given observation classified by a tree.
I Look at the leaf in which this observation lies.
I Call the number of training observations at the leaf n.
I Call the number of training observations at that leaf classified as "positive" (for a
two-class problem) k.
I The probability estimate is then p = k/n.
I This is problematic for at least two reasons.

I First, most classification algorithms will have this probability estimate as either 1
or 0, since they are optimized for "leaf purity".
I Second, if the leaf size is really small, these probabilities will be really noisy.
(Pruning is a poor solution to this one.)

Categories to Probabilistic: Smoothing
I One simple way to transform these p = k/n estimated probabilities is
to shrink them towards something more sensible.
I One approach is to shrink them towards the base rate in the training
dataset.
I Zadrozny and Elkan (2001) recommend the following:

I Replace p = nk with p0 = kn++bm
m where b is the base rate in the training data, and m
is a parameter controlling the degree of shrinkage.
I This smoothing method is called m-estimation.
I They recommend picking m such that bm = 10, since this downweights nodes
where k 10.
I They claim that smoothing results are not very sensitive to m, which is
(conveniently!) desirable.

Categories to Probabilistic: Curtailment
I Another approach is to solve the "pruning" problem.
I The approach is to work backwards from small leaves to bigger parents or
grandparents, where the number of observations is larger, and (potentially)
conditional probabilities are estimated with less noise.
0
I
k k
That is, replace p = n with p = n0 where p0 is the probability/score estimated at
0
the parent or grandparent node. But how far back should we go?
I Zadrozny and Elkan (2001) recommend the following:
I Keep going backwards until you hit a node with less than v training
observations.
I Then use the p0 estimated at the parent of this node.
I They recommend picking v such that bv = 10.
I They simply set v = 200 in their paper, and claim this doesn’t affect things much
in their example.
I Note: curtailment and smoothing can be used as complements: use
m-smoothing on curtailed estimates. Ultimately, this is what they
recommend.

Categories to Probabilistic: Platt Scaling
I A popular approach is Platt scaling – this is essentially fitting a logit
regression to the scores provided by a classifier.
I Train a classifier on the training dataset, and additionally designate a hold out
calibration set.
I Run the trained classifier on the hold out calibration set.
I Create the scores f i generated by the classifier for each observation in the hold
out calibration set: this is essentially p = nk at the leaf nodes for a tree (degenerate
as they may be).
I For each observation in the calibration set, we know the true value y. Transform
y as follows:
N+ +1
I If y = 1, replace with y+ = where N+ is the number of 1 observations in the calibration set.
N+ +2
I 1
If y = 0, replace with y = N +2 where N is the number of 0 observations in the calibration set.
I Then estimate P(y = 1j f ) = 1+exp(1A f + B) using maximum likelihood.

I Note: the weird y transformation is because we are applying Bayes’
rule to a prior on out-of-sample data that is uniform over the class
labels.

Categories to Probabilistic: Isotonic Regression
I Another approach is to use isotonic regression.
I Algorithm (see Niculescu-Mizil and Caruana, 2005):

I Basic assumption is that yi = m( f i ) + εi , where m is an isotonic (monotonically
increasing) function.
I Given a training dataset with (yi , f i ) pairs, find the best function m̂.
I m̂ = arg minz ∑(yi z( f i ))2.
I Once again, use an independent calibration set on which to train the isotonic
function, as above.
I Note: the good news is that python has isotonic regression inbuilt:
sklearn.isotonic IsotonicRegression!

Calibration
I How effective is calibration? According to Niculescu-Mizil and
Caruana (2005):
Fact
“Our experiments show that boosting full decision trees usually yields better
models than boosting weaker stumps. Unfortunately, our results also show that
boosting to directly optimize log-loss, or applying Logistic Correction to models
boosted with exponential loss, is only effective when boosting weak models such
as stumps. Neither of these methods is effective when boosting full decision trees.
Significantly better performance is obtained by boosting full decision trees with
exponential loss, and then calibrating their predictions using either Platt Scaling
or Isotonic Regression. Calibration with Platt Scaling or Isotonic Regression is
so effective that after calibration boosted decision trees predict better probabilities
than any other learning method we have compared them to, including neural
nets, bagged trees, random forests, and calibrated SVMs.

Calibration
Fuster, Goldsmith-Pinkham, Ramadorai, and Walther (2017)
Calibration plots (reliability curve)

1.0
0.8
Fraction of positives
0.6
0.4
0.2 Perfectly calibrated

Logit
LogitNonLinear
RandomForest
0.0 RandomForestIsotonic
0.0 0.2 0.4 0.6 0.8 1.0
3000000 Logit RandomForest

LogitNonLinear RandomForestIsotonic
2500000
2000000
Count
1500000
1000000
500000
0
0.0 0.2 0.4 0.6 0.8 1.0
Mean predicted value

Lecture 3

Uploaded by

Copyright:

Available Formats

Lecture 3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 3

Uploaded by

Copyright:

Available Formats

Big Data in Finance

Regularization and Calibration

Imperial College Business School Imperial means Intelligent Business 1

I Regularization refers to the process of introducing additional

I Helps to solve a common problem: How can we make progress if

Imperial College Business School Imperial means Intelligent Business 2

I We usually estimate the parameters of this regression as:

I Several regularization approaches can be used to modify this objective

I For example, when K 3, OLS is not admissible. (See Efron article in

I Another issue is what we might do when K is very large – for

Imperial College Business School Imperial means Intelligent Business 4

I Another is that it can make testing restrictions or hypotheses very

I Finally there’s the obvious issue of interpretation being very hard

Imperial College Business School Imperial means Intelligent Business 5

I There are several ways to introduce regularization constraints.

I One way is to limit the set of regressors explictly (say k K ). This is

I Computationally hard! The brute force way to do this is to simply fit

Imperial College Business School Imperial means Intelligent Business 7

Table 4: Results with HFR Equally-Weighted Indexes

Imperial College Business School Imperial means Intelligent Business 8

Imperial College Business School Imperial means Intelligent Business 9

I Also worth noting that backward stepwise regression cannot be used

I Forward stepwise is the only solution in this case.

I Now let’s move across to understanding two other important

Figure 4: Distribution of adjusted R2 s from the forecasting regressions in Equations (4)

Imperial College Business School Imperial means Intelligent Business 12

Imperial College Business School Imperial means Intelligent Business 13

Imperial College Business School Imperial means Intelligent Business 14

I The estimator yields sparse models, which only involve a subset of

I Authors use LASSO to predict one-minute stock returns with the

I If OLS coefficient is estimated large (assume positive), then LASSO delivers

Imperial College Business School Imperial means Intelligent Business 16

Relationship Between LASSO and OLS Estimates

Figure 2: x-axis: OLS-regression coefficient in an infinite sample. y-axis: Penalized-

Imperial College Business School Imperial means Intelligent Business 18

I Now, a brief introduction to calibrating classification algorithms.

Imperial College Business School Imperial means Intelligent Business 19

I Many algorithms in their basic form aren’t set up to give you

I However, others are naturally probabilistic:

Imperial College Business School Imperial means Intelligent Business 20

I The simplest way (for a binary "hard" classification) is to set a cutoff

I Generally speaking, since probabilistic classifiers optimize conditional

Imperial College Business School Imperial means Intelligent Business 21

I Take a specific example. For classification trees, the standard

I This is problematic for at least two reasons.

Imperial College Business School Imperial means Intelligent Business 22

I Zadrozny and Elkan (2001) recommend the following:

Imperial College Business School Imperial means Intelligent Business 23

Imperial College Business School Imperial means Intelligent Business 24

I Then estimate P(y = 1j f ) = 1+exp(1A f + B) using maximum likelihood.

Imperial College Business School Imperial means Intelligent Business 25

I Algorithm (see Niculescu-Mizil and Caruana, 2005):

Imperial College Business School Imperial means Intelligent Business 26

Imperial College Business School Imperial means Intelligent Business 27

Calibration plots (reliability curve)

0.2 Perfectly calibrated

3000000 Logit RandomForest

Imperial College Business School Imperial means Intelligent Business 28

You might also like