Advanced Machine Learning
Loss Function and Regularization
Amit Sethi
Electrical Engineering, IIT Bombay
Learning outcomes for the lecture
• Write expressions for common loss functions
• Match loss functions to qualitative objectives
• List advantages and disadvantages of loss
functions
Contents
• Revisiting MSE and L2 regularization
• How L1 regularization lead to sparsity
• Other losses inspired by L1 and L2
• Hinge loss leads to small number of support vectors
• Link between logistic regression and cross entropy
Assumptions behind MSE loss
• MSE is related to RMSE
• RMSE is the standard deviation of the error
• The mean of the error will be zero for a convex
problem
Regularization in regression
• Why regularize?
– Reduce variance, at the cost of bias
– Increase test (validation) accuracy
– Get interpretable models
• How to regularize?
– Shrink coefficients
– Reduce features
Regularization is constraining a model
• How to regularize?
– Reduce the number of parameters
• Share weights in structure
– Constrain parameters to be small
– Encourage sparsity of output in loss
• Most commonly Tikhonov (or L2, or ridge)
regularization (a.k.a. weight decay)
– Penalty on sums of squares of individual weights
𝑁 𝑛 𝑛
1 2
𝜆
𝐽= 𝑦𝑖 − 𝑓 𝑥𝑖 + 𝑤𝑗 2 ; 𝑓 𝑥𝑖 = 𝑤𝑗 𝑥𝑖 𝑗 ;
𝑁 2
𝑖=1 𝑗=1 𝑗=0
Coefficient shrinkage using ridge
Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
L2-regularization visualized
Contents
• Revisiting MSE and L2 regularization
• How L1 regularization lead to sparsity
• Other losses inspired by L1 and L2
• Hinge loss leads to small number of support vectors
• Link between logistic regression and cross entropy
Subset selection
• Set the coefficients with lowest absolute value
to zero
Level sets of Lq norm of coefficients
Which one is ridge? Subset selection? Lasso?
Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Other forms of regularization
• L1-regularization
(sparsity
inducing norm)
– Penalty on sums
of absolute
values of
weights
Lasso coeff paths with decreasing λ
Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Compare to coeff shrinkage path of
ridge
Source: Sci-kit learn tutorial
Contents
• Revisiting MSE and L2 regularization
• How L1 regularization lead to sparsity
• Other losses inspired by L1 and L2
• Hinge loss leads to small number of support vectors
• Link between logistic regression and cross entropy
Smoothly Clipped Absolute Deviation
(SCAD) Penalty
Source: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties, by Fan and Li, Journal of Am. Stat. Assoc., 2001
Thresholding in three cases: No
alteration of large coefficients by SCAD
and Hard
Source: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties, by Fan and Li, Journal of Am. Stat. Assoc., 2001
Motivation for elastic net
• The p >> n problem and grouped selection
– Microarrays: p > 10,000 and n < 100.
– For those genes sharing the same biological “pathway”, the
correlations among them can be high.
• LASSO limitations
– If p > n, the lasso selects at most n variables. The number
of
– Grouped variables: the lasso fails to do grouped selection.
It tends to select one variable from a group and ignore the
others.
Source: Elastic net, by Zou and Hastie
Elastic net: Use both L2 and L2
penalties
Source: Elastic net, by Zou and Hastie
Geometry of elastic net
Source: Elastic net, by Zou and Hastie
Elastic net selects correlated variables
as “group”
Source: Elastic net, by Zou and Hastie
Elastic net selects correlated variables as
“group” and stabilizes the coefficient paths
Source: Elastic net, by Zou and Hastie
Why L2 penalty keeps coefficients of
groups together?
• Try to think of an example with correlated
variables
This analysis
can be
generalized
to linear
SVM
Source: Elastic SCAD SVM, by Becker, Toedt, Lichter and Benner, in BMC Bioinformatics2011
A family of loss functions
Source: “A General and Adaptive Robust Loss Function” Jonathan T. Barron, ArXiv 2017
A family of loss functions
Source: “A General and Adaptive Robust Loss Function” Jonathan T. Barron, ArXiv 2017
Contents
• Revisiting MSE and L2 regularization
• How L1 regularization lead to sparsity
• Other losses inspired by L1 and L2
• Hinge loss leads to small number of support vectors
• Link between logistic regression and cross entropy
What is hinge loss?
• Surrogate loss
function to 0-1 loss
• There are other
surrogate losses
possible
Contents
• Revisiting MSE and L2 regularization
• How L1 regularization lead to sparsity
• Other losses inspired by L1 and L2
• Hinge loss leads to small number of support vectors
• Link between logistic regression and cross entropy
Why logistic regression and BCE
• Let us assume a Bernoulli distribution
• 𝑃(𝑥) = 𝜇𝑥 (1 − 𝜇)1−𝑥
• An exponential family distribution is
𝜃𝑥−𝑏 𝜃
• 𝑃(𝑥|𝜃, 𝜑) = exp* + 𝑐 𝑥, 𝜑 +
𝑎 𝜑
• So, Bernoulli can be re-written as
μ
• 𝑃(𝑥) = exp 𝑥 log + log(1 − μ)
1−μ
μ
• Log odds of success is θ = log
1−μ
1
• So, μ =
1−𝑒 −θ
Source: Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan
Generative vs. discriminative
Generative Discriminative
• Belief network A is more • More robust
modular
– Class-conditional densities are – Don’t need precise model
likely to be local, characteristic specification, so long as it is
functions of the ob jects being from exponential family
classiffied, invariant to the
nature and number of the • Requires fewer parameters
other classes
– O(n) as opposed to O(n2)
• More “natural”
– Deciding what kind of object to
generate and then generating it
from a recipe
• More efficient to estimate
mode, if correct
Source: Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan ftp://psyche.mit.edu/pub/jordan/uai.ps
Losses for ranking and metric learning
• Margin loss
• Cosine similarity
• Ranking
– Point-wise
– Pair-wise
• φ(z) = (1-z)+, e-z, log(1-e-z)
– List-wise
Source: “Ranking Measures and Loss Functions in Learning to Rank” Chen et al, NIPS 2009
Dropout: Drop a unit out to prevent
co-adaptation
Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.
Why dropout?
• Make other features unreliable to break co-
adaptation
• Equivalent to adding noise
• Train several (dropped out) architectures in one
architecture (O(2n))
• Average architectures at run time
– Is this a good method for averaging?
– How about Bayesian averaging?
– Practically, this work well too
Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.
Model averaging
• Average output should be the same
• Alternatively,
– w/p at training time
– w at testing time
Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.
Difference between non-DO and DO
features
Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.
Indeed, DO leads to sparse activation
Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.
There is a sweet spot with DO, even if
you increase the number of neurons
Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.