0% found this document useful (0 votes)

87 views38 pages

Lecture 7 Loss Function and Regularization

This document provides an overview of various loss functions and regularization techniques used in machine learning. It discusses mean squared error and L2 regularization, and how L1 regularization can lead to sparsity. It also covers other losses inspired by L1 and L2 such as elastic net, SCAD, and hinge loss. Logistic regression and cross entropy are linked, and other losses for tasks like ranking and metric learning are mentioned. Dropout is introduced as a technique to prevent neural networks from overfitting by dropping units to break co-adaptation.

Uploaded by

Utkarsh Bhalode

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

87 views38 pages

Lecture 7 Loss Function and Regularization

Uploaded by

Utkarsh Bhalode

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Advanced Machine Learning

Loss Function and Regularization

Amit Sethi
Electrical Engineering, IIT Bombay
Learning outcomes for the lecture

• Write expressions for common loss functions

• Match loss functions to qualitative objectives

• List advantages and disadvantages of loss

functions
Contents
• Revisiting MSE and L2 regularization

• How L1 regularization lead to sparsity

• Other losses inspired by L1 and L2

• Hinge loss leads to small number of support vectors

• Link between logistic regression and cross entropy

Assumptions behind MSE loss
• MSE is related to RMSE

• RMSE is the standard deviation of the error

• The mean of the error will be zero for a convex

problem
Regularization in regression
• Why regularize?
– Reduce variance, at the cost of bias
– Increase test (validation) accuracy
– Get interpretable models

• How to regularize?
– Shrink coefficients
– Reduce features
Regularization is constraining a model
• How to regularize?
– Reduce the number of parameters
• Share weights in structure
– Constrain parameters to be small
– Encourage sparsity of output in loss
• Most commonly Tikhonov (or L2, or ridge)
regularization (a.k.a. weight decay)
– Penalty on sums of squares of individual weights
𝑁 𝑛 𝑛
1 2
𝜆
𝐽= 𝑦𝑖 − 𝑓 𝑥𝑖 + 𝑤𝑗 2 ; 𝑓 𝑥𝑖 = 𝑤𝑗 𝑥𝑖 𝑗 ;
𝑁 2
𝑖=1 𝑗=1 𝑗=0
Coefficient shrinkage using ridge

Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
L2-regularization visualized
Contents
• Revisiting MSE and L2 regularization

• How L1 regularization lead to sparsity

• Other losses inspired by L1 and L2

• Hinge loss leads to small number of support vectors

• Link between logistic regression and cross entropy

Subset selection
• Set the coefficients with lowest absolute value
to zero
Level sets of Lq norm of coefficients

Which one is ridge? Subset selection? Lasso?

Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Other forms of regularization
• L1-regularization
(sparsity
inducing norm)
– Penalty on sums
of absolute
values of
weights
Lasso coeff paths with decreasing λ

Source: Regression Shrinkage and Selection via the Lasso, by Robert Tibshirani, Journal of Royal Stat. Soc., 1996
Compare to coeff shrinkage path of
ridge

Source: Sci-kit learn tutorial

Contents
• Revisiting MSE and L2 regularization

• How L1 regularization lead to sparsity

• Other losses inspired by L1 and L2

• Hinge loss leads to small number of support vectors

• Link between logistic regression and cross entropy

Smoothly Clipped Absolute Deviation
(SCAD) Penalty

Source: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties, by Fan and Li, Journal of Am. Stat. Assoc., 2001
Thresholding in three cases: No
alteration of large coefficients by SCAD
and Hard

Source: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties, by Fan and Li, Journal of Am. Stat. Assoc., 2001
Motivation for elastic net
• The p >> n problem and grouped selection
– Microarrays: p > 10,000 and n < 100.
– For those genes sharing the same biological “pathway”, the
correlations among them can be high.
• LASSO limitations
– If p > n, the lasso selects at most n variables. The number
of
– Grouped variables: the lasso fails to do grouped selection.
It tends to select one variable from a group and ignore the
others.

Source: Elastic net, by Zou and Hastie

Elastic net: Use both L2 and L2
penalties

Source: Elastic net, by Zou and Hastie

Geometry of elastic net

Source: Elastic net, by Zou and Hastie

Elastic net selects correlated variables
as “group”

Source: Elastic net, by Zou and Hastie

Elastic net selects correlated variables as
“group” and stabilizes the coefficient paths

Source: Elastic net, by Zou and Hastie

Why L2 penalty keeps coefficients of
groups together?
• Try to think of an example with correlated
variables
This analysis
can be
generalized
to linear
SVM

Source: Elastic SCAD SVM, by Becker, Toedt, Lichter and Benner, in BMC Bioinformatics2011
A family of loss functions

Source: “A General and Adaptive Robust Loss Function” Jonathan T. Barron, ArXiv 2017
A family of loss functions

Source: “A General and Adaptive Robust Loss Function” Jonathan T. Barron, ArXiv 2017
Contents
• Revisiting MSE and L2 regularization

• How L1 regularization lead to sparsity

• Other losses inspired by L1 and L2

• Hinge loss leads to small number of support vectors

• Link between logistic regression and cross entropy

What is hinge loss?
• Surrogate loss
function to 0-1 loss

• There are other

surrogate losses
possible
Contents
• Revisiting MSE and L2 regularization

• How L1 regularization lead to sparsity

• Other losses inspired by L1 and L2

• Hinge loss leads to small number of support vectors

Source: Why the logistic function? A tutorial discussion on probabilities and neural networks, by Michael I. Jordan ftp://psyche.mit.edu/pub/jordan/uai.ps
Losses for ranking and metric learning
• Margin loss

• Cosine similarity

• Ranking
– Point-wise
– Pair-wise
• φ(z) = (1-z)+, e-z, log(1-e-z)
– List-wise
Source: “Ranking Measures and Loss Functions in Learning to Rank” Chen et al, NIPS 2009
Dropout: Drop a unit out to prevent
co-adaptation

Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.
Why dropout?
• Make other features unreliable to break co-
adaptation
• Equivalent to adding noise
• Train several (dropped out) architectures in one
architecture (O(2n))
• Average architectures at run time
– Is this a good method for averaging?
– How about Bayesian averaging?
– Practically, this work well too

Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.
Model averaging

• Average output should be the same

• Alternatively,
– w/p at training time
– w at testing time
Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.
Difference between non-DO and DO
features

Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.
Indeed, DO leads to sparse activation

Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.
There is a sweet spot with DO, even if
you increase the number of neurons

Source: “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, by Srivastava, Hinton, et al. in JMLR 2014.

Measuring Variability and Factors Affecting The Agricultural Production: A Ridge Regression Approach
No ratings yet
Measuring Variability and Factors Affecting The Agricultural Production: A Ridge Regression Approach
14 pages
AML L2 Logistic Regression
No ratings yet
AML L2 Logistic Regression
37 pages
03. Presentation
No ratings yet
03. Presentation
59 pages
01 Lecturenote SRM
No ratings yet
01 Lecturenote SRM
9 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
07 Regularization
No ratings yet
07 Regularization
51 pages
Lec 07-08 - Final
No ratings yet
Lec 07-08 - Final
32 pages
Unit - 4-NNDL - Notes
No ratings yet
Unit - 4-NNDL - Notes
14 pages
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
No ratings yet
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
100 pages
Chapter 2 - Logistic Regression
No ratings yet
Chapter 2 - Logistic Regression
88 pages
01 Lecturenote SRM
No ratings yet
01 Lecturenote SRM
9 pages
05 AIS302 ANN-Optimization
No ratings yet
05 AIS302 ANN-Optimization
44 pages
L09 - Regularisation
No ratings yet
L09 - Regularisation
79 pages
Most Influential Data Science Research Papers
No ratings yet
Most Influential Data Science Research Papers
628 pages
Loss Function - Ipynb - Colaboratory
No ratings yet
Loss Function - Ipynb - Colaboratory
6 pages
Lec4 PDF
No ratings yet
Lec4 PDF
7 pages
Unit 2.3
No ratings yet
Unit 2.3
43 pages
Linear Models (Unit II) Chapter III 1
No ratings yet
Linear Models (Unit II) Chapter III 1
24 pages
Regression
No ratings yet
Regression
39 pages
Lec 05 Regularization
No ratings yet
Lec 05 Regularization
77 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
ML 19.03 Sidenotes
No ratings yet
ML 19.03 Sidenotes
30 pages
A General and Adaptive Robust Loss Function: Jonathan T. Barron Google Research
No ratings yet
A General and Adaptive Robust Loss Function: Jonathan T. Barron Google Research
19 pages
Neural Networks
No ratings yet
Neural Networks
63 pages
EE 769 Introduction To Machine Learning: Sheet 4 - 2020-21-2 Linear Classification
No ratings yet
EE 769 Introduction To Machine Learning: Sheet 4 - 2020-21-2 Linear Classification
4 pages
DL Assi02
No ratings yet
DL Assi02
9 pages
A General and Adaptive Robust Loss Function
No ratings yet
A General and Adaptive Robust Loss Function
9 pages
Lecture 7 - Part A - Mutli Class and Overfitting and Regularization
No ratings yet
Lecture 7 - Part A - Mutli Class and Overfitting and Regularization
43 pages
Lecture 2
No ratings yet
Lecture 2
6 pages
Module 3.3 Classification Models, An Overview
No ratings yet
Module 3.3 Classification Models, An Overview
11 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
57 pages
MSCV MLDL Remedial
No ratings yet
MSCV MLDL Remedial
95 pages
Logistic Regression
No ratings yet
Logistic Regression
9 pages
DL 02 Basics
No ratings yet
DL 02 Basics
95 pages
2022 Scribe Lecture7
No ratings yet
2022 Scribe Lecture7
9 pages
Chapter Regression
No ratings yet
Chapter Regression
10 pages
10: Empirical Risk Minimization
No ratings yet
10: Empirical Risk Minimization
6 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
NN WK 3 Lec 5 6 Gradient Descent
No ratings yet
NN WK 3 Lec 5 6 Gradient Descent
7 pages
Lecture6 Regularization
No ratings yet
Lecture6 Regularization
56 pages
Regularization
No ratings yet
Regularization
46 pages
Regularization
No ratings yet
Regularization
74 pages
Regularization (Mathematics)
No ratings yet
Regularization (Mathematics)
11 pages
EE353 - 769 08 Linear Classification
No ratings yet
EE353 - 769 08 Linear Classification
22 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
L11+ Regularization
No ratings yet
L11+ Regularization
24 pages
Wainwrightslides 2
No ratings yet
Wainwrightslides 2
77 pages
Lecture 1
No ratings yet
Lecture 1
6 pages
3 - Loss Functions
No ratings yet
3 - Loss Functions
14 pages
Module - 2 Ver 1.4
No ratings yet
Module - 2 Ver 1.4
35 pages
Lecture 9: October 2: 9.1.1 Stochastic Block Model
No ratings yet
Lecture 9: October 2: 9.1.1 Stochastic Block Model
6 pages
Group 30
No ratings yet
Group 30
33 pages
02 - Linear Models - C - Regularization - Logistic - Regression
No ratings yet
02 - Linear Models - C - Regularization - Logistic - Regression
16 pages
Advanced Regression Pres
No ratings yet
Advanced Regression Pres
42 pages
ML Classification Trupesh Patel
No ratings yet
ML Classification Trupesh Patel
39 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
Unit 2
No ratings yet
Unit 2
92 pages
Introduction to Logarithms and Exponentials
From Everand
Introduction to Logarithms and Exponentials
Simone Malacrida
No ratings yet
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
From Everand
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Session 3 - Macroeconomics - What Is It About
No ratings yet
Session 3 - Macroeconomics - What Is It About
16 pages
Session 5 - Consumption
No ratings yet
Session 5 - Consumption
17 pages
Session 14 - The Money Supply
No ratings yet
Session 14 - The Money Supply
13 pages
Session 6 - Investment
No ratings yet
Session 6 - Investment
13 pages
Session 12 - The Labour Market
No ratings yet
Session 12 - The Labour Market
12 pages
Session 2 - Macroeconomics - What Is It About
No ratings yet
Session 2 - Macroeconomics - What Is It About
14 pages
Session 15 - Exchange Rates and Capital Flows
No ratings yet
Session 15 - Exchange Rates and Capital Flows
11 pages
Session 13 - The Money Supply
No ratings yet
Session 13 - The Money Supply
6 pages
Session 7 - Investment
No ratings yet
Session 7 - Investment
6 pages
Session 15 - The Open Economy Mundell Fleming Model
No ratings yet
Session 15 - The Open Economy Mundell Fleming Model
8 pages
Session 1 - Macroeconomics - What Is It About
No ratings yet
Session 1 - Macroeconomics - What Is It About
7 pages
Session 7 - The Demand For Money
No ratings yet
Session 7 - The Demand For Money
12 pages
Session 10 - The IS LM Model
No ratings yet
Session 10 - The IS LM Model
14 pages
Session 16 - The Mundell Fleming Model
No ratings yet
Session 16 - The Mundell Fleming Model
14 pages
Session 9 - The IS LM Model
No ratings yet
Session 9 - The IS LM Model
8 pages
Session 11 - The Labour Market
No ratings yet
Session 11 - The Labour Market
11 pages
Session 8 - The Demand For Money
No ratings yet
Session 8 - The Demand For Money
10 pages
Session 14 - Exchange Rates and Capital Flows
No ratings yet
Session 14 - Exchange Rates and Capital Flows
6 pages
Session 8 - The IS LM Model
No ratings yet
Session 8 - The IS LM Model
10 pages
Session 17 - Economic Growth
No ratings yet
Session 17 - Economic Growth
10 pages
Session 13 - The Labour Market
No ratings yet
Session 13 - The Labour Market
8 pages
Session 18 - Economic Growth
No ratings yet
Session 18 - Economic Growth
9 pages
Cash Flow-Introduction-Example
No ratings yet
Cash Flow-Introduction-Example
2 pages
FRA - Lone Pine Cafe (B)
No ratings yet
FRA - Lone Pine Cafe (B)
2 pages
ML 2
No ratings yet
ML 2
28 pages
GNR602-Lec14-15 Harris-HoG-SIFT
No ratings yet
GNR602-Lec14-15 Harris-HoG-SIFT
86 pages
GNR602-Lec12-13 Image Compression
No ratings yet
GNR602-Lec12-13 Image Compression
85 pages
Shrinkage Parameter Selection Via Modified Cross Validation Approach For Ridge Regression Model
No ratings yet
Shrinkage Parameter Selection Via Modified Cross Validation Approach For Ridge Regression Model
10 pages
Statistical Methods For Bioinformatics Lecture 3
No ratings yet
Statistical Methods For Bioinformatics Lecture 3
33 pages
Rose 2016
No ratings yet
Rose 2016
17 pages
Lecture 7 Loss Function and Regularization
No ratings yet
Lecture 7 Loss Function and Regularization
38 pages
20220523121909pmwebology 18 (6) - 443 PDF
No ratings yet
20220523121909pmwebology 18 (6) - 443 PDF
14 pages
AICcmodavg
No ratings yet
AICcmodavg
22 pages
Module 4 EDA
No ratings yet
Module 4 EDA
20 pages
Estimation of Weibull Shape Parameter by Shrinkage Towards An Interval Under Failure Censored Sampling
No ratings yet
Estimation of Weibull Shape Parameter by Shrinkage Towards An Interval Under Failure Censored Sampling
22 pages
Principal Components in Regression Analysis
No ratings yet
Principal Components in Regression Analysis
27 pages
ML Unit 3 Notes 1
No ratings yet
ML Unit 3 Notes 1
58 pages
Spare Least Trimmed
No ratings yet
Spare Least Trimmed
23 pages
Analysis of Autogenous and Drying Shrinkage of Concrete
No ratings yet
Analysis of Autogenous and Drying Shrinkage of Concrete
171 pages
Regression Shrinkage Methods For Clinical Prediction Models Do Not Guarantee Improved Performance: Simulation Study
No ratings yet
Regression Shrinkage Methods For Clinical Prediction Models Do Not Guarantee Improved Performance: Simulation Study
14 pages
Machine Learning and Econometrics EF
No ratings yet
Machine Learning and Econometrics EF
270 pages
Enhanced Portfolio Optimization: Lasse Heje Pedersen, Abhilash Babu, and Ari Levine
No ratings yet
Enhanced Portfolio Optimization: Lasse Heje Pedersen, Abhilash Babu, and Ari Levine
49 pages
UE20CS312 Unit2 Slides
No ratings yet
UE20CS312 Unit2 Slides
206 pages
Optimizingthe Shrinkageand Bursting Strengthof Knitted Fabricsafter Resin Finishing
No ratings yet
Optimizingthe Shrinkageand Bursting Strengthof Knitted Fabricsafter Resin Finishing
7 pages
ISLR Chap 6 Shaheryar
No ratings yet
ISLR Chap 6 Shaheryar
22 pages
Modified Colon Leakage Score To Predict Anastomotic Leakage in Patients Who Underwent Left-Sided Colorectal Surgery
No ratings yet
Modified Colon Leakage Score To Predict Anastomotic Leakage in Patients Who Underwent Left-Sided Colorectal Surgery
11 pages
A New Liu-Type Estimator For The Inverse Gaussian Regression Model
No ratings yet
A New Liu-Type Estimator For The Inverse Gaussian Regression Model
21 pages
Lecture Notes in Statistics 148
No ratings yet
Lecture Notes in Statistics 148
241 pages
AMTA Assignment AMTA B (Aswin Avni Navya)
No ratings yet
AMTA Assignment AMTA B (Aswin Avni Navya)
13 pages
Potato Dry 33
No ratings yet
Potato Dry 33
16 pages
EDA 4th Module
No ratings yet
EDA 4th Module
26 pages
7SSMM700 Lecture 8
No ratings yet
7SSMM700 Lecture 8
33 pages
Shrinkage Content
No ratings yet
Shrinkage Content
1 page
TVP Var
No ratings yet
TVP Var
35 pages
Asce (2021)
No ratings yet
Asce (2021)
1 page
An Introduction To Statistical Learning From A Reg PDF
No ratings yet
An Introduction To Statistical Learning From A Reg PDF
25 pages