0% found this document useful (0 votes)

67 views12 pages

Lecture+Notes+-+Advanced+Regression

The document covers advanced regression concepts, including non-linear relationships, regularization techniques like Ridge and Lasso, and the importance of feature engineering. It discusses model selection parameters and methods such as Best Subset and Stepwise Selection for optimizing regression models. Additionally, it emphasizes the balance between model complexity and performance through regularization and the use of various metrics for model evaluation.

Uploaded by

Aaquib Sattar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views12 pages

Lecture+Notes+-+Advanced+Regression

Uploaded by

Aaquib Sattar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Lecture Notes

Advanced Regression
In this module, you were introduced to the concepts of the advanced regression framework. You have
learnt to deal with the problems when the target variable y was non-linearly related to the predictor
variables X. You were introduced to the concept of regularization in regression models. We discussed at
length about the two regularized regression models, namely Ridge and Lasso. The concept of
hyperparameter (λ) was also described in the context of regularization, along with its impact on the built
model.

Generalized Regression
In linear regression, you had encountered problems where the target variable y was linearly related to the
predictor variables X. But what if the relationship is not linear? Let's see how we can use generalised
regression to tackle such problems.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

You should follow these two steps while building any model:
1. Carry out exploratory data analysis by examining scatter plots of explanatory and dependent
variables.
2. Choose an appropriate set of functions which seem to fit the plot well, build models using them,
and compare the results.

Feature Engineering

While constructing the non-linear regression model, instead of using the raw explanatory variables in the
current form, we create some function of the explanatory variables to best explain the data points. These
functions capture the non-linearity in the data

The derived features could be combinations of two or more attributes and/or transformations of
individual attributes. These combinations and transformations could be linear or non-linear.

Note that a linear combination of two attributes x1 and x2 allows only two operations - multiplying by a
constant and adding the results. For example, 3𝑥# + 5𝑥& is a linear combination, though 2𝑥# 𝑥& is a non-
linear combination.

Generalized Regression Framework

We also saw several commonly used functions used in regression and how an n-degree polynomial can be
expressed as a linear combination of features.

The next step is to find out the coefficients of such models mathematically, i.e. to fit the model. Let's see
how we can do that.
In generalised regression models, the basic algorithm remains the same as linear regression- we compute
the values of constants which result in the least possible error (best fit). The only difference is that we now
use the features

instead of the raw attributes.

The term 'linear' in linear regression refers to the linearity in the coefficients, i.e. the target variable y is
linearly related to the model coefficients. It does not require that y should be linearly related to the raw
attributes or features. Feature functions could be linear or non-linear.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

In a linear combination of features, the following operations can be performed:

1. We can multiply

with constants, for example,

2. We can add those terms (but not multiply, divide, exponentiate etc.)
for example,

Expressions
We can express the regression equation as a dot product of 2 vectors - 1. a vector of all the
coefficients and 2. a vector with the features:

Next, we sum up the errors between predicted and actual response variables and minimize the
residual sum of errors to get the optimal coefficients:

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

To summarise the key points:
1. We first created a feature matrix of dimension n x k, where n is the number of data points in the
training dataset and k is the number of features.
2. We then use the expression to identify the coefficients that would correspond to the best-
fit regression model and minimize the residual sum of errors:

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

As our goal is to minimise the loss function hence we’ll differentiate the Loss function and equate it to
zero.

Regularized Regression

A predictive model has to be as simple as possible, but no simpler. There is an important relationship
between the complexity of a model and its usefulness in a learning context because of the following
reasons:
• Simpler models are usually more generic and are more widely applicable (are generalizable)
• Simpler models require fewer training samples for effective training than the more complex ones

Regularization is a process used to create an optimally complex model, i.e. a model which is as simple as
possible while performing well on the training data.

Through regularization, the algorithm designer tries to strike the delicate balance between keeping
the model simple, yet not making it too naive to be of any use.
The regression does not account for model complexity - it only tries to minimize the error (e.g. MSE),
although if it may result in arbitrarily complex coefficients. On the other hand, in regularized regression,
the objective function has two parts - the error term and the regularization term.

Ridge Regression

In ridge regression, an additional term of "sum of the squares of the coefficients" is added to the cost
function along with the error term

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Lasso Regression

In case of lasso regression, a regularisation term of "sum of the absolute value of the coefficients" is added

These are two commonly used regularised regression methods - Ridge regression and Lasso regression. Both these
methods are used to make the regression model simpler while balancing the 'bias-variance' trade-off.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Difference between Ridge and Lasso Regression

You learnt that both Ridge and Lasso regularize the coefficients by reducing them in value, essentially
causing shrinkage of the coefficients. Ridge and Lasso perform different measures of shrinkage which depends
on the value of hyperparameter, λ. In the process of shrinkage, Lasso shrinks some of the variable coefficients to 0,
thus performing variable selection.

Thus, the key observation here is that at the optimum solution for α (the place where the sum of the error and
regularisation terms is minimum), the corresponding regularization contour and the error contour must ’touch’ each
other tangentially and not 'cross'. The 'blue stars' highlight the touch points between the error contours and the
lasso regularization contours. The 'green stars' highlight the touch points between the error contours and the ridge
regularization terms. The picture illustrates the fact that because of the 'corners' in the lasso contours (unlike ridge
regression), the touch points are more likely to be on one or more of the axes. This implies that the other
coefficients become zero. Hence, lasso regression also serves as a variable shrinkage method, whereas ridge
regression does not.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Model Selection Parameters

While creating the best model for any problem statement, we end up choosing from a set of models which would
give us the least test error. Hence, the test error, and not only the training error, needs to be estimated in order to
select the best model. This can be done in the following two ways.

1. Use metrics which take into account both model fit and simplicity. They penalise the model for being too
complex (i.e. for overfitting), and thus are more representative of the unseen ‘test error’. Some examples of
such metrics are Mallow's Cp, Adjusted 𝑅 & , AIC and BIC

2. Estimate the test error via a validation set or a cross-validation approach.

In validation set approach, we find the test error by training the model on training set and fitting on an
unseen validation set while in n-fold cross-validation approach, we take the mean of errors generated by
training the model on all folds except the kth fold and testing the model on the kth fold where k varies from
1 to n.

Let's look into these one by one

1. Mallow's Cp

2. AIC (Akaike information criterion)

3. BIC (Bayesian information criterion)

4. Adjusted 𝑅 &

AIC and BIC are defined for models fit by maximum likelihood estimator. We can notice that as we increase the
number of predictors d, the penalty term in Cp, AIC and BIC all increase while the RSS decreases. Hence, lower the
value of Cp, AIC and BIC, better is the fit of the model. Higher the Adjusted 𝑅 & , better is the fit of the model.

Best Subset Selection

Now, we will look at the different methods of choosing the best set of predictors that shall give the least test error.

Features' subset selection can be performed using two different methods:

1. Best Subset Selection

A brief explanation of the Best Subset Selection algorithm (run on a dataset with p features) is as follows (please
refer to the image below): You start with d=0 features, i.e. a null model M0 with no features. Now, as you increase d,
you consider every model that has all combinations of d features and select a model which results in the least RSS
(or largest R2). This gives you a model Md with d features. Continue this iteration by increasing the value of d by one
till you reach d=p and find the models M0, M1, M2,.....,Mp.

Out of all these models M0, M1, M2,.....,Mp, select the best one, as measured by a measure such as Cp, AIC, BIC,
Adjusted 𝑅 & or mean cross-validated error.

We can see that the total number of models that need to be analysed for Best Subset Selection is 2* where p is the
total number of predictors. If we have 20 predictors, the total number of models is 2&+ = 1048576 which is over a
million. Hence, it becomes computationally infeasible to perform best subset selection for number of predictors
greater than 40.

2. Stepwise Selection

Forward Stepwise Selection

A brief explanation of the forward selection algorithm (run on a dataset with p features) is as follows (please refer to
the flowchart below): You start with d=0 features, i.e. a null model M0 with no features. Now, out of the (p-d)
remaining features, you identify one additional feature which (when added to the model Md) results in the least RSS
(or largest R2). This gives you a model Md+1 with one additional feature. Continue this iteration by increasing the
value of d by one till you reach d=p and find the models M0, M1, M2,.....,Mp.

Out of all these models M0, M1, M2,.....,Mp, select the best one, as measured by a measure such as Cp, AIC, BIC,
Adjusted 𝑅 & or mean cross-validated error.

Backward Stepwise Selection

The backward selection algorithm is the opposite of the forward one - rather than starting with d=0 features and
adding a feature in each iteration, you start with d=p features (a model Mp, with all the features as predictors) and
remove a feature in every iteration - the one that minimises the error (or maximises R2) and find the models
Mp, Mp−1, Mp−2,............., M0.

Out of all these models Mp, Mp−1, Mp−2,............., M0, select the best one, as measured by a measure such as Cp,
AIC, BIC, Adjusted 𝑅 & or mean cross-validated error.

We can see that the total number of models that need to be analysed for Forward Stepwise Selection is
1+p(p+1)/2 where p is the total number of predictors. It is the same for Backward Stepwise Selection also.

So, if the number of predictors is 40, there are just 821 models that need to analysed which is significantly lesser
than 2,+ . In this way, it is better than Best Subset Selection but it also has limitations.

Stepwise Selection does not ensure that we have chosen the best model. If we start off Forward Stepwise Selection
with predictor X1 then the best model with 2 predictors becomes X1 and X2 since X2 is the next best predictor. But
the best model with 2 predictors may be X2 and X3. This can happen with Backward Stepwise Selection also.

Though, Forward Stepwise Selection can be applied for n<p, where n is the number of observations, Backward
Stepwise Selection cannot be applied, as a full model cannot be fit when n<p.

Disclaimer: All content and material on the UpGrad website is copyrighted material, either belonging to UpGrad or
its bonafide contributors and is purely for the dissemination of education. You are permitted to access print and
download extracts from this site purely for your own education only and on the following basis:

• You can download this document from the website for self-use only.
• Any copies of this document, in part or full, saved to disc or to any other storage medium may only be used
for subsequent, self-viewing purposes or to print an individual extract or copy for non-commercial personal
use only.
• Any further dissemination, distribution, reproduction, copying of the content of the document herein or the
uploading thereof on other websites or use of content for any other commercial/unauthorized purposes in
any way which could infringe the intellectual property rights of UpGrad or its contributors, is strictly
prohibited.
• No graphics, images or photographs from any accompanying text in this document will be used separately
for unauthorised purposes.
• No material in this document will be modified, adapted or altered in any way.
• No part of this document or UpGrad content may be reproduced or stored in any other web site or included
in any public or private electronic retrieval system or service without UpGrad’s prior written permission.
• Any rights not expressly granted in these terms are reserved.

Unit 2
No ratings yet
Unit 2
92 pages
21csc305p ML Unit 2
No ratings yet
21csc305p ML Unit 2
115 pages
Regression Questionnaire
No ratings yet
Regression Questionnaire
10 pages
ML EasySol
No ratings yet
ML EasySol
62 pages
Linear Regression
No ratings yet
Linear Regression
31 pages
Lecture-6 Linear Regression Addition
No ratings yet
Lecture-6 Linear Regression Addition
15 pages
Regression Analysis in Machine Learning: Context
No ratings yet
Regression Analysis in Machine Learning: Context
16 pages
ML Solved Endsem
No ratings yet
ML Solved Endsem
16 pages
Machine Learning
No ratings yet
Machine Learning
19 pages
What Is Linear Regression
No ratings yet
What Is Linear Regression
14 pages
Chapter2 Annotated Part2
No ratings yet
Chapter2 Annotated Part2
30 pages
9 - Linear Regression-Problems and Solutions
No ratings yet
9 - Linear Regression-Problems and Solutions
23 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
MLA TAB Lecture3
No ratings yet
MLA TAB Lecture3
70 pages
ML Unit-2
No ratings yet
ML Unit-2
34 pages
ECE 3040 Lecture 18: Curve Fitting by Least-Squares-Error Regression
No ratings yet
ECE 3040 Lecture 18: Curve Fitting by Least-Squares-Error Regression
38 pages
3 Da
No ratings yet
3 Da
16 pages
Karthik Nambiar 60009220193
No ratings yet
Karthik Nambiar 60009220193
9 pages
Updated Module2_OTML Updated
No ratings yet
Updated Module2_OTML Updated
83 pages
ML
No ratings yet
ML
12 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
Machine Learning and Deep Learning Course
No ratings yet
Machine Learning and Deep Learning Course
23 pages
Linear Regression
No ratings yet
Linear Regression
11 pages
Linear Regression Algorithm
No ratings yet
Linear Regression Algorithm
16 pages
5.linear Regression
No ratings yet
5.linear Regression
39 pages
ML Ai
No ratings yet
ML Ai
53 pages
Linear Regression
No ratings yet
Linear Regression
9 pages
Chapter2 - Optimisation
No ratings yet
Chapter2 - Optimisation
7 pages
Chapter4 Regression
No ratings yet
Chapter4 Regression
15 pages
Unit III
No ratings yet
Unit III
18 pages
PA Notes 2
No ratings yet
PA Notes 2
23 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Abstract: y F X X X, X, X
No ratings yet
Abstract: y F X X X, X, X
10 pages
ML - Module 2
No ratings yet
ML - Module 2
16 pages
Ridge Lasso Regression Bias Variance Tradeoff 71
No ratings yet
Ridge Lasso Regression Bias Variance Tradeoff 71
19 pages
Unit - Iii Supervisied Learning - Notes
No ratings yet
Unit - Iii Supervisied Learning - Notes
42 pages
Machine Learning (CSO851) - Lecture 02
No ratings yet
Machine Learning (CSO851) - Lecture 02
74 pages
CSE445 Linear-Regression
No ratings yet
CSE445 Linear-Regression
40 pages
MLF Week 4 Notes by Manisha Pal
No ratings yet
MLF Week 4 Notes by Manisha Pal
13 pages
2.1 Supervised Regression
No ratings yet
2.1 Supervised Regression
26 pages
UNIT II Regration
No ratings yet
UNIT II Regration
62 pages
Advance Machine Learning
No ratings yet
Advance Machine Learning
16 pages
Numerical Computation - 7 - Linear Regression
No ratings yet
Numerical Computation - 7 - Linear Regression
27 pages
Unit - Iii
No ratings yet
Unit - Iii
9 pages
Module 3.3 Classification Models, An Overview
No ratings yet
Module 3.3 Classification Models, An Overview
11 pages
Regression Analysis
No ratings yet
Regression Analysis
7 pages
Linear Regression
No ratings yet
Linear Regression
62 pages
Regression Analysis in Machine Learning
No ratings yet
Regression Analysis in Machine Learning
13 pages
Session 1: Simple Linear Regression: Figure 1 - Supervised and Unsupervised Learning Methods
No ratings yet
Session 1: Simple Linear Regression: Figure 1 - Supervised and Unsupervised Learning Methods
16 pages
Regression
No ratings yet
Regression
45 pages
Curve Fitting and Interpolation
No ratings yet
Curve Fitting and Interpolation
14 pages
Unit - 3 - ML - 24
No ratings yet
Unit - 3 - ML - 24
41 pages
Module 5.2
No ratings yet
Module 5.2
51 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
CH 03 Regression Techniques
No ratings yet
CH 03 Regression Techniques
74 pages
Classical Machine Learning: Linear Regression: Ramesh S
No ratings yet
Classical Machine Learning: Linear Regression: Ramesh S
28 pages
Linear Regression
No ratings yet
Linear Regression
49 pages
Unit II ML
No ratings yet
Unit II ML
14 pages
Supervised Learning Algorithms
No ratings yet
Supervised Learning Algorithms
20 pages
LTE Power Control
100% (2)
LTE Power Control
34 pages
10 Leadsaday
No ratings yet
10 Leadsaday
26 pages
22533-2022-Winter-Model-Answer-Paper (Msbte Study Resources)
No ratings yet
22533-2022-Winter-Model-Answer-Paper (Msbte Study Resources)
22 pages
Computerized Enrollment System For Mary
No ratings yet
Computerized Enrollment System For Mary
30 pages
Daftar Harga Produk TIENS
No ratings yet
Daftar Harga Produk TIENS
2 pages
First Week First Course Introduction To Digital Technologies
No ratings yet
First Week First Course Introduction To Digital Technologies
8 pages
WaterShapes - Hydraulics-Hot-Tub-Concrete-Spa-Jets-Hydrotherapy-Venturi-Hartford-Loop
No ratings yet
WaterShapes - Hydraulics-Hot-Tub-Concrete-Spa-Jets-Hydrotherapy-Venturi-Hartford-Loop
7 pages
Switch Gear and Protection Anil Bhai Practical File
No ratings yet
Switch Gear and Protection Anil Bhai Practical File
18 pages
Unit 5-dld Notes (Pranalini)
No ratings yet
Unit 5-dld Notes (Pranalini)
16 pages
IPR - Quiz 1 2024
No ratings yet
IPR - Quiz 1 2024
1 page
Annexure - 4 Sub-Vendor List - BoP
0% (1)
Annexure - 4 Sub-Vendor List - BoP
68 pages
Difference Between QPSK, OQPSK
No ratings yet
Difference Between QPSK, OQPSK
2 pages
5.hmt-B19162a-M02 - Piping Diagram of Ballast Water System - 1.0
No ratings yet
5.hmt-B19162a-M02 - Piping Diagram of Ballast Water System - 1.0
6 pages
PAC-USWHS002-WF-2 Install Manual 04 21
No ratings yet
PAC-USWHS002-WF-2 Install Manual 04 21
8 pages
Shaft Design
No ratings yet
Shaft Design
14 pages
The Origin of Paper
No ratings yet
The Origin of Paper
3 pages
WAS-2362EN ReferenceDesign
No ratings yet
WAS-2362EN ReferenceDesign
25 pages
MT 101 Chapter 1
No ratings yet
MT 101 Chapter 1
5 pages
YB4408 Manual de Partes PDF
100% (1)
YB4408 Manual de Partes PDF
533 pages
Product Installation Manual: Digital Video Disc Player Model #'S: DVD-01x-x, DVD-01x-40x Document #540066
No ratings yet
Product Installation Manual: Digital Video Disc Player Model #'S: DVD-01x-x, DVD-01x-40x Document #540066
17 pages
XEV 9e Brochure
No ratings yet
XEV 9e Brochure
27 pages
Wicked Problem
No ratings yet
Wicked Problem
13 pages
Types of Brakes
No ratings yet
Types of Brakes
12 pages
Duplichecker Plagiarism Report
No ratings yet
Duplichecker Plagiarism Report
2 pages
List of EN1317 Compliant RRS March 2016
No ratings yet
List of EN1317 Compliant RRS March 2016
82 pages
Matrox PowerStream Plus User Guide
No ratings yet
Matrox PowerStream Plus User Guide
129 pages
Implementation of Novel Machine Learning Technique Using Several Meta With Naive Bayes Models To Analyse The Performance of Wave Energy Converters
No ratings yet
Implementation of Novel Machine Learning Technique Using Several Meta With Naive Bayes Models To Analyse The Performance of Wave Energy Converters
6 pages
Photojournalism
100% (3)
Photojournalism
47 pages
0936E1001R00
No ratings yet
0936E1001R00
1 page

Lecture+Notes+-+Advanced+Regression

Uploaded by

Lecture+Notes+-+Advanced+Regression

Uploaded by

Lecture Notes

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Generalized Regression Framework

instead of the raw attributes.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

with constants, for example,

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

2. Estimate the test error via a validation set or a cross-validation approach.

Let's look into these one by one

2. AIC (Akaike information criterion)

3. BIC (Bayesian information criterion)

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Features' subset selection can be performed using two different methods:

1. Best Subset Selection

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Forward Stepwise Selection

Backward Stepwise Selection

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

You might also like