0% found this document useful (0 votes)

8 views

1. Lecture+Notes+-+Advanced+Regression

The document covers advanced regression concepts, including non-linear relationships, regularization techniques like Ridge and Lasso, and the importance of feature engineering. It discusses model selection parameters and methods such as Best Subset and Stepwise Selection for optimizing regression models. Additionally, it emphasizes the balance between model complexity and performance through regularization and the use of various metrics for model evaluation.

Uploaded by

Aaquib Sattar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

1. Lecture+Notes+-+Advanced+Regression

Uploaded by

Aaquib Sattar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Lecture Notes

Advanced Regression
In this module, you were introduced to the concepts of the advanced regression framework. You have
learnt to deal with the problems when the target variable y was non-linearly related to the predictor
variables X. You were introduced to the concept of regularization in regression models. We discussed at
length about the two regularized regression models, namely Ridge and Lasso. The concept of
hyperparameter (λ) was also described in the context of regularization, along with its impact on the built
model.

Generalized Regression
In linear regression, you had encountered problems where the target variable y was linearly related to the
predictor variables X. But what if the relationship is not linear? Let's see how we can use generalised
regression to tackle such problems.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

You should follow these two steps while building any model:
1. Carry out exploratory data analysis by examining scatter plots of explanatory and dependent
variables.
2. Choose an appropriate set of functions which seem to fit the plot well, build models using them,
and compare the results.

Feature Engineering

While constructing the non-linear regression model, instead of using the raw explanatory variables in the
current form, we create some function of the explanatory variables to best explain the data points. These
functions capture the non-linearity in the data

The derived features could be combinations of two or more attributes and/or transformations of
individual attributes. These combinations and transformations could be linear or non-linear.

Note that a linear combination of two attributes x1 and x2 allows only two operations - multiplying by a
constant and adding the results. For example, 3𝑥# + 5𝑥& is a linear combination, though 2𝑥# 𝑥& is a non-
linear combination.

Generalized Regression Framework

We also saw several commonly used functions used in regression and how an n-degree polynomial can be
expressed as a linear combination of features.

The next step is to find out the coefficients of such models mathematically, i.e. to fit the model. Let's see
how we can do that.
In generalised regression models, the basic algorithm remains the same as linear regression- we compute
the values of constants which result in the least possible error (best fit). The only difference is that we now
use the features

instead of the raw attributes.

The term 'linear' in linear regression refers to the linearity in the coefficients, i.e. the target variable y is
linearly related to the model coefficients. It does not require that y should be linearly related to the raw
attributes or features. Feature functions could be linear or non-linear.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

In a linear combination of features, the following operations can be performed:

1. We can multiply

with constants, for example,

2. We can add those terms (but not multiply, divide, exponentiate etc.)
for example,

Expressions
We can express the regression equation as a dot product of 2 vectors - 1. a vector of all the
coefficients and 2. a vector with the features:

Next, we sum up the errors between predicted and actual response variables and minimize the
residual sum of errors to get the optimal coefficients:

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

To summarise the key points:
1. We first created a feature matrix of dimension n x k, where n is the number of data points in the
training dataset and k is the number of features.
2. We then use the expression to identify the coefficients that would correspond to the best-
fit regression model and minimize the residual sum of errors:

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

As our goal is to minimise the loss function hence we’ll differentiate the Loss function and equate it to
zero.

Regularized Regression

A predictive model has to be as simple as possible, but no simpler. There is an important relationship
between the complexity of a model and its usefulness in a learning context because of the following
reasons:
• Simpler models are usually more generic and are more widely applicable (are generalizable)
• Simpler models require fewer training samples for effective training than the more complex ones

Regularization is a process used to create an optimally complex model, i.e. a model which is as simple as
possible while performing well on the training data.

Through regularization, the algorithm designer tries to strike the delicate balance between keeping
the model simple, yet not making it too naive to be of any use.
The regression does not account for model complexity - it only tries to minimize the error (e.g. MSE),
although if it may result in arbitrarily complex coefficients. On the other hand, in regularized regression,
the objective function has two parts - the error term and the regularization term.

Ridge Regression

In ridge regression, an additional term of "sum of the squares of the coefficients" is added to the cost
function along with the error term

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Lasso Regression

In case of lasso regression, a regularisation term of "sum of the absolute value of the coefficients" is added

These are two commonly used regularised regression methods - Ridge regression and Lasso regression. Both these
methods are used to make the regression model simpler while balancing the 'bias-variance' trade-off.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Difference between Ridge and Lasso Regression

You learnt that both Ridge and Lasso regularize the coefficients by reducing them in value, essentially
causing shrinkage of the coefficients. Ridge and Lasso perform different measures of shrinkage which depends
on the value of hyperparameter, λ. In the process of shrinkage, Lasso shrinks some of the variable coefficients to 0,
thus performing variable selection.

Thus, the key observation here is that at the optimum solution for α (the place where the sum of the error and
regularisation terms is minimum), the corresponding regularization contour and the error contour must ’touch’ each
other tangentially and not 'cross'. The 'blue stars' highlight the touch points between the error contours and the
lasso regularization contours. The 'green stars' highlight the touch points between the error contours and the ridge
regularization terms. The picture illustrates the fact that because of the 'corners' in the lasso contours (unlike ridge
regression), the touch points are more likely to be on one or more of the axes. This implies that the other
coefficients become zero. Hence, lasso regression also serves as a variable shrinkage method, whereas ridge
regression does not.

© Copyright 2018. UpGrad Education Pvt. Ltd. All rights reserved

Model Selection Parameters

While creating the best model for any problem statement, we end up choosing from a set of models which would
give us the least test error. Hence, the test error, and not only the training error, needs to be estimated in order to
select the best model. This can be done in the following two ways.

1. Use metrics which take into account both model fit and simplicity. They penalise the model for being too
complex (i.e. for overfitting), and thus are more representative of the unseen ‘test error’. Some examples of
such metrics are Mallow's Cp, Adjusted 𝑅 & , AIC and BIC

2. Estimate the test error via a validation set or a cross-validation approach.

In validation set approach, we find the test error by training the model on training set and fitting on an
unseen validation set while in n-fold cross-validation approach, we take the mean of errors generated by
training the model on all folds except the kth fold and testing the model on the kth fold where k varies from
1 to n.

Let's look into these one by one

1. Mallow's Cp

2. AIC (Akaike information criterion)

3. BIC (Bayesian information criterion)

4. Adjusted 𝑅 &

AIC and BIC are defined for models fit by maximum likelihood estimator. We can notice that as we increase the
number of predictors d, the penalty term in Cp, AIC and BIC all increase while the RSS decreases. Hence, lower the
value of Cp, AIC and BIC, better is the fit of the model. Higher the Adjusted 𝑅 & , better is the fit of the model.

Best Subset Selection

Now, we will look at the different methods of choosing the best set of predictors that shall give the least test error.

Features' subset selection can be performed using two different methods:

1. Best Subset Selection

A brief explanation of the Best Subset Selection algorithm (run on a dataset with p features) is as follows (please
refer to the image below): You start with d=0 features, i.e. a null model M0 with no features. Now, as you increase d,
you consider every model that has all combinations of d features and select a model which results in the least RSS
(or largest R2). This gives you a model Md with d features. Continue this iteration by increasing the value of d by one
till you reach d=p and find the models M0, M1, M2,.....,Mp.

Out of all these models M0, M1, M2,.....,Mp, select the best one, as measured by a measure such as Cp, AIC, BIC,
Adjusted 𝑅 & or mean cross-validated error.

We can see that the total number of models that need to be analysed for Best Subset Selection is 2* where p is the
total number of predictors. If we have 20 predictors, the total number of models is 2&+ = 1048576 which is over a
million. Hence, it becomes computationally infeasible to perform best subset selection for number of predictors
greater than 40.

2. Stepwise Selection

Forward Stepwise Selection

A brief explanation of the forward selection algorithm (run on a dataset with p features) is as follows (please refer to
the flowchart below): You start with d=0 features, i.e. a null model M0 with no features. Now, out of the (p-d)
remaining features, you identify one additional feature which (when added to the model Md) results in the least RSS
(or largest R2). This gives you a model Md+1 with one additional feature. Continue this iteration by increasing the
value of d by one till you reach d=p and find the models M0, M1, M2,.....,Mp.

Out of all these models M0, M1, M2,.....,Mp, select the best one, as measured by a measure such as Cp, AIC, BIC,
Adjusted 𝑅 & or mean cross-validated error.

Backward Stepwise Selection

The backward selection algorithm is the opposite of the forward one - rather than starting with d=0 features and
adding a feature in each iteration, you start with d=p features (a model Mp, with all the features as predictors) and
remove a feature in every iteration - the one that minimises the error (or maximises R2) and find the models
Mp, Mp−1, Mp−2,............., M0.

Out of all these models Mp, Mp−1, Mp−2,............., M0, select the best one, as measured by a measure such as Cp,
AIC, BIC, Adjusted 𝑅 & or mean cross-validated error.

We can see that the total number of models that need to be analysed for Forward Stepwise Selection is
1+p(p+1)/2 where p is the total number of predictors. It is the same for Backward Stepwise Selection also.

So, if the number of predictors is 40, there are just 821 models that need to analysed which is significantly lesser
than 2,+ . In this way, it is better than Best Subset Selection but it also has limitations.

Stepwise Selection does not ensure that we have chosen the best model. If we start off Forward Stepwise Selection
with predictor X1 then the best model with 2 predictors becomes X1 and X2 since X2 is the next best predictor. But
the best model with 2 predictors may be X2 and X3. This can happen with Backward Stepwise Selection also.

Though, Forward Stepwise Selection can be applied for n<p, where n is the number of observations, Backward
Stepwise Selection cannot be applied, as a full model cannot be fit when n<p.

Disclaimer: All content and material on the UpGrad website is copyrighted material, either belonging to UpGrad or
its bonafide contributors and is purely for the dissemination of education. You are permitted to access print and
download extracts from this site purely for your own education only and on the following basis:

• You can download this document from the website for self-use only.
• Any copies of this document, in part or full, saved to disc or to any other storage medium may only be used
for subsequent, self-viewing purposes or to print an individual extract or copy for non-commercial personal
use only.
• Any further dissemination, distribution, reproduction, copying of the content of the document herein or the
uploading thereof on other websites or use of content for any other commercial/unauthorized purposes in
any way which could infringe the intellectual property rights of UpGrad or its contributors, is strictly
prohibited.
• No graphics, images or photographs from any accompanying text in this document will be used separately
for unauthorised purposes.
• No material in this document will be modified, adapted or altered in any way.
• No part of this document or UpGrad content may be reproduced or stored in any other web site or included
in any public or private electronic retrieval system or service without UpGrad’s prior written permission.
• Any rights not expressly granted in these terms are reserved.

Linear Programming On Work Scheduling - Operations Management
100% (1)
Linear Programming On Work Scheduling - Operations Management
3 pages
21csc305p Ml Unit 2 Ppt
No ratings yet
21csc305p Ml Unit 2 Ppt
115 pages
Regression_Questionnaire
No ratings yet
Regression_Questionnaire
10 pages
ML EasySol
No ratings yet
ML EasySol
62 pages
Linear Regression
No ratings yet
Linear Regression
31 pages
Regression Analysis in Machine Learning: Context
No ratings yet
Regression Analysis in Machine Learning: Context
16 pages
ML Solved Endsem
No ratings yet
ML Solved Endsem
16 pages
Machine learning
No ratings yet
Machine learning
19 pages
What Is Linear Regression
No ratings yet
What Is Linear Regression
14 pages
Chapter2 Annotated Part2
No ratings yet
Chapter2 Annotated Part2
30 pages
2.1 Linear Regression
No ratings yet
2.1 Linear Regression
39 pages
9_Linear Regression-Problems and Solutions
No ratings yet
9_Linear Regression-Problems and Solutions
23 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
MLA TAB Lecture3
No ratings yet
MLA TAB Lecture3
70 pages
ML unit-2 ppt
No ratings yet
ML unit-2 ppt
34 pages
ECE 3040 Lecture 18: Curve Fitting by Least-Squares-Error Regression
No ratings yet
ECE 3040 Lecture 18: Curve Fitting by Least-Squares-Error Regression
38 pages
3 Da
No ratings yet
3 Da
16 pages
Karthik Nambiar 60009220193
No ratings yet
Karthik Nambiar 60009220193
9 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
Machine Learning and Deep Learning Course
No ratings yet
Machine Learning and Deep Learning Course
23 pages
Linear Regression
No ratings yet
Linear Regression
11 pages
Linear Regression Algorithm
No ratings yet
Linear Regression Algorithm
16 pages
5.linear Regression
No ratings yet
5.linear Regression
39 pages
ML_AI
No ratings yet
ML_AI
53 pages
chapter2- optimisation
No ratings yet
chapter2- optimisation
7 pages
w3 - Linear Model - Linear Regression
No ratings yet
w3 - Linear Model - Linear Regression
33 pages
w4 Generalisation
No ratings yet
w4 Generalisation
42 pages
Chapter4_Regression.docx
No ratings yet
Chapter4_Regression.docx
15 pages
Unit III
No ratings yet
Unit III
18 pages
PA Notes 2
No ratings yet
PA Notes 2
23 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
eng
No ratings yet
eng
10 pages
ML - Module 2
No ratings yet
ML - Module 2
16 pages
Ridge Lasso Regression Bias Variance Tradeoff 71
No ratings yet
Ridge Lasso Regression Bias Variance Tradeoff 71
19 pages
Machine Learning (CSO851) - Lecture 02
No ratings yet
Machine Learning (CSO851) - Lecture 02
74 pages
MLF_Week_4_Notes_by_Manisha_Pal
No ratings yet
MLF_Week_4_Notes_by_Manisha_Pal
13 pages
Numerical Computation - 7 - Linear Regression
No ratings yet
Numerical Computation - 7 - Linear Regression
27 pages
UNIT - III
No ratings yet
UNIT - III
9 pages
Module 3.3 Classification Models, An Overview
No ratings yet
Module 3.3 Classification Models, An Overview
11 pages
Regression Analysis in Machine Learning
No ratings yet
Regression Analysis in Machine Learning
13 pages
Linear Regression
No ratings yet
Linear Regression
62 pages
Regression
No ratings yet
Regression
45 pages
Session 1: Simple Linear Regression: Figure 1 - Supervised and Unsupervised Learning Methods
No ratings yet
Session 1: Simple Linear Regression: Figure 1 - Supervised and Unsupervised Learning Methods
16 pages
Curve Fitting and Interpolation
No ratings yet
Curve Fitting and Interpolation
14 pages
Unit -3_ML_24
No ratings yet
Unit -3_ML_24
41 pages
Module 5.2
No ratings yet
Module 5.2
51 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
CH 03 Regression Techniques
No ratings yet
CH 03 Regression Techniques
74 pages
CS464 Ch9 LinearRegression
100% (1)
CS464 Ch9 LinearRegression
43 pages
Classical Machine Learning: Linear Regression: Ramesh S
No ratings yet
Classical Machine Learning: Linear Regression: Ramesh S
28 pages
3. Linear Regression
No ratings yet
3. Linear Regression
49 pages
Supervised Learning Algorithms
No ratings yet
Supervised Learning Algorithms
20 pages
ML models and when to choose one over others
No ratings yet
ML models and when to choose one over others
7 pages
Article Module 4
No ratings yet
Article Module 4
8 pages
5_AML Lecture 5_Linear regression
No ratings yet
5_AML Lecture 5_Linear regression
56 pages
3 Unit - Dspu
No ratings yet
3 Unit - Dspu
23 pages
Data Science
100% (1)
Data Science
14 pages
Regression
No ratings yet
Regression
16 pages
Linear Regression-Part 2
No ratings yet
Linear Regression-Part 2
26 pages
Unit Iii
No ratings yet
Unit Iii
27 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Tree Traversals (Inorder, Preorder and Postorder)
No ratings yet
Tree Traversals (Inorder, Preorder and Postorder)
4 pages
ECE 301: Signals and Systems Class Participation Problems #3
No ratings yet
ECE 301: Signals and Systems Class Participation Problems #3
3 pages
LQG/LQR Controller Design: Undergraduate Lecture Notes On
No ratings yet
LQG/LQR Controller Design: Undergraduate Lecture Notes On
37 pages
Numerical Analysis Assignment
No ratings yet
Numerical Analysis Assignment
2 pages
Mws Ind Ode TXT Runge4th Examples PDF
No ratings yet
Mws Ind Ode TXT Runge4th Examples PDF
6 pages
Lab - Session - 1.ipynb - Colaboratory
No ratings yet
Lab - Session - 1.ipynb - Colaboratory
8 pages
Multi Modal Hate Speech Detection Using Machine Learning
100% (1)
Multi Modal Hate Speech Detection Using Machine Learning
5 pages
Autonomous University of Nuevo Leon Faculty of Mechanical and Electrical Engineering
No ratings yet
Autonomous University of Nuevo Leon Faculty of Mechanical and Electrical Engineering
27 pages
1c-1 Solving Linear Simultaneous Equations
No ratings yet
1c-1 Solving Linear Simultaneous Equations
5 pages
Dsa Q6 PDF
No ratings yet
Dsa Q6 PDF
6 pages
DSF Course Curriculum 1305231045
No ratings yet
DSF Course Curriculum 1305231045
8 pages
pulse code mod ch 3
No ratings yet
pulse code mod ch 3
19 pages
Module3 - Fixed Partitions
No ratings yet
Module3 - Fixed Partitions
17 pages
Introduction To Algorithms
No ratings yet
Introduction To Algorithms
25 pages
Lecture 17 - Minimum Spanning Tree PDF
No ratings yet
Lecture 17 - Minimum Spanning Tree PDF
16 pages
LECTURE 4 Sytetms Classification
No ratings yet
LECTURE 4 Sytetms Classification
42 pages
Conjugate Gradient Method
No ratings yet
Conjugate Gradient Method
14 pages
Data Pre Processing
No ratings yet
Data Pre Processing
63 pages
CEP MATLAB Codes
No ratings yet
CEP MATLAB Codes
3 pages
2. Convolution and Pooling as an Infinitely Strong Prior
No ratings yet
2. Convolution and Pooling as an Infinitely Strong Prior
11 pages
Fasahat Ullah Siddiqui, Abid Yahya - Clustering Techniques For Image Segmentation-Springer (2021)
No ratings yet
Fasahat Ullah Siddiqui, Abid Yahya - Clustering Techniques For Image Segmentation-Springer (2021)
121 pages
Array_Leetcode.pdf
No ratings yet
Array_Leetcode.pdf
4 pages
Run Length Encoding
No ratings yet
Run Length Encoding
7 pages
BEC502 Lab Manual
No ratings yet
BEC502 Lab Manual
49 pages
DSP Lab 2
No ratings yet
DSP Lab 2
8 pages
99 + Machine Learning Algorithms
No ratings yet
99 + Machine Learning Algorithms
7 pages
5.2 Processing N Jobs K Machines
No ratings yet
5.2 Processing N Jobs K Machines
15 pages
Text Books
No ratings yet
Text Books
2 pages
EET305 - ktu qbank
No ratings yet
EET305 - ktu qbank
7 pages