100% found this document useful (1 vote)
183 views

Regression Notes

Linear regression is a basic machine learning model that finds the linear relationship between input and output variables. It models this relationship using a linear equation to calculate output values from inputs. The goal is to determine the slope and y-intercept values that best fit the training data using methods like the least squares method and gradient descent. Model performance is evaluated using metrics like mean absolute error, mean squared error, and R-squared value.

Uploaded by

Omkar Todkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
183 views

Regression Notes

Linear regression is a basic machine learning model that finds the linear relationship between input and output variables. It models this relationship using a linear equation to calculate output values from inputs. The goal is to determine the slope and y-intercept values that best fit the training data using methods like the least squares method and gradient descent. Model performance is evaluated using metrics like mean absolute error, mean squared error, and R-squared value.

Uploaded by

Omkar Todkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

UNIT-2

REGREESION
Linear regression is one of the most basic machine learning model. Its like a ‘hello world’
program of the machine learning. Linear regression is used when there is linear relationship
between input variables and output variables. That means we can calculate the values of output
variables by using some kind of linear combination of input variables. If there is only one input
variable then we call it ‘Single Variable Linear Regression’ or ‘Univariate Linear Regression’.
Regression’
And in case of moree than one input variables we call it ‘Multi Variable Linear Regression’
or ‘Multivariate Linear Regression’.
Regression’

Objective Of Linear Model


Every machine learning model actually generalize the relationship between input variables and
output variables. In case of linear regression since relationship is linear, this generalization can
be represented by simple line function. Let’s consider the below example, input values are
plotted on X axis and output values are plotted on Y axis.

Since there are only few data point, we can easily eyeball it and draw the best fit line, which will
generalize the relationship between input and output variables for us.
Since this line generalizes the relationship between input and output values for any prediction on
given input value, we can simply plot it on a line and Y coordinate for that point will give us the
prediction value.

Create Hypothesis Function


Linear model hypothesis function is nothing but line function only. Equation of line is
only.Equation
y = mx + b
where, m = slope/gradient, x = input, b = Y intercept.
Least Squares Method
Line of Best Fit
Our aim is to calculate the values m (slope) and b (y-intercept) in the equation of a line :
y = mx + b.
Where:
y = how far up
x = how far along
m = Slope or Gradient (how steep the line is)
b = the Y Intercept (where the line crosses the Y axis).

Steps
To find the line of best fit for N points:

Step 1: For each (x,y) point calculate x2 and xy

Step 2: Sum all x, y, x2 and xy, which gives us Σx, Σy, Σx2 and Σxy (Σ means "sum up")

Step 3: Calculate Slope m:

m = N Σ(xy) − Σx Σy / N Σ(x2) − (Σx)2

(N is the number of points.)

Step 4: Calculate Intercept b:

b = Σy − m Σx / N

Step 5: Assemble the equation of a line

y = mx + b

Example: Sam found how many hours of sunshine vs how many ice creams were sold at the shop
from Monday to Friday:

"x" "y"
Hours of Ice Creams
Sunshine Sold
2 4
3 5
5 7
7 10
9 15

Let us find the best m (slope) and b (y-intercept) that suits that data, y = mx + b.

Step 1: For each (x,y) calculate x2 and xy:

x y x2 xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135

Step 2: Sum x, y, x2 and xy (gives us Σx, Σy, Σx2 and Σxy):

x y x2 xy
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135
Σx: 26 Σy: 41 Σx2: 168 Σxy: 263

Also N (number of data values) = 5

Step 3: Calculate Slope m:

m = N Σ(xy) − Σx Σy / N Σ(x2) − (Σx)2

= 5 x 263 − 26 x 41 / 5 x 168 − 262


= 1315 – 1066 / 840 − 676

= 249 / 164 = 1.5183...

Step 4: Calculate Intercept b:

b = Σy − m Σx / N

= 41 − 1.5183 x 26 / 5

= 0.3049...

Step 5: Assemble the equation of a line:

y = mx + b

y = 1.518x + 0.305

Let's see how it works out:

x y y = 1.518x + 0.305 error


2 4 3.34 −0.66
3 5 4.86 −0.14
5 7 7.89 0.89
7 10 10.93 0.93
9 15 13.97 −1.03

Here are the (x,y) points and the line y = 1.518x + 0.305 on a graph:
Nice fit!

Sam hears the weather forecast which says "we expect 8 hours of sun tomorrow", so he uses the
above equation to estimate that he will sell

y = 1.518 x 8 + 0.305 = 12.45 Ice Creams.


Creams

Mean Absolute Error(MAE)

MAE is a very simple metric which calculates the absolute difference between actu actual and
predicted values.To better understand, let’s take an example you have input data and output data
and use Linear Regression, which draws a best-fit
best line.

Now you must find the MAE of your model which is basically a mistake made by the model
known as an error. Now find the difference between the actual value and predicted value that is
an absolute error, but we must find the mean absolute of the complete dataset. So, sum all the
nd this is MAE. And we aim to get a
errors and divide them by a total number of observations and
minimum MAE because this is a loss.

Advantages of MAE

 The MAE you get is in the same unit as the output variable.
 It is most Robust to outliers.

Disadvantages of MAE

 The graph of MAE is not differentiable so we have to apply various optimizers like
Gradient descent which can be differentiable.
differentiable

Mean Squared Error(MSE)

MSE is a most used and very simple metric with a little bit of change in mean absolute error.

between actual and predicted


Mean squared error states that finding the squared difference between
value.So, above we are finding the absolute difference and here we are finding the squared

difference.What actually the MSE represents? It represents the squared distance between actual

and predicted values. we perform squared to avo


avoid
id the cancellation of negative terms and it is the

benefit of MSE.

Advantages of MSE

The graph of MSE is differentiable, so you can easily use it as a loss function.

Disadvantages of MSE

 The value you get after calculating MSE is a squared unit of output. for example, the
output variable is in meter(m) then after calculating MSE the output we get is in meter
squared.
 If you have outliers in the dataset then it penalizes the outliers most and the calculated
Robust to outliers which were an advantage in MAE.
MSE is bigger. So, in short, It is not Robust

Root Mean Squared Error(RMSE)

As RMSE is clear by the name itself, that it is a simple square root of mean squared error.

Advantages of RMSE
 The output value you get is in the same unit as the required output variable which makes
interpretation of loss easy.

Disadvantages of RMSE

 or performing RMSE we have to


It is not that robust to outliers as compared to MAE. For
NumPy NumPy square root function over MSE.

R Squared (R2)

R2 score is a metric that tells the performance of your model, not the loss in an absolute sense

that how many wells did your model perform.In contrast, MAE and MSE depend on the context

as we have seen whereas the R2 score is independent of context.

So, with help of R squared we have a baseline model to compare a model which none of the

other metrics provides. The same we have in classification problems which we call a threshold

which is fixed at 0.5. So basically R2 squared calculates how must regression line is better than a

Hence, R2 squared is also known as Coefficient of Determination or sometimes also


mean line.Hence,

known as Goodness of fit.

Linear Regression using Gradient Descent Algorithm

Gradient descent is an optimization algorithm used to minimize some function by iteratively


moving in the direction of steepest descent as defined by the negative of the gradient. In machine
model.When there are more than
learning, we use gradient descent to update the parameters of our model.When
one inputs you can use a process of optimizing values of coefficients by iteratively minimizing
error of model on your training data. This is called Gradient Descent and works by starting with
random values for each coefficient. The sum of squared errors are calculated for each pair of input
and output variables.
A learning rate is used for each pair of input and output values. It is a scalar factor and
coefficients are updated in direction towards minimizing error. The process is repeated until a
minimum sum squared error is achieved or no further improvement is possible.When using this
method, learning rate alpha determines the size of improvement step to take on each iteration of
procedure. In practise, Gradient Descent is useful when there is a large dataset either in number of
rows or number of columns.
The Gradient Descent algorithm determines the values of m and c, such that line corresponding to
those values is best fitting line / gives minimum error.
First we need to calculate the loss function. The loss function is defined as the error in our
predicted value of m and c. Our aim is to minimize this error to obtain most accurate value of m
and c. To calculate the loss, Mean Squared Error function is used.
· The difference between actual and predicted y value if found.
· The difference is squared.
· Then, the mean of squares for every value in x is calculated.
Now, we have to minimize “m” and “c”. To minimize these parameters, Gradient Descent
Algorithm is used.

1. Initially, let m = 0, c = 0

Where L = learning rate — controlling how much the value of “m” changes with each step. The

smaller the L, greater the accuracy. L = 0.001 for a good accuracy.


2. Calculating the partial derivative of loss function wrt “m” and give current values of x, y , m

and c to get the derivative D.

3. Similarly, D wrt c
4. Now, updating the current value of m and c,

5. We repeat this process until loss function is very small ,i.e. ideally 0 % error (100% accuracy).

Multivariate Regression
Multivariate Regression is a supervised machine learning algorithm involving multiple data
variables for analysis. A Multivariate regression is an extension of multiple regression with
one dependent variable and multiple independent variables. Based on the number of
independent variables, we try to predict the output.
Multivariate regression tries to find out a formula that can explain how factors in variables
respond simultaneously to changes in others.
There are numerous areas where multivariate regression can be used. Let’s look at some
examples to understand multivariate regression better.

1. Pual, wants to estimate the price of a house. She will collect details such as the
location of the house, number of bedrooms, size in square feet, amenities
available, or not. Basis these details price of the house can be predicted and how
each variables are interrelated.
2. An agriculture scientist wants to predict the total crop yield expected for the
summer. He collected details of the expected amount of rainfall, fertilizers to be
used, and soil conditions. By building a Multivariate regression model scientists
can predict his crop yield. With the crop yield, the scientist also tries to understand
the relationship among the variables.
3. If an organization wants to know how much it has to pay to a new hire, they will
take into account many details such as education level, number of experience, job
location, has niche skill or not. Basis this information salary of an employee can be
predicted, how these variables help in estimating the salary.
4. Economists can use Multivariate regression to predict the GDP growth of a state or
a country based on parameters like total amount spent by consumers, import
expenditure, total gains from exports, total savings, etc.
5. A company wants to predict the electricity bill of an apartment, the details needed
here are the number of flats, the number of appliances in usage, the number of
people at home, etc. With the help of these variables, the electricity bill can be
predicted.
Mathematical Equation
The simple regression linear model represents a straight line meaning y is a function of x.
When we have an extra dimension (z), the straight line becomes a plane.
Here, the plane is the function that expresses y as a function of x and z. The linear regression
equation can now be expressed as:
y = m1.x + m2.z+ c
y is the dependent variable, that is, the variable that needs to be predicted.
x is the first independent variable. It is the first input.

m1 is the slope of x1. It lets us know the angle of the line (x).
z is the second independent variable. It is the second input.
m2 is the slope of z. It helps us to know the angle of the line (z).
c is the intercept. A constant that finds the value of y when x and z are 0.

The equation for a model with two input variables can be written as:
y = β0 + β1.x1 + β2.x2
What if there are three variables as inputs? Human visualizations can be only three
dimensions. In the machine learning world, there can be n number of dimensions. The
equation for a model with three input variables can be written as:
y = β0 + β1.x1 + β2.x2 + β3.x3
Below is the generalized equation for the multivariate regression model-
y = β0 + β1.x1 + β2.x2 +….. + βn.xn
Where n represents the number of independent variables, β0~ βn represents the coefficients
and x1~xn, is the independent variable.
The multivariate model helps us in understanding and comparing coefficients across the
output. Here, small cost function makes Multivariate linear regression a better model.
mod

Cost Function
The cost function is a function that allows a cost to samples when the model differs from
observed data. This equation is the sum of the square of the difference between the predicted
value and the actual value divided by twice the length of the dataset. A smaller mean squared
error implies a better performance. Here, the cost is the sum of squared errors.

Cost of Multiple Linear regression:

Steps of Multivariate Regression analysis

Steps involved for Multivariate regression analysis are feature selection and feature
engineering, normalizing the features, selecting the loss function and hypothesis, set
hypothesis parameters, minimize the loss function, testing the hypothesis, and generating the
regression model.

 Feature selection--

The selection of features is an important step in multivariate regression. Feature selection


also known as variable selection. It becomes important for us to pick significant variables for
better model building.

 Normalizing Features-
Features
We need to scale the features as it maintains general distribution and ratios in data. This will
lead to an efficient analysis. The value of each feature can also be changed.

 Select Loss function and Hypothesis-

The loss function predicts whenever there is an error. Meaning, when the hypothesis
prediction deviates from actual values. Here, the hypothesis is the predicted value from the
feature/variable.

 Set Hypothesis Parameters-

The hypothesis parameter needs to be set in such a way that it reduces the loss function and
predicts well.

 Minimize the Loss Function-

The loss function needs to be minimized by using a loss minimization algorithm on the
dataset, which will help in adjusting hypothesis parameters. After the loss is minimized, it
can be used for further action. Gradient descent is one of the algorithms commonly used for
loss minimization.

 Test the hypothesis function-

The hypothesis function needs to be checked on as well, as it is predicting values. Once this
is done, it has to be tested on test data.

Polynomial Regression

o Polynomial Regression is a regression algorithm that models the relationship between a


dependent(y) and independent variable(x) as nth degree polynomial. The Polynomial
Regression equation is given below:

y= b0+b1x1+ b2x12+ b2x13+...... bnx1n

o It is also called the special case of Multiple Linear Regression in ML. Because we add
some polynomial terms to the Multiple Linear regression equation to convert it into
Polynomial Regression.
o It is a linear model with some modification in order to increase the accuracy.
o The dataset used in Polynomial regression for training is of non-linear
non linear nature.
o It makes use of a linear regression model to fit the complicated and non-linear
non functions
and datasets.

o In the above image, we have taken a dataset which is arranged non-linearly.


non So if we try
to cover it with a linear model, then we can clearly see that it hardly covers any data
point. On the other hand, a curve is suitable to cover most of the data points, which is of
the Polynomial model.
o non linear fashion, then we should use the
Hence, if the datasets are arranged in a non-linear
Polynomial Regression model instead of Simple Linear Regression.

Regularization Techniques

As we move towards the right in this image, our model tries to learn too well the details and the

noise from the training data, which ultimately results in poor performance on the unseen data.
In other words, while going towards the right, the complexity of the model increases such that

the training error reduces but the testing error doesn’t.

How does Regularization help reduce Overfitting?

Let’s consider a neural network which is overfitting on the training data as shown in the image

below.

Assume that our regularization coefficient is so high that some of the weight matrices are nearly

equal to zero.

This will result in a much simpler linear network and slight underfitting of the training data.
Such a large value of the regularization coefficient is not that useful. We need to optimize the

value of regularization coefficient in order to obtain a well


well-fitted
fitted model as shown in the image

below.

Different Regularization Techniques

L2 & L1 regularization

L1 and L2 are the most common types of regularization. These update the general cost function

by adding another term known as the regularization term.

Cost function = Loss (say, binary cross entropy) + Regularization term

Due to the addition of this regularization term, the values of weight matrices decrease because it

assumes that a neural network with smaller weight matrices leads to simpler models. Therefore,

it will also reduce overfitting to quite an extent.

However, this regularization term differs in L1 and L2.

In L2, we have:

Here, lambda is the regularization parameter. It is the hyperparameter whose value is optimized

for better results. L2 regularization is also known as weight decay as it forces the weights to

decay towards zero (but not exactly zero).

In L1, we have:
In this, we penalize the absolute
solute value of the weights. Unlike L2, the weights may be reduced to

zero here. Hence, it is very useful when we are trying to compress our model. Otherwise, we

usually prefer L2 over it.

Dropout

This is the one of the most interesting types of regularization techniques. It also produces very

good results and is consequently the most frequently used regularization technique in the field of

deep learning.

To understand dropout, let’s say our neural network structure is akin to the one shown below:

So what does dropout do? At every iteration, it randomly selects some nodes and removes them

along with all of their incoming and outgoing connections as shown below.
So each iteration has a different set of nodes and this results in a different set of outputs. It can

also be thought of as an ensemble technique in machine learning.

Ensemble models usually perform better than a single model as they capture more randomness.

Similarly, dropout also performs better than a normal neural network model.

Generalization, Overfitting, and Under-fitting

Considering that we are designing a machine learning model. A model is said to be a good

machine-learning model, if it generalizes any new input data from the problem domain in a

proper way. This helps us to make predictions in the future data, that data model has never seen.

Whenever working on a data set to predict or classify a problem, we tend to find accuracy by

implementing a design model on first train set, then on test set. If the accuracy is satisfactory, we

tend to increase accuracy of data-sets prediction either by increasing or decreasing data feature or

features selection or applying feature engineering in our machine-learning model. But sometime

our model maybe giving poor result. This can be explained by overfitting and underfitting, which

are majorly responsible for the poor performances of the machine learning algorithms.

A statistical model or a machine-learning algorithm is said to have underfitting when it cannot

capture the underlying trend of the data. We want the model to learn from the training data, but

we don’t want it to learn too much (i.e. too many patterns). One solution could be to stop the
training earlier. However, this could lead the model to not learn enough patterns from the

training data, and possibly not even capture the dominant trend. In underfitting (i.e. high bias) is

just as bad for generalization of the model as overfitting. In high bias, the model might not have

enough flexibility in terms of line fitting, resulting in a simplistic line that does not generalize

well.

When we run our training algorithm on the data set, we allow the overall cost (i.e. distance from

each point to the line) to become smaller with more iterations. Leaving this training algorithm

run for long leads to minimal overall cost. However, this means that the line will be fit into all

the points (including noise), catching secondary patterns that may not be needed for the

generalizability of the model.


Bias: It gives us how closeness is our predictive models to training data after averaging predict
value. Generally, algorithm has high bias, which help them to learn fast and easy to understand
but are less flexible. That loses it ability to predict complex problem, so it fails to explain the
algorithm bias. This results in underfitting of our model.
Variance: It define as deviation of predictions, in simple it is the amount which tell us when its
point data value change or a different data is use how much the predicted value will be affected
for same model or for different model respectively. Ideally, the predicted value which we predict
from model should remain same even changing from one training data sets to another, but if the
model has high variance, then model predict value are affect by value of datasets.

You might also like