Module 3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

Module 3

Linear Models
Linear Regression

• Linear Regression is a machine learning algorithm based on supervised learning.


• Linear regression performs the task to predict a dependent variable value (y) based on a given
independent variable (x). So, this regression technique finds out a linear relationship between x (input)
and y(output). Hence, the name is Linear Regression.
• Hypothesis of Linear Regression is
• Y= m.x+c
• The model gets the best regression fit line by finding the best m (slope) and
c(intercept) values.
Least Square Method

• The least squares method is a statistical procedure to find


the best fit for a set of data points by minimizing the sum
of the offsets or residuals of points from the plotted
curve.
• Least squares regression is used to predict the behavior
of dependent variables.
Cost Function

• The model aims to predict y value such that the error difference between predicted value
and true value is minimum.
• Cost function(J) of Linear Regression is the Root Mean Squared Error (RMSE) between
predicted y value (pred) and true y value (y).

GRADIENT DESCENT IN
LINEAR REGRESSION

• An algorithm to minimize a loss function by optimizing the


parameters

• NewValue = old value – step size


• Newvalue = old value – Learning Rate * slope
What is a Cost Function?

• Linear
Regression
Alpha- Learning rate

• If the learning rate is too high, we might OVERSHOOT the minima and keep bouncing,
without reaching the minima
• If the learning rate is too small, the training might turn out to be too long
• Blog to study
• https://www.analyticsvidhya.com/blog/2021/08/understanding-gradient-descent-algorithm
-and-the-maths-behind-it/
Plotting the Gradient Descent Algorithm
• When we have a single
parameter (theta), we can plot
the dependent variable cost on
the y-axis and theta on the
x-axis. If there are two
parameters, we can go with a
3-D plot, with cost on one axis
and the two parameters (thetas)
along the other two axes.
Plotting the Gradient Descent Algorithm
Regularization in ML

• Problem : Overfitting
• Solution : This is a form of regression, that constrains/ regularizes or shrinks the
coefficient estimates towards zero. In other words, this technique discourages learning a
more complex or flexible model, so as to avoid the risk of overfitting.
• A simple Linear Regression
• Y = mx + c
Regression Regularization

• Particularly, regularization is implemented to avoid overfitting of the data, especially


when there is a large variance between train and test set performances.
• With regularization, the number of features used in training is kept constant, yet the
magnitude of the coefficients (B) as seen in equation is reduced
Intuition for Regression

• While there are quite a number of predictors, RM and RAD


have the largest coefficients. The implication of this will be
that housing prices will be driven more significantly by these
two features leading to overfitting, where generalizable
patterns have not been learned.
• There are different ways of reducing model complexity and
preventing overfitting in linear models. This includes ridge
and lasso regression models.

Image of coefficients below to predict house prices.


Ridge Regression

• RSS is modified by adding the shrinkage quantity.


• λ is the tuning parameter that decides how much we want to penalize the flexibility of
our model.
• It adds a factor of sum of squares of coefficients in the optimization objective. Thus, ridge
regression optimizes the following:
• Objective = RSS + α * (sum of square of coefficients)
• The coefficient estimates produced by this method are also known as the L2 norm.
• Performs L2 regularization, i.e. adds penalty equivalent to square of the magnitude of
coefficients
Ridge Regression

• Objective = RSS + α * (sum of square of coefficients)


• α = 0:
• The objective becomes same as simple linear regression.
• We’ll get the same coefficients as simple linear regression.
• α = ∞:
• The coefficients will be zero. Why? Because of infinite weightage on square of coefficients, anything less than
zero will make the objective infinite.
• 0 < α < ∞:
• The magnitude of α will decide the weightage given to different parts of objective.
• The coefficients will be somewhere between 0 and ones for simple linear regression.
LASSO
Least Absolute Shrinkage and Selection Operator,

• this variation differs from ridge regression only in penalizing the high coefficients.
• It uses |βj|(modulus)instead of squares of β, as its penalty. In statistics, this is known as
the L1 norm.
• Lasso regression performs L1 regularization, i.e. it adds a factor of sum of absolute value
of coefficients in the optimization objective. Thus, lasso regression optimizes the
following:
• Objective = RSS + α * (sum of absolute value of coefficients)
LASSO

• Like that of ridge, α can take various values. Lets iterate it here briefly
• α = 0: Same coefficients as simple linear regression
• α = ∞: All coefficients zero (same logic as before)
• 0 < α < ∞: coefficients between 0 and that of simple linear regression
Alpha (α) can be any real-valued number between zero and
infinity; the larger the value, the more aggressive the
penalization is.
lasso regression shrinks the coefficients and helps to reduce the model
complexity and multi-collinearity.
Key Take Away for Lasso : Lasso Regression for Model
Selection

• Due to the fact that coefficients will be shrunk towards a mean of zero, less important
features in a dataset are eliminated when penalized.
• The shrinkage of these coefficients based on the alpha value provided leads to some form
of automatic feature selection, as input variables are removed in an effective approach.
Why Lasso can be Used for Model Selection, but not Ridge
Regression
• The elliptical contours (red circles) are the cost
functions for each.
• Since lasso regression takes a diamond shape in
the plot for the constrained region, each time the
elliptical regions intersect with these corners, at
least one of the coefficients becomes zero. This
is impossible in the ridge regression model as it
forms a circular shape and therefore values can
be shrunk close to zero, but never equal to zero.
Conclusion
❑ The cost function for both ridge and lasso regression are similar. However,
ridge regression takes the square of the coefficients and lasso takes the
magnitude.
❑ Lasso regression can be used for automatic feature selection, as the
geometry of its constrained region allows coefficient values to inert to zero.
❑ An alpha value of zero in either ridge or lasso model will have results
similar to the regression model.
❑ The larger the alpha value, the more aggressive the penalization.
What is Hyperplane ??
• For a linearly separable dataset having n features , a hyperplane is basically an (n – 1) dimensional
subspace used for separating the dataset into two sets, each set containing data points belonging to a
different class.
• For example, for a dataset having two features X and Y (therefore lying in a 2-dimensional space), the
separating hyperplane is a line (a 1-dimensional subspace).
• Similarly, for a dataset having 3-dimensions, we have a 2-dimensional separating hyperplane, and so
on.
• In machine learning, Support Vector Machine (SVM) is a non-probabilistic, linear, binary classifier
used for classifying data by learning a hyperplane separating the data.
What is SVM ?

• SVM was developed by Vladimir Vapnik in the 1970s


• Vapnik envisaged that coming up with a decision boundary that tries to maximize the margin
between the two classes will give great results and overcome the problem of overfitting.
• In the SVM algorithm, we plot each data item as a point in n-dimensional space (where n is a
number of features you have) .Then, we perform classification by finding the hyper-plane that
differentiates the two classes.
• kernel method was introduced that made it possible to solve non-linear problems using SVM.
Find the right Hyperplane
Scenario ?
Margin (maximum)

Our objective is to find a plane that has the maximum


margin, i.e the maximum distance between data points of
both classes.

Support vectors are data points that are closer to the


hyperplane and influence the position and orientation of
the hyperplane.
These margins are calculated using data points known as
Support Vectors.
OR
Support Vectors are those data points that are near to the
hyper-plane and help in orienting it.
Algorithm
Intuition of SVM
• In SVM, we take the output of the linear function and if that output is greater than 1, we
identify it with one class and if the output is -1, we identify is with another class.

• Step 1: SVM algorithm predicts the classes. One of the classes is identified as 1 while the
other is identified as -1.
Step 2

• convert the problem into a mathematical equation involving unknowns. These unknowns
are then found by converting the problem into an optimization problem.
• As optimization problems always aim at maximizing or minimizing something while
looking and tweaking for the unknowns, in the case of the SVM classifier, a loss function
known as the hinge loss function is used and tweaked to find the maximum margin.
Step 3: Loss function

• If cost function is zero no class is predicted incorrectly


• The problem is that there is a trade-off between maximizing margin and the loss generated
if the margin is maximized to a very large extent. To bring these concepts in theory, a
regularization parameter is added.
Step 4: Partial derivative

• we take partial derivatives with respect to the weights to find the gradients. Using the
gradients, we can update our weights.
Understanding the loss function/svm/kernels in svm

• https://www.geeksforgeeks.org/hinge-loss-relationship-with-support-vector-machines/
• https://iq.opengenus.org/hinge-loss-for-svm/
• Kernel SVM
• https://www.analyticsvidhya.com/blog/2021/10/support-vector-machinessvm-a-complete-
gui de-for-beginners/
Step 5: Update weight
• When there is no misclassification, i.e our model correctly predicts the class of our data
point, we only have to update the gradient from the regularization parameter.
• When there is a misclassification, i.e our model make a mistake on the prediction of the
class of our data point, we include the loss along with the regularization parameter to
perform gradient update.
Introduction to kernels

• When we can easily separate data with hyperplane by drawing a straight line is Linear
SVM.
• When we cannot separate data with a straight line we use Non – Linear SVM. In this, we
have Kernel functions
• It transforms data into another dimension so that the data can be classified.
• It transforms two variables x and y into three variables along with z.
Kernel Trick
Datasets are which you will be working or
currently working on might not always be
linear. One approach to handling
nonlinear datasets is to add more
features, such as polynomial features; in
some cases, this can result in a linearly
separable dataset.
Consider the left plot in Figure 1: it
represents a simple dataset with just one
feature x1. This dataset is not linearly
separable, as you can see. But if you add
a second feature x2 = (x1)2, the resulting
2D dataset is perfectly linearly separable.
Kernel Trick

Kernel functions Quadratic function used


• Polynomial Kernel
• Sigmoid Kernel
• RBF Kernel
Polynomial Kernel

• A polynomial kernel is a kind of SVM kernel that uses a polynomial function to map the
data into a higher-dimensional space. It does this by taking the dot product of the data
points in the original space and the polynomial function in the new space.
• The important terms we need to note are x1, x2, x1^2, x2^2, and x1 * x2. When finding
these new terms, the non-linear dataset is converted to another dimension that has
features x1^2, x2^2, and x1 * x2.
ℽ = 1 / 2σ^2,

RBF Kernel

• the Squared Euclidean Distance is multiplied


by the gamma parameter and then finding
the exponent of the whole.

• where,
1. ‘σ’ is the variance and our hyperparameter
2. ||X₁ - X₂|| is the Euclidean (L₂-norm) Distance
between two points X₁ and X₂
“Gamma” parameter in the RBF kernel

• It controls the width of the Gaussian function used to map the input data into a higher-dimensional space.
• A small value of gamma means that the influence of each training example is relatively large, and the
decision boundary becomes more curved or nonlinear.
• Conversely, a larger value of gamma means that the influence of each training example is relatively small,
and the decision boundary becomes more linear.
• choosing the optimal value of gamma depends on the complexity of the dataset and the number of training
examples.
• if gamma is too small, there is a risk of underfitting the data, while if it is too high, there is a risk of
overfitting.

You might also like