Module 3
Module 3
Module 3
Linear Models
Linear Regression
• The model aims to predict y value such that the error difference between predicted value
and true value is minimum.
• Cost function(J) of Linear Regression is the Root Mean Squared Error (RMSE) between
predicted y value (pred) and true y value (y).
•
GRADIENT DESCENT IN
LINEAR REGRESSION
• Linear
Regression
Alpha- Learning rate
• If the learning rate is too high, we might OVERSHOOT the minima and keep bouncing,
without reaching the minima
• If the learning rate is too small, the training might turn out to be too long
• Blog to study
• https://www.analyticsvidhya.com/blog/2021/08/understanding-gradient-descent-algorithm
-and-the-maths-behind-it/
Plotting the Gradient Descent Algorithm
• When we have a single
parameter (theta), we can plot
the dependent variable cost on
the y-axis and theta on the
x-axis. If there are two
parameters, we can go with a
3-D plot, with cost on one axis
and the two parameters (thetas)
along the other two axes.
Plotting the Gradient Descent Algorithm
Regularization in ML
• Problem : Overfitting
• Solution : This is a form of regression, that constrains/ regularizes or shrinks the
coefficient estimates towards zero. In other words, this technique discourages learning a
more complex or flexible model, so as to avoid the risk of overfitting.
• A simple Linear Regression
• Y = mx + c
Regression Regularization
• this variation differs from ridge regression only in penalizing the high coefficients.
• It uses |βj|(modulus)instead of squares of β, as its penalty. In statistics, this is known as
the L1 norm.
• Lasso regression performs L1 regularization, i.e. it adds a factor of sum of absolute value
of coefficients in the optimization objective. Thus, lasso regression optimizes the
following:
• Objective = RSS + α * (sum of absolute value of coefficients)
LASSO
• Like that of ridge, α can take various values. Lets iterate it here briefly
• α = 0: Same coefficients as simple linear regression
• α = ∞: All coefficients zero (same logic as before)
• 0 < α < ∞: coefficients between 0 and that of simple linear regression
Alpha (α) can be any real-valued number between zero and
infinity; the larger the value, the more aggressive the
penalization is.
lasso regression shrinks the coefficients and helps to reduce the model
complexity and multi-collinearity.
Key Take Away for Lasso : Lasso Regression for Model
Selection
• Due to the fact that coefficients will be shrunk towards a mean of zero, less important
features in a dataset are eliminated when penalized.
• The shrinkage of these coefficients based on the alpha value provided leads to some form
of automatic feature selection, as input variables are removed in an effective approach.
Why Lasso can be Used for Model Selection, but not Ridge
Regression
• The elliptical contours (red circles) are the cost
functions for each.
• Since lasso regression takes a diamond shape in
the plot for the constrained region, each time the
elliptical regions intersect with these corners, at
least one of the coefficients becomes zero. This
is impossible in the ridge regression model as it
forms a circular shape and therefore values can
be shrunk close to zero, but never equal to zero.
Conclusion
❑ The cost function for both ridge and lasso regression are similar. However,
ridge regression takes the square of the coefficients and lasso takes the
magnitude.
❑ Lasso regression can be used for automatic feature selection, as the
geometry of its constrained region allows coefficient values to inert to zero.
❑ An alpha value of zero in either ridge or lasso model will have results
similar to the regression model.
❑ The larger the alpha value, the more aggressive the penalization.
What is Hyperplane ??
• For a linearly separable dataset having n features , a hyperplane is basically an (n – 1) dimensional
subspace used for separating the dataset into two sets, each set containing data points belonging to a
different class.
• For example, for a dataset having two features X and Y (therefore lying in a 2-dimensional space), the
separating hyperplane is a line (a 1-dimensional subspace).
• Similarly, for a dataset having 3-dimensions, we have a 2-dimensional separating hyperplane, and so
on.
• In machine learning, Support Vector Machine (SVM) is a non-probabilistic, linear, binary classifier
used for classifying data by learning a hyperplane separating the data.
What is SVM ?
• Step 1: SVM algorithm predicts the classes. One of the classes is identified as 1 while the
other is identified as -1.
Step 2
• convert the problem into a mathematical equation involving unknowns. These unknowns
are then found by converting the problem into an optimization problem.
• As optimization problems always aim at maximizing or minimizing something while
looking and tweaking for the unknowns, in the case of the SVM classifier, a loss function
known as the hinge loss function is used and tweaked to find the maximum margin.
Step 3: Loss function
• we take partial derivatives with respect to the weights to find the gradients. Using the
gradients, we can update our weights.
Understanding the loss function/svm/kernels in svm
• https://www.geeksforgeeks.org/hinge-loss-relationship-with-support-vector-machines/
• https://iq.opengenus.org/hinge-loss-for-svm/
• Kernel SVM
• https://www.analyticsvidhya.com/blog/2021/10/support-vector-machinessvm-a-complete-
gui de-for-beginners/
Step 5: Update weight
• When there is no misclassification, i.e our model correctly predicts the class of our data
point, we only have to update the gradient from the regularization parameter.
• When there is a misclassification, i.e our model make a mistake on the prediction of the
class of our data point, we include the loss along with the regularization parameter to
perform gradient update.
Introduction to kernels
• When we can easily separate data with hyperplane by drawing a straight line is Linear
SVM.
• When we cannot separate data with a straight line we use Non – Linear SVM. In this, we
have Kernel functions
• It transforms data into another dimension so that the data can be classified.
• It transforms two variables x and y into three variables along with z.
Kernel Trick
Datasets are which you will be working or
currently working on might not always be
linear. One approach to handling
nonlinear datasets is to add more
features, such as polynomial features; in
some cases, this can result in a linearly
separable dataset.
Consider the left plot in Figure 1: it
represents a simple dataset with just one
feature x1. This dataset is not linearly
separable, as you can see. But if you add
a second feature x2 = (x1)2, the resulting
2D dataset is perfectly linearly separable.
Kernel Trick
• A polynomial kernel is a kind of SVM kernel that uses a polynomial function to map the
data into a higher-dimensional space. It does this by taking the dot product of the data
points in the original space and the polynomial function in the new space.
• The important terms we need to note are x1, x2, x1^2, x2^2, and x1 * x2. When finding
these new terms, the non-linear dataset is converted to another dimension that has
features x1^2, x2^2, and x1 * x2.
ℽ = 1 / 2σ^2,
RBF Kernel
• where,
1. ‘σ’ is the variance and our hyperparameter
2. ||X₁ - X₂|| is the Euclidean (L₂-norm) Distance
between two points X₁ and X₂
“Gamma” parameter in the RBF kernel
• It controls the width of the Gaussian function used to map the input data into a higher-dimensional space.
• A small value of gamma means that the influence of each training example is relatively large, and the
decision boundary becomes more curved or nonlinear.
• Conversely, a larger value of gamma means that the influence of each training example is relatively small,
and the decision boundary becomes more linear.
• choosing the optimal value of gamma depends on the complexity of the dataset and the number of training
examples.
• if gamma is too small, there is a risk of underfitting the data, while if it is too high, there is a risk of
overfitting.