ML unit-2 ppt
ML unit-2 ppt
ML unit-2 ppt
Unit-II
Linear regression
• Regression is essentially finding a relationship (or) association between the
dependent variable (Y) and the independent variable(s) (X), i.e. to find the function ‘f
’ for the association Y = f (X).
• Linear regression is a statistical model that is used to predict a continuous
dependent variable from one or more independent variables
• It is called "linear" because the model is based on the idea that the relationship between
the dependent and independent variables is linear.
• In a linear regression model, the independent variables are referred to as the predictors
and the dependent variable is
• The goal is to find the "best" line that fits the data. The "best" line is
the one that minimizes the sum of the squared differences between the
observed responses in the dataset and the responses predicted by the line.
• For example, if you were using linear regression to model the relationship
between the temperature outside and the number of ice cream cones sold at
an ice cream shop, you could use the model to predict how many ice cream
cones you would sell on a hot day given the temperature outside.
• The goal is to find the "best" line that fits the data. The "best" line is
the one that minimizes the sum of the squared differences between the
observed responses in the dataset and the responses predicted by the line.
• For example, if you were using linear regression to model the relationship
between the temperature outside and the number of ice cream cones sold at
an ice cream shop, you could use the model to predict how many ice cream
cones you would sell on a hot day given the temperature outside.
• The value of intercept indicates the
value of Y when X = 0. It is known
as ‘the intercept or Y intercept’
because it specifies where the
straight line crosses the vertical or Y-
axis.
where ‘a’ and ‘b’ are intercept and slope of the straight
line, respectively.
Slope of the simple linear regression
model
• Slope of a straight line represents how much the line in a graph
changes in the vertical direction (Y-axis) over a change in the
horizontal direction (X-axis)
• Rise is the change in Y-axis
• Run is the change in X-axis
Ordinary Least Squares (OLS)
algorithm
Exercise Problem
• A college professor believes that if the grade for internal examination
is high in a class, the grade for external examination will also be high.
A random sample of 15 students in that class was selected, and the
data is given as,
Multiple Linear Regression
• In a multiple regression model, two or more independent variables,
i.e. predictors are involved.
• Example: A model which can predict the correct value of a real estate
if it has certain standard inputs such as area (sq. m.) of the property,
location, floor, number of years since purchase, amenities available
etc as independent variables.
• The model describes a plane in the three-dimensional space of Ŷ, X1 ,
and X2 . Parameter ‘a’ is the intercept of this plane. Parameters ‘b1’
and ‘b2’ are referred to as partial regression coefficients.
• Parameter b1 represents the change in the mean response
corresponding to a unit change in X1 when X2 is held constant.
• Parameter b2 represents the change in the mean response
corresponding to a unit change in X2 when X1 is held constant.
• Consider the following example of a multiple linear regression model
with two predictor variables, namely X1 and X2
While finding the best fit line, we can fit either a polynomial or curvilinear
regression. These are known as polynomial or curvilinear regression, respectively.
Assumptions in Regression Analysis
• linear relationship between the dependent and independent variales
• Regression line can be valid only over a limited range of data. If the line is
extended (outside the range of extrapolation), it may only lead to wrong
predictions.
• The values of the error (ε) are independent and are not related to any
values of X
• The number of observations (n) is greater than the number of parameters (k) to
be estimated, i.e. n > k.
• normally distributed error component
• no multicollinearity, instability of regression coefficients
• no hereoskedasticity ,the variance of the residuals must be constant across the
predicted vaues.
Given the above assumptions, the OLS estimator is the Best Linear Unbiased
Estimator (BLUE), and this is called as Gauss-Markov Theorem
problems in Regression Analysis
1.Multicollinearity
2.Heteroskedasticity
Multicollinearity
• 2 or more independent variables are strongly correlated with one
another.
• problem arises with this is,individual variables can not be clearly
seperated by this regression equation becomes unstable.
• draw the regression line to one independent variable to others
independent variables
• to detect multicollinearity
where ‘var’ represents the variance, ‘cov’ represents the covariance, ‘u’ represents
the error terms, and ‘X’ represents the independent variables.
This assumption is more commonly written as
Improving Accuracy of the Linear
Regression Model
• Accuracy refers to how close the estimation is near the actual value
• Prediction refers to continuous estimation of the value.
• Bias and Variance is similar to accuracy and prediction
High bias = low accuracy (not close to real value)
High variance = low prediction (values are scattered)
Low bias = high accuracy (close to real value)
Low variance = high prediction (values are close to each
other)
• A regression model which is highly accurate and highly predictive, the
overall error of the model will be low, implying a low bias (high
accuracy) and low variance (high prediction) - highly preferable
• Similarly, if the variance increases (low prediction), the spread of our
data points increases, which results in less accurate prediction. As the
bias increases (low accuracy), the error between our predicted value
and the observed values increases.
• Balancing out bias and accuracy is essential in a regression model In the
linear regression model, it is assumed that the number of observations
(n) is greater than the number of parameters (k) to be estimated, i.e. n >
k, and in that case, the least squares estimates tend to have low
variance and hence will perform well on test observations.
• However, if observations (n) is not much larger than parameters (k),
then there can be high variability in the least squares fit, resulting in
overfitting and leading to poor predictions.If k > n, then linear regression
is not usable.
• Accuracy of linear regression can be improved using the following
three methods:
1. Shrinkage Approach
2. Subset Selection
3. Dimensionality (Variable) Reduction
Shrinkage (Regularization) approach
• This approach involves fitting a model involving all predictors. However,
the estimated coefficients are shrunken towards zero relative to the
least squares estimates.
• This shrinkage (also known as regularization) has the effect of reducing
the overall variance. Some of the coefficients may also be estimated to
be exactly zero, thereby indirectly performing variable selection.
• The two best-known techniques for shrinking the regression
coefficients towards zero are
1. ridge regression
2. lasso (Least Absolute Shrinkage Selector Operator)
Ridge regression (L2 regularization)
• It modifies the over-fitted or under fitted models by adding the
penalty equivalent to the sum of the squares of the magnitude of
coefficients.
• Ridge Regression performs regularization by shrinking the coefficients
present.
L2= (y-ŷ)2+α (sum of square of coefficients)