Chap-11 Correlation and Regression
Chap-11 Correlation and Regression
Chap-11 Correlation and Regression
7.75
6.50
Revenue (billions)
5.25
4.00
2.75
1.50
y
100
90
80
Final Grade
70
60
50
40
30
x
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Machine Learning
Correlation Coefficient
Correlation Coefficient
-1 0 +1
Correlation Coefficient
y y y
x x x
(a) r= 0.50 (b) r= 0.90 (c) r= 1.00
y y y
x x x
(d) r= -0.50 (e) r= -0.90 (f) r= -1.00
Correlation Coefficient
x
Regression: Line of Best Fit
Observed
value
Predicated value
x
Machine Learning
Regression Line Equation
Regression Line Equation
A line as represented in Algebra and in Statistics
y
Slope
y Intercept
y = mx + c
y = 0.5x + 5 △ =2
△ =4
△ 2
= = = 0.5
△ 4
x
(a) Algebra of a line
y’
y’ Intercept
Slope
y’ = a + bx
y’ = 5 + 0.5x △ ′=2
△ =4
△ ′ 2
= = = 0.5
△ 4
x
(b) Statistical notation for a regression line
Regression Line Equation
100
90
80
Final grade
70
60 y’=102.493 – 3.622x
50
40
30
0 5 10 15
Number of absences
Regression Model
• The regression model is y 0 1 x
• Data about x and y are obtained from a sample.
Multiple Regression
(Gender) x2 y (Income)
(Experience) x3
(Age) x4
Model with simultaneous relationship
Price of wheat Quantity of wheat produced
Multiple Regression
Example:
Student GPA x1 Age x2 State board score y
A 3.2 22 550
B 2.7 27 570
C 2.5 24 525
D 3.4 28 670
E 2.2 23 490
The multiple regression equation obtained from the data is
y’ = -44.81 + 87.64x1 + 14.533x2
If GPA=3.0 and Age=25, then predicted State board score =
1 ⋯
1 ⋯
. = 1 ⋯ . + .
. ⋮ ⋮ ⋮ ⋱ ⋮ . .
. 1 ⋯ . .
Machine Learning
Errors & Residuals
Errors & Residuals
• A statistical error is the difference between an observation and
its expected value which is based on the entire population.
Usually values for the entire population are unobservable, e.g.
mean height of all human beings
• A residual is the difference between an observation and its
estimated value, e.g. mean height of a randomly chosen sample
of human beings
• Residual Sum of Squares (RSS), also known as the Sum of
Squared Residuals (SSR) is the sum of the squares of residuals.
• It is a measure of the discrepancy between the data and the
estimation model and is used as an optimality criterion in
parameter selection and model selection.
• In a standard linear simple regression model, = + +
=∑ 2 = ∑ ( − (α + ))2 where α is the
estimated value of a and β is the estimated value of the
slope b.
• Minimizing the RSS function is a building block of supervised
learning algorithms, and in the field of machine learning this
function is referred to as the cost function.
Graphical Example Explaining Residual
• When you perform simple linear regression (or any
other type of regression analysis), you get a line of best
fit.
ŷ= +
Regression ŷ
Predicated ‘y’ Line
=( −ŷ ) =( −ŷ )
ŷ ŷ
Residual:y - ӯ
Observed ‘y’ =( −ŷ )
R-Square & Adjusted R Square
• R-Square determines how much of the total variation in
Y (dependent variable) is explained by the variation in X
(independent variable).
∑
− =1 − ∑
• The value of R-square is always between 0 and 1, where
0 means that the model does not explain any variability
in the target variable (Y) and 1 meaning it explains full
variability in the target variable.
• The Adjusted R-Square is a modified form of R-Square
that has been adjusted for the number of predictors in
the model. It incorporates model’s degree of freedom.
The adjusted R-Square only increases if the new term
improves the model accuracy.
2
1 − 2 ( − 1)
=1 −
− −1
2 2
where R = sample R value, N = total sample size and
p = Number of predictors
Machine Learning
Regularization
Regularization
• Overfitting refers to a model that corresponds too closely or exactly to a
particular set of training data, but fails to fit new data or predict future
observations reliably.
• It is a statistical model that contains more parameters than can be justified
by the data.
• Regularization is a very important technique in machine learning to
prevent overfitting.
• Mathematically speaking, it adds a regularization term in order to prevent
the coefficients to fit so perfectly to overfit.
• L1-norm loss function minimizes the sum of the absolute differences (S)
between the target value (Yi) and the estimated values (f(xi)):
=∑ | − ( )|
• L2-norm loss function minimizes the sum of the squares of the differences
(S) between the target value (Yi) and the estimated values (f(xi)):
=∑ ( − ( ))2
• Regularization adds a regularizer R(f) to a loss function.
∑ ( , )+ λ ( )
• Where V is the underlying loss function, e.g. L1 or L2 and
λ = parameter which controls the importance of the regularizer
Machine Learning
Ridge & LASSO
Ridge & LASSO
Ridge and LASSO (Least Absolute Shrinkage and
Selection Operator) regression are powerful techniques
generally used for creating compact models in presence
of a “large” number of features. Problem with large
number of features:
• Enhance the tendency of a model to overfit (as low
as 10 variables might cause overfitting)
• Cause computational challenges. With modern
systems, this situation might arise in case of
millions or billions of features
Ridge & LASSO
Both Ridge and LASSO work by penalizing the magnitude
of coefficients of features along with minimizing the
error between predicted and actual observations. The
key difference is in how they assign penalty to the
coefficients.
Ridge Regression:
Performs L2 regularization, i.e. adds penalty equivalent
to square of the magnitude of coefficients
Minimization objective = Least Squares Objective + α *
(sum of square of coefficients)
LASSO Regression:
Performs L1 regularization, i.e. adds penalty equivalent
to absolute value of the magnitude of coefficients
Minimization objective = Least Squares Objective + α *
(sum of absolute value of coefficients)
Ridge & LASSO (Cont.)
• Both Ridge and LASSO try to penalize the Beta
coefficients so that we can get the important variables
(all in case of Ridge and few in case of LASSO).
• If we take α = 0, it will become Ridge and if α = 1 it is
LASSO.
• The major advantage of Ridge regression is coefficient
shrinkage and reducing model complexity. So It is
majorly used to prevent overfitting but not very useful in
reducing number of features.
• Along with shrinking coefficients, LASSO performs
feature selection as well - some of the coefficients
become exactly zero, which is means that the feature is
excluded from the model. So it is useful for modelling
cases where the number of features are in millions or
more
Let us go for a practical demonstration…