Regression Analysis
1
Contents
• What is Regression?
• Why Regression?
• Linear Regression
– Linear Regression algorithm using least square method
– Evaluation of method
• Multilinear Regression
• Logistic Regression
2
What is Regression?
• Regression is a supervised learning algorithm under
Machine Learning terminology
• An important tool in Predictive Analytics
• Regression analysis is a predictive modeling technique
which investigates the relationship between a
dependent and independent variable.
• Graphing a line over a set of data points that most
closely fits the overall shape of the data.
• The regression shows the changes in the dependent
variable on the Y axis to the changes in the explanatory
3 variable on X axis.
What is Regression?
• Regression is a tool for finding existence of an association
relationship between a dependent variable (Y) and one or more
independent variables (X1, X2, …, Xn) in a study.
• The relationship can be linear or non-linear.
• A dependent variable (response variable) “measures an
outcome of a study (also called outcome variable)”.
• An independent variable (explanatory variable) “explains
changes in a response variable”.
4
Types of Regression
One More than
independent One
variable independent
5 variable
Most Common Regression Algorithms
● Simple linear regression
● Multiple linear regression
● Polynomial regression
● Multivariate adaptive regression splines
● Logistic regression
● Maximum likelihood estimation (least squares)
6
Use cases of Regression
• Predictive analytics
• Operation efficiency
• Supporting decisions
• Correcting errors
• New insights
• House Price Predictions
• Trend forecasting
– E.g. what will be the price of gold in next six months
• Finding Associations among attributes:
– E.g. Mediclaim agencies: Effect of age on claims
7
Linear Regression
• Linear regression: It is a linear approach to modelling the
relationship between a scalar response and one or more
explanatory variables (also known as dependent and independent
variables).
• The case of one explanatory variable is called simple linear
regression; for more than one, the process is called multiple
linear regression.
• In linear regression, the relationships are modeled using linear
predictor functions whose unknown
model parameters are estimated from the data.
• Linear regression models are often fitted using the least
squares approach.
8
Simple Linear Regression
• One of the easiest algorithm in machine learning.
• Simple Linear regression: It is a statistical model that
attempts to show the relationship between two variables
through the linear equation.
• Data is modeled using a straight line (Y = mX + c)
• Correlation between X and Y variables
9
Simple Linear Regression: Understanding
+ve Relationship
Speed of Vehicle
m=Slop of the line
(Dependent variable)
Distance travelled in
fixed duration of time
c= y – intercept of the line
10
(Independent variable)
Simple Linear Regression: Understanding
11
Slops of Simple Linear Regression Model
Linear positive slope Linear negative slope
Slope = (Y − Y ) / (X − X ) = Delta (Y) / Delta(X)
Slope = Change in Y/Change in X
Curve linear positive slope
Example:
(X , Y ) = (−3, −2) and (X , Y ) = (2, 2)
Rise = (Y − Y ) = (2 − (−2)) = 2 + 2 = 4
Run = (X − X ) = (2 − (−3)) = 2 + 3 = 5
12 Slope = Rise/Run = 4/5 = 0.8
Curve linear negative slope
Relations in Regression
Linear positive slope Linear negative slope
Curve linear positive slope
13
Curve linear negative slope
Simple Linear Regression:
Least Square Method
• How to find the best Regression Line?
• Our challenge is to determine the value of m and c, that gives
the minimum error for the given dataset. We will be doing this
by using the Least Squares method.
• Loss function:
y = mx +c
For minimum loss we take partial
derivative of L(x) and equate to 0,
then finding expression of m and c.
14
Simple Linear Regression:
Least Square Method (Example)
• A method to predict best fit line.
15
Simple Linear Regression:
Least Square Method (Example)
16
Simple Linear Regression
• Measure of Goodness: R2 method
18
OLS algorithm
● Step 1: Calculate the mean of X and Y
● Step 2: Calculate the errors of X and Y
● Step 3: Get the product
● Step 4: Get the summation of the products
● Step 5: Square the difference of X
● Step 6: Get the sum of the squared difference
● Step 7: Divide output of step 4 by output of step 6 to calculate ‘b’
● Step 8: Calculate ‘a’ using the value of ‘b’
19
Example of Simple Linear Regression
Calculation summary:
Sum of X = 299
Sum of Y = 852
Mean X, M = 19.93
Mean Y, M = 56.8
20
Error in Simple Regression
Y = (a + bX) + ε
Example of simple regression
Scatter plot and regression line
Sum of Square of Residual
21 SSE=
Residual is the distance between the predicted point (on the
regression line) and the actual point as depicted in Figure
Multiple Linear Regression
• Two or more independent variables, i.e. predictors are involved in the
model.
• In the example of simple linear regression by considering Price of a
Property as the dependent variable and the Area of the Property (in sq.
m.) as the predictor variable.
• If we consider Price of a Property (in $) as the dependent variable and
Area of the Property (in sq. m.), location, floor, number of years since
purchase and amenities available as the independent variables, we can
form a multiple regression equation as shown below:
22
• The simple linear regression
• Parameter ‘a’ is the intercept of
model and the multiple
this plane. Parameters ‘b1 ’ and
regression model assume that
‘b2 ’ are referred to as partial
the dependent variable is
regression coefficients.
continuous.
• Parameter b1 represents the
• The following expression
change in the mean response
describes the equation involving
corresponding to a unit change in
the relationship with two
X1 when X2 is held constant.
predictor variables, namely X1
and X2 .
• Parameter b2 represents the
change in the mean response
23 corresponding to a unit change in
• The model describes a plane in X2 when X1 is held constant.
the three-dimensional space of
Ŷ, X1, and X2 .
• Consider the following
example of a multiple linear
regression model with two
predictor variables, namely
X1 and X2.
24
Multiple regression for estimating equation when there are ‘n’ predictor
variables is as follows:
While finding the best fit line, we can fit either a polynomial or
curvilinear regression. These are known as polynomial or curvilinear
regression, respectively.
25
Assumptions in Regression Analysis
1. The dependent variable (Y) can be calculated / predicated as a
linear function of a specific set of independent variables (X’s) plus an
error term (ε).
2. The number of observations (n) is greater than the number of
parameters (k) to be estimated, i.e. n > k.
3. Relationships determined by regression are only relationships of
association based on the data set and not necessarily of cause and
effect of the defined class.
4. Regression line can be valid only over a limited range of data. If the
line is extended (outside the range of extrapolation), it may only lead
to wrong predictions.
26
5. If the business conditions change and the business assumptions
underlying the regression model are no longer valid, then the past
data set will no longer be able to predict future trends.
6. Variance is the same for all values of X (homoskedasticity).
7. The error term (ε) is normally distributed. This also means that the
mean of the error (ε) has an expected value of 0.
8. The values of the error (ε) are independent and are not related to
any values of X. This means that there are no relationships between a
particular X, Y that are related to another specific value of X, Y.
27 Given the above assumptions, the OLS estimator is the Best Linear
Unbiased Estimator (BLUE), and this is called as Gauss-Markov
Theorem.
Main Problems in Regression Analysis
• Two primary problems: Multicollinearity and heteroskedasticity
Mutilcollinearity
• Two variables are perfectly collinear if there is an exact linear
relationship between them.
• Multicollinearity is the situation in which the degree of correlation is
not only between the dependent variable and the independent
variable, but there is also a strong correlation within (among) the
independent variables themselves.
• A multiple regression equation can make good predictions when
28 there is multicollinearity, but it is difficult for us to determine how
the dependent variable will change if each independent variable is
changed one at a time.
• When multicollinearity is present, it increases the standard errors
of the coefficients.
• One way to gauge multicollinearity is to calculate the Variance
Inflation Factor (VIF), which assesses how much the variance of
an estimated regression coefficient increases if the predictors are
correlated.
• If no factors are correlated, the VIFs will be equal to 1.
• The assumption of no perfect collinearity states that there is no
exact linear relationship among the independent variables.
• This assumption implies two aspects of the data on the
independent variables.
29
• First, none of the independent variables, other than the variable
associated with the intercept term, can be a constant.
• Second, variation in the X’s is necessary.
• In general, the more variation in the independent variables, the
better will be the OLS estimates in terms of identifying the impacts
of the different independent variables on the dependent variable.
Heteroskedasticity
• Refers to the changing variance of the error term.
• If the variance of the error term is not constant across data sets,
there will be erroneous predictions.
30 • In general, for a regression equation to make accurate predictions,
the error term should be independent, identically (normally)
distributed (iid).
31
Improving Accuracy of the Linear Regression Model
• Accuracy refers to how close the estimation is near the actual
value, whereas prediction refers to continuous estimation of the
value.
High bias = low accuracy (not close to real value)
High variance = low prediction (values are scattered)
Low bias = high accuracy (close to real value)
Low variance = high prediction (values are close to each other)
• We have a regression model which is highly accurate and highly
predictive; therefore, the overall error of our model will be low,
implying a low bias (high accuracy) and low variance (high
32 prediction). This is highly preferable.
Improving Accuracy of the Linear Regression Model
Accuracy of linear regression can be improved using the following
three methods:
1. Shrinkage Approach
2. Subset Selection
3. Dimensionality (Variable) Reduction
33
Polynomial Regression Model
• Extension of the simple linear model by adding extra predictors
obtained by raising (squaring) each of the original predictors to a
power.
• This approach provides a simple way to yield a non-linear fit to
data. For example,
• Let us use the below data set of (X, Y) for degree 3 polynomial.
34
35
• As you can observe, the regression line is slightly curved for
polynomial degree 3 with the above 15 data points.
• The regression line will curve further if we increase the polynomial
degree.
• At the extreme value as shown above, the regression line will be
overfitting into all the original values of X.
36
What is Logistic Regression?
• Logistic regression is a Classification algorithm.
• Logistic Regression is all about predicting binary variables,
not predicting continuous variables.
• Logistic regression models estimate how probability of an
event may be affected by one or more explanatory variables.
• Logistic regression is a technique used for predicting “class
probability”, that is the probability that the case belongs to a
particular class.
37
Use cases of Logistic Regression
• Mail[Spam / Not Spam]
• Transaction [Fraudulent / Normal]
• Tumor [Malignant / Benign]
• Sentimental Analysis [Positive / Negative]
• Weather Prediction [Rain / Not Rain]
• Medical Diagnosis [Fit / ill]
38
Linear and Logistic Regression
39
Linear and Logistic Regression
Logistic curve
Sigmoid (S) curve
40
Logistic Regression Curve
41
Some fundamentals terms of Logistic Regression
• The probability that an event will occur is the fraction of times you expect
to see that event in many trials. If the probability of an event occurring is
Y, then the probability of the event not occurring is 1-
Y. Probabilities always range between 0 and 1.
• The odds are defined as the probability that the event will occur divided
by the probability that the event will not occur. Unlike probability, the
odds are not constrained to lie between 0 and 1 but can take any value
from zero to infinity.
• If the probability of Success is P, then the odds of that event is:
• The logit function is the logarithmic transformation of the logistic function.
It is defined as the natural logarithm of odds.
43
Math behind Logistic Regression
44
Math behind Logistic Regression
45
Math behind Logistic Regression
46
• Let us say we have a model that can predict whether a person is
male or female on the basis of their height.
• Given a height of 150 cm, we need to predict whether the person
is male or female.
• We know that the coefficients of a = −100 and b = 0.6.
• Using the above equation, we can calculate the probability of male
given a height of 150 cm or more formally P(male|height = 150).
47 or a probability of near zero that the person is a male.
Linear vs Logistic Regression
Basis Linear Regression Logistic Regression
Core concept Data is modeled using a Data is modeled using a
(Modeling of data) straight line. logistic (sigmoid) function.
Used with Continuous variable Categorical variable
Output/prediction Value of the variable Probability of occurrence of
event
Problem Solved Regression Classification
Accuracy Loss, R2, adjusted R2, etc. Accuracy, Precision,
(goodness of fit) Recall, F1 score, ROC
curve, Confusion matrix,
etc.
• The basic difference: the type of function that is used for
mapping
– (Linear: continuous X -> Continuous Y;
48
– Logistic: continuous X -> binary Y) – used for deciding category or true/false
decisions of the data
Parameter Estimation by
Maximum Likelihood Method
● The coefficients in a logistic regression are estimated using a process called Maximum
Likelihood Estimation (MLE).
● Likelihood function:
49
Thank You
50
Parameter Estimation by
Maximum Likelihood Method
• Probability density function for binary logistic regression is
given by:
51
Parameter Estimation by
Maximum Likelihood Method
The above system of equations are solved iteratively to
estimate β0 and β1
52
References
• Coursera tutorial - Linear Regression, Logistic Regression
• SimpliLearn tutorial – Logistic Regression
• Wikipedia – linear regression
• Business Analytics – by U.Dinesh Kumar
53