Session 4 - Multiple Linear Regression
Session 4 - Multiple Linear Regression
Session 4 - Multiple Linear Regression
2
WHERE IS IT USED?
A few examples of MLR are as follows:
The treatment cost of a cardiac patient may depend on factors such as age, past
medical history, body weight, blood pressure, and so on.
Salary of MBA students at the time of graduation may depend on factors such as their
academic performance, prior work experience, communication skills, and so on.
Market share of a brand may depend on factors such as price, promotion expenses,
competitors’ prices, etc.
3
MULTIPLE LINEAR REGRESSION
Y = β 0 + β1x1 + β 2 x2 + ... + β k xk + ε
2
Y = β 0 + β1x1 + β 2 x2 + β 3 x1x2 + β 4 x2 ... + β k xk + ε
An important task in multiple regression is to estimate the beta values (β1, β2,
β3 etc…)
5
LINEAR OR NON-LINEAR REGRESSION ???
1 β3
Y = β0 + + X2 +ε
β1 + β 2 X 1
2
Y = β1 + β1 X1 + β 2 X1 X 2 + β3 X 2
6
DEFINE THE FUNCTIONAL FORM OF RELATIONSHIP
For better predictive ability (model accuracy) it is important to specify the correct
functional form between the dependent variable and the independent variable.
Scatter plots may assist the modeller to define the right functional form.
Linear relationship between X1 and Y1 Log-linear relationship between X2 and Y2
7
REGRESSION: MATRIX REPRESENTATION
Y = Xβ + ε
y1 1 x11 x21 xk 1 β 0 ε 1
y 1
2 x12 x22 xk 2 β1 ε 2
= • +
yn 1 x1n x2 n xkn β k ε n
8
MULTIPLE LINEAR REGRESSION
The equation that describes how the mean value of y is related to x1, x2, . . . xk
is:
9
ESTIMATED MULTIPLE REGRESSION EQUATION
10
ESTIMATION PROCESS
11
LEAST SQUARES METHOD
Minimize the sum of the squares of the deviations between the observed values of the
dependent variable yi and the predicted values of the dependent variable y�i .
∧
Provides the Best Linear Unbiased Estimate (BLUE), that is, 𝐸𝐸 𝛽𝛽 − 𝛽𝛽 = 0, where 𝛽𝛽
∧
is the population parameter and 𝛽𝛽 is the estimated parameter value from the sample.
12
MLR ASSUMPTIONS
The assumptions that are made in multiple linear regression model are as follows:
13
MLR ASSUMPTIONS
The variance of the residuals, Var(εi|Xi), is constant for all values of Xi. When the
variance of the residuals is constant for different values of Xi, it is called
homoscedasticity. A non-constant variance of residuals is called heteroscedasticity.
There is no high correlation between independent variables in the model (called multi-
collinearity). Multi-collinearity can destabilize the model and can result in incorrect
estimation of the regression parameters.
14
HAT MATRIX
∧
The regression coefficients 𝛽𝛽̂ is given by 𝛃𝛃 = (𝐗𝐗 𝐓𝐓 𝐗𝐗)−𝟏𝟏 𝐗𝐗 𝐓𝐓 𝐘𝐘
Hat matrix, also known as the influence matrix, since it describes the influence of
each observation on the predicted values of response variable.
Hat matrix plays a crucial role in identifying the outliers and influential observations in
the sample.
15
FRAMEWORK FOR BUILDING MULTIPLE LINEAR REGRESSION (MLR)
16
EXAMPLE
17
Serial CTRP P R Serial CTRP P R
1 133 111600 1197576 20 156 104400 1326360
2 111 104400 1053648 21 119 136800 1162596
3 129 97200 1124172 22 125 115200 1195116
4 117 79200 987144 23 130 115200 1134768
5 130 126000 1283616 24 123 151200 1269024
6 154 108000 1295100 25 128 97200 1118688
7 149 147600 1407444 26 97 122400 904776
8 90 104400 922416 27 124 208800 1357644
9 118 169200 1272012 28 138 93600 1027308
10 131 75600 1064856 29 137 115200 1181976
11 141 133200 1269960 30 129 118800 1221636
12 119 133200 1064760 31 97 129600 1060452
13 115 176400 1207488 32 133 100800 1229028
14 102 180000 1186284 33 145 147600 1406196
15 129 133200 1231464 34 149 126000 1293936
16 144 147600 1296708 35 122 108000 1056384
17 153 122400 1320648 36 120 194400 1415316
18 96 158400 1102704 37 128 176400 1338060
19 104 165600 1184316 38 117 172800 1457400
EXAMPLE
The MLR model is given by
Model Summary
Model R R-Square Adjusted R- Std. Error of
Square the Estimate
19
COEFFICIENTS
For every one unit increase in CTRP, the revenue increases by 5931.850 when the variable promotion
is kept constant, and for one unit increase in promotion the revenue increases by 3.136 when CTRP is
kept constant.
Note that television-rating point is likely to change when the amount spent on promotion is changed.
STANDARDIZED REGRESSION CO-EFFICIENT
The coefficient value for CTRP is 5931.85 and the coefficient for promotion
spend is 3.136. However, this does not mean that CTRP has more influence on
the revenue compared to promotion expenses.
The reason is that the unit of measurement for CTRP is different from the unit of
measurement for promotion.
We have to derive standardized regression coefficients to compare the impact of
different explanatory variables that have different units of measurement.
Since the regression coefficients can not be compared directly due to differences
in scale and units of measurement of variables, one has to normalize the data to
compare the regression coefficients and their impact on the response variable.
21
STANDARDIZED REGRESSION CO-EFFICIENT
A regression model can be built on standardized dependent variable and
standardized independent variables, the resulting regression coefficients are
then known as standardized regression coefficients.
The standardized regression coefficient can also be calculated using the
following formula:
SXi
∧
Standardized Beta = β×
S
Y
Where SXi is the standard deviation of the explanatory variable Xi and SY is the
standard deviation of the response variable Y.
22
STANDARDIZED REGRESSION CO-EFFICIENT
β1 β2
5931.850 3.136
∧ SXi
Standardized Beta = β×
S
Y
23
STANDARDIZED REGRESSION CO-EFFICIENT
For one standard deviation change in the explanatory variable, the standard
regression coefficient captures the number of standard deviations by which the
response variable will change.
For example, when CTRP is changed by one standard deviation, revenue
will change by 0.732 standard deviations.
Similarly, when promotion changes by one standard deviation, revenue will
change by 0.736 standard deviations.
That is, the variable promotion has slightly higher impact on the revenue
compared to CTRP.
24
REGRESSION MODELS WITH QUALITATIVE VARIABLES
25
REGRESSION MODELS WITH QUALITATIVE VARIABLES
The data in Table provides salary and educational qualifications of 30 randomly chosen people in
Mumbai. Build a regression model to establish the relationship between salary earned and their
educational qualifications.
1 1 1 0 0 9800
11 2 0 1 0 17200
19 3 0 0 1 18500
27 4 0 0 0 7700
27
REGRESSION MODELS WITH QUALITATIVE VARIABLES
where HS, UG, and PG are the dummy variables corresponding to the
categories high school, under-graduate, and post-graduate, respectively.
The fourth category (none) for which we did not create an explicit dummy
variable is called the base category. In Eq, when HS = UG = PG = 0, the value
of Y is β0, which corresponds to the education category, “none”.
28
REGRESSION MODELS WITH QUALITATIVE VARIABLES
Coefficients
Model Unstandardized Standardized t-value p-value
Coefficients Coefficients
B Std. Error Beta
(Constant) 7383.333 1184.793 6.232 0.000
High-School (HS) 5437.667 1498.658 0.505 3.628 0.001
1 9860.417 1567.334 0.858 6.291 0.000
Under-Graduate (UG)
29
INTERPRETATION OF REGRESSION COEFFICIENTS OF CATEGORICAL
VARIABLES
30
INTERACTION VARIABLES IN REGRESSION MODELS
31
EXAMPLE
S. No. Gender WE Salary S. No. Gender WE Salary
1 1 2 6800 16 0 2 22100
2 1 3 8700 17 0 1 20200
3 1 1 9700 18 0 1 17700
Female: Gender = 1 4 1 3 9500 19 0 6 34700
Male: Gender = 0
5 1 4 10100 20 0 7 38600
6 1 6 9800 21 0 7 39900
7 0 2 14500 22 0 7 38300
8 0 3 19100 23 0 3 26900
9 0 4 18600 24 0 4 31800
10 0 2 14200 25 1 5 8000
11 0 4 28000 26 1 5 8700
12 0 3 25700 27 1 3 6200
13 0 1 20350 28 1 3 4100
14 0 4 30400 29 1 2 5000
15 0 1 19400 30 1 1 4800 32
EXAMPLE
Salary, gender, and work experience (WE) of 30 workers in a firm.
Female: Gender = 1; Male: Gender = 0; and WE is the work experience in number of years.
Build a regression model by including an interaction variable between gender and work experience.
S. No. Gender WE Salary S. No. Gender WE Salary
1 1 2 6800 16 0 2 22100
2 1 3 8700 17 0 1 20200
3 1 1 9700 18 0 1 17700
4 1 3 9500 19 0 6 34700
5 1 4 10100 20 0 7 38600
6 1 6 9800 21 0 7 39900
7 0 2 14500 22 0 7 38300
8 0 3 19100 23 0 3 26900
9 0 4 18600 24 0 4 31800
10 0 2 14200 25 1 5 8000
11 0 4 28000 26 1 5 8700
12 0 3 25700 27 1 3 6200
13 0 1 20350 28 1 3 4100
14 0 4 30400 29 1 2 5000
15 0 1 19400 30 1 1 4800 33
SOLUTION
Let the regression model be:
Y = β0 + β1 × Gender + β2 × WE + β3 × Gender × WE
The output for the regression model including interaction variable is given in
Table
Model Unstandardized Standardized T Sig.
Coefficients Coefficients
35
VALIDATION OF MULTIPLE REGRESSION MODEL
The following measures and tests are carried out to validate a multiple linear
regression model:
36
VALIDATION OF MULTIPLE REGRESSION MODEL
F-test to check the statistical significance of the overall model at a given
significance level (α) or at (1 − α) 100% confidence level.
n ∧
SSR SSE ∑ (Y i − Yi ) 2
R2 = = 1− 1 − i =n1 ∧ −
=
SST SST
∑ i ( Y −
i =1
Y ) 2
38
Problems with R-squared statistic
The R-squared statistic isn’t perfect. In fact, it suffers from a major flaw. Its value
never decreases no matter the number of variables we add to our regression
model.
That is, even if we are adding redundant variables to the data, the value of R-
squared does not decrease. It either remains the same or increases with the
addition of new independent variables.
This clearly does not make sense because some of the independent variables
might not be useful in determining the target variable. Adjusted R-squared deals
with this issue.
39
CO-EFFICIENT OF MULTIPLE DETERMINATION (R-SQUARE) AND
ADJUSTED R-SQUARE
SSE is the sum of squares of errors and SST is the sum of squares of total
deviation. In case of MLR, SSE will decrease as the number of explanatory
variables increases, and SST remains constant.
So, it is possible, that R-square will increase even when there is no
statistically significant relationship between the explanatory variable and
the response variable.
To counter this, R2 value is adjusted by normalizing both SSE and SST with the
corresponding degrees of freedom. The adjusted R-square is given by
SSE/(n - k - 1)
Adjusted R - Square = 1 -
SST/(n - 1)
40
CO-EFFICIENT OF MULTIPLE DETERMINATION (R-SQUARE) AND
ADJUSTED R-SQUARE
41
STATISTICAL SIGNIFICANCE OF INDIVIDUAL VARIABLES IN
MLR – T-TEST
Checking the statistical significance of individual variables is achieved through t-
test. Note that the estimate of regression coefficient is given by Eq:
∧
β = (XT X)−1 XT Y
This means the estimated value of regression coefficient is a linear function of the
response variable. Since we assume that the residuals follow normal distribution,
Y follows a normal distribution and the estimate of regression coefficient also
follows a normal distribution. Since the standard deviation of the regression
coefficient is estimated from the sample, we use a t-test.
42
STATISTICAL SIGNIFICANCE OF INDIVIDUAL VARIABLES IN
MLR – T-TEST
The null and alternative hypotheses in the case of individual independent variable and the
dependent variable Y is given, respectively, by
H0: There is no relationship between independent variable Xi and dependent variable Y
Alternatively,
H0: β i = 0
HA: βi ≠ 0
44
VALIDATION OF OVERALL REGRESSION MODEL – F-TEST
(SSE R − SSEF ) / ( k − r )
Partial F =
MSEF
VALIDATION OF PORTIONS OF A MLR MODEL – PARTIAL F-
TEST
(SSE R − SSEF ) / ( k − r )
Partial F =
MSEF
48
RESIDUAL ANALYSIS IN MULTIPLE LINEAR REGRESSION
49
MULTI-COLLINEARITY AND VARIANCE INFLATION FACTOR
Multi-collinearity can have the following impact on the model:
The standard error of estimate of a regression coefficient may be inflated, and may
result in retaining of null hypothesis in t-test, resulting in rejection of a statistically
significant explanatory variable.
∧ ∧
The t-statistic value is 𝛽𝛽�𝑆𝑆𝑒𝑒 (𝛽𝛽) .
∧
If 𝑆𝑆𝑒𝑒 (𝛽𝛽) is inflated, then the t-value will be underestimated resulting in high p-value
that may result in failing to reject the null hypothesis.
Thus, it is possible that a statistically significant explanatory variable may
be labelled as statistically insignificant due to the presence of multi-
collinearity.
50
IMPACT OF MULTICOLLINEARITY
The sign of the regression coefficient may be different, that is, instead of negative
value for regression coefficient, we may have a positive regression
coefficient and vice versa.
Adding/removing a variable or even an observation may result in large
variation in regression coefficient estimates.
51
MULTICOLLINEARITY: EXAMPLE
52
VARIANCE INFLATION FACTOR (VIF)
Y = β 0 + β1 X1 + β 2 X 2
53
VARIANCE INFLATION FACTOR (VIF)
1
VIF =
2
1 − R12
The value 𝟏𝟏 − 𝑹𝑹𝟐𝟐𝟏𝟏𝟏𝟏 is called the tolerance (𝑹𝑹𝟐𝟐𝟏𝟏𝟏𝟏 is the R-square value for the
regression model: 𝑋𝑋1 = 𝛼𝛼0 + 𝛼𝛼1 𝑋𝑋2 )
𝑉𝑉𝑉𝑉𝑉𝑉 is the value by which the t-statistic is deflated. So, the actual t-value is given
by ∧
β1
tactual = × VIF
∧
S (β )
e 1
54
VARIANCE INFLATION FACTOR (VIF)
There will be some correlation between explanatory variables in almost all cases, thus
the value of VIF is likely to be more than one.
The threshold value for VIF is 4. VIF value of greater than 4 requires further
investigation to assess the impact of multi-collinearity.
Before building the multiple regression models, it is advised to check the correlation
between different explanatory variables for potential multi-collinearity.
VIF value equal to 4 implies that the t-statistic value is deflated by a factor 2
and thus there will be a significant increase in the corresponding p-value.
The serious impact of multi-collinearity is that it can change the sign of the
regression coefficient (for example, instead of positive, the model may have negative
regression coefficient for a predictor and vice versa).
55
REMEDIES FOR HANDLING MULTI-COLLINEARITY
To remove one of the variables from the model building. For example, you may
remove a variable that is either difficult or expensive to collect.
Another approach suggested by researchers is to use centered variables, that is,
use (𝑋𝑋𝑖𝑖 − 𝑋𝑋�𝑖𝑖 ) instead of 𝑋𝑋𝑖𝑖 .
When there are many variables in the data, we can use Principle Component
Analysis (PCA) to avoid multi-collinearity.
PCA will create orthogonal components and thus remove potential multi-
collinearity. In the recent years, authors use advanced regression models such
as Ridge regression and LASSO regression to handle multi-collinearity.
56
RIDGE AND LASSO REGRESSION
They work by penalizing the magnitude of coefficients of features and minimizing the error
between predicted and actual observations. These are called ‘regularization’ techniques.
Ridge Regression:
1. Performs L2 regularization, i.e., adds penalty equivalent to the square of the magnitude of
coefficients
2. Minimization objective = LS Obj + α * (sum of square of coefficients)
Lasso Regression:
1. Performs L1 regularization, i.e., adds penalty equivalent to the absolute value of the magnitude
of coefficients
2. Minimization objective = LS Obj + α * (sum of the absolute value of coefficients)
Here, LS Obj refers to the ‘least squares objective,’ i.e., the linear regression objective without
regularization.
57
While time series data is data collected over time, there are different types of
data that describe how and when that time data was recorded. For example:
Time series data is data that is recorded over consistent intervals of time.
Cross-sectional data consists of several variables recorded at the same
time.
Pooled data is a combination of both time series data and cross-sectional
data.
58
TIME SERIES DATA
Example: Retail Sales
59
AUTO-CORRELATION
Autocorrelation is just the correlation of the data with itself. So, instead of
measuring the correlation between two random variables, we are measuring the
correlation between a random variable against itself. Hence, why it is called
auto-correlation.
For time-series, the autocorrelation is the correlation of that time series at two
different points in time (also known as lags).
It is the correlation between successive error terms in a time-series data.
Consider a time-series model as defined below:
Yt = β 0 + β1X t + ε t
60
AUTO-CORRELATION
Autocorrelation =
61
DURBIN-WATSON TEST FOR AUTO-CORRELATION
The Durbin−Watson test has two critical values, DL and DU. The inference of the
test can be made based on the following conditions:
If D < DL, then the errors are positively correlated.
If D > DL, then there is no evidence for positive auto-correlation.
If DL < D < DU, the Durbin−Watson test is inconclusive.
If (4 − D) < DL, then errors are negatively correlated.
If (4 − D) > DU, there is no evidence for negative auto-correlation.
If DL < (4 − D) < DU, the test is inconclusive.
63
RESIDUAL PLOT
64