Session 4 - Multiple Linear Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 63

REGRESSION ANALYSIS

MULTIPLE LINEAR REGRESSION

 In this chapter we continue our study of regression analysis by


considering situations involving two or more independent
variables.
 Multiple regression analysis, enables us to consider more factors and
thus obtain better estimates than are possible with simple linear
regression.

2
WHERE IS IT USED?
A few examples of MLR are as follows:

 The treatment cost of a cardiac patient may depend on factors such as age, past
medical history, body weight, blood pressure, and so on.

 Salary of MBA students at the time of graduation may depend on factors such as their
academic performance, prior work experience, communication skills, and so on.

 Market share of a brand may depend on factors such as price, promotion expenses,
competitors’ prices, etc.

3
MULTIPLE LINEAR REGRESSION

 Multiple linear regression means linear in regression parameters (beta values).


The following are examples of multiple linear regression:

Y = β 0 + β1x1 + β 2 x2 + ... + β k xk + ε
2
Y = β 0 + β1x1 + β 2 x2 + β 3 x1x2 + β 4 x2 ... + β k xk + ε

 An important task in multiple regression is to estimate the beta values (β1, β2,
β3 etc…)

5
LINEAR OR NON-LINEAR REGRESSION ???

1 β3
Y = β0 + + X2 +ε
β1 + β 2 X 1

2
Y = β1 + β1 X1 + β 2 X1 X 2 + β3 X 2

6
DEFINE THE FUNCTIONAL FORM OF RELATIONSHIP
 For better predictive ability (model accuracy) it is important to specify the correct
functional form between the dependent variable and the independent variable.

 Scatter plots may assist the modeller to define the right functional form.
Linear relationship between X1 and Y1 Log-linear relationship between X2 and Y2

7
REGRESSION: MATRIX REPRESENTATION

Y = Xβ + ε

 y1  1 x11 x21 xk 1   β 0   ε 1 
 y  1
 2  x12 x22 xk 2   β1  ε 2 
 = • + 
       
       
 yn  1 x1n x2 n xkn   β k  ε n 

8
MULTIPLE LINEAR REGRESSION

The equation that describes how the mean value of y is related to x1, x2, . . . xk
is:

E(y) = 𝛽𝛽0 + 𝛽𝛽1 x1 + 𝛽𝛽2 x2 + . . . + 𝛽𝛽k xk

9
ESTIMATED MULTIPLE REGRESSION EQUATION

y^ = b0 + b1x1 + b2x2 + . . . + bkxk


A simple random sample is used to compute sample statistics b0, b1, b2, . . . , bk
that are used as the point estimators of the parameters 𝛽𝛽0, 𝛽𝛽1, 𝛽𝛽2, . . . , 𝛽𝛽k.

10
ESTIMATION PROCESS

Regression Model Sample Data:


y = 𝛽𝛽0 + 𝛽𝛽1 x1 + 𝛽𝛽2 x2 + . . . + 𝛽𝛽k xk + ε x1 x2 . . . xk y
Regression Equation . . . . .
E(y) = 𝛽𝛽0 + 𝛽𝛽1 x1 + 𝛽𝛽2 x2 + . . . + 𝛽𝛽k xk . . . . .
Unknown Parameters
𝛽𝛽0, 𝛽𝛽1, 𝛽𝛽2, . . . , 𝛽𝛽k

b0, b1, b2, . . . , bk Estimated Regression Equation


provide estimates y^ = b0 + b1x1 + b2x2 + ... + bkxk
of Sample Statistics
𝛽𝛽0, 𝛽𝛽1, 𝛽𝛽2, . . . , 𝛽𝛽k b0, b1, b2, . . . , bk

11
LEAST SQUARES METHOD

 Least Squares Criterion

 Minimize the sum of the squares of the deviations between the observed values of the
dependent variable yi and the predicted values of the dependent variable y�i .

 Provides the Best Linear Unbiased Estimate (BLUE), that is, 𝐸𝐸 𝛽𝛽 − 𝛽𝛽 = 0, where 𝛽𝛽

is the population parameter and 𝛽𝛽 is the estimated parameter value from the sample.

12
MLR ASSUMPTIONS

The assumptions that are made in multiple linear regression model are as follows:

 The regression model is linear in parameter.


 The explanatory variable, X, is assumed to be non-stochastic (that is, X is
deterministic).
 The conditional expected value of the residuals, E(εi ), is zero.
 In a time series data, residuals are uncorrelated, that is, Cov(εi, εj) = 0 for all i ≠ j.

13
MLR ASSUMPTIONS

 The residuals, εi, follow a normal distribution.

 The variance of the residuals, Var(εi|Xi), is constant for all values of Xi. When the
variance of the residuals is constant for different values of Xi, it is called
homoscedasticity. A non-constant variance of residuals is called heteroscedasticity.

 There is no high correlation between independent variables in the model (called multi-
collinearity). Multi-collinearity can destabilize the model and can result in incorrect
estimation of the regression parameters.

14
HAT MATRIX

The regression coefficients 𝛽𝛽̂ is given by 𝛃𝛃 = (𝐗𝐗 𝐓𝐓 𝐗𝐗)−𝟏𝟏 𝐗𝐗 𝐓𝐓 𝐘𝐘

The estimated values of response variable are ∧ ∧


𝐘𝐘 = 𝐗𝐗𝛃𝛃 = 𝐗𝐗(𝐗𝐗 𝐓𝐓 𝐗𝐗)−𝟏𝟏 𝐗𝐗 𝐓𝐓 𝐘𝐘

�i is a linear function of Yi.


In above Eq. the predicted value of dependent variable Y
Equation can be written as follows:

𝐘𝐘 = 𝚮𝚮𝚮𝚮
𝐇𝐇 = 𝐗𝐗(𝐗𝐗 𝐓𝐓 𝐗𝐗)−𝟏𝟏 𝐗𝐗 𝐓𝐓

 Hat matrix, also known as the influence matrix, since it describes the influence of
each observation on the predicted values of response variable.
 Hat matrix plays a crucial role in identifying the outliers and influential observations in
the sample.
15
FRAMEWORK FOR BUILDING MULTIPLE LINEAR REGRESSION (MLR)

16
EXAMPLE

 The Cumulative Television Rating Points (CTRP) of a television program,


money spent on promotion (denoted as P), and the advertisement revenue
(in Indian rupees denoted as R) generated over one-month period for 38
different television programs is provided.
 Develop a multiple regression model to understand the relationship between the
advertisement revenue (R) generated as response variable or outcome
variable and promotions (P) and CTRP as predictor variable or explanatory
variable.

17
Serial CTRP P R Serial CTRP P R
1 133 111600 1197576 20 156 104400 1326360
2 111 104400 1053648 21 119 136800 1162596
3 129 97200 1124172 22 125 115200 1195116
4 117 79200 987144 23 130 115200 1134768
5 130 126000 1283616 24 123 151200 1269024
6 154 108000 1295100 25 128 97200 1118688
7 149 147600 1407444 26 97 122400 904776
8 90 104400 922416 27 124 208800 1357644
9 118 169200 1272012 28 138 93600 1027308
10 131 75600 1064856 29 137 115200 1181976
11 141 133200 1269960 30 129 118800 1221636
12 119 133200 1064760 31 97 129600 1060452
13 115 176400 1207488 32 133 100800 1229028
14 102 180000 1186284 33 145 147600 1406196
15 129 133200 1231464 34 149 126000 1293936
16 144 147600 1296708 35 122 108000 1056384
17 153 122400 1320648 36 120 194400 1415316
18 96 158400 1102704 37 128 176400 1338060
19 104 165600 1184316 38 117 172800 1457400
EXAMPLE
The MLR model is given by

R(Advertisement Revenue) = β0 + β1 × CTRP + β2 × P

The regression coefficients can be estimated using OLS estimation.


The output for the above regression model is provided in tables

Model Summary
Model R R-Square Adjusted R- Std. Error of
Square the Estimate

1 0.912 0.832 0.822 57548.382

19
COEFFICIENTS

Model Unstandardized Standardized t Sig.


Coefficients Coefficients
B Std. Error Beta
Constant 41008.840 90958.920 0.451 0.655
1 CTRP 5931.850 576.622 0.732 10.287 0.000
P 3.136 0.303 0.736 10.344 0.000

The regression model after estimation of the parameters is given by


R = 41008.84 + 5931.850 CTRP + 3.136 P

 For every one unit increase in CTRP, the revenue increases by 5931.850 when the variable promotion
is kept constant, and for one unit increase in promotion the revenue increases by 3.136 when CTRP is
kept constant.
 Note that television-rating point is likely to change when the amount spent on promotion is changed.
STANDARDIZED REGRESSION CO-EFFICIENT

 The coefficient value for CTRP is 5931.85 and the coefficient for promotion
spend is 3.136. However, this does not mean that CTRP has more influence on
the revenue compared to promotion expenses.
 The reason is that the unit of measurement for CTRP is different from the unit of
measurement for promotion.
 We have to derive standardized regression coefficients to compare the impact of
different explanatory variables that have different units of measurement.
 Since the regression coefficients can not be compared directly due to differences
in scale and units of measurement of variables, one has to normalize the data to
compare the regression coefficients and their impact on the response variable.

21
STANDARDIZED REGRESSION CO-EFFICIENT
 A regression model can be built on standardized dependent variable and
standardized independent variables, the resulting regression coefficients are
then known as standardized regression coefficients.
 The standardized regression coefficient can also be calculated using the
following formula:
 SXi 

Standardized Beta = β×  
S
 Y 
 Where SXi is the standard deviation of the explanatory variable Xi and SY is the
standard deviation of the response variable Y.

22
STANDARDIZED REGRESSION CO-EFFICIENT

SRevenue SCTRP SPromotion


136527.88 16.85 32052.62

β1 β2
5931.850 3.136

∧  SXi 
Standardized Beta = β×  
S
 Y 

 Standardized regression coefficient for CTRP = 5931.850× 16.85 /136527.88 = 0.732


 Standardized regression coefficient for Promotion = 3.136× 32052.62 /136527.88=
0.736

23
STANDARDIZED REGRESSION CO-EFFICIENT

 For one standard deviation change in the explanatory variable, the standard
regression coefficient captures the number of standard deviations by which the
response variable will change.
 For example, when CTRP is changed by one standard deviation, revenue
will change by 0.732 standard deviations.
 Similarly, when promotion changes by one standard deviation, revenue will
change by 0.736 standard deviations.
 That is, the variable promotion has slightly higher impact on the revenue
compared to CTRP.

24
REGRESSION MODELS WITH QUALITATIVE VARIABLES

 In MLR, many predictor variables are likely to be qualitative or categorical


variables.
 Since the scale is not a ratio or interval for categorical variables, we
cannot include them directly in the model, since its inclusion directly will result
in model misspecification.
 We have to pre-process the categorical variables using dummy variables
for building a regression model.

25
REGRESSION MODELS WITH QUALITATIVE VARIABLES
 The data in Table provides salary and educational qualifications of 30 randomly chosen people in
Mumbai. Build a regression model to establish the relationship between salary earned and their
educational qualifications.

S. No. Education Salary S. No. Education Salary S. No. Education Salary

1 1 9800 11 2 17200 21 3 21000


2 1 10200 12 2 17600 22 3 19400
3 1 14200 13 2 17650 23 3 18800
4 1 21000 14 2 19600 24 3 21000
5 1 16500 15 2 16700 25 4 6500
6 1 19210 16 2 16700 26 4 7200
7 1 9700 17 2 17500 27 4 7700
8 1 11000 18 2 15000 28 4 5600
9 1 7800 19 3 18500 29 4 8000
10 1 8800 20 3 19700 30 4 9300
1: HS, 2: UG, 3: PG, 4: None 26
REGRESSION MODELS WITH QUALITATIVE VARIABLES
Note that, if we build a model Y = β 0 + β1 × Education, it will be incorrect. We have to use 3
dummy variables since there are 4 categories for educational qualification. Data has to be pre-
processed using 3 dummy variables (HS, UG and PG) as shown in Table.

Pre-processed data (sample)

Observation Education Pre-processed data Salary

High School Under- Post-Graduate


(HS) Graduate (UG) (PG)

1 1 1 0 0 9800

11 2 0 1 0 17200

19 3 0 0 1 18500

27 4 0 0 0 7700

27
REGRESSION MODELS WITH QUALITATIVE VARIABLES

 The corresponding regression model is as follows:


Y = β0 + β1 × HS + β2 × UG + β3 × PG

 where HS, UG, and PG are the dummy variables corresponding to the
categories high school, under-graduate, and post-graduate, respectively.
 The fourth category (none) for which we did not create an explicit dummy
variable is called the base category. In Eq, when HS = UG = PG = 0, the value
of Y is β0, which corresponds to the education category, “none”.

28
REGRESSION MODELS WITH QUALITATIVE VARIABLES

Coefficients
Model Unstandardized Standardized t-value p-value
Coefficients Coefficients
B Std. Error Beta
(Constant) 7383.333 1184.793 6.232 0.000
High-School (HS) 5437.667 1498.658 0.505 3.628 0.001
1 9860.417 1567.334 0.858 6.291 0.000
Under-Graduate (UG)

Post-Graduate (PG) 12350.000 1675.550 0.972 7.371 0.000

The corresponding regression equation is given by


Y = 7383.33 + 5437.667 × HS + 9860.417 × UG + 12350.00 × PG
Note that all the dummy variables are statistically significant α = 0.01, since p-values are less
than 0.01.

29
INTERPRETATION OF REGRESSION COEFFICIENTS OF CATEGORICAL
VARIABLES

 In regression model with categorical variables, the regression coefficient


corresponding to a specific category represents the change in the value of Y
from the base category value (β0).
 For example, when HS = UG = PG = 0, the value of Y = 7383.33. In this case,
the base category is the education ‘none’. That is, when education category is
none, the average salary is 7383.33.
 When education category is ‘HS’, we get Y = 7383.333 + 5437.667 = 12821.00
 That is, 5437.667 is the shift or deviation from the base category for category
‘HS’ (education category high school)

30
INTERACTION VARIABLES IN REGRESSION MODELS

 Interaction variables are basically inclusion of variables in the regression model


that are a product of two independent variables (such as X1 X2).

 Usually the interaction variables are between a continuous and a categorical


variable.

 The inclusion of interaction variables helps to check the existence of


conditional relationship between the dependent variable and two independent
variables.

31
EXAMPLE
S. No. Gender WE Salary S. No. Gender WE Salary
1 1 2 6800 16 0 2 22100
2 1 3 8700 17 0 1 20200
3 1 1 9700 18 0 1 17700
Female: Gender = 1 4 1 3 9500 19 0 6 34700
Male: Gender = 0
5 1 4 10100 20 0 7 38600
6 1 6 9800 21 0 7 39900
7 0 2 14500 22 0 7 38300
8 0 3 19100 23 0 3 26900
9 0 4 18600 24 0 4 31800
10 0 2 14200 25 1 5 8000
11 0 4 28000 26 1 5 8700
12 0 3 25700 27 1 3 6200
13 0 1 20350 28 1 3 4100
14 0 4 30400 29 1 2 5000
15 0 1 19400 30 1 1 4800 32
EXAMPLE
Salary, gender, and work experience (WE) of 30 workers in a firm.
Female: Gender = 1; Male: Gender = 0; and WE is the work experience in number of years.
Build a regression model by including an interaction variable between gender and work experience.
S. No. Gender WE Salary S. No. Gender WE Salary
1 1 2 6800 16 0 2 22100
2 1 3 8700 17 0 1 20200
3 1 1 9700 18 0 1 17700
4 1 3 9500 19 0 6 34700
5 1 4 10100 20 0 7 38600
6 1 6 9800 21 0 7 39900
7 0 2 14500 22 0 7 38300
8 0 3 19100 23 0 3 26900
9 0 4 18600 24 0 4 31800
10 0 2 14200 25 1 5 8000
11 0 4 28000 26 1 5 8700
12 0 3 25700 27 1 3 6200
13 0 1 20350 28 1 3 4100
14 0 4 30400 29 1 2 5000
15 0 1 19400 30 1 1 4800 33
SOLUTION
Let the regression model be:
Y = β0 + β1 × Gender + β2 × WE + β3 × Gender × WE
 The output for the regression model including interaction variable is given in
Table
Model Unstandardized Standardized T Sig.
Coefficients Coefficients

B Std. Error Beta


13443.895 1539.893 8.730 0.000
(Constant)

Gender −7757.751 2717.884 −0.348 −2.854 0.008


1
WE 3523.547 383.643 0.603 9.184 0.000
−2913.908 744.214 −0.487 −3.915 0.001
Gender*WE
34
SOLUTION
The regression equation is given by
Y = 13442.895 – 7757.75 Gender + 3523.547 WE – 2913.908 Gender × WE
Equation can be written as:
 For Female (Gender = 1)
Y = 13442.895 – 7757.75 + (3523.547 – 2913.908) WE
 For Male (Gender = 0)
Y = 13442.895 + 3523.547 WE
That is, the change in salary for female when WE increases by one year is 609.639 and for
male is 3523.547. That is the salary for male workers is increasing at a higher rate
compared female workers. Interaction variables are an important class of derived variables
in regression model building.

35
VALIDATION OF MULTIPLE REGRESSION MODEL

The following measures and tests are carried out to validate a multiple linear
regression model:

 Coefficient of multiple determination (R-Square) and Adjusted R-Square, which


can be used to judge the overall fitness of the model.

 t-test to check the existence of statistically significant relationship between the


response variable and individual explanatory variable at a given significance
level (α) or at (1 − α) 100% confidence level.

36
VALIDATION OF MULTIPLE REGRESSION MODEL
 F-test to check the statistical significance of the overall model at a given
significance level (α) or at (1 − α) 100% confidence level.

 Conduct a residual analysis to check whether the normality, homoscedasticity


assumptions have been satisfied. Also, check for any pattern in the residual plots to
check for correct model specification.

 Check for presence of multi-collinearity (strong correlation between independent


variables) that can destabilize the regression model.

 Check for auto-correlation in case of time-series data.


37
CO-EFFICIENT OF MULTIPLE DETERMINATION (R-SQUARE) AND
ADJUSTED R-SQUARE

 As in the case of simple linear regression, R-square measures the proportion of


variation in the dependent variable explained by the model. The co-efficient of
multiple determination (R-Square or R2) is given by

n ∧

SSR SSE ∑ (Y i − Yi ) 2

R2 = = 1− 1 − i =n1 ∧ −
=
SST SST
∑ i ( Y −
i =1
Y ) 2

38
Problems with R-squared statistic
 The R-squared statistic isn’t perfect. In fact, it suffers from a major flaw. Its value
never decreases no matter the number of variables we add to our regression
model.
 That is, even if we are adding redundant variables to the data, the value of R-
squared does not decrease. It either remains the same or increases with the
addition of new independent variables.
 This clearly does not make sense because some of the independent variables
might not be useful in determining the target variable. Adjusted R-squared deals
with this issue.

39
CO-EFFICIENT OF MULTIPLE DETERMINATION (R-SQUARE) AND
ADJUSTED R-SQUARE
 SSE is the sum of squares of errors and SST is the sum of squares of total
deviation. In case of MLR, SSE will decrease as the number of explanatory
variables increases, and SST remains constant.
 So, it is possible, that R-square will increase even when there is no
statistically significant relationship between the explanatory variable and
the response variable.
 To counter this, R2 value is adjusted by normalizing both SSE and SST with the
corresponding degrees of freedom. The adjusted R-square is given by

SSE/(n - k - 1)
Adjusted R - Square = 1 -
SST/(n - 1)
40
CO-EFFICIENT OF MULTIPLE DETERMINATION (R-SQUARE) AND
ADJUSTED R-SQUARE

 While R-square is a non-decreasing function, adjusted R-square is not.


 So, when a new variable is added, it is worth checking whether there is any
increase in adjusted R2 than R2.
 R-Square and adjusted R-square values for Example are 0.832 and 0.822, respectively.
 The adjusted R-square value is always less than or equal to the R-square
value.
 No increase in adjusted R-square after adding a new predictor variable to the model
may indicate that the newly added variable may not be statistically significant or it is
not explaining the variation in the response variables that is not explained by the
variables that are already present in the model.

41
STATISTICAL SIGNIFICANCE OF INDIVIDUAL VARIABLES IN
MLR – T-TEST
 Checking the statistical significance of individual variables is achieved through t-
test. Note that the estimate of regression coefficient is given by Eq:


β = (XT X)−1 XT Y

 This means the estimated value of regression coefficient is a linear function of the
response variable. Since we assume that the residuals follow normal distribution,
Y follows a normal distribution and the estimate of regression coefficient also
follows a normal distribution. Since the standard deviation of the regression
coefficient is estimated from the sample, we use a t-test.
42
STATISTICAL SIGNIFICANCE OF INDIVIDUAL VARIABLES IN
MLR – T-TEST
The null and alternative hypotheses in the case of individual independent variable and the
dependent variable Y is given, respectively, by
 H0: There is no relationship between independent variable Xi and dependent variable Y

 HA: There is a relationship between independent variable Xi and dependent variable Y

Alternatively,
 H0: β i = 0

 HA: βi ≠ 0

The corresponding test statistic is given by


∧ ∧
βi − 0 βi
t = =
∧ ∧
Se (βi ) Se (βi )
43
VALIDATION OF OVERALL REGRESSION MODEL – F-TEST

Analysis of Variance (ANOVA) is used to validate the overall regression model. If


there are k independent variables in the model, then the null and the alternative
hypotheses are, respectively, given by
H0: β 1 = β 2 = β 3 = … = β k = 0
H1: Not all β’s are zero.

44
VALIDATION OF OVERALL REGRESSION MODEL – F-TEST

The null and alternative hypothesis for F-test is given by


H0: There is no statistically significant relationship between Y and any of the
explanatory variables (i.e., all regression coefficients are zero).
H1: Not all regression coefficients are zero
 Alternatively:
H0: β 1 = β 2 = β 3 = … = β k = 0
H1: Not all β’s are zero.

 The F-statistic is given by

Regression mean square (MSR) SSR / 1


F =
Mean square error ( MSE ) SSE / n − 2
45
VALIDATION OF PORTIONS OF A MLR MODEL – PARTIAL F-
TEST
The objective of the partial F-test is to check where the additional variables
(Xr+1, Xr+2, …, Xk) in the full model are statistically significant.
The corresponding partial F-test has the following null and alternative
hypotheses:
 H0: βr+1 = βr+2 = … = βk = 0
 H1: Not all βr+1, βr+2, …, βk are zero
 The partial F-test statistic is given by

 (SSE R − SSEF ) / ( k − r ) 
Partial F =  
 MSEF 
VALIDATION OF PORTIONS OF A MLR MODEL – PARTIAL F-
TEST

 (SSE R − SSEF ) / ( k − r ) 
Partial F =  
 MSEF 

 SSER is the sum of squared errors of reduced model


 SSEF is the sum of squared error of full model
 MSEF is the mean squared error of the full model, and (k − r) is the difference
between the number of variables between full model and the reduced model.
RESIDUAL ANALYSIS IN MULTIPLE LINEAR REGRESSION
 Residual analysis is important for checking assumptions about normal distribution of residuals,
homoscedasticity, and the functional form of a regression model.
 If the residuals do not follow normal distribution, then we cannot trust the p-values of t-test
and F-test since for the statistic to follow t-distribution and F-distribution, the residuals should
follow normal or approximate normal distribution.
 There are many reasons why residuals may not be normal; one such case is misspecification of
functional form of regression, that is, we may have used linear model instead of log-linear or
log-log model.

48
RESIDUAL ANALYSIS IN MULTIPLE LINEAR REGRESSION

49
MULTI-COLLINEARITY AND VARIANCE INFLATION FACTOR
Multi-collinearity can have the following impact on the model:
 The standard error of estimate of a regression coefficient may be inflated, and may
result in retaining of null hypothesis in t-test, resulting in rejection of a statistically
significant explanatory variable.
∧ ∧
 The t-statistic value is 𝛽𝛽�𝑆𝑆𝑒𝑒 (𝛽𝛽) .

 If 𝑆𝑆𝑒𝑒 (𝛽𝛽) is inflated, then the t-value will be underestimated resulting in high p-value
that may result in failing to reject the null hypothesis.
 Thus, it is possible that a statistically significant explanatory variable may
be labelled as statistically insignificant due to the presence of multi-
collinearity.
50
IMPACT OF MULTICOLLINEARITY

 The sign of the regression coefficient may be different, that is, instead of negative
value for regression coefficient, we may have a positive regression
coefficient and vice versa.
 Adding/removing a variable or even an observation may result in large
variation in regression coefficient estimates.

51
MULTICOLLINEARITY: EXAMPLE

52
VARIANCE INFLATION FACTOR (VIF)

 Variance inflation factor (VIF) measures the magnitude of multi-collinearity. Let


us consider a regression model with two explanatory variables defined as follows:

Y = β 0 + β1 X1 + β 2 X 2

 To find whether there is multi-collinearity, we develop a regression model between


the two explanatory variables as follows:

𝑋𝑋1 = 𝛼𝛼0 + 𝛼𝛼1 𝑋𝑋2

53
VARIANCE INFLATION FACTOR (VIF)

 Variance inflation factor (VIF) is then given by:

1
VIF =
2
1 − R12

 The value 𝟏𝟏 − 𝑹𝑹𝟐𝟐𝟏𝟏𝟏𝟏 is called the tolerance (𝑹𝑹𝟐𝟐𝟏𝟏𝟏𝟏 is the R-square value for the
regression model: 𝑋𝑋1 = 𝛼𝛼0 + 𝛼𝛼1 𝑋𝑋2 )
 𝑉𝑉𝑉𝑉𝑉𝑉 is the value by which the t-statistic is deflated. So, the actual t-value is given
by  ∧ 
 β1 
tactual =  × VIF
∧ 
 S (β ) 
 e 1 

54
VARIANCE INFLATION FACTOR (VIF)
 There will be some correlation between explanatory variables in almost all cases, thus
the value of VIF is likely to be more than one.
 The threshold value for VIF is 4. VIF value of greater than 4 requires further
investigation to assess the impact of multi-collinearity.
 Before building the multiple regression models, it is advised to check the correlation
between different explanatory variables for potential multi-collinearity.
 VIF value equal to 4 implies that the t-statistic value is deflated by a factor 2
and thus there will be a significant increase in the corresponding p-value.
 The serious impact of multi-collinearity is that it can change the sign of the
regression coefficient (for example, instead of positive, the model may have negative
regression coefficient for a predictor and vice versa).
55
REMEDIES FOR HANDLING MULTI-COLLINEARITY

 To remove one of the variables from the model building. For example, you may
remove a variable that is either difficult or expensive to collect.
 Another approach suggested by researchers is to use centered variables, that is,
use (𝑋𝑋𝑖𝑖 − 𝑋𝑋�𝑖𝑖 ) instead of 𝑋𝑋𝑖𝑖 .
 When there are many variables in the data, we can use Principle Component
Analysis (PCA) to avoid multi-collinearity.
 PCA will create orthogonal components and thus remove potential multi-
collinearity. In the recent years, authors use advanced regression models such
as Ridge regression and LASSO regression to handle multi-collinearity.

56
RIDGE AND LASSO REGRESSION
 They work by penalizing the magnitude of coefficients of features and minimizing the error
between predicted and actual observations. These are called ‘regularization’ techniques.
 Ridge Regression:

1. Performs L2 regularization, i.e., adds penalty equivalent to the square of the magnitude of
coefficients
2. Minimization objective = LS Obj + α * (sum of square of coefficients)

 Lasso Regression:

1. Performs L1 regularization, i.e., adds penalty equivalent to the absolute value of the magnitude
of coefficients
2. Minimization objective = LS Obj + α * (sum of the absolute value of coefficients)

 Here, LS Obj refers to the ‘least squares objective,’ i.e., the linear regression objective without
regularization.
57
 While time series data is data collected over time, there are different types of
data that describe how and when that time data was recorded. For example:
 Time series data is data that is recorded over consistent intervals of time.
 Cross-sectional data consists of several variables recorded at the same
time.
 Pooled data is a combination of both time series data and cross-sectional
data.

58
TIME SERIES DATA
 Example: Retail Sales

59
AUTO-CORRELATION

 Autocorrelation is just the correlation of the data with itself. So, instead of
measuring the correlation between two random variables, we are measuring the
correlation between a random variable against itself. Hence, why it is called
auto-correlation.
 For time-series, the autocorrelation is the correlation of that time series at two
different points in time (also known as lags).
 It is the correlation between successive error terms in a time-series data.
Consider a time-series model as defined below:

Yt = β 0 + β1X t + ε t
60
AUTO-CORRELATION

 What is ‘k’th lag?


A time series (y) with ‘k’th lag is its version that is ‘t-k’ periods behind in time.
A time series with lag (k=1) is a version of the original time series that is 1 period
behind in time, i.e. y(t-1).

Autocorrelation =

61
DURBIN-WATSON TEST FOR AUTO-CORRELATION

 Durbin−Watson is a hypothesis test to check the existence of auto-correlation.


 Let ρ be the correlation between error terms (εt, εt−1). The null and alternative
hypotheses are stated below:
H0: ρ = 0
H1: ρ ≠ 0

 The Durbin−Watson statistic, D, for correlation between errors of one lag is


given by
n  n 
∑ (ei − ei −1 )
2
 ∑ ei ei −1 
D = i =2 ≅ 21 − i = 2 
n
2  n
2 
e
∑ i  ∑ ei 
i =1  i =1  62
DURBIN-WATSON TEST FOR AUTO-CORRELATION

The Durbin−Watson test has two critical values, DL and DU. The inference of the
test can be made based on the following conditions:
If D < DL, then the errors are positively correlated.
If D > DL, then there is no evidence for positive auto-correlation.
If DL < D < DU, the Durbin−Watson test is inconclusive.
If (4 − D) < DL, then errors are negatively correlated.
If (4 − D) > DU, there is no evidence for negative auto-correlation.
If DL < (4 − D) < DU, the test is inconclusive.

63
RESIDUAL PLOT

64

You might also like