Multiple Regression

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 56

Basic Business Statistics

10th Edition
Chapter 14
Introduction to Multiple Regression

Chap 14-1

Learning Objectives
In this chapter, you learn:

How to develop a multiple regression model

How to interpret the regression coefficients

How to determine which independent variables to


include in the regression model

How to determine which independent variables are more


important in predicting a dependent variable

How to use categorical variables in a regression model

How to predict a categorical dependent variable using


logistic regression

The Multiple Regression


Model
Idea: Examine the linear relationship between
1 dependent (Y) & 2 or more independent variables (Xi)
Multiple Regression Model with k Independent Variables:
Y-intercept

Population slopes

Random Error

Yi 0 1X1i 2 X 2i k Xki i

Multiple Regression Equation


The coefficients of the multiple regression model are
estimated using sample data
Multiple regression equation with k independent variables:
Estimated
(or predicted)
value of Y

Estimated
intercept

Estimated slope coefficients

Yi b0 b1X1i b 2 X 2i bk Xki
In this chapter we will always use Excel to obtain the
regression slope coefficients and other regression
summary measures.

Multiple Regression Equation


(continued)

Two variable model


Y

pe
o
Sl

e
bl
a
i
ar
v
r
fo

X1

Y b0 b1X1 b 2 X 2

X1

varia
r
o
f
e
lo p

X2
ble X 2

Example:
2 Independent Variables

A distributor of frozen desert pies wants to


evaluate factors thought to influence demand

Dependent variable:
Pie sales (units per week)
Independent variables: Price (in $)
Advertising ($100s)

Data are collected for 15 weeks

Pie Sales Example


Week

Pie
Sales

Price
($)

Advertising
($100s)

350

5.50

3.3

460

7.50

3.3

350

8.00

3.0

430

8.00

4.5

350

6.80

3.0

380

7.50

4.0

430

4.50

3.0

470

6.40

3.7

450

7.00

3.5

10

490

5.00

4.0

11

340

7.20

3.5

12

300

7.90

3.2

13

440

5.90

4.0

14

450

5.00

3.5

15

300

7.00

2.7

Multiple regression equation:

Sales = b0 + b1 (Price)
+ b2 (Advertising)

Using Minitab
Stat | Regression | Regression

Enter appropriate
variables for
Response and
Predictors

Minitab
Multiple Regression Output
Regression Analysis: Pie Sales versus Price, Advertising
The regression equation is
Pie Sales = 307 - 25.0 Price + 74.1 Advertising
Predictor
Coef
Constant
306.5
Price
-24.98
Advertising 74.13
S = 47.4634

SE Coef
T
114.3
2.68
10.83 -2.31
25.97
2.85

R-Sq = 52.1%

P
0.020
0.040
0.014

R-Sq(adj) = 44.2%

Analysis of Variance
Source
DF
Regression
2
Residual Error 12
Total
14
Source
Price
Advertising

DF
1
1

SS
29460
27033
56493
Seq SS
11100
18360

MS
14730
2253

F
6.54

P
0.012

The Multiple Regression Equation


Sales 306.5 - 24.98(Price) 74.13(Adve rtising)
where
Sales is in number of pies per week
Price is in $
Advertising is in $100s.

b1 = -24.98: sales will


decrease, on
average, by 24.98
pies per week for
each $1 increase in
selling price, net of
the effects of changes
due to advertising

b2 = 74.13: sales will


increase, on average,
by 74.13 pies per
week for each $100
increase in
advertising, net of the
effects of changes
due to price

Using The Equation to Make


Predictions
Predict sales for a week in which the selling
price is $5.50 and advertising is $350:
Sales 306.5 - 24.98(Price) 74.13(Adve rtising)
306.5 - 24.98 (5.50) 74.13 (3.5)
428.6

Predicted sales
is 428.6 pies

Note that Advertising is


in $100s, so $350
means that X2 = 3.5

Predictions in Minitab
Select
Stat | Regression | Regression
then click on the options box

Enter desired
value for each
independent
variable
Check the
Confidence limits
and Prediction
limits boxes

Predictions in Minitab
(continued)
Minitab output:

New
Obs Fit SE Fit
1 428.6 17.2

<

Predicted Values for New Observations

Predicted Y value
95% CI
(391.1, 466.1)

95% PI
(318.6, 538.6)
<

Prediction interval for an


individual Y value, given
these Xs

Values of Predictors for New Observations


Advertising
3.50

Input values

Confidence interval for the


mean Y value, given
these Xs
<

New
Obs Price
1
5.50

Coefficient of
Multiple Determination

Reports the proportion of total variation in Y


explained by all X variables taken together

SSR regression sum of squares


r

SST
total sum of squares
2

Multiple Coefficient of
Determination

(continued)

Regression Analysis: Pie Sales versus Price, Advertising


The regression equation is
Pie Sales = 307 - 25.0 Price + 74.1 Advertising
Predictor
Coef
Constant
306.5
Price
-24.98
Advertising 74.13
S = 47.4634

SE Coef
T
114.3
2.68
10.83 -2.31
25.97
2.85

R-Sq = 52.1%

P
0.020
0.040
0.014

SSR 29460
r

.521
SST 56493
2

R-Sq(adj) = 44.2%

Analysis of Variance
Source
DF
Regression
2
Residual Error 12
Total
14
Source
Price
Advertising

DF
1
1

SS
29460
27033
56493
Seq SS
11100
18360

MS
14730
2253

F
6.54

P
0.012

52.1% of the variation


in pie sales is
explained by the
variation in price and
advertising

Adjusted r2

r2 never decreases when a new X variable is


added to the model
This can be a disadvantage when comparing
models
What is the net effect of adding a new variable?
We lose a degree of freedom when a new X
variable is added
Did the new X variable add enough
explanatory power to offset the loss of one
degree of freedom?

Adjusted r2
(continued)

Shows the proportion of variation in Y explained by all X


variables adjusted for the number of X variables used

n 1
r 1 (1 r )

n k 1
(where n = sample size, k = number of independent
variables)

2
adj

Penalize excessive use of unimportant independent variables


Smaller than r2
Useful in comparing among models

Multiple Coefficient of
Determination

(continued)

Regression Analysis: Pie Sales versus Price, Advertising


The regression equation is
Pie Sales = 307 - 25.0 Price + 74.1 Advertising
Predictor
Coef
Constant
306.5
Price
-24.98
Advertising 74.13
S = 47.4634

SE Coef
T
114.3
2.68
10.83 -2.31
25.97
2.85

R-Sq = 52.1%

2
adj

P
0.020
0.040
0.014

R-Sq(adj) = 44.2%

Analysis of Variance
Source
DF
Regression
2
Residual Error 12
Total
14
Source
Price
Advertising

DF
1
1

SS
29460
27033
56493
Seq SS
11100
18360

MS
14730
2253

F
6.54

P
0.012

.442

44.2% of the variation


in pie sales is
explained by the
variation in price and
advertising, taking into
account the sample
size and number of
independent variables

Is the Model Significant?

F Test for Overall Significance of the Model

Shows if there is a linear relationship between all


of the X variables considered together and Y

Use F-test statistic

Hypotheses:
H0: 1 = 2 = = k = 0 (no linear relationship)
H1: at least one i 0 (at least one independent
variable affects Y)

F Test for Overall Significance

Test statistic:

SSR
MSR
k
F

SSE
MSE
n k 1
where F has (numerator) = k and
(denominator) = (n k - 1)
degrees of freedom

F Test for Overall Significance


(continued)
Regression Analysis: Pie Sales versus Price, Advertising
The regression equation is
Pie Sales = 307 - 25.0 Price + 74.1 Advertising
Predictor
Coef
Constant
306.5
Price
-24.98
Advertising 74.13
S = 47.4634

SE Coef
T
114.3
2.68
10.83 -2.31
25.97
2.85

R-Sq = 52.1%

P
0.020
0.040
0.014

MSR 14730
F

6.54
MSE 2253

R-Sq(adj) = 44.2%

With 2 and 12 degrees


of freedom

Analysis of Variance
Source
DF
Regression
2
Residual Error 12
Total
14
Source
Price
Advertising

DF
1
1

SS
29460
27033
56493
Seq SS
11100
18360

MS
14730
2253

F
6.54

P
0.012

P-value for
the F Test

F Test for Overall Significance


(continued)

Test Statistic:

H0: 1 = 2 = 0
H1: 1 and 2 not both zero
= .05
df1= 2

df2 = 12

Decision:

Critical
Value:

Since F test statistic is in


the rejection region (pvalue < .05), reject H0

F = 3.885
= .05

Do not
reject H0

Reject H0

F.05 = 3.885

MSR
F
6.54
MSE

Conclusion:
F

There is evidence that at least one


independent variable affects Y

Are Individual Variables


Significant?

Use t tests of individual variable slopes

Shows if there is a linear relationship between


the variable Xj and Y

Hypotheses:

H0: j = 0 (no linear relationship)

H1: j 0 (linear relationship does exist


between Xj and Y)

Are Individual Variables


Significant?

(continued)

H0: j = 0 (no linear relationship)


H1: j 0 (linear relationship does exist
between xj and y)
Test Statistic:

bj 0
Sb j

(df = n k 1)

Are Individual Variables


Significant?
(continued)
Regression Analysis: Pie Sales versus Price, Advertising
The regression equation is
Pie Sales = 307 - 25.0 Price + 74.1 Advertising
Predictor
Coef
Constant
306.5
Price
-24.98
Advertising 74.13
S = 47.4634

SE Coef
T
114.3
2.68
10.83 -2.31
25.97
2.85

R-Sq = 52.1%

P
0.020
0.040
0.014

R-Sq(adj) = 44.2%

Analysis of Variance
Source
DF
Regression
2
Residual Error 12
Total
14
Source
Price
Advertising

DF
1
1

SS
29460
27033
56493
Seq SS
11100
18360

MS
14730
2253

F
6.54

P
0.012

t-value for Price is t = -2.31, with


p-value 0.04
t-value for Advertising is t = 2.85,
with p-value 0.014

Inferences about the Slope:


t Test Example
From Minitab output:

H0: i = 0

H1: i 0

Price
Advertising

d.f. = 15-2-1 = 12

Coef

SECoef

-24.98

10.83

-2.31

0.04

74.13

25.97

2.85

0.014

The test statistic for each variable falls


in the rejection region (p-values < .05)

= .05
t/2 = 2.1788

Decision:
/2=.025

/2=.025

Reject H0 for each variable

Conclusion:
Reject H0

Do not reject H0

-t/2
-2.1788

Reject H0

t/2
2.1788

There is evidence that both


Price and Advertising affect
pie sales at = .05

Confidence Interval Estimate


for the Slope
Confidence interval for the population slope j

b j t nk 1Sb j

where t has
(n k 1) d.f.
d.f. = 15-2-1 = 12

Coef

Price
Advertising

SECoef

-24.98

10.83

74.13

25.97

= .05
t/2 = 2.1788

Example: Form a 95% confidence interval for the effect of changes in


price (X1) on pie sales:
-24.98 (2.1788)(10.83)
So the interval is (-48.576 , -1.374)
(This interval does not contain zero, so price has a significant effect on sales)

Confidence Interval Estimate


for the Slope

(continued)

Confidence interval for the population slope i


Example: Form a 95% confidence interval for the effect of changes in
price (X1) on pie sales:
-24.98 (2.1788)(10.83)

So the interval is (-48.576 , -1.374)

Interpretation:
Weekly sales are estimated to be reduced by between 1.37 to
48.58 pies for each increase of $1 in the selling price

Assumptions of Regression
Use the acronym LINE:

Linearity

Independence of Errors

Error values are statistically independent

Normality of Error

The underlying relationship between X and Y is linear

Error values () are normally distributed for any given value of


X

Equal Variance (Homoscedasticity)

The probability distribution of the errors has constant variance

Residual Analysis
ei Yi Yi

The residual for observation i, ei, is the difference


between its observed and predicted value

Check the assumptions of regression by examining the


residuals

Examine for linearity assumption

Evaluate independence assumption

Evaluate normal distribution assumption

Examine for constant variance for all levels of X


(homoscedasticity)

Graphical Analysis of Residuals

Can plot residuals vs. X

Residual Analysis for Linearity


Y

Not Linear

residuals

residuals

Linear

Residual Analysis for


Independence
Not Independent

residuals

residuals

residuals

Independent
X

Residual Analysis for Normality


A normal probability plot of the residuals can
be used to check for normality:
Percent

100

0
-3

-2

-1

Residual

Residual Analysis for


Equal Variance
Y

x
Non-constant variance

residuals

residuals

Constant variance

Examples of Minitab Plots

Measuring Autocorrelation:
The Durbin-Watson Statistic

Used when data are collected over time to


detect if autocorrelation is present

Autocorrelation exists if residuals in one


time period are related to residuals in
another period

Autocorrelation

Autocorrelation is correlation of the errors


(residuals) over time

Here, residuals show a


cyclic pattern, not
random. Cyclical
patterns are a sign of
positive autocorrelation

Violates the regression assumption that


residuals are random and independent

The Durbin-Watson Statistic

The Durbin-Watson statistic is used to test for


autocorrelation
H0: residuals are not correlated
H1: positive autocorrelation is present
n

(e e
i 2

2
e
i
i1

i 1

The possible range is 0 D 4


D should be close to 2 if H0 is true
D less than 2 may signal positive
autocorrelation, D greater than 2 may
signal negative autocorrelation

Testing for Positive


Autocorrelation
H0: positive autocorrelation does not exist
H1: positive autocorrelation is present
Calculate the Durbin-Watson test statistic = D
(The Durbin-Watson Statistic can be found using Excel or Minitab)

Find the values dL and dU from the Durbin-Watson table


(for sample size n and number of independent variables k)

Decision rule: reject H0 if D < dL


Reject H0

Inconclusive

dL

Do not reject H0

dU

Testing for Positive


Autocorrelation

(continued)

Suppose we have the following time series data:

Is there autocorrelation?

Testing for Positive


Autocorrelation

Example with n = 25:

Durbin-Watson Calculations
Sum of Squared
Difference of Residuals

3296.18

Sum of Squared
Residuals

3279.98

Durbin-Watson
Statistic

1.00494
n

(e e
i 2

ei
i 1

i1

)2

3296.18
1.00494
3279.98

(continued)

Minitab Output

Testing for Positive


Autocorrelation

(continued)

Here, n = 25 and there is k = 1 one independent variable

Using the Durbin-Watson table, dL = 1.29 and dU = 1.45

D = 1.00494 < dL = 1.29, so reject H0 and conclude that


significant positive autocorrelation exists

Therefore the linear model is not the appropriate model


to forecast sales
Decision: reject H0 since
D = 1.00494 < dL
Reject H0

Inconclusive

dL=1.29

Do not reject H0

dU=1.45

Influence Analysis

Several methods can be used to measure the


influence of individual observations:

The hat matrix elements hi

The Studentized deleted residuals ti

Cooks distance statistic Di

The Hat Matrix Elements hi

The hat matrix diagonal element for


observation i, denoted hi, reflects the possible
influence of Xi on the regression equation

If potentially influential observations are


present, you may need to reevaluate the need
to keep them in the model

Rule: if hi > 2(k + 1)/n , then Xi is an influential


observation and is a candidate for removal from
the model

The Studentized
Deleted Residuals ti

The Studentized deleted residual measures


the difference of each Yi from the value
predicted by a model that includes all
observations except observation i

Expressed as a t statistic

The Studentized
Deleted Residuals ti

(continued)

The Studentized deleted residual:


n k 1
t i ei
SSE(1 hi ) ei2
where

ei = the residual for observation i


k = number of independent variables
SSE = error sum of squares
hi = hat matrix diagonal element for observation i

If ti > tn-k-2 or ti < - tn-k-2 (using a two tail test at =


0.10), then observation i is highly influential and a
candidate for removal

Cooks distance statistic Di

Cooks distance statistic Di is based on


both hi and the Studentized residual

Used to decide whether an observation


flagged by either the hi or ti criterion is
unduly affecting the model

Cooks distance statistic Di


(continued)

Cooks Di statistic

ei2
Di
k MSE

hi

2
(1 hi )

where

ei = the residual for observation i


k = number of independent variables
SSE = error sum of squared
hi = hat matrix diagonal element for observation i

If Di > Fk+1 , n-k-1 at = 0.05, then the observation is highly


influential on the regression equation and is a candidate
for removal

Collinearity

Collinearity: High correlation exists among two


or more independent variables

This means the correlated variables contribute


redundant information to the multiple regression
model

Collinearity
(continued)

Including two highly correlated independent


variables can adversely affect the regression
results

No new information provided

Can lead to unstable coefficients (large


standard error and low t-values)

Coefficient signs may not match prior


expectations

Some Indications of Strong


Collinearity

Incorrect signs on the coefficients


Large change in the value of a previous
coefficient when a new variable is added to the
model
A previously significant variable becomes nonsignificant when a new independent variable is
added
The estimate of the standard deviation of the
model increases when a variable is added to
the model

Detecting Collinearity
(Variance Inflationary Factor)
VIFj is used to measure collinearity:

1
VIFj
2
1 R j
where R2j is the coefficient of determination of
variable Xj with all other X variables

If VIFj > 5, Xj is highly correlated with


the other independent variables

Example: Pie Sales


Week

Pie
Sales

Price
($)

Advertising
($100s)

350

5.50

3.3

460

7.50

3.3

350

8.00

3.0

430

8.00

4.5

350

6.80

3.0

380

7.50

4.0

430

4.50

3.0

470

6.40

3.7

450

7.00

3.5

10

490

5.00

4.0

11

340

7.20

3.5

12

300

7.90

3.2

13

440

5.90

4.0

14

450

5.00

3.5

15

300

7.00

2.7

Recall the multiple regression


equation of chapter 13:

Sales = b0 + b1 (Price)
+ b2 (Advertising)

Detecting Collinearity in Minitab


Stat / Regression / Regression
Click on the Options box
Select the Variance inflationary factors box
Regression Analysis: Pie Sales versus Price, Advertising

Output for the pie


sales example:

The regression equation is


Pie Sales = 307 - 25.0 Price + 74.1 Advertising
Predictor
Coef SE Coef
Constant
306.5
114.3
Price
-24.98
10.83
Advertising 74.13
25.97

T
2.68
-2.31
2.85

P
0.020
0.040
0.014

S = 47.4634 R-Sq = 52.1% R-Sq(adj) = 44.2%

VIF
1.0
1.0

VIFs are < 5


There is no evidence of
collinearity between Price
and Advertising

Chapter Summary

Developed the multiple regression model


Tested the significance of the multiple
regression model
Discussed adjusted r2
Discussed using residual plots to check model
assumptions
Tested individual regression coefficients

You might also like