Multiple Regression

Basic Business Statistics
10th Edition
Chapter 14
Introduction to Multiple Regression
Chap 14-1
Learning Objectives
In this chapter, you learn:
How to develop a multiple regression model
How to interpret the regression coefficients
How to determine which independent variables to

include in the regression model
How to determine which independent variables are more

important in predicting a dependent variable
How to use categorical variables in a regression model
How to predict a categorical dependent variable using

logistic regression
The Multiple Regression

Model
Idea: Examine the linear relationship between
1 dependent (Y) & 2 or more independent variables (Xi)
Multiple Regression Model with k Independent Variables:
Y-intercept
Population slopes
Random Error
Yi 0 1X1i 2 X 2i k Xki i
Multiple Regression Equation

The coefficients of the multiple regression model are
estimated using sample data
Multiple regression equation with k independent variables:
Estimated
(or predicted)
value of Y
Estimated
intercept
Estimated slope coefficients
Yi b0 b1X1i b 2 X 2i bk Xki
In this chapter we will always use Excel to obtain the
regression slope coefficients and other regression
summary measures.
Multiple Regression Equation

(continued)
Two variable model

Y
pe
o
Sl
e
bl
a
i
ar
v
r
fo
X1
Y b0 b1X1 b 2 X 2
X1
varia
r
o
f
e
lo p
X2
ble X 2
Example:
2 Independent Variables
A distributor of frozen desert pies wants to

evaluate factors thought to influence demand
Dependent variable:
Pie sales (units per week)
Independent variables: Price (in $)
Advertising ($100s)
Data are collected for 15 weeks
Pie Sales Example

Week
Pie
Sales
Price
($)
Advertising
($100s)
350
5.50
3.3
460
7.50
3.3
350
8.00
3.0
430
8.00
4.5
350
6.80
3.0
380
7.50
4.0
430
4.50
3.0
470
6.40
3.7
450
7.00
3.5
10
490
5.00
4.0
11
340
7.20
3.5
12
300
7.90
3.2
13
440
5.90
4.0
14
450
5.00
3.5
15
300
7.00
2.7
Multiple regression equation:
Sales = b0 + b1 (Price)
+ b2 (Advertising)
Using Minitab
Stat | Regression | Regression
Enter appropriate
variables for
Response and
Predictors
Minitab
Multiple Regression Output
Regression Analysis: Pie Sales versus Price, Advertising
The regression equation is
Pie Sales = 307 - 25.0 Price + 74.1 Advertising
Predictor
Coef
Constant
306.5
Price
-24.98
Advertising 74.13
S = 47.4634
SE Coef
T
114.3
2.68
10.83 -2.31
25.97
2.85
R-Sq = 52.1%
P
0.020
0.040
0.014
R-Sq(adj) = 44.2%
Analysis of Variance
Source
DF
Regression
2
Residual Error 12
Total
14
Source
Price
Advertising
DF
1
1
SS
29460
27033
56493
Seq SS
11100
18360
MS
14730
2253
F
6.54
P
0.012
The Multiple Regression Equation

Sales 306.5 - 24.98(Price) 74.13(Adve rtising)
where
Sales is in number of pies per week
Price is in $
Advertising is in $100s.
b1 = -24.98: sales will

decrease, on
average, by 24.98
pies per week for
each $1 increase in
selling price, net of
the effects of changes
due to advertising
b2 = 74.13: sales will

increase, on average,
by 74.13 pies per
week for each $100
increase in
advertising, net of the
effects of changes
due to price
Using The Equation to Make

Predictions
Predict sales for a week in which the selling
price is $5.50 and advertising is $350:
Sales 306.5 - 24.98(Price) 74.13(Adve rtising)
306.5 - 24.98 (5.50) 74.13 (3.5)
428.6
Predicted sales
is 428.6 pies
Note that Advertising is

in $100s, so $350
means that X2 = 3.5
Predictions in Minitab
Select
Stat | Regression | Regression
then click on the options box
Enter desired
value for each
independent
variable
Check the
Confidence limits
and Prediction
limits boxes
Predictions in Minitab
(continued)
Minitab output:
New
Obs Fit SE Fit
1 428.6 17.2
<
Predicted Values for New Observations
Predicted Y value
95% CI
(391.1, 466.1)
95% PI
(318.6, 538.6)
<
Prediction interval for an

individual Y value, given
these Xs
Values of Predictors for New Observations

Advertising
3.50
Input values
Confidence interval for the

mean Y value, given
these Xs
<
New
Obs Price
1
5.50
Coefficient of
Multiple Determination
Reports the proportion of total variation in Y

explained by all X variables taken together
SSR regression sum of squares

r
SST
total sum of squares
2
Multiple Coefficient of
Determination
(continued)

Predictor
Coef
Constant
306.5
Price
-24.98
Advertising 74.13
S = 47.4634
SE Coef
T
114.3
2.68
10.83 -2.31
25.97
2.85
R-Sq = 52.1%
P
0.020
0.040
0.014
SSR 29460
r
.521
SST 56493
2
R-Sq(adj) = 44.2%
Source
DF
Regression
2
Residual Error 12
Total
14
Source
Price
Advertising
DF
1
1
SS
29460
27033
56493
Seq SS
11100
18360
MS
14730
2253
F
6.54
P
0.012
52.1% of the variation

in pie sales is
explained by the
variation in price and
advertising
Adjusted r2
r2 never decreases when a new X variable is

added to the model
This can be a disadvantage when comparing
models
What is the net effect of adding a new variable?
We lose a degree of freedom when a new X
variable is added
Did the new X variable add enough
explanatory power to offset the loss of one
degree of freedom?
Adjusted r2
(continued)
Shows the proportion of variation in Y explained by all X

variables adjusted for the number of X variables used
n 1
r 1 (1 r )

n k 1
(where n = sample size, k = number of independent
variables)

2
adj
Penalize excessive use of unimportant independent variables

Smaller than r2
Useful in comparing among models
Multiple Coefficient of
Determination
(continued)

Predictor
Coef
Constant
306.5
Price
-24.98
Advertising 74.13
S = 47.4634
SE Coef
T
114.3
2.68
10.83 -2.31
25.97
2.85
R-Sq = 52.1%
2
adj
P
0.020
0.040
0.014
R-Sq(adj) = 44.2%
Source
DF
Regression
2
Residual Error 12
Total
14
Source
Price
Advertising
DF
1
1
SS
29460
27033
56493
Seq SS
11100
18360
MS
14730
2253
F
6.54
P
0.012
.442
44.2% of the variation

in pie sales is
explained by the
variation in price and
advertising, taking into
account the sample
size and number of
independent variables
Is the Model Significant?
F Test for Overall Significance of the Model
Shows if there is a linear relationship between all

of the X variables considered together and Y
Use F-test statistic
Hypotheses:
H0: 1 = 2 = = k = 0 (no linear relationship)
H1: at least one i 0 (at least one independent
variable affects Y)
F Test for Overall Significance
Test statistic:
SSR
MSR
k
F
SSE
MSE
n k 1
where F has (numerator) = k and
(denominator) = (n k - 1)
degrees of freedom

(continued)
Predictor
Coef
Constant
306.5
Price
-24.98
Advertising 74.13
S = 47.4634
SE Coef
T
114.3
2.68
10.83 -2.31
25.97
2.85
R-Sq = 52.1%
P
0.020
0.040
0.014
MSR 14730
F
6.54
MSE 2253
R-Sq(adj) = 44.2%
With 2 and 12 degrees

of freedom
Source
DF
Regression
2
Residual Error 12
Total
14
Source
Price
Advertising
DF
1
1
SS
29460
27033
56493
Seq SS
11100
18360
MS
14730
2253
F
6.54
P
0.012
P-value for
the F Test

(continued)
Test Statistic:
H0: 1 = 2 = 0
H1: 1 and 2 not both zero
= .05
df1= 2
df2 = 12
Decision:
Critical
Value:
Since F test statistic is in

the rejection region (pvalue < .05), reject H0
F = 3.885
= .05
Do not
reject H0
Reject H0
F.05 = 3.885
MSR
F
6.54
MSE
Conclusion:
F
There is evidence that at least one

independent variable affects Y
Are Individual Variables

Significant?
Use t tests of individual variable slopes
Shows if there is a linear relationship between

the variable Xj and Y
Hypotheses:
H0: j = 0 (no linear relationship)
H1: j 0 (linear relationship does exist

between Xj and Y)

Significant?
(continued)
H0: j = 0 (no linear relationship)

H1: j 0 (linear relationship does exist
between xj and y)
Test Statistic:
bj 0
Sb j
(df = n k 1)

Significant?
(continued)
Predictor
Coef
Constant
306.5
Price
-24.98
Advertising 74.13
S = 47.4634
SE Coef
T
114.3
2.68
10.83 -2.31
25.97
2.85
R-Sq = 52.1%
P
0.020
0.040
0.014
R-Sq(adj) = 44.2%
Source
DF
Regression
2
Residual Error 12
Total
14
Source
Price
Advertising
DF
1
1
SS
29460
27033
56493
Seq SS
11100
18360
MS
14730
2253
F
6.54
P
0.012
t-value for Price is t = -2.31, with

p-value 0.04
t-value for Advertising is t = 2.85,
with p-value 0.014
Inferences about the Slope:

t Test Example
From Minitab output:
H0: i = 0
H1: i 0
Price
Advertising
d.f. = 15-2-1 = 12
Coef
SECoef
-24.98
10.83
-2.31
0.04
74.13
25.97
2.85
0.014
The test statistic for each variable falls

in the rejection region (p-values < .05)
= .05
t/2 = 2.1788
Decision:
/2=.025
/2=.025
Reject H0 for each variable
Conclusion:
Reject H0
Do not reject H0
-t/2
-2.1788
Reject H0
t/2
2.1788
There is evidence that both

Price and Advertising affect
pie sales at = .05
Confidence Interval Estimate

for the Slope
Confidence interval for the population slope j
b j t nk 1Sb j
where t has
(n k 1) d.f.
d.f. = 15-2-1 = 12
Coef
Price
Advertising
SECoef
-24.98
10.83
74.13
25.97
= .05
t/2 = 2.1788
Example: Form a 95% confidence interval for the effect of changes in

price (X1) on pie sales:
-24.98 (2.1788)(10.83)
So the interval is (-48.576 , -1.374)
(This interval does not contain zero, so price has a significant effect on sales)
Confidence Interval Estimate

for the Slope
(continued)
Confidence interval for the population slope i

Example: Form a 95% confidence interval for the effect of changes in
price (X1) on pie sales:
-24.98 (2.1788)(10.83)
So the interval is (-48.576 , -1.374)
Interpretation:
Weekly sales are estimated to be reduced by between 1.37 to
48.58 pies for each increase of $1 in the selling price
Assumptions of Regression
Use the acronym LINE:
Linearity
Independence of Errors
Error values are statistically independent
Normality of Error
The underlying relationship between X and Y is linear
Error values () are normally distributed for any given value of

X
Equal Variance (Homoscedasticity)
The probability distribution of the errors has constant variance
Residual Analysis
ei Yi Yi
The residual for observation i, ei, is the difference

between its observed and predicted value
Check the assumptions of regression by examining the

residuals
Examine for linearity assumption
Evaluate independence assumption
Evaluate normal distribution assumption
Examine for constant variance for all levels of X

(homoscedasticity)
Graphical Analysis of Residuals
Can plot residuals vs. X
Residual Analysis for Linearity

Y
Not Linear
residuals
residuals
Linear
Residual Analysis for

Independence
Not Independent
residuals
residuals
residuals
Independent
X
Residual Analysis for Normality

A normal probability plot of the residuals can
be used to check for normality:
Percent
100
0
-3
-2
-1
Residual
Residual Analysis for

Equal Variance
Y
x
Non-constant variance
residuals
residuals
Constant variance
Examples of Minitab Plots
Measuring Autocorrelation:
The Durbin-Watson Statistic
Used when data are collected over time to

detect if autocorrelation is present
Autocorrelation exists if residuals in one

time period are related to residuals in
another period
Autocorrelation
Autocorrelation is correlation of the errors

(residuals) over time
Here, residuals show a

cyclic pattern, not
random. Cyclical
patterns are a sign of
positive autocorrelation
Violates the regression assumption that

residuals are random and independent
The Durbin-Watson Statistic
The Durbin-Watson statistic is used to test for

autocorrelation
H0: residuals are not correlated
H1: positive autocorrelation is present
n
(e e
i 2
2
e
i
i1
i 1
The possible range is 0 D 4

D should be close to 2 if H0 is true
D less than 2 may signal positive
autocorrelation, D greater than 2 may
signal negative autocorrelation
Testing for Positive

Autocorrelation
H0: positive autocorrelation does not exist
H1: positive autocorrelation is present
Calculate the Durbin-Watson test statistic = D
(The Durbin-Watson Statistic can be found using Excel or Minitab)
Find the values dL and dU from the Durbin-Watson table

(for sample size n and number of independent variables k)
Decision rule: reject H0 if D < dL

Reject H0
Inconclusive
dL
Do not reject H0
dU

Autocorrelation
(continued)
Suppose we have the following time series data:
Is there autocorrelation?

Autocorrelation
Example with n = 25:
Durbin-Watson Calculations
Sum of Squared
Difference of Residuals
3296.18
Sum of Squared
Residuals
3279.98
Durbin-Watson
Statistic
1.00494
n
(e e
i 2
ei
i 1
i1
)2
3296.18
1.00494
3279.98
(continued)
Minitab Output

Autocorrelation
(continued)
Here, n = 25 and there is k = 1 one independent variable
Using the Durbin-Watson table, dL = 1.29 and dU = 1.45
D = 1.00494 < dL = 1.29, so reject H0 and conclude that

significant positive autocorrelation exists
Therefore the linear model is not the appropriate model

to forecast sales
Decision: reject H0 since
D = 1.00494 < dL
Reject H0
Inconclusive
dL=1.29
Do not reject H0
dU=1.45
Influence Analysis
Several methods can be used to measure the

influence of individual observations:
The hat matrix elements hi
The Studentized deleted residuals ti
Cooks distance statistic Di
The Hat Matrix Elements hi
The hat matrix diagonal element for

observation i, denoted hi, reflects the possible
influence of Xi on the regression equation
If potentially influential observations are

present, you may need to reevaluate the need
to keep them in the model
Rule: if hi > 2(k + 1)/n , then Xi is an influential

observation and is a candidate for removal from
the model
The Studentized
Deleted Residuals ti
The Studentized deleted residual measures

the difference of each Yi from the value
predicted by a model that includes all
observations except observation i
Expressed as a t statistic
The Studentized
Deleted Residuals ti
(continued)
The Studentized deleted residual:

n k 1
t i ei
SSE(1 hi ) ei2
where
ei = the residual for observation i

k = number of independent variables
SSE = error sum of squares
hi = hat matrix diagonal element for observation i
If ti > tn-k-2 or ti < - tn-k-2 (using a two tail test at =

0.10), then observation i is highly influential and a
candidate for removal
Cooks distance statistic Di is based on

both hi and the Studentized residual
Used to decide whether an observation

flagged by either the hi or ti criterion is
unduly affecting the model

(continued)
Cooks Di statistic
ei2
Di
k MSE
hi
2
(1 hi )
where
ei = the residual for observation i

k = number of independent variables
SSE = error sum of squared
hi = hat matrix diagonal element for observation i
If Di > Fk+1 , n-k-1 at = 0.05, then the observation is highly

influential on the regression equation and is a candidate
for removal
Collinearity
Collinearity: High correlation exists among two

or more independent variables
This means the correlated variables contribute

redundant information to the multiple regression
model
Collinearity
(continued)
Including two highly correlated independent

variables can adversely affect the regression
results
No new information provided
Can lead to unstable coefficients (large

standard error and low t-values)
Coefficient signs may not match prior

expectations
Some Indications of Strong

Collinearity
Incorrect signs on the coefficients

Large change in the value of a previous
coefficient when a new variable is added to the
model
A previously significant variable becomes nonsignificant when a new independent variable is
added
The estimate of the standard deviation of the
model increases when a variable is added to
the model
Detecting Collinearity
(Variance Inflationary Factor)
VIFj is used to measure collinearity:
1
VIFj
2
1 R j
where R2j is the coefficient of determination of
variable Xj with all other X variables
If VIFj > 5, Xj is highly correlated with

the other independent variables
Example: Pie Sales

Week
Pie
Sales
Price
($)
Advertising
($100s)
350
5.50
3.3
460
7.50
3.3
350
8.00
3.0
430
8.00
4.5
350
6.80
3.0
380
7.50
4.0
430
4.50
3.0
470
6.40
3.7
450
7.00
3.5
10
490
5.00
4.0
11
340
7.20
3.5
12
300
7.90
3.2
13
440
5.90
4.0
14
450
5.00
3.5
15
300
7.00
2.7
Recall the multiple regression

equation of chapter 13:
Sales = b0 + b1 (Price)
+ b2 (Advertising)
Detecting Collinearity in Minitab

Stat / Regression / Regression
Click on the Options box
Select the Variance inflationary factors box
Output for the pie

sales example:

Predictor
Coef SE Coef
Constant
306.5
114.3
Price
-24.98
10.83
Advertising 74.13
25.97
T
2.68
-2.31
2.85
P
0.020
0.040
0.014
S = 47.4634 R-Sq = 52.1% R-Sq(adj) = 44.2%
VIF
1.0
1.0
VIFs are < 5

There is no evidence of
collinearity between Price
and Advertising
Chapter Summary
Developed the multiple regression model

Tested the significance of the multiple
regression model
Discussed adjusted r2
Discussed using residual plots to check model
assumptions
Tested individual regression coefficients

Multiple Regression

Uploaded by

Copyright:

Available Formats

Multiple Regression

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multiple Regression

Uploaded by

Copyright:

Available Formats

Basic Business Statistics

How to develop a multiple regression model

How to interpret the regression coefficients

How to determine which independent variables to

How to determine which independent variables are more

How to use categorical variables in a regression model

How to predict a categorical dependent variable using

The Multiple Regression

Multiple Regression Equation

Estimated slope coefficients

Multiple Regression Equation

Two variable model

A distributor of frozen desert pies wants to

Data are collected for 15 weeks

Pie Sales Example

Multiple regression equation:

The Multiple Regression Equation

b1 = -24.98: sales will

b2 = 74.13: sales will

Using The Equation to Make

Note that Advertising is

Predicted Values for New Observations

Prediction interval for an

Values of Predictors for New Observations

Confidence interval for the

Reports the proportion of total variation in Y

SSR regression sum of squares

Regression Analysis: Pie Sales versus Price, Advertising

52.1% of the variation

r2 never decreases when a new X variable is

Shows the proportion of variation in Y explained by all X

Penalize excessive use of unimportant independent variables

Regression Analysis: Pie Sales versus Price, Advertising

44.2% of the variation

Is the Model Significant?

F Test for Overall Significance of the Model

Shows if there is a linear relationship between all

Use F-test statistic

F Test for Overall Significance

F Test for Overall Significance

With 2 and 12 degrees

F Test for Overall Significance

Since F test statistic is in

There is evidence that at least one

Are Individual Variables

Use t tests of individual variable slopes

Shows if there is a linear relationship between

H0: j = 0 (no linear relationship)

H1: j 0 (linear relationship does exist

Are Individual Variables

H0: j = 0 (no linear relationship)

Are Individual Variables

t-value for Price is t = -2.31, with

Inferences about the Slope:

The test statistic for each variable falls

Reject H0 for each variable

There is evidence that both

Confidence Interval Estimate