6 Multiple Regression
6 Multiple Regression
6 Multiple Regression
Viola adorata
Learning Objectives
x1
y$ = b0 + b1x
X x2 y$ = b 0 + b1 x1 + b 2 x 2
Inaasimple
In simpleregression
regressionmodel,
model,
model InInaamultiple
multipleregression
regressionmodel,
model,
model
theleast-squares
the least-squaresestimators
estimators theleast-squares
least-squaresestimators
estimators
the
minimizethe
minimize thesum
sumofofsquared
squared minimizethethesum
sumofofsquared
squared
minimize
errors from the estimated
errors from the estimated errorsfrom
fromthe
theestimated
estimated
errors
regressionline.
regression line. regressionplane.
plane.
regression
Yi = 0 + 1X1i + 2 X 2i + K + k X ki + i
Multiple Regression Equation
The coefficients of the multiple regression model
are estimated using sample data
Multiple regression equation with k independent variables:
Estimated Estimated
(or predicted) Estimated slope coefficients
intercept
value of Y
Example with Y
two independent Ŷ = b0 + b1X1 + b 2 X 2
variables
X1
e
bl
aria
rv
fo
e X2
op
Sl
able X2
for vari
Slope
X1
Multiple Regression Equation
2 Variable Example
A distributor of frozen dessert pies wants to
evaluate factors thought to influence demand
ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333
Coefficient of
Multiple Determination
Adjusted r2
n − 1
r 2 = 1 − (1 − rY2.12..k )
n − k − 1
(where n = sample size, k = number of independent variables)
Penalizes excessive use of unimportant independent
variables
Smaller than r2
Useful in comparing models
Adjusted r2
Regression Statistics
Multiple R 0.72213
2
radj = .44172
R Square 0.52148
Adjusted R Square 0.44172
44.2% of the variation in pie sales is explained
Standard Error 47.46341 by the variation in price and advertising, taking
Observations 15 into account the sample size and number of
independent variables
ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333
SSR
MSR k
F= =
MSE SSE
n − k −1
where F has (numerator) = k and
(denominator) = (n - k - 1)
degrees of freedom
F-Test for Overall Significance
Regression Statistics
Multiple R 0.72213
ei = (Yi – Yi)
<
Yi
x2i
X2
The best fitting linear regression
x1i
<
equation, Y, is found by
minimizing the sum of squared
X1 errors, Σe2
ei = (Yi – Yi)
Assumptions:
The errors are independent
The errors are normally distributed
Errors have an equal variance
Multiple Regression Assumptions
<
Residuals vs. Yi
Residuals vs. X1i
Residuals vs. X2i
Residuals vs. time (if time series data)
Individual Variables
Tests of Hypothesis
Use t-tests of individual variable slopes
Shows if there is a linear relationship
between the variable Xi and Y
Hypotheses:
H0: i = 0 (no linear relationship)
H1: i 0 (linear relationship does exist
between Xi and Y)
Individual Variables
Tests of Hypothesis
H0: j = 0 (no linear relationship)
H1: j 0 (linear relationship does exist
between Xi and Y)
Test Statistic:
bj − 0
t=
(df = n – k – 1)
Sb j
Individual Variables
Tests of Hypothesis
Regression Statistics
t-value for Price is t = -2.306, with p-
Multiple R 0.72213
value .0398
R Square 0.52148
Adjusted R Square 0.44172
Standard Error 47.46341
t-value for Advertising is t = 2.855,
Observations 15
with p-value .0145
ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333
Decision:
α/2=.025 α/2=.025 Reject H0 for each variable
Conclusion:
There is evidence that both Price and
Reject H0 Do not reject H0
-t/2 t/2
Reject H0 Advertising affect pie sales at α = .05
0
t a2 = F1,a
Where a = degrees of freedom
Coefficient of Partial Determination
for k Variable Model
2
rYj.(all variables except j)
Ŷ = b 0 + b1 X1 + b 2 X 2
Let:
Y = pie sales
X1 = price
X2 = holiday (X2 = 1 if a holiday occurred during the week)
(X2 = 0 if there was no holiday that week)
Ŷ = b 0 + b1 X1 + b 2 (0) = b0 + b1 X1 No Holiday
b0 + b2 If H0: 2 = 0 is
Holi
day
b0 (X = rejected, then
No H 2 1)
olida “Holiday” has a
y (X significant effect
2 = 0)
on pie sales
X1 (Price)
Dummy Variable Example
Sales = 300 - 30(Price) + 15(Holiday)
Sales: number of pies sold per week
Price: pie price in $
1 If a holiday occurred during the week
Holiday:
0 If no holiday occurred
Interaction Between
Independent Variables
Hypothesizes interaction between pairs of X
variables
Response to one X variable may vary at different levels
of another X variable
Ŷ = b0 + b1X1 + b 2 X2 + b3 X3
= b0 + b1X1 + b 2 X2 + b3 (X1X 2 )
Effect of Interaction
Given: Y = 0 + 1X 1 + 2 X 2 + 3 X 1 X 2 +
Interaction Example
Suppose X2 is a dummy variable and the estimated
regression equation is Ŷ = 1 + 2X1 + 3X2 + 4X1X2
Y
12
X2 = 1:
8 Y = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1
4
X2 = 0:
Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1
0
X1
0 0.5 1 1.5
Slopes are different if the effect of X1 on Y depends on X2 value
Significance of Interaction
Term
Can perform a partial F-test for the
contribution of a variable to see if the
addition of an interaction term improves the
model
Simultaneous Contribution of
Independent Variables
Use partial F-test for the simultaneous contribution
of multiple variables to the model
Let m variables be an additional set of variables added
simultaneously
To test the hypothesis that the set of m variables
improves the model:
Lecture Summary
In this lecture, we have
SST
SSR SSE
2 SSR SSE
R = =1-
SST SST
Example: :
Example ss==1.911
1.911 R-sq==96.1%
R-sq 96.1% R-sq(adj)==95.0%
R-sq(adj) 95.0%
x x
Outliers
Outliers InfluentialObservations
Influential Observations
Possible Relation in the Region between
the Available Cluster of Data and the Far
Point
Point
Point with
with aa large
large value
value of
of xxii x
y *
Some
Some ofof the
the possible
possible data
data between
between the
the x
original
original cluster
cluster and
and the
the far
far point
point x
x
x x
x x
. x x x
...... ... . x x
x x x
x
. .. . x
x
x x x
More
More appropriate
appropriate curvilinear
curvilinear relationship
relationship
(seen
(seen when
when the
the in
in between
between data
data are
are known).
known).
y$ = b + b X + b X
0 1 2
2
(b < 0) y$ = b + b X + b X + b X
0 1 2
2
3
3
X1 X1
Multicollinearity
x2
x1 x2 x1
Orthogonal X variables provide Perfectly collinear X variables
information from independent provide identical information
sources. No multicollinearity. content. No regression.
x2
x2
x1 x1
Some degree of collinearity.
A high degree of negative
Problems with regression depend
collinearity also causes problems
on the degree of collinearity.
with regression.
Effects of Multicollinearity
•• Variancesof
Variances ofregression
regressioncoefficients
coefficientsare
areinflated.
inflated.
•• Magnitudesof
Magnitudes ofregression
regressioncoefficients
coefficientsmay
maybebedifferent
different
fromwhat
from whatare
areexpected.
expected.
•• Signsof
Signs ofregression
regressioncoefficients
coefficientsmay
maynotnotbe
beas
asexpected.
expected.
•• Addingor
Adding orremoving
removingvariables
variablesproduces
produceslarge
largechanges
changesin in
coefficients.
coefficients.
•• Removingaadata
Removing datapoint
pointmay
maycause
causelarge
largechanges
changesin in
coefficient estimates or signs.
coefficient estimates or signs.
•• Insome
In somecases,
cases,the
theFFratio
ratiomay
maybebesignificant
significantwhile
whilethe
thett
ratios are not.
ratios are not.
50
0
0.0 0.5 1.0 Rh2
Variance Inflation Factor (VIF)
PartialFFtest:
Partial test:
H00::ββ33==ββ44==00
H
HH11::ββ33and
andββ 4not
4
notboth
both00
PartialFFstatistic:
statistic: (SSE − SSE ) / r
Partial R F
F =
(r, (n − (k + 1)) M SE
F
whereSSE
where SSERisisthe
thesum
sumofofsquared
squarederrors
errorsofofthe
thereduced
reducedmodel,
model,SSE SSEFisisthe
thesum
sumofofsquared
squared
R F
errorsofofthe
errors thefull
fullmodel;
model;MSEMSEFisisthe
themean
meansquare
squareerror
errorofofthe
thefull
fullmodel
model[MSE
[MSEF==
F F
SSEF/(n-(k+1))];
SSE /(n-(k+1))];rrisisthe
thenumber
numberofofvariables
variablesdropped
droppedfrom
fromthe thefull
fullmodel.
model.
F
Variable Selection Methods
•• Stepwise
Stepwiseprocedures
procedures
Forward
Forwardselection
selection
•• Add
Addone
onevariable
variableatataatime
timetotothe
themodel,
model,on
onthe
thebasis
basisof
of
itsFFstatistic
its statistic
Backward
Backwardelimination
elimination
•• Remove
Removeone
onevariable
variableatataatime,
time,on
onthe
thebasis
basisof
ofits
itsFF
statistic
statistic
Stepwise
Stepwiseregression
regression
•• Adds
Addsvariables
variablestotothe
themodel
modeland
andsubtracts
subtractsvariables
variables
fromthe
from themodel,
model,on
onthe
thebasis
basisof
ofthe
theFFstatistic
statistic
Stepwise Regression
Compute F statistic for each variable not in the model
Remove
Is there a variable with p-value > Pout?
variable
No
Influential Points
Bibliography
Steel, R. & Torrie, J. (1986). Principles and Procedures of
Statistics: A Biometrical Approach. Singapore: McGraw-Hill
Book Company.
Gomez, K. & Gomez, A. (1984). Statistical Procedures for
Agricultural Research. Singapore: John Willey & Sons, Inc.
Kuehl, R. (2000). Designs of Experiments: Statistical
Principles of Research Design and Analysis. Pacific Grove:
Duxbury Thomson Learning.
Jacoby, W. (2000). Loess: a nonparametric, graphical tool
for depicting relationships between variables. Electoral
Studies, 19, 577-613.
Zar, J. (1996). Biostatistical Analysis. New Jersey: Prentice-
Hall International, Inc.
Kirk, R. (1995). Experimental Design: Procedures for the
Behavioral Sciences. Pacific Grove: Brooks/Cole Publishing
Company.
Kleinbaum, D., Kupper, L., Muller, K. & Nizam, A. (1998).
Applied Regression Analysis and Other Multivariable
Methods. Pacific Grove: Duxbury Press.
Viola adorata