6 Multiple Regression

Multiple Regression
Dr. George Menexes

Aristotle University of Thessaloniki
School of Agriculture, Lab of Agronomy
Viola adorata
Learning Objectives
In this lecture, you learn:

How to develop a multiple regression model
How to interpret the regression coefficients
How to determine which independent variables to
include in the regression model
How to determine which independent variables are
most important in predicting a dependent variable
How to use categorical variables in a regression
model
Simple and Multiple Least-
Squares Regression
Y y
x1
y$ = b0 + b1x
X x2 y$ = b 0 + b1 x1 + b 2 x 2
Inaasimple
In simpleregression
regressionmodel,
model,
model InInaamultiple
multipleregression
regressionmodel,
model,
model
theleast-squares
the least-squaresestimators
estimators theleast-squares
least-squaresestimators
estimators
the
minimizethe
minimize thesum
sumofofsquared
squared minimizethethesum
sumofofsquared
squared
minimize
errors from the estimated
errors from the estimated errorsfrom
fromthe
theestimated
estimated
errors
regressionline.
regression line. regressionplane.
plane.
regression
The Multiple Regression

Model
Idea: Examine the linear relationship between 1 dependent (Y)

& 2 or more independent variables (Xi).
Multiple Regression Model with k Independent Variables:
Y-intercept Population slopes Random Error
Yi = 0 + 1X1i + 2 X 2i + K + k X ki + i
Multiple Regression Equation
The coefficients of the multiple regression model
are estimated using sample data
Multiple regression equation with k independent variables:
Estimated Estimated
(or predicted) Estimated slope coefficients
intercept
value of Y
Ŷi = b0 + b1X1i + b 2 X 2i + K + bk Xki

In this lecture we will always use Excel to obtain the regression
slope coefficients and other regression summary measures.
Example with Y
two independent Ŷ = b0 + b1X1 + b 2 X 2
variables
X1
e
bl
aria
rv
fo
e X2
op
Sl
able X2
for vari
Slope
X1
2 Variable Example
A distributor of frozen dessert pies wants to
evaluate factors thought to influence demand
Dependent variable: Pie sales (units per week)

Independent variables: Price (in $)
Advertising ($100’s)
Data are collected for 15 weeks

2 Variable Example
Price Advertising
Week Pie Sales ($) ($100s)
1 350 5.50 3.3

2 460 7.50 3.3
Multiple regression equation:
3 350 8.00 3.0
4 430 8.00 4.5
Sales = b0 + b1 (Price) +
5 350 6.80 3.0
6 380 7.50 4.0 b2 (Advertising)
7 430 4.50 3.0
8 470 6.40 3.7 Sales = b0 +b1X1 + b2X2
9 450 7.00 3.5
10 490 5.00 4.0
11 340 7.20 3.5 Where X1 = Price
12 300 7.90 3.2
13 440 5.90 4.0 X2 = Advertising
14 450 5.00 3.5
15 300 7.00 2.7
2 Variable Example, Excel
Regression Statistics
Multiple R 0.72213
R Square 0.52148
Adjusted R Square 0.44172
Standard Error 47.46341
Sales = 306.526 - 24.975(X1 ) + 74.131(X2 )
Observations 15
ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888

2 Variable Example
Sales = 306.526 - 24.975(X1 ) + 74.131(X 2 )
where
Sales is in number of pies per week
Price is in $
Advertising is in $100’s.
b1 = -24.975: sales will b2 = 74.131: sales will

decrease, on average, by increase, on average, by
24.975 pies per week for 74.131 pies per week
each $1 increase in selling for each $100 increase
price, net of the effects of in advertising, net of the
changes due to advertising effects of changes due
to price
2 Variable Example
Predict sales for a week in which the selling price is
$5.50 and advertising is $350:
Sales = 306.526 - 24.975(X1 ) + 74.131(X2 )

= 306.526 - 24.975 (5.50) + 74.131(3.5)
= 428.62
Note that Advertising is in

Predicted sales is $100’s, so $350 means that
428.62 pies X2 = 3.5
Coefficient of
Multiple Determination
Reports the proportion of total variation in Y

explained by all X variables taken together
SSR regression sum of squares

r2 = =
SST total sum of squares
Coefficient of
Multiple Determination (Excel)
SSR 29460.0
Multiple R 0.72213
r2 = = = .52148
R Square 0.52148 SST 56493.3
52.1% of the variation in pie sales is
explained by the variation in price
Observations 15
and advertising
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333

Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888
Adjusted r2
r2 never decreases when a new X variable is added

to the model
This can be a disadvantage when comparing
models
What is the net effect of adding a new variable?
We lose a degree of freedom when a new X
variable is added
Did the new X variable add enough independent
power to offset the loss of one degree of
freedom?
Adjusted r2
Shows the proportion of variation in Y explained by all X
variables adjusted for the number of X variables used
  n − 1 
r 2 = 1 − (1 − rY2.12..k ) 
  n − k − 1 
(where n = sample size, k = number of independent variables)
Penalizes excessive use of unimportant independent
variables
Smaller than r2
Useful in comparing models
Adjusted r2
Multiple R 0.72213
2
radj = .44172
R Square 0.52148
44.2% of the variation in pie sales is explained
Standard Error 47.46341 by the variation in price and advertising, taking
Observations 15 into account the sample size and number of
independent variables
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333

Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888
F-Test for Overall Significance
F-Test for Overall Significance of the Model

Shows if there is a linear relationship between all of
the X variables considered together and Y
Use F test statistic
Hypotheses:
H0: 1 = 2 = … = k = 0 (no linear relationship)
H1: at least one i 0 (at least one independent variable
affects Y)

Test statistic:
SSR
MSR k
F= =
MSE SSE
n − k −1
where F has (numerator) = k and
(denominator) = (n - k - 1)
degrees of freedom
Multiple R 0.72213
R Square 0.52148 MSR 14730.0

F= = = 6.5386
MSE 2252.8
Observations 15 P-value for
the F-Test
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333

Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888

H 0: 1 = 2 = 0 Test Statistic:
H1: 1 and 2 not both zero
MSR
α = .05 F= = 6.5386
MSE
df1= 2 df2 = 12
Decision:
Critical Since F test statistic is in the
Value: rejection region (p-value < .05),
Fα = 3.885 reject H0
α = .05
Conclusion:
0 F There is evidence that at least one
Do not Reject H0
reject H0 F.05 = 3.89 independent variable affects Y
Residuals in Multiple Regression
Two variable model
Y Sample
Yi observation Ŷ = b0 + b1X1 + b 2 X 2
Residual =
<
ei = (Yi – Yi)
<
Yi
x2i
X2
The best fitting linear regression
x1i
<
equation, Y, is found by
minimizing the sum of squared
X1 errors, Σe2
Multiple Regression Assumptions
Errors (residuals) from the regression model:

<
ei = (Yi – Yi)
Assumptions:
The errors are independent
The errors are normally distributed
Errors have an equal variance
Multiple Regression Assumptions
These residual plots are used in multiple regression:
<
Residuals vs. Yi
Residuals vs. X1i
Residuals vs. X2i
Residuals vs. time (if time series data)
Use the residual plots to check for

violations of regression assumptions
Individual Variables
Tests of Hypothesis
Use t-tests of individual variable slopes
Shows if there is a linear relationship
between the variable Xi and Y
Hypotheses:
H0: i = 0 (no linear relationship)
H1: i 0 (linear relationship does exist
between Xi and Y)
Tests of Hypothesis
H0: j = 0 (no linear relationship)
H1: j 0 (linear relationship does exist
between Xi and Y)
Test Statistic:
bj − 0
t=
(df = n – k – 1)
Sb j
Tests of Hypothesis
t-value for Price is t = -2.306, with p-
Multiple R 0.72213
value .0398
R Square 0.52148
t-value for Advertising is t = 2.855,
Observations 15
with p-value .0145
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333

Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888
Tests of Hypothesis
H0: j = 0 Coefficients Standard Error t Stat P-value
H1: j 0 Price -24.97509 10.83213 -2.30565 0.03979

Advertising 74.13096 25.96732 2.85478 0.01449
d.f. = 15 - 2 - 1 = 12
α = .05
The test statistic for each variable falls in
tα/2 = 2.1788
the rejection region (p-values < .05)
Decision:
α/2=.025 α/2=.025 Reject H0 for each variable
Conclusion:
There is evidence that both Price and
Reject H0 Do not reject H0
-t/2 t/2
Reject H0 Advertising affect pie sales at α = .05
0
Confidence Interval Estimate

for the Slope
Confidence interval for the population slope i
bi ± t n−k −1Sbi where t has

(n – k – 1) d.f.
Coefficients Standard Error
Intercept 306.52619 114.25389 Here, t has
Price -24.97509 10.83213
(15 – 2 – 1) = 12 d.f.
Advertising 74.13096 25.96732
Example: Form a 95% confidence interval for the effect of changes in

price (X1) on pie sales, holding constant the effects of advertising:
-24.975 ± (2.1788)(10.832): So the interval is (-48.576, -1.374)
Confidence Interval Estimate
for the Slope
Confidence interval for the population slope i
Coefficients Standard Error … Lower 95% Upper 95%

Intercept 306.52619 114.25389 … 57.58835 555.46404
Price -24.97509 10.83213 … -48.57626 -1.37392
Advertising 74.13096 25.96732 … 17.55303 130.70888
Example: Excel output also reports these interval endpoints:

Weekly sales are estimated to be reduced by between 1.37 to 48.58 pies
for each increase of $1 in the selling price, holding constant the effects of
advertising.
Testing Portions of the Multiple

Regression Model
Contribution of a Single Independent Variable Xj
SSR(Xj | all variables except Xj)

= SSR (all variables) – SSR(all variables except Xj)
Measures the contribution of Xj in explaining the total

variation in Y (SST)
Testing Portions of the Multiple
Regression Model
Contribution of a Single Independent Variable Xj, assuming
all other variables are already included
(consider here a 3-variable model):
SSR(X1 | X2 and X3)

= SSR (all variables) – SSR(X2 and X3)
From ANOVA section of From ANOVA section of

regression for regression for
Ŷ = b0 + b1X1 + b 2 X2 + b 3 X3 Ŷ = b0 + b 2 X2 + b3 X3
Measures the contribution of X1 in explaining SST
The Partial F-Test Statistic

Consider the hypothesis test:
H0: variable Xj does not significantly improve the
model after all other variables are included
H1: variable Xj significantly improves the model after
all other variables are included
Test using the F-test statistic:
(with 1 and n-k-1 d.f.)
SSR (X j | all variables except j)

F=
MSE
Testing Portions of Model:
Example
Example: Frozen dessert pies
Test at the = .05 level to determine whether

the price variable significantly improves the
model given that advertising is included

Example
H0: X1 (price) does not improve the model
with X2 (advertising) included
H1: X1 does improve model
= .05, df = 1 and 12
F critical Value = 4.75
(For X1 and X2) (For X2 only)
ANOVA ANOVA
df SS MS df SS
Regression 2 29460.02687 14730.01343 Regression 1 17484.22249
Residual 12 27033.30647 2252.775539 Residual 13 39009.11085
Total 14 56493.33333 Total 14 56493.33333
Example
(For X1 and X2) (For X2 only)
ANOVA ANOVA
df SS MS df SS
Regression 2 29460.02687 14730.01343 Regression 1 17484.22249
Residual 12 27033.30647 2252.775539 Residual 13 39009.11085
Total 14 56493.33333 Total 14 56493.33333
SSR (X1 | X2 ) 29,460.03 − 17,484.22

F= = = 5.316
MSE(all) 2252.78
Conclusion: Reject H0; adding X1 does improve model
Relationship Between Test

Statistics
The partial F test statistic developed in this section
and the t test statistic are both used to determine the
contribution of an independent variable to a multiple
regression model.
The hypothesis tests associated with these two
statistics always result in the same decision (that is,
the p-values are identical).
t a2 = F1,a
Where a = degrees of freedom
Coefficient of Partial Determination
for k Variable Model
2
rYj.(all variables except j)
SSR (X | all variables except j)

=
j
SST − SSR(all variables ) + SSR(X j | all variables except j)
Measures the proportion of variation in the dependent variable

that is explained by Xj while controlling for (holding constant)
the other independent variables
Using Dummy Variables
A dummy variable is a categorical

independent variable with two levels:
yes or no, on or off, male or female
coded as 0 or 1
Assumes equal slopes for other variables
If more than two levels, the number of
dummy variables needed is (number of levels
- 1)
Dummy Variable Example
Ŷ = b 0 + b1 X1 + b 2 X 2
Let:
Y = pie sales
X1 = price
X2 = holiday (X2 = 1 if a holiday occurred during the week)
(X2 = 0 if there was no holiday that week)

Ŷ = b 0 + b1 X1 + b 2 (1) = (b 0 + b 2 ) + b1 X1 Holiday
Ŷ = b 0 + b1 X1 + b 2 (0) = b0 + b1 X1 No Holiday
Y (sales) Different Same

intercept slope
b0 + b2 If H0: 2 = 0 is
Holi
day
b0 (X = rejected, then
No H 2 1)
olida “Holiday” has a
y (X significant effect
2 = 0)
on pie sales
X1 (Price)
Sales = 300 - 30(Price) + 15(Holiday)
Sales: number of pies sold per week
Price: pie price in $
1 If a holiday occurred during the week
Holiday:
0 If no holiday occurred
b2 = 15: on average, sales were 15 pies greater in weeks

with a holiday than in weeks without a holiday, given the
same price
Interaction Between
Independent Variables
Hypothesizes interaction between pairs of X
variables
Response to one X variable may vary at different levels
of another X variable
Contains a two-way cross product term
Ŷ = b0 + b1X1 + b 2 X2 + b3 X3
= b0 + b1X1 + b 2 X2 + b3 (X1X 2 )
Effect of Interaction
Given: Y = 0 + 1X 1 + 2 X 2 + 3 X 1 X 2 +
Without interaction term, effect of X1 on Y is

measured by 1
With interaction term, effect of X1 on Y is measured
by 1 + 3 X2
Effect changes as X2 changes
Interaction Example
Suppose X2 is a dummy variable and the estimated
regression equation is Ŷ = 1 + 2X1 + 3X2 + 4X1X2
Y
12
X2 = 1:
8 Y = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1
4
X2 = 0:
Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1
0
X1
0 0.5 1 1.5
Slopes are different if the effect of X1 on Y depends on X2 value
Significance of Interaction
Term
Can perform a partial F-test for the
contribution of a variable to see if the
addition of an interaction term improves the
model
Multiple interaction terms can be included

Use a partial F-test for the simultaneous
contribution of multiple variables to the model
Simultaneous Contribution of
Independent Variables
Use partial F-test for the simultaneous contribution
of multiple variables to the model
Let m variables be an additional set of variables added
simultaneously
To test the hypothesis that the set of m variables
improves the model:
[SSR(all) − SSR (all except new set of m variables)] / m

F=
MSE(all)
(where F has m and n-k-1 d.f.)

Lecture Summary
In this lecture, we have
Developed the multiple regression model

Tested the significance of the multiple
regression model
Discussed adjusted r2
Discussed using residual plots to check
model assumptions
Lecture Summary
In this lecture, we have
Tested individual regression coefficients

Tested portions of the regression model
Used dummy variables
Evaluated interaction effects
Some Special Topics
The F Test of a Multiple

Regression Model
AAstatistical
statisticaltest
testfor
forthe
theexistence
existenceof ofaalinear
linearrelationship
relationshipbetween
betweenYYand
andany
anyor
or
allof
all ofthe
theindependent
independentvariables
variablesXX1, ,xx2, ,...,
...,XXk::
1 2 k
HH00:: ββ11==ββ22==...=
...=ββk==00
k
HH1:: Not
Notall theββi(i=1,2,...,k)
allthe (i=1,2,...,k)are areequal
equaltoto00
1 i
Source of Sum of Degrees of

Variation Squares Freedom Mean Square F Ratio
Regression SSR k SSR

MSR =
k
Error SSE n - (k+1) SSE
MSE =
( n − ( k + 1))
Total SST n-1 SST
MST =
( n − 1)
Decomposition of the Sum of Squares and
the Adjusted Coefficient of Determination
SST
SSR SSE
2 SSR SSE
R = =1-
SST SST
The adjusted multiple coefficient of determination, R 2, is the coefficient of

determination with the SSE and SST divided by their respective degrees of freedom:
SSE
R 2 =1- (n -(k +1))
SST
(n -1)
Example: :
Example ss==1.911
1.911 R-sq==96.1%
R-sq 96.1% R-sq(adj)==95.0%
R-sq(adj) 95.0%
Investigating the Validity of the

Regression: Outliers and Influential
Observations
Regression line Point with a large
y without outlier y value of xi *
. .
. ..
.. Regression Regression line
when all data are
. .. .. line with
.
...... ....
included
outlier
.. .
. . .. .
No relationship in
this cluster
* Outlier
x x
Outliers
Outliers InfluentialObservations
Influential Observations
Possible Relation in the Region between
the Available Cluster of Data and the Far
Point
Point
Point with
with aa large
large value
value of
of xxii x
y *
Some
Some ofof the
the possible
possible data
data between
between the
the x
original
original cluster
cluster and
and the
the far
far point
point x
x
x x
x x
. x x x
...... ... . x x
x x x
x
. .. . x
x
x x x
More
More appropriate
appropriate curvilinear
curvilinear relationship
relationship
(seen
(seen when
when the
the in
in between
between data
data are
are known).
known).
Prediction in Multiple Regression
A (1-α ) 100% prediction interval for a value of Y given values of X :

i
yˆ ± t α s ( yˆ)+ MSE
2
( ,(n−(k +1)))
2
A (1-) 100% prediction interval for the conditional mean of Y given
values of X :
i
yˆ ± t α s[Eˆ (Y )]
( ,(n−(k +1)))
2
Polynomial Regression
One-variable polynomial regression model:

β1 X + β2X2 + β3X3 +. . . + β mXm +εε
Y= β 0+β
where m is the degree of the polynomial - the highest power of X appearing in
the equation. The degree of the polynomial is the order of the model.
Y Y
y$ = b + b X
y$ = b + b X
0 1
0 1
y$ = b + b X + b X
0 1 2
2
(b < 0) y$ = b + b X + b X + b X
0 1 2
2
3
3
X1 X1
Nonlinear Models and

Transformations
T h e m u l t ip l ic a t i v e m o d e l :
Y = β X 0 1
β1
X β
2
2
X β
3
3
ε
T h e lo g a r it h m i c t r a n s f o r m a t io n :
l o g Y = l o g β + β l o g X + β lo g X
0 1 1 2 2
+ β lo g X + lo g ε
3 3
Transformations:
Exponential Model
T h e e x p o n e n t i a l m o d e l:
Y = β 0
e β 1 X
ε
T h e lo g a r ith m ic tr a n s fo r m a tio n :
lo g Y = lo g β 0
+ β 1
X 1
+ lo g ε
Multicollinearity
x2
x1 x2 x1
Orthogonal X variables provide Perfectly collinear X variables
information from independent provide identical information
sources. No multicollinearity. content. No regression.
x2
x2
x1 x1
Some degree of collinearity.
A high degree of negative
Problems with regression depend
collinearity also causes problems
on the degree of collinearity.
with regression.
Effects of Multicollinearity
•• Variancesof
Variances ofregression
regressioncoefficients
coefficientsare
areinflated.
inflated.
•• Magnitudesof
Magnitudes ofregression
coefficientsmay
maybebedifferent
different
fromwhat
from whatare
areexpected.
expected.
•• Signsof
Signs ofregression
coefficientsmay
maynotnotbe
beas
asexpected.
expected.
•• Addingor
Adding orremoving
removingvariables
variablesproduces
produceslarge
largechanges
changesin in
coefficients.
coefficients.
•• Removingaadata
Removing datapoint
pointmay
maycause
causelarge
largechanges
changesin in
coefficient estimates or signs.
coefficient estimates or signs.
•• Insome
In somecases,
cases,the
theFFratio
ratiomay
maybebesignificant
significantwhile
whilethe
thett
ratios are not.
ratios are not.
Variance Inflation Factor

The variance inflation factor associated with X h :
1
VIF ( X h ) =
1 − Rh2
where R 2h is the R 2 value obtained for the regression of X on
the other independent variables.
Relationship between VIF and Rh2

VIF100
50
0
0.0 0.5 1.0 Rh2
Variance Inflation Factor (VIF)
Observation: The VIF (Variance Inflation Factor)

values for both variables Lend and Price are both
greater than 5. This would indicate that some degree of
multicollinearity exists with respect to these two
variables.
Partial F Tests and Variable

Selection Methods
Fullmodel:
Full model:
YY==ββ00++ββ11XX11++ββ22XX22++ββ33XX33++ββ44XX44++εε
Reducedmodel:
Reduced model:
YY==ββ00++ββ11XX11++ββ22XX22++εε
PartialFFtest:
Partial test:
H00::ββ33==ββ44==00
H
HH11::ββ33and
andββ 4not
4
notboth
both00
PartialFFstatistic:
statistic: (SSE − SSE ) / r
Partial R F
F =
(r, (n − (k + 1)) M SE
F
whereSSE
where SSERisisthe
thesum
sumofofsquared
squarederrors
errorsofofthe
thereduced
reducedmodel,
model,SSE SSEFisisthe
thesum
sumofofsquared
squared
R F
errorsofofthe
errors thefull
fullmodel;
model;MSEMSEFisisthe
themean
meansquare
squareerror
errorofofthe
thefull
fullmodel
model[MSE
[MSEF==
F F
SSEF/(n-(k+1))];
SSE /(n-(k+1))];rrisisthe
thenumber
numberofofvariables
variablesdropped
droppedfrom
fromthe thefull
fullmodel.
model.
F
Variable Selection Methods
•• Stepwise
Stepwiseprocedures
procedures
Forward
Forwardselection
selection
•• Add
Addone
onevariable
variableatataatime
timetotothe
themodel,
model,on
onthe
thebasis
basisof
of
itsFFstatistic
its statistic
Backward
Backwardelimination
elimination
•• Remove
Removeone
onevariable
variableatataatime,
time,on
onthe
thebasis
basisof
ofits
itsFF
statistic
statistic
Stepwise
Stepwiseregression
regression
•• Adds
Addsvariables
variablestotothe
themodel
modeland
andsubtracts
subtractsvariables
variables
fromthe
from themodel,
model,on
onthe
thebasis
basisof
ofthe
theFFstatistic
statistic
Stepwise Regression
Compute F statistic for each variable not in the model
Is there at least one variable with p-value > Pin? No

Stop
Yes
Enter most significant (smallest p-value) variable into model
Calculate partial F for all variables in the model
Remove
Is there a variable with p-value > Pout?
variable
No
Influential Points
Outliers (univariate, multivariate)

Leverage Points (Distances)
Influence Statistics
Influential Points continued…

Distances
Mahalanobis: A measure of how much a case's values on the

independent variables differ from the average of all cases. A
large Mahalanobis distance identifies a case as having extreme
values on one or more of the independent variables.
Cook’s: A measure of how much the residuals of all cases
would change if a particular case were excluded from the
calculation of the regression coefficients. A large Cook's D
indicates that excluding a case from computation of the
regression statistics, changes the coefficients substantially.
Leverage values: Measures the influence of a point on the fit
of the regression. The centered leverage ranges from 0 (no
influence on the fit) to (N-1)/N.
Influence Statistics (1)

DfBeta(s): The difference in beta value is the
change in the regression coefficient that results from
the exclusion of a particular case. A value is
computed for each term in the model, including the
constant.
Std. DfBeta(s): Standardized difference in beta
value. The change in the regression coefficient that
results from the exclusion of a particular case. You
may want to examine cases with absolute values
greater than 2 divided by the square root of N, where
N is the number of cases. A value is computed for
each term in the model, including the constant.
DfFit: The difference in fit value is the change in the
predicted value that results from the exclusion of a
particular case.
Influence Statistics (2)
Std. DfFit: Standardized difference in fit value. The
change in the predicted value that results from the
exclusion of a particular case. You may want to
examine standardized values which in absolute value
exceed 2 divided by the square root of p/N, where p
is the number of independent variables in the
equation and N is the number of cases.
Covariance Ratio: The ratio of the determinant of
the covariance matrix with a particular case excluded
from the calculation of the regression coefficients to
the determinant of the covariance matrix with all
cases included. If the ratio is close to 1, the case
does not significantly alter the covariance matrix.
Bibliography
Steel, R. & Torrie, J. (1986). Principles and Procedures of
Statistics: A Biometrical Approach. Singapore: McGraw-Hill
Book Company.
Gomez, K. & Gomez, A. (1984). Statistical Procedures for
Agricultural Research. Singapore: John Willey & Sons, Inc.
Kuehl, R. (2000). Designs of Experiments: Statistical
Principles of Research Design and Analysis. Pacific Grove:
Duxbury Thomson Learning.
Jacoby, W. (2000). Loess: a nonparametric, graphical tool
for depicting relationships between variables. Electoral
Studies, 19, 577-613.
Zar, J. (1996). Biostatistical Analysis. New Jersey: Prentice-
Hall International, Inc.
Kirk, R. (1995). Experimental Design: Procedures for the
Behavioral Sciences. Pacific Grove: Brooks/Cole Publishing
Company.
Kleinbaum, D., Kupper, L., Muller, K. & Nizam, A. (1998).
Applied Regression Analysis and Other Multivariable
Methods. Pacific Grove: Duxbury Press.
Viola adorata

6 Multiple Regression

Uploaded by

Copyright:

Available Formats

6 Multiple Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

6 Multiple Regression

Uploaded by

Copyright:

Available Formats

Multiple Regression

Dr. George Menexes

In this lecture, you learn:

The Multiple Regression

Idea: Examine the linear relationship between 1 dependent (Y)

Multiple Regression Model with k Independent Variables:

Y-intercept Population slopes Random Error

Ŷi = b0 + b1X1i + b 2 X 2i + K + bk Xki

Multiple Regression Equation

Dependent variable: Pie sales (units per week)

Multiple Regression Equation

1 350 5.50 3.3

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Multiple Regression Equation

b1 = -24.975: sales will b2 = 74.131: sales will

Sales = 306.526 - 24.975(X1 ) + 74.131(X2 )

Note that Advertising is in

Reports the proportion of total variation in Y

SSR regression sum of squares

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

r2 never decreases when a new X variable is added

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

F-Test for Overall Significance of the Model

F-Test for Overall Significance

R Square 0.52148 MSR 14730.0

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

F-Test for Overall Significance

Multiple Regression Assumptions

Errors (residuals) from the regression model:

These residual plots are used in multiple regression:

Use the residual plots to check for

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

H1: j 0 Price -24.97509 10.83213 -2.30565 0.03979

Confidence Interval Estimate

bi ± t n−k −1Sbi where t has

Example: Form a 95% confidence interval for the effect of changes in

Coefficients Standard Error … Lower 95% Upper 95%

Example: Excel output also reports these interval endpoints:

Testing Portions of the Multiple

SSR(Xj | all variables except Xj)

Measures the contribution of Xj in explaining the total

SSR(X1 | X2 and X3)

From ANOVA section of From ANOVA section of

The Partial F-Test Statistic

SSR (X j | all variables except j)

Test at the = .05 level to determine whether

Testing Portions of Model:

SSR (X1 | X2 ) 29,460.03 − 17,484.22

Conclusion: Reject H0; adding X1 does improve model

Relationship Between Test

SSR (X | all variables except j)

SST − SSR(all variables ) + SSR(X j | all variables except j)

Measures the proportion of variation in the dependent variable

Using Dummy Variables

A dummy variable is a categorical

Dummy Variable Example

Y (sales) Different Same

b2 = 15: on average, sales were 15 pies greater in weeks

Contains a two-way cross product term