6 Multiple Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

Multiple Regression

Dr. George Menexes


Aristotle University of Thessaloniki
School of Agriculture, Lab of Agronomy

Viola adorata

Learning Objectives

In this lecture, you learn:


 How to develop a multiple regression model
 How to interpret the regression coefficients
 How to determine which independent variables to
include in the regression model
 How to determine which independent variables are
most important in predicting a dependent variable
 How to use categorical variables in a regression
model
Simple and Multiple Least-
Squares Regression
Y y

x1
y$ = b0 + b1x
X x2 y$ = b 0 + b1 x1 + b 2 x 2

Inaasimple
In simpleregression
regressionmodel,
model,
model InInaamultiple
multipleregression
regressionmodel,
model,
model
theleast-squares
the least-squaresestimators
estimators theleast-squares
least-squaresestimators
estimators
the
minimizethe
minimize thesum
sumofofsquared
squared minimizethethesum
sumofofsquared
squared
minimize
errors from the estimated
errors from the estimated errorsfrom
fromthe
theestimated
estimated
errors
regressionline.
regression line. regressionplane.
plane.
regression

The Multiple Regression


Model

Idea: Examine the linear relationship between 1 dependent (Y)


& 2 or more independent variables (Xi).

Multiple Regression Model with k Independent Variables:

Y-intercept Population slopes Random Error

Yi = 0 + 1X1i + 2 X 2i + K + k X ki + i
Multiple Regression Equation
The coefficients of the multiple regression model
are estimated using sample data
Multiple regression equation with k independent variables:

Estimated Estimated
(or predicted) Estimated slope coefficients
intercept
value of Y

Ŷi = b0 + b1X1i + b 2 X 2i + K + bk Xki


In this lecture we will always use Excel to obtain the regression
slope coefficients and other regression summary measures.

Multiple Regression Equation

Example with Y
two independent Ŷ = b0 + b1X1 + b 2 X 2
variables

X1
e
bl
aria
rv
fo
e X2
op
Sl
able X2
for vari
Slope

X1
Multiple Regression Equation
2 Variable Example
 A distributor of frozen dessert pies wants to
evaluate factors thought to influence demand

 Dependent variable: Pie sales (units per week)


 Independent variables: Price (in $)
Advertising ($100’s)
 Data are collected for 15 weeks

Multiple Regression Equation


2 Variable Example
Price Advertising
Week Pie Sales ($) ($100s)

1 350 5.50 3.3


2 460 7.50 3.3
Multiple regression equation:
3 350 8.00 3.0
4 430 8.00 4.5
 Sales = b0 + b1 (Price) +
5 350 6.80 3.0
6 380 7.50 4.0 b2 (Advertising)
7 430 4.50 3.0
8 470 6.40 3.7  Sales = b0 +b1X1 + b2X2
9 450 7.00 3.5
10 490 5.00 4.0
11 340 7.20 3.5 Where X1 = Price
12 300 7.90 3.2
13 440 5.90 4.0 X2 = Advertising
14 450 5.00 3.5
15 300 7.00 2.7
Multiple Regression Equation
2 Variable Example, Excel
Regression Statistics
Multiple R 0.72213
R Square 0.52148
Adjusted R Square 0.44172
Standard Error 47.46341
Sales = 306.526 - 24.975(X1 ) + 74.131(X2 )
Observations 15

ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888

Multiple Regression Equation


2 Variable Example
Sales = 306.526 - 24.975(X1 ) + 74.131(X 2 )
where
Sales is in number of pies per week
Price is in $
Advertising is in $100’s.

b1 = -24.975: sales will b2 = 74.131: sales will


decrease, on average, by increase, on average, by
24.975 pies per week for 74.131 pies per week
each $1 increase in selling for each $100 increase
price, net of the effects of in advertising, net of the
changes due to advertising effects of changes due
to price
Multiple Regression Equation
2 Variable Example
Predict sales for a week in which the selling price is
$5.50 and advertising is $350:

Sales = 306.526 - 24.975(X1 ) + 74.131(X2 )


= 306.526 - 24.975 (5.50) + 74.131(3.5)
= 428.62

Note that Advertising is in


Predicted sales is $100’s, so $350 means that
428.62 pies X2 = 3.5

Coefficient of
Multiple Determination

 Reports the proportion of total variation in Y


explained by all X variables taken together

SSR regression sum of squares


r2 = =
SST total sum of squares
Coefficient of
Multiple Determination (Excel)
Regression Statistics
SSR 29460.0
Multiple R 0.72213
r2 = = = .52148
R Square 0.52148 SST 56493.3
Adjusted R Square 0.44172
52.1% of the variation in pie sales is
Standard Error 47.46341
explained by the variation in price
Observations 15
and advertising
ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888

Adjusted r2

 r2 never decreases when a new X variable is added


to the model
 This can be a disadvantage when comparing
models
 What is the net effect of adding a new variable?
 We lose a degree of freedom when a new X
variable is added
 Did the new X variable add enough independent
power to offset the loss of one degree of
freedom?
Adjusted r2
 Shows the proportion of variation in Y explained by all X
variables adjusted for the number of X variables used

  n − 1 
r 2 = 1 − (1 − rY2.12..k ) 
  n − k − 1 
(where n = sample size, k = number of independent variables)
 Penalizes excessive use of unimportant independent
variables
 Smaller than r2
 Useful in comparing models

Adjusted r2
Regression Statistics
Multiple R 0.72213
2
radj = .44172
R Square 0.52148
Adjusted R Square 0.44172
44.2% of the variation in pie sales is explained
Standard Error 47.46341 by the variation in price and advertising, taking
Observations 15 into account the sample size and number of
independent variables
ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888
F-Test for Overall Significance

 F-Test for Overall Significance of the Model


 Shows if there is a linear relationship between all of
the X variables considered together and Y
 Use F test statistic
 Hypotheses:
H0: 1 = 2 = … = k = 0 (no linear relationship)
H1: at least one i  0 (at least one independent variable
affects Y)

F-Test for Overall Significance


 Test statistic:

SSR
MSR k
F= =
MSE SSE
n − k −1
 where F has (numerator) = k and
(denominator) = (n - k - 1)
degrees of freedom
F-Test for Overall Significance
Regression Statistics
Multiple R 0.72213

R Square 0.52148 MSR 14730.0


Adjusted R Square 0.44172
F= = = 6.5386
MSE 2252.8
Standard Error 47.46341
Observations 15 P-value for
the F-Test
ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888

F-Test for Overall Significance


 H 0:  1 =  2 = 0 Test Statistic:
 H1: 1 and 2 not both zero
MSR
 α = .05 F= = 6.5386
MSE
 df1= 2 df2 = 12
Decision:
Critical Since F test statistic is in the
Value: rejection region (p-value < .05),
Fα = 3.885 reject H0
α = .05
Conclusion:
0 F There is evidence that at least one
Do not Reject H0
reject H0 F.05 = 3.89 independent variable affects Y
Residuals in Multiple Regression
Two variable model
Y Sample
Yi observation Ŷ = b0 + b1X1 + b 2 X 2
Residual =
<

ei = (Yi – Yi)
<

Yi

x2i
X2
The best fitting linear regression
x1i

<
equation, Y, is found by
minimizing the sum of squared
X1 errors, Σe2

Multiple Regression Assumptions

Errors (residuals) from the regression model:


<

ei = (Yi – Yi)

Assumptions:
 The errors are independent
 The errors are normally distributed
 Errors have an equal variance
Multiple Regression Assumptions

 These residual plots are used in multiple regression:

<
 Residuals vs. Yi
 Residuals vs. X1i
 Residuals vs. X2i
 Residuals vs. time (if time series data)

Use the residual plots to check for


violations of regression assumptions

Individual Variables
Tests of Hypothesis
 Use t-tests of individual variable slopes
 Shows if there is a linear relationship
between the variable Xi and Y
 Hypotheses:
H0: i = 0 (no linear relationship)
H1: i  0 (linear relationship does exist
between Xi and Y)
Individual Variables
Tests of Hypothesis
H0: j = 0 (no linear relationship)
H1: j  0 (linear relationship does exist
between Xi and Y)

 Test Statistic:

bj − 0
t=
(df = n – k – 1)

Sb j

Individual Variables
Tests of Hypothesis
Regression Statistics
t-value for Price is t = -2.306, with p-
Multiple R 0.72213
value .0398
R Square 0.52148
Adjusted R Square 0.44172
Standard Error 47.46341
t-value for Advertising is t = 2.855,
Observations 15
with p-value .0145

ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888
Individual Variables
Tests of Hypothesis
H0: j = 0 Coefficients Standard Error t Stat P-value

H1: j  0 Price -24.97509 10.83213 -2.30565 0.03979


Advertising 74.13096 25.96732 2.85478 0.01449
d.f. = 15 - 2 - 1 = 12
α = .05
The test statistic for each variable falls in
tα/2 = 2.1788
the rejection region (p-values < .05)

Decision:
α/2=.025 α/2=.025 Reject H0 for each variable
Conclusion:
There is evidence that both Price and
Reject H0 Do not reject H0
-t/2 t/2
Reject H0 Advertising affect pie sales at α = .05
0

Confidence Interval Estimate


for the Slope
Confidence interval for the population slope i

bi ± t n−k −1Sbi where t has


(n – k – 1) d.f.
Coefficients Standard Error
Intercept 306.52619 114.25389 Here, t has
Price -24.97509 10.83213
(15 – 2 – 1) = 12 d.f.
Advertising 74.13096 25.96732

Example: Form a 95% confidence interval for the effect of changes in


price (X1) on pie sales, holding constant the effects of advertising:
-24.975 ± (2.1788)(10.832): So the interval is (-48.576, -1.374)
Confidence Interval Estimate
for the Slope
Confidence interval for the population slope i

Coefficients Standard Error … Lower 95% Upper 95%


Intercept 306.52619 114.25389 … 57.58835 555.46404
Price -24.97509 10.83213 … -48.57626 -1.37392
Advertising 74.13096 25.96732 … 17.55303 130.70888

Example: Excel output also reports these interval endpoints:


Weekly sales are estimated to be reduced by between 1.37 to 48.58 pies
for each increase of $1 in the selling price, holding constant the effects of
advertising.

Testing Portions of the Multiple


Regression Model
 Contribution of a Single Independent Variable Xj

SSR(Xj | all variables except Xj)


= SSR (all variables) – SSR(all variables except Xj)

 Measures the contribution of Xj in explaining the total


variation in Y (SST)
Testing Portions of the Multiple
Regression Model
Contribution of a Single Independent Variable Xj, assuming
all other variables are already included
(consider here a 3-variable model):

SSR(X1 | X2 and X3)


= SSR (all variables) – SSR(X2 and X3)

From ANOVA section of From ANOVA section of


regression for regression for
Ŷ = b0 + b1X1 + b 2 X2 + b 3 X3 Ŷ = b0 + b 2 X2 + b3 X3
Measures the contribution of X1 in explaining SST

The Partial F-Test Statistic


Consider the hypothesis test:
H0: variable Xj does not significantly improve the
model after all other variables are included
H1: variable Xj significantly improves the model after
all other variables are included
Test using the F-test statistic:
(with 1 and n-k-1 d.f.)

SSR (X j | all variables except j)


F=
MSE
Testing Portions of Model:
Example
Example: Frozen dessert pies

Test at the = .05 level to determine whether


the price variable significantly improves the
model given that advertising is included

Testing Portions of Model:


Example
H0: X1 (price) does not improve the model
with X2 (advertising) included
H1: X1 does improve model
 = .05, df = 1 and 12
F critical Value = 4.75
(For X1 and X2) (For X2 only)
ANOVA ANOVA
df SS MS df SS
Regression 2 29460.02687 14730.01343 Regression 1 17484.22249
Residual 12 27033.30647 2252.775539 Residual 13 39009.11085
Total 14 56493.33333 Total 14 56493.33333
Testing Portions of Model:
Example
(For X1 and X2) (For X2 only)
ANOVA ANOVA
df SS MS df SS
Regression 2 29460.02687 14730.01343 Regression 1 17484.22249
Residual 12 27033.30647 2252.775539 Residual 13 39009.11085
Total 14 56493.33333 Total 14 56493.33333

SSR (X1 | X2 ) 29,460.03 − 17,484.22


F= = = 5.316
MSE(all) 2252.78

Conclusion: Reject H0; adding X1 does improve model

Relationship Between Test


Statistics
 The partial F test statistic developed in this section
and the t test statistic are both used to determine the
contribution of an independent variable to a multiple
regression model.
 The hypothesis tests associated with these two
statistics always result in the same decision (that is,
the p-values are identical).

t a2 = F1,a
Where a = degrees of freedom
Coefficient of Partial Determination
for k Variable Model

2
rYj.(all variables except j)

SSR (X | all variables except j)


=
j

SST − SSR(all variables ) + SSR(X j | all variables except j)

Measures the proportion of variation in the dependent variable


that is explained by Xj while controlling for (holding constant)
the other independent variables

Using Dummy Variables

 A dummy variable is a categorical


independent variable with two levels:
 yes or no, on or off, male or female
 coded as 0 or 1
 Assumes equal slopes for other variables
 If more than two levels, the number of
dummy variables needed is (number of levels
- 1)
Dummy Variable Example

Ŷ = b 0 + b1 X1 + b 2 X 2

Let:
Y = pie sales
X1 = price
X2 = holiday (X2 = 1 if a holiday occurred during the week)
(X2 = 0 if there was no holiday that week)

Dummy Variable Example


Ŷ = b 0 + b1 X1 + b 2 (1) = (b 0 + b 2 ) + b1 X1 Holiday

Ŷ = b 0 + b1 X1 + b 2 (0) = b0 + b1 X1 No Holiday

Y (sales) Different Same


intercept slope

b0 + b2 If H0: 2 = 0 is
Holi
day
b0 (X = rejected, then
No H 2 1)
olida “Holiday” has a
y (X significant effect
2 = 0)
on pie sales
X1 (Price)
Dummy Variable Example
Sales = 300 - 30(Price) + 15(Holiday)
Sales: number of pies sold per week
Price: pie price in $
1 If a holiday occurred during the week
Holiday:
0 If no holiday occurred

b2 = 15: on average, sales were 15 pies greater in weeks


with a holiday than in weeks without a holiday, given the
same price

Interaction Between
Independent Variables
 Hypothesizes interaction between pairs of X
variables
 Response to one X variable may vary at different levels
of another X variable

 Contains a two-way cross product term

 Ŷ = b0 + b1X1 + b 2 X2 + b3 X3

= b0 + b1X1 + b 2 X2 + b3 (X1X 2 )
Effect of Interaction
 Given: Y =  0 +  1X 1 +  2 X 2 +  3 X 1 X 2 +

 Without interaction term, effect of X1 on Y is


measured by 1
 With interaction term, effect of X1 on Y is measured
by 1 + 3 X2
 Effect changes as X2 changes

Interaction Example
Suppose X2 is a dummy variable and the estimated
regression equation is Ŷ = 1 + 2X1 + 3X2 + 4X1X2
Y

12
X2 = 1:
8 Y = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1

4
X2 = 0:
Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1
0
X1
0 0.5 1 1.5
Slopes are different if the effect of X1 on Y depends on X2 value
Significance of Interaction
Term
 Can perform a partial F-test for the
contribution of a variable to see if the
addition of an interaction term improves the
model

 Multiple interaction terms can be included


 Use a partial F-test for the simultaneous
contribution of multiple variables to the model

Simultaneous Contribution of
Independent Variables
 Use partial F-test for the simultaneous contribution
of multiple variables to the model
 Let m variables be an additional set of variables added
simultaneously
 To test the hypothesis that the set of m variables
improves the model:

[SSR(all) − SSR (all except new set of m variables)] / m


F=
MSE(all)

(where F has m and n-k-1 d.f.)


Lecture Summary
In this lecture, we have

 Developed the multiple regression model


 Tested the significance of the multiple
regression model
 Discussed adjusted r2
 Discussed using residual plots to check
model assumptions

Lecture Summary
In this lecture, we have

 Tested individual regression coefficients


 Tested portions of the regression model
 Used dummy variables
 Evaluated interaction effects
Some Special Topics

The F Test of a Multiple


Regression Model
AAstatistical
statisticaltest
testfor
forthe
theexistence
existenceof ofaalinear
linearrelationship
relationshipbetween
betweenYYand
andany
anyor
or
allof
all ofthe
theindependent
independentvariables
variablesXX1, ,xx2, ,...,
...,XXk::
1 2 k
HH00:: ββ11==ββ22==...=
...=ββk==00
k
HH1:: Not
Notall theββi(i=1,2,...,k)
allthe (i=1,2,...,k)are areequal
equaltoto00
1 i

Source of Sum of Degrees of


Variation Squares Freedom Mean Square F Ratio

Regression SSR k SSR


MSR =
k
Error SSE n - (k+1) SSE
MSE =
( n − ( k + 1))
Total SST n-1 SST
MST =
( n − 1)
Decomposition of the Sum of Squares and
the Adjusted Coefficient of Determination

SST

SSR SSE
2 SSR SSE
R = =1-
SST SST

The adjusted multiple coefficient of determination, R 2, is the coefficient of


determination with the SSE and SST divided by their respective degrees of freedom:
SSE
R 2 =1- (n -(k +1))
SST
(n -1)

Example: :
Example ss==1.911
1.911 R-sq==96.1%
R-sq 96.1% R-sq(adj)==95.0%
R-sq(adj) 95.0%

Investigating the Validity of the


Regression: Outliers and Influential
Observations
Regression line Point with a large
y without outlier y value of xi *
. .
. ..
.. Regression Regression line
when all data are
. .. .. line with
.
...... ....
included
outlier
.. .
. . .. .
No relationship in
this cluster
* Outlier

x x
Outliers
Outliers InfluentialObservations
Influential Observations
Possible Relation in the Region between
the Available Cluster of Data and the Far
Point
Point
Point with
with aa large
large value
value of
of xxii x
y *
Some
Some ofof the
the possible
possible data
data between
between the
the x
original
original cluster
cluster and
and the
the far
far point
point x
x
x x
x x
. x x x
...... ... . x x
x x x
x
. .. . x
x
x x x

More
More appropriate
appropriate curvilinear
curvilinear relationship
relationship
(seen
(seen when
when the
the in
in between
between data
data are
are known).
known).

Prediction in Multiple Regression

A (1-α ) 100% prediction interval for a value of Y given values of X :


i
yˆ ± t α s ( yˆ)+ MSE
2
( ,(n−(k +1)))
2
A (1-) 100% prediction interval for the conditional mean of Y given
values of X :
i
yˆ ± t α s[Eˆ (Y )]
( ,(n−(k +1)))
2
Polynomial Regression

One-variable polynomial regression model:


β1 X + β2X2 + β3X3 +. . . + β mXm +εε
Y= β 0+β
where m is the degree of the polynomial - the highest power of X appearing in
the equation. The degree of the polynomial is the order of the model.
Y Y
y$ = b + b X
y$ = b + b X
0 1
0 1

y$ = b + b X + b X
0 1 2
2

(b < 0) y$ = b + b X + b X + b X
0 1 2
2
3
3

X1 X1

Nonlinear Models and


Transformations
T h e m u l t ip l ic a t i v e m o d e l :
Y = β X 0 1
β1
X β
2
2
X β
3
3
ε
T h e lo g a r it h m i c t r a n s f o r m a t io n :
l o g Y = l o g β + β l o g X + β lo g X
0 1 1 2 2
+ β lo g X + lo g ε
3 3
Transformations:
Exponential Model
T h e e x p o n e n t i a l m o d e l:
Y = β 0
e β 1 X
ε
T h e lo g a r ith m ic tr a n s fo r m a tio n :
lo g Y = lo g β 0
+ β 1
X 1
+ lo g ε

Multicollinearity
x2

x1 x2 x1
Orthogonal X variables provide Perfectly collinear X variables
information from independent provide identical information
sources. No multicollinearity. content. No regression.

x2
x2
x1 x1
Some degree of collinearity.
A high degree of negative
Problems with regression depend
collinearity also causes problems
on the degree of collinearity.
with regression.
Effects of Multicollinearity

•• Variancesof
Variances ofregression
regressioncoefficients
coefficientsare
areinflated.
inflated.
•• Magnitudesof
Magnitudes ofregression
regressioncoefficients
coefficientsmay
maybebedifferent
different
fromwhat
from whatare
areexpected.
expected.
•• Signsof
Signs ofregression
regressioncoefficients
coefficientsmay
maynotnotbe
beas
asexpected.
expected.
•• Addingor
Adding orremoving
removingvariables
variablesproduces
produceslarge
largechanges
changesin in
coefficients.
coefficients.
•• Removingaadata
Removing datapoint
pointmay
maycause
causelarge
largechanges
changesin in
coefficient estimates or signs.
coefficient estimates or signs.
•• Insome
In somecases,
cases,the
theFFratio
ratiomay
maybebesignificant
significantwhile
whilethe
thett
ratios are not.
ratios are not.

Variance Inflation Factor


The variance inflation factor associated with X h :
1
VIF ( X h ) =
1 − Rh2
where R 2h is the R 2 value obtained for the regression of X on
the other independent variables.

Relationship between VIF and Rh2


VIF100

50

0
0.0 0.5 1.0 Rh2
Variance Inflation Factor (VIF)

Observation: The VIF (Variance Inflation Factor)


values for both variables Lend and Price are both
greater than 5. This would indicate that some degree of
multicollinearity exists with respect to these two
variables.

Partial F Tests and Variable


Selection Methods
Fullmodel:
Full model:
YY==ββ00++ββ11XX11++ββ22XX22++ββ33XX33++ββ44XX44++εε
Reducedmodel:
Reduced model:
YY==ββ00++ββ11XX11++ββ22XX22++εε

PartialFFtest:
Partial test:
H00::ββ33==ββ44==00
H
HH11::ββ33and
andββ 4not
4
notboth
both00

PartialFFstatistic:
statistic: (SSE − SSE ) / r
Partial R F
F =
(r, (n − (k + 1)) M SE
F

whereSSE
where SSERisisthe
thesum
sumofofsquared
squarederrors
errorsofofthe
thereduced
reducedmodel,
model,SSE SSEFisisthe
thesum
sumofofsquared
squared
R F
errorsofofthe
errors thefull
fullmodel;
model;MSEMSEFisisthe
themean
meansquare
squareerror
errorofofthe
thefull
fullmodel
model[MSE
[MSEF==
F F
SSEF/(n-(k+1))];
SSE /(n-(k+1))];rrisisthe
thenumber
numberofofvariables
variablesdropped
droppedfrom
fromthe thefull
fullmodel.
model.
F
Variable Selection Methods

•• Stepwise
Stepwiseprocedures
procedures
Forward
 Forwardselection
selection
•• Add
Addone
onevariable
variableatataatime
timetotothe
themodel,
model,on
onthe
thebasis
basisof
of
itsFFstatistic
its statistic
Backward
 Backwardelimination
elimination
•• Remove
Removeone
onevariable
variableatataatime,
time,on
onthe
thebasis
basisof
ofits
itsFF
statistic
statistic
Stepwise
 Stepwiseregression
regression
•• Adds
Addsvariables
variablestotothe
themodel
modeland
andsubtracts
subtractsvariables
variables
fromthe
from themodel,
model,on
onthe
thebasis
basisof
ofthe
theFFstatistic
statistic

Stepwise Regression
Compute F statistic for each variable not in the model

Is there at least one variable with p-value > Pin? No


Stop
Yes
Enter most significant (smallest p-value) variable into model

Calculate partial F for all variables in the model

Remove
Is there a variable with p-value > Pout?
variable
No
Influential Points

 Outliers (univariate, multivariate)


 Leverage Points (Distances)
 Influence Statistics

Influential Points continued…


Distances

 Mahalanobis: A measure of how much a case's values on the


independent variables differ from the average of all cases. A
large Mahalanobis distance identifies a case as having extreme
values on one or more of the independent variables.
 Cook’s: A measure of how much the residuals of all cases
would change if a particular case were excluded from the
calculation of the regression coefficients. A large Cook's D
indicates that excluding a case from computation of the
regression statistics, changes the coefficients substantially.
 Leverage values: Measures the influence of a point on the fit
of the regression. The centered leverage ranges from 0 (no
influence on the fit) to (N-1)/N.

Influence Statistics (1)


 DfBeta(s): The difference in beta value is the
change in the regression coefficient that results from
the exclusion of a particular case. A value is
computed for each term in the model, including the
constant.
 Std. DfBeta(s): Standardized difference in beta
value. The change in the regression coefficient that
results from the exclusion of a particular case. You
may want to examine cases with absolute values
greater than 2 divided by the square root of N, where
N is the number of cases. A value is computed for
each term in the model, including the constant.
 DfFit: The difference in fit value is the change in the
predicted value that results from the exclusion of a
particular case.
Influence Statistics (2)
 Std. DfFit: Standardized difference in fit value. The
change in the predicted value that results from the
exclusion of a particular case. You may want to
examine standardized values which in absolute value
exceed 2 divided by the square root of p/N, where p
is the number of independent variables in the
equation and N is the number of cases.
 Covariance Ratio: The ratio of the determinant of
the covariance matrix with a particular case excluded
from the calculation of the regression coefficients to
the determinant of the covariance matrix with all
cases included. If the ratio is close to 1, the case
does not significantly alter the covariance matrix.

Bibliography
 Steel, R. & Torrie, J. (1986). Principles and Procedures of
Statistics: A Biometrical Approach. Singapore: McGraw-Hill
Book Company.
 Gomez, K. & Gomez, A. (1984). Statistical Procedures for
Agricultural Research. Singapore: John Willey & Sons, Inc.
 Kuehl, R. (2000). Designs of Experiments: Statistical
Principles of Research Design and Analysis. Pacific Grove:
Duxbury Thomson Learning.
 Jacoby, W. (2000). Loess: a nonparametric, graphical tool
for depicting relationships between variables. Electoral
Studies, 19, 577-613.
 Zar, J. (1996). Biostatistical Analysis. New Jersey: Prentice-
Hall International, Inc.
 Kirk, R. (1995). Experimental Design: Procedures for the
Behavioral Sciences. Pacific Grove: Brooks/Cole Publishing
Company.
 Kleinbaum, D., Kupper, L., Muller, K. & Nizam, A. (1998).
Applied Regression Analysis and Other Multivariable
Methods. Pacific Grove: Duxbury Press.
Viola adorata

You might also like