0% found this document useful (0 votes)
17 views19 pages

Chapter 2

Uploaded by

fxiqxxhjxnnxh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views19 pages

Chapter 2

Uploaded by

fxiqxxhjxnnxh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

CHAPTER 2

MULTIPLE LINEAR REGRESSION


So far, the simple regression model only able to examine the relationship between
two variables. However, the model may have large unexplained variation, that is
having low R2 which indicate poor fit to the data. The poor fir may well be due to
the fact that the response variable, y depends on not just x but a few other factors as
well. When used alone, x fails to be a good predictor of y since the effect of those
other influencing variables have not been taken into the modeling. Naturally, adding
more factors to the model would explain more variation in the response variable.
Probabilistic model that is used to examine the relationship between two or more
independent variable and a dependent variable is called multiple regression model.
Multiple regression model should be a better model for predicting the dependent
variable.

2.1 MULTIPLE REGRESSION MODELS


Suppose that the yield (in tons) of crops depends on not only the amount of fertilizer
but also on the rainfall and average temperature. A multiple regression model that
might describe this relationship is
y = 0 + 1x1 +  2 x2 + 3 x3 + 
where y denotes the yield, x1 denotes the amount of fertilizer, x2 denotes the average
rainfall and x3 denotes the average temperature. This is a multiple linear regression
model with three independent variables. The term linear is used as the equation is a
linear function of the unknown parameters 0 , 1 , 2 and 3 .

The parameter  0 is the intercept of the regression plane. If the range of the data
includes x1 = x2 = x3 = 0 , then  0 is the mean of y when x1 = x2 = x3 = 0 . Otherwise  0
has no physical interpretation. The parameter 1 indicates the expected change in
response variable, y per unit change in x1 when x2 and x3 are held constant.
Similarly, 2 measures the expected change in y per unit change in x2 when x1 and
x3 are held constant.

In general, the response (dependent) y may be related to k independent or predictor


variables. The model
y = 0 + 1x1 +  2 x2 + +  k xk + 
is called a multiple linear regression model with k predictors. The parameters
 j , j = 1, 2, , k , are called the regression coefficients. This model describes a
hyperplane in the k-dimensional space of the independent variables x j . The
parameter  j represents the expected change in the response y per unit change in x j

1
when all of the remaining independent variables xi ( i  j ) are held constant. For
this reason the parameters  j , j = 1, 2, , k , are often called partial regression
coefficients.
Consider the following estimated 2 predictor variables relating performance at
university, performance at high school and entrance test score:
UGPA = 1.29 + 0.453HSGPA + 0.094ENTS
Since no one who attends college has either a zero high school GPA or a zero on
university entrance test, the intercept in this equation is not meaningful. Holding
ENTS fixed, another point on HSGPA is associated with additional 0.453 of a point
on the university GPA. That is, if we choose two students A and B and these students
have the same ENTS score, but the high school GPA of student A is one point higher
than the high school GPA of student B, the we would expect student A to have
university GPA 0.453 higher than that of student B.
Multiple linear regression models are often used as empirical models or
approximating functions. That is, the true functional relationship between y and
x1 , x2 ,..., xk is unknown, but over certain ranges of the independent variables, the
linear regression model is an adequate approximation to the true unknown function.
The variable,  is the error term containing factors other than x1 , x2 ,..., xk that affect
y. The independent variables can be function of variables such as higher order term,
interaction between variables as well as coded/dummy variables.
CMOX = 0 + 1TAR + 2TAR2 + 3 FIL + 

2.1.1 ESTIMATION OF MODEL PARAMETERS


The method of least squares can be used to estimate the regression coefficients in
a multiple regression model. Suppose that n  k observations are available, and let
yi denote the ith observed response and xij denote the ith observation or level of
independent variable, x j . The error term  in the model has a normal distribution
with E (  ) = 0 , Var ( ) =  2 , and that the errors are uncorrelated.

The sample multiple regression model above may be rewritten as,


k
yi =  0 + 1xi1 +  2 xi 2 + +  k xik +  i =  0 +   j xij +  i
j =1

The method of ordinary least squares chooses the estimates, ˆ j to minimize the sum
of squared residuals. That is, given n observations on y and x j , the estimates ˆ j are
chosen simultaneously to make the least squares function

2
2
n k n 
S (  0 , 1 , ...,  k ) =   i2 =   yi −  0 −   j xij 
i =1 i =1  j =1 
as small as possible. The function S must be minimized with respect to
0 , 1, ,  k . The least squares estimators of 0 , 1, , k must satisfy
n  
S k
= −2  yi − ˆ0 −  ˆ j xij  = 0
0 ˆ0 , ˆ1 , , ˆk i =1  j =1 
and
S n k 
= −2 xij   yi − 0 −  ˆ j xij  = 0,
ˆ j = 1, 2, ,k
 j i =1  j =1 
ˆ0 , ˆ1 , , ˆk

giving least squares normal equations


n n n n
nˆ0 + ˆ1  xi1 + ˆ2  xi 2 + + ˆ k  xik =  yi
i =1 i =1 i =1 i =1

n n n n n
ˆ0  xi1 + 1  xi1 +  2  xi1 xi 2 +  xi1xik =  xi1 yi
ˆ 2 ˆ + ˆ k
i =1 i =1 i =1 i =1 i =1
n n n n n
ˆ0  xi 2 + ˆ1  xi1 xi 2 + ˆ2  xi22 + + ˆk  xi 2 xik =  xi 2 yi
i =1 i =1 i =1 i =1 i =1

n n n n n
ˆ0  xik + ˆ1  xi1xik + ˆ2  xi 2 xik + + ˆk  xik2 =  xik yi
i =1 i =1 i =1 i =1 i =1

Note that there are p = k + 1 normal equations, one for each of the unknown
regression coefficients. The solution to the normal equations will be the least squares
estimators 0 , 1, , k . It is more convenient to deal with multiple regression
models if there are expressed in matrix notation. This allows a very compact display
of the model, data, and results. In matrix notation, the regression model for k
independent variables can be written as:
y = Xβ + ε
where
 y1  1 x11 x12 x1k   0   1 
y  1 x x22 x2 k     
y =  2 , X= 21  , β =  1 , ε =  2
       
       
 yn  1 xn1 xn 2 xnk   k   n 

3
In general, y is an ( n x 1) vector of the observations, X is an ( n  p ) matrix of the
independent variables, β is a ( p  1) vector of the regression coefficients, and ε is
an ( n  1) vector of random errors.

The OLS estimation finds the vector of least squares estimators, β̂ , that minimizes:
n
S (  ) =  ε i2 = εε = ( y − X ) ( y − X )
i =1

= yy − Xy − yX + XX


= yy − 2Xy + XX

The least squares estimators must satisfy:


S
= −2Xy + 2XXˆ =   XXˆ = Xy
 ˆ

To solve the normal equations, multiply both sides by the inverse of XX . Thus, the
least squares estimator of β is given by:

ˆ = ( XX) Xy
−1

provided that the inverse matrix ( XX) exists. The ( XX) matrix will always exists
-1 -1

if the independent variables are linearly independent, that is if no column of the X


matrix is a linear combination of the other columns.
The matrix XX is a ( p x p ) symmetric matrix and Xy is a ( p  1) column vector.
Note the special structure of the XX matrix. The diagonal elements of XX , and
the off-diagonal elements are the sums of cross products of the elements in the
columns of X . Furthermore, note that the elements of Xy are the sums of cross
products of the columns of X and the observations yi . That is:

 n n n
  ˆ   n 
 n  xi1  xi 2 L  ik   0    yi 
x
 i =1 i =1 i =1     i =1 
 n n n n  ˆ   n 
  xi1  xi21  xi1xi 2 L  x i1 xik    
 1  x i1 yi 
 i =1 i =1 i =1 i =1    i =1 
 n n n n   =  n 
  xi 2  xi 2 xi1  xi22 L  i 2 ik   2   i 2 i 
x x ˆ x y
 i =1 i =1 i =1 i =1     i =1 
 M M M M     M 
 n   M  n 
 n n n
2   
  xik  xik xi1  xik xi 2 L  xik   ˆk    xik yi 
 i =1 i =1 i =1 i =1   i =1 

4
For even moderately sized n and k, solving the normal equations by hand calculation
is tedious. Luckily, for practical purpose, modern computers statistical software can
solve these equations, even for large n and k in a fraction of second.
The vector of fitted values yˆ i corresponding to the observed values yi is given by:
yˆ = Xβˆ = X ( XX) Xy = Hy
-1

The ( n  n ) matrix H = X ( XX ) X is usually called the hat matrix. The n


−1

residuals may be conveniently written in matrix notation as


e = y − yˆ
which can be written as: e = y − Xβˆ = y − Hy = ( I − H ) y . Note that H = ( I − H )
is a symmetric and an idempotent matrix with tr ( H ) = ( n − k ) .

Like a simple linear regression model, a multiple linear regression model is based
on certain assumptions. The major assumptions for multiple regression model are:
1) The probability distribution of the error has a mean of zero
2) The errors are independent. In addition, these errors are normally distributed and
have a constant standard deviation.
3) The independent variables are not linearly related. 4) There is no linear
association between the error and each independent variable.

Example: The Delivery Time Data


A soft drink bottler interested in predicting the amount of time required by the route
driver to service the vending machines in an outlet. The industrial engineer has
suggested that two most important variables affecting the delivery time (y) are the
number of cases of product stocked ( x1 ), and the distance walked by the route driver
( x2 ). The multiple linear regression model to be fitted is: y = 0 + 1x1 +  2 x2 + 
Delivery No of Distance Delivery No of Distance
Obs. Time (min), Cases, x1 (feet), x2 Obs. Time Cases, x1 (feet), x2
y (min), y
1 16.68 7 560 14 19.75 6 462
2 11.50 3 220 15 24.00 9 448
3 12.03 3 340 16 29.00 10 776
4 14.88 4 80 17 15.35 6 200
5 13.75 6 150 18 19.00 7 132
6 18.11 7 330 19 9.50 3 36
7 8.00 2 110 20 35.10 17 770
8 17.83 7 210 21 17.90 10 140
9 79.24 30 1460 22 52.32 26 810
10 21.50 5 605 23 18.75 9 450
11 40.33 16 688 24 19.83 8 635
12 21.00 10 215 25 10.75 4 150
13 13.50 4 255

5
1 7 560  16.68 
1 3 220  11.50 
  
1 3 340  12.03 
   
1 4 80  14.88 
1 6 150  13.75 
   
1 7 330  18.11 
1 2 110   8.00 
   
1 7 210  17.83 
1 30 1460   79.24 
   
1 5 605   21.5 
   
1 16 688   40.33
1 10 215   21.00 
   
X = 1 4 255  , y = 13.50 
1 6 462  19.75 
   
1 9 448   24.00 
1 10 776   29.00 
   
1 6 200  15.35 
1 7 132  19.00 
   
1 3 36   9.50 
   
1 17 770   35.10 
1 10 140  17.90 
   
1 26 810   52.32 
1 9 450  18.75 
   
1 8 635  19.83 
1 4 150  10.75 
  
The XX matrix is
1 7 560 
 1 1 1  1 3 220   25 219 10232 
XX =  7 3 4    =  219 3,055 133899 
     
560 220 150     10232 133899 6725688 
1 4 150 
and the Xy vector is
16.68
 1 1 1  11.50   559.60 

XX = 7 3 4    =  7375.44 
     
560 220 150    
 337072.00 
10.75 

6
The least squares estimator of β is βˆ = ( XX ) Xy or
-1

ˆ    25 219
−1
10,232   559.60 
  
 ˆ   =  219 3055 133899   7375.44 
  
 ˆ  10232 133899 6725688 337072.00
     
 0.11321528 −.00444859 −.00008367   559.60   2.34123115
=  −.00444859 0.00274378 −.00004786   7375.44  = 1.61590712 
     
 −.00008367 −.00004876 0.00000123  337072.00   0.01438483

The least squares fit (with the regression coefficients reported to 5 decimal places)
is
yˆ = 2.341 + 1.616x1 + 0.014x2

2.1.2 PROPERTIES OF THE LEAST SQUARES ESTIMATORS


The variance property of β̂ is expressed by the covariance matrix
   
() 
()   ()
Cov βˆ = E  βˆ − E βˆ  βˆ − E βˆ   = E  ( XX ) Xy − β  ( XX ) Xy − β  
 
-1

-1

which is a ( p  p ) symmetric matrix whose jth diagonal element is the variance of
ˆ and whose (ij)th off-diagonal element is the covariance between ˆ and ˆ . It
j i j

can proven that the covariance matrix of β̂ is given by:

()
Cov βˆ =  2 ( XX )
−1

Therefore, if let C = ( XX ) , the covariance of ˆ j is  2C jj and the covariance


−1

between ˆi and ˆ j is  2Cij .

2.1.3 ESTIMATION OF  2
As in simple linear regression, an estimator of  2 may be developed from the
residual sum of squares
n n
SSE =  ( yi − yˆi ) =  ei2 = ee
2

i =1 i =1

substituting e = y − Xβˆ , then

( 
SSE = y − Xβˆ y − Xβˆ )( )
= yy − βˆ Xy − yXβˆ + βˆ XXβˆ
= yy − 2βˆ Xy + βˆ XXβˆ

7
Since XXβˆ = Xy , this last equation becomes: SSE = yy − βˆ Xy

It can be shown that the residual sum of squares has ( n − p ) degrees of freedom
associated with it since p parameters are estimated in the regression model. The
residual mean square is estimated as:
SSE
ˆ 2 = MSE =
n− p

Example: Delivery Time Data


Estimate the error variance  2 for the multiple regression model fit to the soft drink
delivery time data. Note that,
25
yy =  yi2 = 18310.6290
i =1

 559.60 
βˆ Xy =  2.34123115 1.61590721 0.01438483  7375.44  = 18076.90304
 
337072.00 
The residual sum of squares is
SSE = yy − βˆ Xy
= 18310.6290 − 18076.9030 = 233.7260

Therefore, the estimate of σ 2 is the residual mean square


SSE 223.7260
ˆ 2 = = = 10.6239
n− p 25 − 3
Recall that the estimate of  2 is model dependent. The estimate of  2 for a simple
regression model involving only one dependent variable, cases ( x1 ) is 17.4841,
which is considerably larger that the estimate for the two independent multiple
regression model. Which estimate is correct? Both estimates are in a sense correct,
but heavily dependent on the choice of model. Which model is correct/better? Since
 2 is the variance of the errors, it is usually preferred to have a model with a smaller
residual mean square.

2.2 HYPOTHESIS TESTING IN MULTIPLE LINEAR REGRESSION


Several hypothesis testing procedures prove useful for addressing the questions:
1. Which is the overall adequacy of the model?
2. Which specific independent variable seem important?
The formal tests require that our random errors be independent and follow a normal
distribution with mean E ( i ) = 0 and variance Var ( i ) =  2 .

8
2.2.1 TEST FOR SIGNIFICANCE OF REGRESSION
The test for significance of regression is a test to determine if there is a linear
relationship between the response y and any of the independent variables
x1, x2 , ..., xk . This procedure is often thought of as an overall or global test of
model adequacy. The appropriate hypotheses are

H0: 1 =  2 = ... =  k = 0 vs. H1:  k  0 for at least one j

Rejection of this null hypothesis implies that at least one of the independent variables
x1, x2 , ..., xk contributes significantly to the model.

The test procedure is a generalization of the analysis of variance used in simple linear
regression. The total sum of squares SST is partitioned into a sum of squares due to
regression, SSR , and a residual sum of squares, SSE . Thus,
SST = SSR + SSE

It can be shown that if the null hypothesis is true, then SSR  2 follows a  k2
distribution, which has the number of degrees of freedom as number independent
variables in the model. It also can be shown that SSE  2  n2−k −1 and that SSE and
SS R are independent. The F statistic is given by:
SSR k MSR
F= =
SSE ( n − k − 1) MSE

Therefore, to test the hypothesis H0: 1 = 2 = ... = k = 0 , compute the test statistic
F and reject H 0 if F  F ,k , −k −1 .

A computational formula for SSE is found by starting with


SSE = yy − βˆ Xy
and since
2 2
 n   n 
n   yi    yi 
SST =  yi −
2  i =1  = yy −  i =1 
i =1 n n

the above equation may be written as


 n    n  
2 2

  yi     yi  
SSE = yy −  i =1  − βXˆ y −  i =1  
n  n 
 
 
SSE = SST − SSR

9
Therefore, the regression sum of squares is given as:
2
 n 
  yi 
ˆ y −  i =1 
SSR = βX
n
The test procedure is usually summarized in an analysis of variance ANOVA table
as below.
Table ANOVA for Significance of Regression in Multiple Regression
Source of Sum of Degrees of Mean
Variation Squares Freedom Square F
Regression SSR k MSR MSR MSE
Residual SSE n − k −1 MSE
Total SST n −1

Example: The Delivery Data


Test the significance of multiple regression model.
2
 n 
  yi 
( 559.60 )
2

SST = yy −  i =1  = 18310.6290 = = 5784.5426


n 25
2
 n 
  yi 
( )
2
ˆ y −
SSR = βX  i =1  = 18076.9030 =
559.60
= 5550.8166
n 25
SSE = SST − SSR = yy − βˆ Xy = 233.7260

ANOVA: Test for Significance of Regression for Delivery Time


Source of Sum of Degrees of Mean
F p-value
Variation Squares Freedom Square
Regression 5550.8166 2 2775.4083 261.24 4.7 x 10-16
Residual 233.7260 22 10.6239
Total 5784.5426 24

The analysis of variance is shown in the table above. To test


H0: 1 =  2 = ... =  k = 0 , the F statistic is
MSR 2775.4083
F= = = 261.24
MSRes 10.6239

Since the p-value is extremely small, it can be concluded that the delivery time is
linearly related to delivery volume and/or distance. However, this does not
necessarily imply that the relationship found is an appropriate one for predicting

10
delivery time as a function of volume and distance. Further tests of model adequacy
are required.

2.2.2 TEST ON INDIVIDUAL REGRESSION COEFFICIENTS


Once the F-test has determine that at least one of the independent variable is
important, a logical question becomes which one(s) is (are) the significant factor(s)
affecting the dependent variable.
The hypotheses for testing the significance of any individual regression coefficient
are
H0 :  j = 0 vs. H1 :  j = 0
If H 0 :  j = 0 is not rejected, then this indicates that the independent variable x j is
not significantly affecting the dependent variable and perhaps can be excluded from
the model. The test statistic for this hypothesis is
ˆ j ˆ j
t= =
se ˆ ( )
j ˆ 2C  jj

where C jj is the diagonal element of ( XX ) corresponding to ˆ j . The null


−1

hypothesis H0 :  j = 0 is rejected if t  tα 2, n−k −1 . Note that this is really a partial or


marginal test because the regression coefficient ˆ j depends on all of the other
independent variables xi ( i  j ) that are in the model. Thus, this is a test of the
contribution of x j given the other independent variables in the model.

Example: The Delivery Time Data


The hypotheses are: H0 : 1 = 0 vs. H1 : 1 = 0 and H0 :  2 = 0 vs. H1 :  2 = 0

The main diagonal element of ( XX ) corresponding to 1 and  2 are


−1

C11 = 0.00274378 and C22 = 0.00000123 respectively, so the corresponding t


statistic are given by:
ˆ1 1.61590712
t1 = = = 9.46
ˆ2C11 (10.6239 )( 0.00274378)
ˆ2 0.01438
t2 = = = 3.98
ˆ2C22 (10.6239)( 0.00000123)
Since critical value, t0.025, 22 = 2.074 , H0 : 1 = 0 and H0 : 2 = 0 are rejected and
conclude that both x1 and x2 , contribute significantly to the model, that is both
CASES and DISTANCE significantly affect the delivery TIME.

11
2.2.3 CONFIDENCE INTERVAL ON REGRESSION COEFFICIENTS
Confidence intervals on individual regression coefficients and confidence intervals
on the mean response given specific levels of the independent play the same
important role in multiple regression that they do in simple linear regression.

To construct confidence interval for the regression coefficients, assumptions for the
errors  i are normally and independently distributed with mean zero and variance
 2 is needed. Therefore, the observations yi are normally and independently
distributed with mean  0 +  j =1  j xij and variance  2 . Since the least squares
k

estimator β̂ is a linear combination of the observations, it follows that β̂ is normally


distributed with mean vector β and covariance matrix  2 ( XX ) . This implies that
−1

the marginal distribution of any regression coefficient ˆ j is normal with mean  j


and variance  2C jj , where C jj is the jth diagonal element of the ( XX )
−1
matrix.
Consequently, each of the statistics:
ˆ j −  j
, j = 1, 2, ..., k
ˆ2C jj
is distributed as t with ( n − p ) degrees of freedom, where   is the estimate of the
error variance.

The 100 (1 −  ) % confidence interval for the regression coefficient  j , j = 1, 2, ..., k


as
ˆ j − t 2, n− p ˆ 2C jj   j  ˆ j + t 2, n− p ˆ 2C jj

Example: The Delivery Time Data


Given ˆ2C11 = (10.6239 )( 0.00274378) = 0.17073 and ˆ2C22 = 0.0000131. The
95% confidence interval for the parameter 1 and 2 are calculated as the following:

ˆ1 − t0.025, 22 ˆ 2C11  1  ˆ1 + t0.025, 22 ˆ 2C11


1.61591 − ( 2.074 )( 0.17073 )  1  1.61591 + ( 2.074 )( 0.17073 )
1.26181  1  1.97001

ˆ2 − t0.025, 22 ˆ 2C22   2  ˆ2 + t0.025, 22 ˆ 2C22


0.01438 − ( 2.074 )( 0.0000131)   2  0.01438 + ( 2.074 )( 0.0000131)
0.01435   2  0.01441

12
Notice that for each case, the value of 0 falls outside the confidence interval,
supporting the rejection of the null hypotheses for testing individual coefficient.

2.3 MULTIPLE COEFFICIENT OF DETERMINATION, R2 AND ITS


ADJUSTED VERSION, R2
In the 2-variable case, R 2 measures the goodness-of-fit of the regression model, that
is, it gives the proportion or percentage of the total variation in the dependent
variable y that can be explained by the single independent variable, x. This notation
of R 2 can always be easily extended to regression models containing more than two
variables. Thus, in a 3-variable model, R 2 known as the multiple coefficient of
determination, measures the proportion of the variation in y that can be explained by
the independent variables x1 and x2 . The fit of the model is said to be “better” the
closer is R 2 to 1.

An important property of R 2 is that it is a nondecreasing function of the number of


independent/explanantory variables that are present in the model. As the number of
independent variables increases, R 2 almost inevitably increases and never decreases.
That is, an additional x (independent variable) will not decrease R 2 . Recall that
n n

 ( yˆi − y )  ( y − yˆ )
2 2
i
SSR SS SSE
R2 = = i =1
n
= ˆ1 xy = 1 − =1− i =1
n

( y − y ) ( y − y )
SST 2 SS yy SST 2
i i
i =1 i =1

( y − y ) is independent of the number of x variables in the model, the


2
Although i

 ( y − yˆ )
2
SSE, i depends on the number of independent variables in the model. It is
clear that as the number of x variables increases,  ( yi − yˆ ) is likely to decrease and
2

hence R 2 will increase. As such, in comparing two regression models with the
different number of independent variables, one should be wary of choosing the
model with the highest R 2 .

To compare two R 2 values, one must take into account the number of independent
variables that are present in the model. This can be done by considering the
alternative coefficient of determination, known as the “adjusted” R 2 or R 2 that is
defined as follows:
( SSE n − k − 1) = 1 − 1 − R2 n − 1
R2 = 1 −
( SST n − 1)
( ) n − k −1
The term adjusted refer to the adjustment made for the “degree of freedom”
associated with the sum of squares in SSE and SST. It should be clear from the
equation above that for k  1 ,

13
i) R 2  R 2 which implies that as the number of independent variables increases, the
R 2 increases less than the unadjusted R 2
ii) R 2 can be negative. In case R 2 is negative in practical application, its value is
taken as zero.
iii) The plot of R 2 against k, has a turning point!!!

The main attractiveness of R 2 is that it imposes a penalty for adding extra


independent variables to the model. As such, when comparing two models with
different (moderate) number of variables, R 2 should be used, rather than R 2 .
An interesting algebraic fact is the following: if we add an extra independent variable
to the regression model, R 2 increases, if and only if, the t-statistic on the new
variable is greater than one in absolute value.

Example: Delivery Time Data


ANOVA Table
Source of Sum of Degrees of Mean
F
Variation Squares Freedom Square
Regression 5550.8166 2 2775.4083 261.24
Residual 233.7260 22 10.6239
Total 5784.5426 24
SSR 5550.82
R2 = = = 0.9596 or 95.96%
SST 5784.54
R2 = 1 −
( SSE n − k − 1) = 1 − ( 233.73 22) = 0.9559 or 95.59%
( SST n − 1) 5784.54 24

2.4 MEAN RESPONSE AND PREDICTION OF NEW OBSERVATION


Think of the following two scenarios.
Suppose the population regression model relating food expenditure, monthly income
and number of children in the households is given as below:
y food = 0 + 1x1inc + 2 x2chil + 
Since E (  ) = 0 , the point estimate for the mean value of food expenditure for all
households with monthly income of RM5.5 thousand with 3 children is given by
( )
E y food x1 = 5.5, x2 = 3 =  0 + 1 x1inc +  2 x2mbr .

However, if we take a second sample of the same number of households from the
( )
same population, the point estimate of E y food x1 = 5.5, x2 = 3 is expected to be
different from those of the first sample. All possible samples of the same size taken
from the same population will give different point estimate. Therefore, a confidence

14
( )
interval for E y food x1 = 5.5, x2 = 3 will be a more reliable estimate than the point
estimate.
The second scenario is the same as above but the aim to predict the food expenditure
for one particular household with monthly income RM5.5 thousand and having 3
children. The point estimate is the same as before but the confidence interval is not
the same, the interval is wider (see below) and it is called prediction interval.
The prediction interval for predicting a single value of y for a certain value of x is
always larger/wider than the confidence interval for estimating the mean value of y
for a certain value of x.

Define the vector x0 as x0 = 1 x01 x02 x0 k  . The fitted value or mean


response at this point is given as yˆ mr = x0βˆ with the variance of ŷ0 given by:

Var ( yˆmr ) =  2x0 ( XX) x0


-1

Therefore, a 100(1 - α) percent confidence interval on the mean response at the


point x0 is given by:

yˆ mr − tα 2, n− p  2x0 ( XX ) x0  E ( y x0 )  yˆmr + tα 2, n− p  2x0 ( XX ) x0


−1 −1

Meanwhile, a point estimate of new prediction at point x0 = 1 x01 x02 x0 k 


has a 100(1 - α) percent prediction interval given by:

( ) (
yˆ npr − tα 2, n− p  2 1 + x0 ( XX ) x0  ynpr  yˆ npr + tα 2, n− p  2 1 + x0 ( XX ) x0
−1 −1
)
The extra width/variation,   for the interval is due to the error in predicting a
particular value. This is in contrast to the error of zero in predicting the mean
response for all members in the population.

Example: Delivery Time Data


Suppose, the soft drink bottler in would like to construct a 95% confidence interval
on the mean delivery time for an outlet requiring x1 = 8 cases and where the distance
x2 = 275 feet. Therefore, x0 = 1 8 275 .

The fitted value at this point is calculated as:


 2.34123
yˆ mr / npr = x0βˆ = 1 8 275 1.61591  = 19.22 minutes
 
0.01438

15
The variance of yˆmr is estimated by:
 2 x0 ( XX ) x 0
-1

 0.11321528 −.00444859 −.00008367   1 


= 10.6239 1 8 275  −.00444859 0.00274378 −.00004786   8 
  
 −.00008367 −.00004876 0.00000123   275
= 10.6239 ( 0.05346 ) = 0.56794

Therefore, a 95% confidence interval on the mean delivery time at this point is
given as
19.22 − 2.074 10.6239 ( 0.05346 )  ymr  19.22 + 2.074 10.6239 ( 0.05346 )
19.22 − 2.074 0.56794  ymr  19.22 + 2.074 0.56794
17.66  ymr  20.78

Meanwhile, a 95% prediction interval on the mean delivery time at this point is
given as
19.22 − 2.074 10.6239 (1 + 0.05346 )  ynpr  19.22 + 2.074 10.6239 (1 + 0.05346 )
12.28  ynpr  26.16
It is very obvious that in this case, the prediction interval for new observation is
much larger/wider than the confidence interval for the mean response.

2.5 SOME CONSIDERATIONS IN THE USE OF REGRESSION


Regression analysis is widely used and, unfortunately, frequently misused. There
are several common abuses of regression that should be mentioned:
1. Regression models are intended as interpolation equations over the range of the
regressor variable(s) used to fit the model. As observed previously, we must be
careful if we extrapolate outside of this range.
2. The disposition of the x values plays an important role in the least-squares fit.
While all points have equal weight in determining the height of the line, the slope
is more strongly influenced by the remote values of x. For example, consider the
data in Figure 3.6. The slope in the least-squares fit depends heavily on either or
both of the points A and B. Furthermore, the remaining data would give a very
different estimate of the slope if A and B were deleted. Situations such as this
often require corrective action, such as further analysis and possible deletion of
the unusual points, estimation of the model parameters with some technique that
is less seriously influenced by these points than least-squares, or restricting the
model, possibly by introducing further regressors.

16
A

y
y

x x
Figure 3.6 two influential observations Figure 3.7 A point remote in x-space

A somewhat different situation is illustrated in Figure 3.7, where one of the 18


observations is very remote in x-space. In this example, the slope is largely
determined by the extreme point. If this point is deleted, the slope estimate is
probably zero. Because of the gap between the two clusters of points, we really
have two distinct information units with which to fit the model. Thus, there are
effectively far fewer than apparent 16 degrees of freedom for error. Situation such
as these seem to occur fairly in practice. In general, we should be aware that in
some data sets one point (or a small cluster of points) may control key model
properties.

3. Outliers or bad values can seriously disturb the least-squares fit. For example,
consider the data in Figure 3.8. Observation A seems to be an “outlier” or “bad
value” because it falls far from the line implied by the rest of the data. If this point
is really an outlier, then the estimate of the intercept may be incorrect and the
residual mean square may be an inflated estimate of σ 2 . On the other hand, the
data point may not be a bad value and may be a highly useful piece of evidence
concerning the process under investigation. Methods for detecting and dealing
with outliers are discussed more completely in Chapter 4.

x
Figure 3.8 An outlier

17
4. As mentioned in Chapter 1, just because a regression analysis has indicated a
strong relationship between two variables, this does not imply that the variables
are related in any causal sense. Causality implies necessary correlation. It cannot
address the issues of necessity. Thus, our expectations of discovering cause and
effect relationships from regression should be modest.
As an example of a “nonsense” relationship between two variables, consider the
data in Table 3.7. This table presents the number of certified mental defectives in
the United Kingdom per 10,000 of estimates population (y), the number of radio
receiver licenses issued ( x1 ), and the first name of the President of the United
States ( x2 ) for the years 1924-1937. We can show that the regression equation
relating y to x1 is
yˆ = 4.582 + 2.204x1
The t-statistic for testing H0 : β1 = 0 for this model is t0 = 27.312 (thus very small
p-value), and the coefficient of determination is R2 = 0.9842 . That is, 98.42% of
the variability in the data is explained by the number of radio receiver licenses
issued. Clearly this is a nonsense relationship, as it is highly unlikely that the
number of mental defectives in the population is functionally related to the
number of radio receiver licenses issued. The reason for this strong statistical
relationship is that y and x are monotonically related (two sequences of numbers
are monotonically related if as one sequence increases, e.g., the other always
either increases or decrease). In this example, y is increasing because diagnostic
procedures for mental disorders are becoming more refined over the years
represented in the study and x1 is increasing because of the emergency and low-
cost availability of radio technology over the years.
Table 3.7
Number of Certified Number of Radio
First Name of
Mental Defectives per Receiver Licenses
Year President of the
10,000 of Estimated Issued (millions) in
U.S. ( x2
Population in U.K. (y) the U.K. ( x1 )
1924 8 1.350 Calvin
1925 8 1.960 Calvin
1926 9 2.270 Calvin
1927 10 2.483 Calvin
1928 11 2.730 Calvin
1929 11 3.091 Calvin
1930 12 3.647 Herbert
1931 16 4.620 Herbert
1932 18 5.497 Herbert
1933 19 6.260 Herbert
1934 20 7.012 Franklin
1935 21 7.618 Franklin
1936 22 8.131 Franklin
1937 23 8.593 Franklin
Source: Kendall and Yule [1950] and Tufte [1974].

18
Any two sequences of numbers that are monotonically related will exhibit similar
properties. To illustrate this further, suppose we regress y on the number of letters
in the first name of U.S. President in the corresponding year. The model is
yˆ = − 26.442 + 5.900x2
with t0 = 8.996 (thus small p-value) and R2 = 0.8709 . Clearly this is a nonsense
relationship as well.

5. In some applications of regression the value of the regressor variable x required


to predict y is unknown. For example, consider predicting maximum daily load
on an electric power generation system from a regression model relating the load
to the maximum daily temperature. To predict tomorrow’s maximum load we
must first predict tomorrow’s maximum temperature. Consequently, the
prediction of maximum load is conditional on the temperature forecast. The
accuracy of the maximum load forecast depends on the accuracy of the
temperature forecast. This must be considered when evaluating model
performance.
6. When using multiple regression, occasionally we find an apparent contradiction
of intuition or theory when one or more of the regression coefficients seems to
have the wrong (unexpected) sign. For example, the problem situation may imply
that a particular regression coefficient should be positive, while the actual
estimate of the parameter is negative. This wrong sign problem can be
disconcerting, as it is usually difficult to explain a negative estimate of a
parameter to the model user when that user believes that the coefficient should
be positive. The regression coefficients may have the wrong sign for the
following reasons:
i) the range of some of the regressor is too small
ii) important regressor have not been included in the model
iii) multicollinearity is present
iv) computational errors have been made

19

You might also like