Chapter 2
Chapter 2
The parameter 0 is the intercept of the regression plane. If the range of the data
includes x1 = x2 = x3 = 0 , then 0 is the mean of y when x1 = x2 = x3 = 0 . Otherwise 0
has no physical interpretation. The parameter 1 indicates the expected change in
response variable, y per unit change in x1 when x2 and x3 are held constant.
Similarly, 2 measures the expected change in y per unit change in x2 when x1 and
x3 are held constant.
1
when all of the remaining independent variables xi ( i j ) are held constant. For
this reason the parameters j , j = 1, 2, , k , are often called partial regression
coefficients.
Consider the following estimated 2 predictor variables relating performance at
university, performance at high school and entrance test score:
UGPA = 1.29 + 0.453HSGPA + 0.094ENTS
Since no one who attends college has either a zero high school GPA or a zero on
university entrance test, the intercept in this equation is not meaningful. Holding
ENTS fixed, another point on HSGPA is associated with additional 0.453 of a point
on the university GPA. That is, if we choose two students A and B and these students
have the same ENTS score, but the high school GPA of student A is one point higher
than the high school GPA of student B, the we would expect student A to have
university GPA 0.453 higher than that of student B.
Multiple linear regression models are often used as empirical models or
approximating functions. That is, the true functional relationship between y and
x1 , x2 ,..., xk is unknown, but over certain ranges of the independent variables, the
linear regression model is an adequate approximation to the true unknown function.
The variable, is the error term containing factors other than x1 , x2 ,..., xk that affect
y. The independent variables can be function of variables such as higher order term,
interaction between variables as well as coded/dummy variables.
CMOX = 0 + 1TAR + 2TAR2 + 3 FIL +
The method of ordinary least squares chooses the estimates, ˆ j to minimize the sum
of squared residuals. That is, given n observations on y and x j , the estimates ˆ j are
chosen simultaneously to make the least squares function
2
2
n k n
S ( 0 , 1 , ..., k ) = i2 = yi − 0 − j xij
i =1 i =1 j =1
as small as possible. The function S must be minimized with respect to
0 , 1, , k . The least squares estimators of 0 , 1, , k must satisfy
n
S k
= −2 yi − ˆ0 − ˆ j xij = 0
0 ˆ0 , ˆ1 , , ˆk i =1 j =1
and
S n k
= −2 xij yi − 0 − ˆ j xij = 0,
ˆ j = 1, 2, ,k
j i =1 j =1
ˆ0 , ˆ1 , , ˆk
n n n n n
ˆ0 xi1 + 1 xi1 + 2 xi1 xi 2 + xi1xik = xi1 yi
ˆ 2 ˆ + ˆ k
i =1 i =1 i =1 i =1 i =1
n n n n n
ˆ0 xi 2 + ˆ1 xi1 xi 2 + ˆ2 xi22 + + ˆk xi 2 xik = xi 2 yi
i =1 i =1 i =1 i =1 i =1
n n n n n
ˆ0 xik + ˆ1 xi1xik + ˆ2 xi 2 xik + + ˆk xik2 = xik yi
i =1 i =1 i =1 i =1 i =1
Note that there are p = k + 1 normal equations, one for each of the unknown
regression coefficients. The solution to the normal equations will be the least squares
estimators 0 , 1, , k . It is more convenient to deal with multiple regression
models if there are expressed in matrix notation. This allows a very compact display
of the model, data, and results. In matrix notation, the regression model for k
independent variables can be written as:
y = Xβ + ε
where
y1 1 x11 x12 x1k 0 1
y 1 x x22 x2 k
y = 2 , X= 21 , β = 1 , ε = 2
yn 1 xn1 xn 2 xnk k n
3
In general, y is an ( n x 1) vector of the observations, X is an ( n p ) matrix of the
independent variables, β is a ( p 1) vector of the regression coefficients, and ε is
an ( n 1) vector of random errors.
The OLS estimation finds the vector of least squares estimators, β̂ , that minimizes:
n
S ( ) = ε i2 = εε = ( y − X ) ( y − X )
i =1
To solve the normal equations, multiply both sides by the inverse of XX . Thus, the
least squares estimator of β is given by:
ˆ = ( XX) Xy
−1
provided that the inverse matrix ( XX) exists. The ( XX) matrix will always exists
-1 -1
n n n
ˆ n
n xi1 xi 2 L ik 0 yi
x
i =1 i =1 i =1 i =1
n n n n ˆ n
xi1 xi21 xi1xi 2 L x i1 xik
1 x i1 yi
i =1 i =1 i =1 i =1 i =1
n n n n = n
xi 2 xi 2 xi1 xi22 L i 2 ik 2 i 2 i
x x ˆ x y
i =1 i =1 i =1 i =1 i =1
M M M M M
n M n
n n n
2
xik xik xi1 xik xi 2 L xik ˆk xik yi
i =1 i =1 i =1 i =1 i =1
4
For even moderately sized n and k, solving the normal equations by hand calculation
is tedious. Luckily, for practical purpose, modern computers statistical software can
solve these equations, even for large n and k in a fraction of second.
The vector of fitted values yˆ i corresponding to the observed values yi is given by:
yˆ = Xβˆ = X ( XX) Xy = Hy
-1
Like a simple linear regression model, a multiple linear regression model is based
on certain assumptions. The major assumptions for multiple regression model are:
1) The probability distribution of the error has a mean of zero
2) The errors are independent. In addition, these errors are normally distributed and
have a constant standard deviation.
3) The independent variables are not linearly related. 4) There is no linear
association between the error and each independent variable.
5
1 7 560 16.68
1 3 220 11.50
1 3 340 12.03
1 4 80 14.88
1 6 150 13.75
1 7 330 18.11
1 2 110 8.00
1 7 210 17.83
1 30 1460 79.24
1 5 605 21.5
1 16 688 40.33
1 10 215 21.00
X = 1 4 255 , y = 13.50
1 6 462 19.75
1 9 448 24.00
1 10 776 29.00
1 6 200 15.35
1 7 132 19.00
1 3 36 9.50
1 17 770 35.10
1 10 140 17.90
1 26 810 52.32
1 9 450 18.75
1 8 635 19.83
1 4 150 10.75
The XX matrix is
1 7 560
1 1 1 1 3 220 25 219 10232
XX = 7 3 4 = 219 3,055 133899
560 220 150 10232 133899 6725688
1 4 150
and the Xy vector is
16.68
1 1 1 11.50 559.60
XX = 7 3 4 = 7375.44
560 220 150
337072.00
10.75
6
The least squares estimator of β is βˆ = ( XX ) Xy or
-1
ˆ 25 219
−1
10,232 559.60
ˆ = 219 3055 133899 7375.44
ˆ 10232 133899 6725688 337072.00
0.11321528 −.00444859 −.00008367 559.60 2.34123115
= −.00444859 0.00274378 −.00004786 7375.44 = 1.61590712
−.00008367 −.00004876 0.00000123 337072.00 0.01438483
The least squares fit (with the regression coefficients reported to 5 decimal places)
is
yˆ = 2.341 + 1.616x1 + 0.014x2
()
Cov βˆ = 2 ( XX )
−1
2.1.3 ESTIMATION OF 2
As in simple linear regression, an estimator of 2 may be developed from the
residual sum of squares
n n
SSE = ( yi − yˆi ) = ei2 = ee
2
i =1 i =1
(
SSE = y − Xβˆ y − Xβˆ )( )
= yy − βˆ Xy − yXβˆ + βˆ XXβˆ
= yy − 2βˆ Xy + βˆ XXβˆ
7
Since XXβˆ = Xy , this last equation becomes: SSE = yy − βˆ Xy
It can be shown that the residual sum of squares has ( n − p ) degrees of freedom
associated with it since p parameters are estimated in the regression model. The
residual mean square is estimated as:
SSE
ˆ 2 = MSE =
n− p
559.60
βˆ Xy = 2.34123115 1.61590721 0.01438483 7375.44 = 18076.90304
337072.00
The residual sum of squares is
SSE = yy − βˆ Xy
= 18310.6290 − 18076.9030 = 233.7260
8
2.2.1 TEST FOR SIGNIFICANCE OF REGRESSION
The test for significance of regression is a test to determine if there is a linear
relationship between the response y and any of the independent variables
x1, x2 , ..., xk . This procedure is often thought of as an overall or global test of
model adequacy. The appropriate hypotheses are
Rejection of this null hypothesis implies that at least one of the independent variables
x1, x2 , ..., xk contributes significantly to the model.
The test procedure is a generalization of the analysis of variance used in simple linear
regression. The total sum of squares SST is partitioned into a sum of squares due to
regression, SSR , and a residual sum of squares, SSE . Thus,
SST = SSR + SSE
It can be shown that if the null hypothesis is true, then SSR 2 follows a k2
distribution, which has the number of degrees of freedom as number independent
variables in the model. It also can be shown that SSE 2 n2−k −1 and that SSE and
SS R are independent. The F statistic is given by:
SSR k MSR
F= =
SSE ( n − k − 1) MSE
Therefore, to test the hypothesis H0: 1 = 2 = ... = k = 0 , compute the test statistic
F and reject H 0 if F F ,k , −k −1 .
yi yi
SSE = yy − i =1 − βXˆ y − i =1
n n
SSE = SST − SSR
9
Therefore, the regression sum of squares is given as:
2
n
yi
ˆ y − i =1
SSR = βX
n
The test procedure is usually summarized in an analysis of variance ANOVA table
as below.
Table ANOVA for Significance of Regression in Multiple Regression
Source of Sum of Degrees of Mean
Variation Squares Freedom Square F
Regression SSR k MSR MSR MSE
Residual SSE n − k −1 MSE
Total SST n −1
Since the p-value is extremely small, it can be concluded that the delivery time is
linearly related to delivery volume and/or distance. However, this does not
necessarily imply that the relationship found is an appropriate one for predicting
10
delivery time as a function of volume and distance. Further tests of model adequacy
are required.
11
2.2.3 CONFIDENCE INTERVAL ON REGRESSION COEFFICIENTS
Confidence intervals on individual regression coefficients and confidence intervals
on the mean response given specific levels of the independent play the same
important role in multiple regression that they do in simple linear regression.
To construct confidence interval for the regression coefficients, assumptions for the
errors i are normally and independently distributed with mean zero and variance
2 is needed. Therefore, the observations yi are normally and independently
distributed with mean 0 + j =1 j xij and variance 2 . Since the least squares
k
12
Notice that for each case, the value of 0 falls outside the confidence interval,
supporting the rejection of the null hypotheses for testing individual coefficient.
( yˆi − y ) ( y − yˆ )
2 2
i
SSR SS SSE
R2 = = i =1
n
= ˆ1 xy = 1 − =1− i =1
n
( y − y ) ( y − y )
SST 2 SS yy SST 2
i i
i =1 i =1
( y − yˆ )
2
SSE, i depends on the number of independent variables in the model. It is
clear that as the number of x variables increases, ( yi − yˆ ) is likely to decrease and
2
hence R 2 will increase. As such, in comparing two regression models with the
different number of independent variables, one should be wary of choosing the
model with the highest R 2 .
To compare two R 2 values, one must take into account the number of independent
variables that are present in the model. This can be done by considering the
alternative coefficient of determination, known as the “adjusted” R 2 or R 2 that is
defined as follows:
( SSE n − k − 1) = 1 − 1 − R2 n − 1
R2 = 1 −
( SST n − 1)
( ) n − k −1
The term adjusted refer to the adjustment made for the “degree of freedom”
associated with the sum of squares in SSE and SST. It should be clear from the
equation above that for k 1 ,
13
i) R 2 R 2 which implies that as the number of independent variables increases, the
R 2 increases less than the unadjusted R 2
ii) R 2 can be negative. In case R 2 is negative in practical application, its value is
taken as zero.
iii) The plot of R 2 against k, has a turning point!!!
However, if we take a second sample of the same number of households from the
( )
same population, the point estimate of E y food x1 = 5.5, x2 = 3 is expected to be
different from those of the first sample. All possible samples of the same size taken
from the same population will give different point estimate. Therefore, a confidence
14
( )
interval for E y food x1 = 5.5, x2 = 3 will be a more reliable estimate than the point
estimate.
The second scenario is the same as above but the aim to predict the food expenditure
for one particular household with monthly income RM5.5 thousand and having 3
children. The point estimate is the same as before but the confidence interval is not
the same, the interval is wider (see below) and it is called prediction interval.
The prediction interval for predicting a single value of y for a certain value of x is
always larger/wider than the confidence interval for estimating the mean value of y
for a certain value of x.
( ) (
yˆ npr − tα 2, n− p 2 1 + x0 ( XX ) x0 ynpr yˆ npr + tα 2, n− p 2 1 + x0 ( XX ) x0
−1 −1
)
The extra width/variation, for the interval is due to the error in predicting a
particular value. This is in contrast to the error of zero in predicting the mean
response for all members in the population.
15
The variance of yˆmr is estimated by:
2 x0 ( XX ) x 0
-1
Therefore, a 95% confidence interval on the mean delivery time at this point is
given as
19.22 − 2.074 10.6239 ( 0.05346 ) ymr 19.22 + 2.074 10.6239 ( 0.05346 )
19.22 − 2.074 0.56794 ymr 19.22 + 2.074 0.56794
17.66 ymr 20.78
Meanwhile, a 95% prediction interval on the mean delivery time at this point is
given as
19.22 − 2.074 10.6239 (1 + 0.05346 ) ynpr 19.22 + 2.074 10.6239 (1 + 0.05346 )
12.28 ynpr 26.16
It is very obvious that in this case, the prediction interval for new observation is
much larger/wider than the confidence interval for the mean response.
16
A
y
y
x x
Figure 3.6 two influential observations Figure 3.7 A point remote in x-space
3. Outliers or bad values can seriously disturb the least-squares fit. For example,
consider the data in Figure 3.8. Observation A seems to be an “outlier” or “bad
value” because it falls far from the line implied by the rest of the data. If this point
is really an outlier, then the estimate of the intercept may be incorrect and the
residual mean square may be an inflated estimate of σ 2 . On the other hand, the
data point may not be a bad value and may be a highly useful piece of evidence
concerning the process under investigation. Methods for detecting and dealing
with outliers are discussed more completely in Chapter 4.
x
Figure 3.8 An outlier
17
4. As mentioned in Chapter 1, just because a regression analysis has indicated a
strong relationship between two variables, this does not imply that the variables
are related in any causal sense. Causality implies necessary correlation. It cannot
address the issues of necessity. Thus, our expectations of discovering cause and
effect relationships from regression should be modest.
As an example of a “nonsense” relationship between two variables, consider the
data in Table 3.7. This table presents the number of certified mental defectives in
the United Kingdom per 10,000 of estimates population (y), the number of radio
receiver licenses issued ( x1 ), and the first name of the President of the United
States ( x2 ) for the years 1924-1937. We can show that the regression equation
relating y to x1 is
yˆ = 4.582 + 2.204x1
The t-statistic for testing H0 : β1 = 0 for this model is t0 = 27.312 (thus very small
p-value), and the coefficient of determination is R2 = 0.9842 . That is, 98.42% of
the variability in the data is explained by the number of radio receiver licenses
issued. Clearly this is a nonsense relationship, as it is highly unlikely that the
number of mental defectives in the population is functionally related to the
number of radio receiver licenses issued. The reason for this strong statistical
relationship is that y and x are monotonically related (two sequences of numbers
are monotonically related if as one sequence increases, e.g., the other always
either increases or decrease). In this example, y is increasing because diagnostic
procedures for mental disorders are becoming more refined over the years
represented in the study and x1 is increasing because of the emergency and low-
cost availability of radio technology over the years.
Table 3.7
Number of Certified Number of Radio
First Name of
Mental Defectives per Receiver Licenses
Year President of the
10,000 of Estimated Issued (millions) in
U.S. ( x2
Population in U.K. (y) the U.K. ( x1 )
1924 8 1.350 Calvin
1925 8 1.960 Calvin
1926 9 2.270 Calvin
1927 10 2.483 Calvin
1928 11 2.730 Calvin
1929 11 3.091 Calvin
1930 12 3.647 Herbert
1931 16 4.620 Herbert
1932 18 5.497 Herbert
1933 19 6.260 Herbert
1934 20 7.012 Franklin
1935 21 7.618 Franklin
1936 22 8.131 Franklin
1937 23 8.593 Franklin
Source: Kendall and Yule [1950] and Tufte [1974].
18
Any two sequences of numbers that are monotonically related will exhibit similar
properties. To illustrate this further, suppose we regress y on the number of letters
in the first name of U.S. President in the corresponding year. The model is
yˆ = − 26.442 + 5.900x2
with t0 = 8.996 (thus small p-value) and R2 = 0.8709 . Clearly this is a nonsense
relationship as well.
19