0% found this document useful (0 votes)
68 views63 pages

Reg Analysis

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 63

c

HYON-JUNG KIM, 2017

1 Regression analysis - basics

Regression, one of the most widely used statistical technique, estimates relationships among
variables. Regression models provide a very flexible framework for describing and testing
hypotheses about relationships between explanatory variables and a response variable.

Typically, a regression analysis is used for the following purposes:

(1) modeling the relationship between variables.

(2) prediction of the target variable (forecasting).

(3) testing of hypotheses.

The basis of regression analysis is the linear model. The model can be characterized as
follows. We have n sets of observations {X1i , X2i , ..., Xpi , Yi }, i = 1, ..., n, which represent
a random sample from a larger population. It is assumed that these observations satisfy a
linear relationship

Yi = β0 + β1 X1i + β2 X2i + ...βp Xpi + ǫi , i = 1, . . . , n

where the β coefficients are unknown parameters, and the ǫi are random error terms.

Note: By a linear model, it is meant that the model is linear in the parameters.

Linear models may not represent a true representation of reality; but perhaps it provides a
useful representation of reality and works well for a wide range of circumstances.

George Box: “All models are wrong, but some are useful.”

John Tukey: “Embrace your data, not your models.”

Ref. deterministic (or functional) relationships vs statistical relationships

9
eg. Fahrenheit = 5
Celsius +32

Circumference = π× diameter

E = ms2 , E: Energy, m: mass, s: speed of light

PAGE 1
c
HYON-JUNG KIM, 2017

Height and weight, Driving speed and gas mileage

I. Simple linear regression:

a statistical method that allows us to summarize and study relationships between two
continuous (quantitative) variables.

One variable (X) is regarded as the predictor, explanatory, or independent variable.

The other variable, denoted (Y ), is the response, outcome, or dependent variable.

Simple linear model (SLM):

Yi = β0 + β1 Xi + ǫi , ǫi ∼ uncorrelated (0, σ 2 )

The first step in any analysis is to look at the data; in the regression context, that means
looking at histograms and a scatter plot. Estimating the unknown parameters β0 and β1
corresponds to putting a straight line through the point cloud in the scatter plot.

Q. “What is the best fitting line?”

The standard approach is the least squares regression, where the estimates are chosen
to minimize
n
X
(Yi − β0 − β1 Xi )2 .
i=1
This is a standard calculus problem, and was solved for the first time either by Legendre in
1805, or by Gauss in 1794 (Legendre published first). It can be shown that the least squares
estimates satisfy

β̂0 = Y − β̂1 X and


P
(Xi −X)(Yi −Y ) Sxy
β̂1 = P 2 = Sxx
i (Xi −X)

- Special notation for simplicity:


P 2
P (
P
Y i )2
Syy = i (Yi − Y ) = Yi2 − n

P 2
P (
P
X i )2
Sxx = i (Xi − X) = Xi2 − n

Pn P (
P
Xi )(
P
Yi )
Sxy = i (Xi − X)(Yi − Y ) = Xi Y i − n

PAGE 2
c
HYON-JUNG KIM, 2017

Terminology

• The least squares parameter (regression coefficients) estimates: β̂0 , β̂1

• The predicted (or fitted) values : Ybi = β̂0 − β̂1 Xi for i = 1, ..., n

• The residuals for data point i: ei = Yi − Ybi

• The squared prediction error for data point i: e2i = (Yi − Ybi )2

• Sum of Squared Error (SSE):


n
X
SSE = (Yi − Yb )2
i=1

cf. MSE = SSE


n−2
: Mean Squared Error

• Because the formulas for β̂0 and β̂1 are derived using the least squares criterion, the
resulting equation often referred to as the “least squares regression line”, or simply the
“least squares line”. It is also sometimes called the estimated regression equation.

We need a few assumptions in order to justify using the least squares regression:

(a) the expected value of the errors is zero. i.e. E(ǫi ) = 0 for all i. (or the mean of the
response E(Yi ) at each value of the predictor Xi is a linear function of Xi .)

(b) errors are uncorrelated with each other (independent).

(c) errors are normally distributed (confidence/prediction intervals or hypothesis tests).

(d) the variance of the errors is constant (equal). i.e. Var(ǫi ) = σ 2 for all i.

Gauss-Markov theorem: under the above conditions of the simple linear model (SLM) the
least squares estimators are unbiased and have minimum variance among all unbiased linear
estimators.

PAGE 3
c
HYON-JUNG KIM, 2017

Example. Effects of ozone pollution on soybean yield

Four dose levels of ozone and the resulting mean seed yield of soybeans are given. The
dose of ozone is the average concentration (parts per million, ppm) during the growing
season. Yield is reported in grams per plant.

X Y
Ozone (ppm) Yield (gm/plt)
.02 242
.07 237
.11 231
.15 201

Yi Ŷi ei e2i
242
237
231
201

PAGE 4
1.1 INTERVAL ESTIMATES and HYPOTHESES TESTS c
HYON-JUNG KIM, 2017

1.1 INTERVAL ESTIMATES and HYPOTHESES TESTS

In order to construct confidence intervals or test hypotheses it is necessary to make some


assumptions about the distribution of ǫi , i = 1, ..., n. The usual assumption is that

ǫi ∼ Normal(0, σ 2 ),

i.e. ǫi is normally distributed, independent of all other ǫj and all have the same variance,
σ 2 , and mean zero.

Since the random errors are unknown, the residuals can be used to get an estimate of
the error variance:
n n
1 X 2 1 X
ei = (Yi − Yb )2 = MSE
n − p i=1 n − p i=1

Properties of least squares estimates

We consider the simple linear regression model Yi = β0 + β1 Xi + ǫi , for i = 1, ..., n, where


ǫi ∼ i.i.d. Normal (0, σ 2 ).

(a) E(β̂0 ) = β0 and E(β̂1 ) = β1 ; unbiased


2
(b) Var(β̂0 ) = ( n1 + X
Sxx
)σ 2

σ2
(c) Var(β̂1 ) = Sxx

(d) Cov(β̂0 , β̂1 ) = −σ 2 SXxx

(e) E(MSE) =σ 2 , where SSE


σ2
∼ χ2n−2

(f) Both β̂0 and β̂1 are normally distributed. (MSE is independent of both β̂1 and β̂1 .)

Since β̂1 ∼ Normal(β1 , σ 2 /Sxx ),



∗ (β̂1 − β1 ) Sxx /σ
t =q ∼ tn−2 .
(n−2)MSE
σ2
/(n − 2)
p
Then, the 100(1 − α)% confidence interval for β1 is β̂1 ± t1− α2 MSE/Sxx .

PAGE 5
1.1 INTERVAL ESTIMATES and HYPOTHESES TESTS c
HYON-JUNG KIM, 2017

Under the normality assumption, in general, the 100(1 − α)% confidence interval for βk is

β̂k ± t1− α2 s(β̂k ), k = 1, . . . , p

where s(β̂k ) is a standard error (estimate) of β̂k .

To test a one-sided hypothesis: H0 : βk ≥ 0 vs. Ha : βk < 0

Reject H0 if
β̂k − 0
t∗ = < tα,n−p ; otherwise do not reject H0 .
s(β̂k )

 
2 (n−2)MSE (n−2)MSE
A 100(1 − α)% confidence interval for σ is given by χ2n−2,1−α/2
, χ2 .
n−2,α/2

CONFIDENCE INTERVALS FOR THE MEAN OF Y : One very useful application of


the preceding result is to the problem of estimating the mean value of Y at a fixed value of
X, say, X0 . In our straight-line regression setting,

E(Y ) at X=X0 = β0 + β1 X0

which is just a linear combination of β0 and β1 with l′ = (1 X0 ). Then, it is estimated by


l′ β̂ = β̂0 + β̂1 X0 and the variance of l′ β̂ is given by
 
2 1 (X0 − X)2
σ + (verify!)
n Sxx

Thus, a 100(1 − α)% confidence interval for the mean response of Y at a fixed value X0 is
given by s  
1 (X0 − X)2
(β̂0 + β̂1 X0 ) ± t1− α2 ,n−2 MSE + .
n Sxx

EXTRAPOLATION: It is sometimes desired to estimate E(Y ) at X=X0 = β0 + β1 X0


based on the fit of the straight line for values of X0 outside the range of X values used in the
data. This is called extrapolation, and can be quite dangerous. In order for our inferences
to be valid, we must believe that the straight line relationship holds for X values outside the
range where we have observed data. In some situations, this may be reasonable; in others,
we may have no theoretical basis for making such a claim without data to support it.

PAGE 6
1.1 INTERVAL ESTIMATES and HYPOTHESES TESTS c
HYON-JUNG KIM, 2017

PREDICTION INTERVALS FOR a FUTURE Y : Sometimes we are interested in the


actual value of Y we might observe when X = X0 . On the surface, this may sound like
the same problem as above, but they are, indeed, very different. For example, consider a
stockbroker who would like to learn about the value of a stock based on previous data. In
this setting, the stockbroker would like to predict or forecast the actual value of the stock,
say, Y0 that might be observed when X = X0 . On the other hand, the stockbroker probably
does not care about what might happen “on the average” at some future time; that is, he/she
is probably not concerned with estimating E(Y ) at X=X0 = β0 + β1 X0 . In the context of our
model, we are interested in the future observation Y0 = β0 + β1 X0 + ǫ0 where ǫ0 is the error
associated with Y0 that makes it differ from the mean β0 + β1 X0 . Note also that we wish to
predict a random quantity, not a fixed parameter.

The error in prediction Yb0 − Y0 is normally distributed; more precisely,


  2

1 (X 0 − X)
Yb0 − Y0 ∼ N 0, σ 1 + +
2
n Sxx
Then, a 100(1 − α)% prediction interval for Y0 is
s  
b 1 (X0 − X)2
Y0 ± t1− α2 ,n−2 MSE 1 + +
n Sxx

Pn b
i=1 (Yi −Y )2
REMARK: In the simple linear regression case, the F ∗ test using F ∗ = MSE is the
same as the t∗ test for H0 : β1 = 0 vs. Ha : β1 =
6 0. In general, the t∗ statistic is the square
root of F ∗ for one degree of freedom (two-sided) hypotheses tests. So if a two-sided test is
desired, either t∗ or F ∗ give identical results (Numerator of F ∗ has one degree of freedom).

The rejection rule is: Reject H0 if F ∗ > F1−α,df1 ,df2 ; otherwise do not reject H0 .

( df1 = p − 1, df2 = n − p), p = number of regression parameters.

Example. Effects of ozone pollution on soybean yield (- continued)

a) Give the 95% confidence interval for β0 .

b) Test H0 : β1 = 0 vs. H0 : β1 6= 0.

c) Find the 95% prediction interval for the soybean yield at X0 = .02.

PAGE 7
1.2 Analysis of Variance approach c
HYON-JUNG KIM, 2017

1.2 Analysis of Variance approach

SUMS OF SQUARES

In a regression framework we have a set of values of Y , whose variability we want to


explain. We are interested in how much the variables in X help to explain Y .

The total variation of Y , corrected only for the mean ( SST) is a measure of the variation
of Y around the mean Y . The total variation can be split up into two component sums of
squares, sum of squares due to regression (SSR) and sum of squares due to errors (SSE).
SSR is the variation of the fitted points around the mean and SSE is the residual variation
in Y that is not explained by the regression curve.

n
X n
X n
X
2
(Yi − Y ) = b 2
( Yi − Y ) + (Yi − Yb )2
i=1 i=1 i=1

This kind of partitioning is called the analysis of variance because the total sum of
squares (SST) is the sum of squares that is used to compute the variance of Y if no model
is fit. It is this variance that is being “analyzed” or partitioned in the name of analysis of
variance (ANOVA).

Typically this breakdown of sums of squares is put into an analysis of variance table: Let
p denote the number of parameters to be estimated in a regression model.

Source degrees of freedom Sum of Squares Mean Square F


Pn b 2
Pn b 2 2 SSR =MSR MSR
Regression p−1 i=1 (Yi − Y ) = i=1 Yi − n(Y ) p−1 MSE
Pn b 2 SSE =MSE
Error n−p i=1 (Yi − Y ) n−p
Pn 2
Pn b 2 2
Total n−1 i=1 (Yi − Y ) = i=1 Yi − n(Y )

The Coefficient of Determination: R2

This immediately implies that a good regression is one with a large R2 , where
Pn b
2 ( Yi − Y ) 2 SSR SSE
R = Pi=1 n = = 1 − .
i=1 (Y i − Y ) 2 SST SST

PAGE 8
1.2 Analysis of Variance approach c
HYON-JUNG KIM, 2017

The R2 value (called the coefficient of determination) measures the proportion of variability
in Y accounted for by the regression. Values closer to 1 indicate a strong regression, while
values closer to 0 indicate a weaker one. When it is small, it may be that there is a lot of
random inherent variation in the data.

Sometimes a slightly adjusted value of R2 , which is designed to offset an upward bias in


it, is reported; the adjusted R2 has the form
1
Ra2 = R2 − (1 − R2 ).
n−2

The Coefficient of Correlation

In observational data situations, often, we do not choose fixed values of X; rather, we


merely observe the pair of random variables (X, Y ). Then, there may exist a bivariate
distribution of the random vector (X, Y ) with the (population) correlation coefficient ρ
indicating their linear association:
Cov(X, Y )
ρ= p .
Var(X)Var(Y )
The sample correlation coefficient defined as
Sxy
r (or rXY ) = p
Sxx Syy

is ± R2 with the sign depending on the sign of the estimated slope coefficient.

Warnings

1. A large R2 value does not imply that the estimated regression line fits the data well.
Another function might better describe the trend in the data.

2. One data point (or a few data points) can greatly affect the R2 value.

3. Association (or correlation) does not imply causation.

4. Correlations that are based on rates or averages tend to overstate the strength of an
association.

5. A “statistically significant” R2 value does not imply that the slope is meaningfully
different from 0. A large r-squared value does not necessarily imply useful predictions.

PAGE 9
1.3 Matrix approach to Simple Linear Models c
HYON-JUNG KIM, 2017

1.3 Matrix approach to Simple Linear Models

Reminder about Matrices

i) AB 6= BA ii) (AB)′ = B′ A′ (transpose)

iii) Rank and dependence

In general, k columns (of a matrix) U1 , U2 , . . . , Uk are linearly dependent if there exist


λ1 , λ2 , . . . , λk such that 1) λ1 U1 + λ2 U2 + . . . + λk Uk = Φ where Φ is a column of 0’s and 2)
at least one of the λ’s is not 0. Thus, k columns are linearly independent if the only linear
combination of them which will produce the zero vector is the linear combination with all
λ’s 0.

The rank of a matrix is the maximum number of linearly independent columns which
can be selected from the columns of the matrix. Note that if the rank of a matrix is 1, then
there is one column such that all other columns are direct multiples.

For any matrix A, the rank of A is the same as the rank of A′ A. The row rank of any
matrix is always equal to the column rank.

iv) Inverse of a matrix (A−1 )

The inverse of an n × n matrix A is an n × n matrix B such that AB = In where In is


an n × n identity matrix. Such a matrix B will exist only if A is of rank n. In this case it is
also true that BA = In .

RANDOM VECTORS

Let Y = (Y1 , Y2 , ..., Yn )′ be a random vector with the probability density function (pdf)
denoted by f (y) (describing how Y1 , Y2 , ..., Yn are jointly distributed).

Then, E(Y ) = (E(Y1 ), E(Y2 ), ..., E(Yn ))′ and the variance-covariance matrix of Y is
 
Var(Y1 ) Cov(Y1 , Y2 ) · · · Cov(Y1 , Yn )
 
 Cov(Y2 , Y1 ) Var(Y ) · · · Cov(Y , Y ) 
 2 2 n 
 .. ... .. 
 . . 
 
Cov(Yn , Y1 ) Cov(Yn , Y2 ) · · · Var(Yn )

PAGE 10
1.3 Matrix approach to Simple Linear Models c
HYON-JUNG KIM, 2017

defined by E[(Y − E(Y ))(Y − E(Y ))′ ] = Var(Y ). Since Cov(Yi , Yj ) = Cov(Yj , Yi ), it
follows that the variance-covariance matrix is symmetric.

The covariance of two random vectors, Y n×1 and Z m×1 is given by


 
Cov(Y1 , Z1 ) Cov(Y1 , Z2 ) · · · Cov(Y1 , Zm )
 
 Cov(Y2 , Z1 ) Cov(Y1 , Z2 ) · · · Cov(Y2 , Zm ) 
 
Cov(Y , Z) =  .. ... ..  .
 . . 
 
Cov(Yn , Z1 ) Cov(Yn , Z2 ) · · · Cov(Yn , Zm ) n×m

BASIC RESULTS

Suppose that A, B are n × m matrices of constants and that c is a vector of constants. Let
V be the variance-covariance matrix of Y .

i) E(AY ) = A E(Y )

ii) Var(AY + c) = A Var(Y )A′ = AV A′ .

iii) Cov(AY , BZ) = A Cov(Y , Z)B ′


PP
iv) Y ′ AY = aij Yi Yj is a quadratic form and

E(Y ′ AY ) = µ′ Aµ + tr(AV ),
Pn
where E(Y ) = µ and tr(W ) = i=1 wii is the trace of Wn×n .

Recall that expectation of a linear combination of random variables, Uk , k = 1, · · · , m :

m
X m
X
E( ak U k ) = ak E(Uk )
k=1 k=1

The variance of a linear combination of random variables is


Xm Xm X m
Var( ak Uk ) = ak al Cov(Uk , Ul )
k=1 k=1 l=1
m
X XX
= a2k Var(Uk ) + ak al Cov(Uk , Ul )
k=1 l6=k

Definition: The random vector Y = (Y1 , Y2 , ..., Yn )′ is said to have a multivariate normal distribution
with mean µ and variance-covariance matrix V if its (joint) probability density function is

PAGE 11
1.3 Matrix approach to Simple Linear Models c
HYON-JUNG KIM, 2017

given by
1 1
fY (y) = exp{− (y − µ)′ V −1 (y − µ)},
(2π)n/2 |V |1/2 2
for all y ∈ Rn . Shorthand notation for this statement is Y ∼ MVN(µ, V ).

• If Y ∼ MVN(µ, V ), then marginally, each Yi ∼ Normal(µi , Var(Yi )).

• If A is a matrix of constants and c is a vector of constants,

AY + c ∼ MVN (Aµ + c, AV A′ ).

Regression Model in Matrix Notation

For more complicated models it is useful to write the model in matrix notation. When
we expand out the simple linear model, Yi = β0 + βi Xi + ǫi , i = 1, . . . , n

Y1 = β0 + β1 X1 + ǫ1

Y2 = β0 + β1 X2 + ǫ2
..
.

Yn = β0 + β1 Xn + ǫn

This can be expressed as

     
Y1 1 X1 ǫ1
   " #  
 Y2   1 X2   ǫ2 
    β0  
 . =. 
..  +  . 
 ..   .. .  β1  .. 
    
Yn 1 Xn ǫn
or
Y = Xβ + ǫ

The normal equations can be written in matrix notation as well. They are

X ′ Xβ = X ′ Y

PAGE 12
1.3 Matrix approach to Simple Linear Models c
HYON-JUNG KIM, 2017

or
" P #" # " P #
n Xi β0 Yi
P P = P
Xi Xi2 β1 Xi Y i

Thus, if the design matrix X has full column rank, then the normal equations can be
solved by pre-multiplying both sides by (X′ X)−1 .

β̂ = (X′ X)−1 X ′ Y

You should verify for yourselves that in the simple linear regression case
" #
Y − β̂1X
β̂ = (X′ X)−1 X ′ Y = Sxy
Sxx

Recall that the vector ǫ is what makes the above model a peculiar statistical model. One of
the assumption of the ordinary regression model is that ǫi ∼ uncorrelated (0, σ 2 ).

We also see that Y is a random vector because Xi is a known constant.

The least squares estimates of the parameters are random vectors since they are linear
combinations of Y.

So we can find the mean and covariance matrix of β̂ as

E(β̂) = E{(X′ X)−1 X ′ Y } = (X′ X)−1 X ′ E{Y }

= (X′ X)−1 X ′ Xβ

= β

Thus, the least squares estimates for β are unbiased.

The variance-covariance matrix of β̂ is

Var(β̂) = Var{(X′ X)−1 X ′ Y }

= (X′ X)−1 X ′ Var{Y }X(X′ X)−1

= (X′ X)−1 X ′ σ 2 In X(X′ X)−1 , when Cov(ǫi , ǫj ) = 0 and Var(ǫi ) = σ 2 for all i, j

= (X′ X)−1 σ 2

PAGE 13
1.3 Matrix approach to Simple Linear Models c
HYON-JUNG KIM, 2017

The predicted values or estimated values of Y are called Yb :

Yb = X β̂ = X(X′ X)−1 X ′ Y = H Y where H = X(X′ X)−1 X ′ called the hat matrix.

- H: symmetric, idempotent (HH = H.)

The residuals are deviations of the observed points from the fitted regression line. They
are important for assessing the fit of the model and for computing an estimate of the variance.

e = Y − Yb = Y − X β̂ = Y − H Y = (I − H )Y

Properties of the residuals that are used in checking the model assumptions:

1. The sum of squares of the residuals is minimized by the criterion for least squares.

2. X ′ e = 0 (The residuals are orthogonal to the columns of X matrix). This means that

a) For a model with an intercept (a column of ones) the sum of residuals is zero because
n
X

1e= ei = 0
i=1

b) The residuals are orthogonal to every independent variable in the model


n
X
Xik ei = 0
i=1

This means that if you tried to regress the residuals on X you would get a slope of zero
and the scatter plot of e vs X would be centered about zero.
′ ′
3. The residuals are also orthogonal to Yb since Yb e = β̂ X ′ e = 0. Notice that the
residuals are not orthogonal to the observed values of Y .

Note that the ANOVA sums of squares are all quadratic forms.

i) SST = Y ′ Y − n1 Y ′ 11′ Y = Y ′ [I − n1 11′ ]Y



ii) SSR = β̂ X ′ Y − n1 Y ′ 11′ Y = Y ′ [H − n1 11′ ]Y

iii) SSE = Y ′ Y − β̂ X ′ Y = Y ′ [I − H]Y

PAGE 14
1.3 Matrix approach to Simple Linear Models c
HYON-JUNG KIM, 2017

We can also test a hypothesis about a linear combination of parameters, l′ β.

l′ β̂ ∼ Normal (l′ β, l′ (X′ X)−1 lσ 2 ).

A two sided test: H0 : l′ β = M vs. Ha : l′ β 6= M can be done as follows.




l β̂ −M
p
Reject H0 if |t∗ | = ′
> t1− α2 ,n−p , s(l′ β̂) = l′ (X′ X)−1 l MSE.
s(l β̂ )
(Recap) Under the normality assumption, the 100(1 − α)% confidence interval for βk is
q
β̂k ± t1− α2 s(β̂k ), k = 1, 2 s(β̂k ) = (X′ X)−1
kk MSE.

Simple example. DATA (X,Y): (4,5), (7,7), (5,6), (4,6) Assume that σ 2 = 0.25 known.

     
5 1 4 ǫ1
   " #  
 7   1 7  β0  ǫ2 
     
 =  + . 
 6   1 5  β1  .. 
     
6 1 4 ǫn
We find the estimated parameters using the formula β̂ = (X′ X)−1 X′ Y = (3.5, 0.5)′ (Verify!)

The variance-covariance matrix of parameter estimates is


" # " #
4.4167 −0.8333 1.1042 −0.2083
(0.25) =
−0.8333 0.1667 −0.2083 0.0417

(1) Test that the slope is 0: t= 0.5/ 0.0417 = 0.5/0.2041 = 2.45 (2 degrees of freedom)

(2) Test that the intercept is 0: (2 degrees of freedom)

(3) Estimate the mean value of Y at X = 6 and give a 95% confidence interval.
" #
β0
We are estimating: β0 + 6β1 = [1, 6] and l′ β̂ =
β1
" #" #
1.1042 −0.2083 1
d ′ β̂) = l′ Var(
Var(l d β̂)l = [1, 6] = 0.1042.
−0.2083 0.0417 6

The 95% confidence interval for our little example is

Note that the 97.5% quantile of the t distribution with 2 d.f. = 4.30.

PAGE 15
1.4 DIAGNOSTICS and MODEL EVALUATION c
HYON-JUNG KIM, 2017

1.4 DIAGNOSTICS and MODEL EVALUATION

When a simple linear model is selected for an application, one cannot be certain in advance
that the model is appropriate to the dataset in question. Note that all the tests, intervals,
predictions, etc., are based on believing that the assumptions of the regression hold. To assess
whether the assumptions underlying the model seem reasonable, we study the residuals with
graphical analyses. That is, we analyze the residuals to see if they support the assumptions
of linearity, independence, normality and equal variances.

When conducting a residual analysis, we make the following plots;

(1) plot of the residuals versus the fitted values.

- The residuals should be scattered randomly around the 0 line.

- The residuals should roughly form a horizontal band around the 0 line, suggesting that
the variances of the error terms are equal.

(2) Residuals vs. predictor plot

- The interpretation is identical to that for a “residuals vs. fits plot.”

- If the predictor on the X axis is a new and different predictor, the residuals vs. predictor
plot can help to determine whether the predictor should be added to the model (and hence
a multiple regression model used instead).

(3) Normal probability plot of the residuals

- constructed by plotting the n ordered residuals against the n ordered quantiles from
the standard normal distribution.

- To assesses the apparent normality of the residuals and should look like a straight line
(roughly).

- Isolated points represent unusual observations, while a curved line indicates that the
errors are probably not normally distributed, and tests and intervals might not be trust-
worthy.

PAGE 16
1.4 DIAGNOSTICS and MODEL EVALUATION c
HYON-JUNG KIM, 2017

(4) Residuals versus time

-To check if your data has a time structure to it. If not, there should be no apparent
pattern.

ei ei

OK autocorrelation
ei ei ei

i i i

variance increases linear trend quadratic trend

OUTLIERS: An outlier is an observation that is unusually small or large; i.e. outliers are
the data points that do not fit well with the pattern of the rest of the data. In straight-line
regression, an outlier might be an observation that falls far off the apparent approximate
straight line trajectory followed by the remaining observations. Practitioners often “toss
out” such anomalous points, which may or may not be a good idea. If it is clear that an
outlier is the result of a mishap or a gross recording error, then this may be acceptable. On
the other hand, if no such basis may be identified, the outlier may, in fact, be a genuine
response; in this case, it contains information about the process under study, and may be
reflecting a legitimate phenomenon. Then, “throwing out” an outlier may lead to misleading
conclusions, because a legitimate feature is being ignored.

STRATEGIES TO DEAL WITH OUTLIERS: What should we do if an outlier (or out-


liers) are identified? Unfortunately, there is no clear-cut answer! However, here are some
suggestions:

a. Delete outliers and redo the analysis (new outliers may surface).

PAGE 17
1.4 DIAGNOSTICS and MODEL EVALUATION c
HYON-JUNG KIM, 2017

b. Sometimes the purpose of the experiment is just to identify the outliers. In this case,
there is no need to redo the analysis.

c. Check the experimental circumstances surrounding the data collection for the outlying
cases.

d. Report the analysis both with and without the analysis and let the reader decide.

LEVERAGES: To identify outliers, we should consider first looking at the residual plot of
ei versus Ŷi . Recall the prperty of residuals:

ei = Yi − Ŷi ∼ N (0, σ 2 (1 − hii ))

where
1 (Xi − X)2
hii = + , which is called the leverage of ith case.
n Sxx
So the residuals (unlike the errors) do not have constant variance and are slightly correlated!
Observations where Xi is far away from X will have large values of hii . However, not all
observations with large leverages are necessarily outliers.

Leverage is an index of the importance of an observation to a regression analysis: large


deviations from mean are influential.

STUDENTIZED RESIDUALS: To account for the different variances among residuals, we


consider “studentizing” the residuals (i.e., dividing by an estimate of their standard devia-
tion).

The (internally) studentized residuals are defined as:

ei
ri = p
MSE(1 − hii )

have E(ri ) = 0, Var(ri ) = 1 so that the studentized residuals have a constant variance
regardless of the location of the X’s. Values of |ri | larger than 3 or so should cause concern.

TRANSFORMATIONS: Transforming the data is a way of handling violations of the usual


assumptions. In the regression context, this may be done in a number of ways. One way
is to invoke an appropriate transformation, and then postulate a regression model on the

PAGE 18
1.4 DIAGNOSTICS and MODEL EVALUATION c
HYON-JUNG KIM, 2017

transformed scale. Sometimes it may be that, although the data do exhibit constant variance
on the original scale, they may be analyzed better on some transformed scale. It is also
important to remember that if a transformation is used, the resulting inferences apply to
this transformed scale (and no longer to the original scale). One should remember that
transforming the data may fix one problem, but it may create other violations of the model.

Another approach is to proceed with a regression method known as weighted-least squares.


In a weighted regression analysis, different responses are given different weights depending
on their variances (to be discussed later).

In general, transforming your data almost always involves lots of trial and error.

BOX-COX TRANSFORMATION: The power transformation



 log(Y ), λ = 0
g(Y ) =
 Y λ, λ > 0

was suggested by Box and Cox (1964). The log and square root transformations are special
cases with λ = 0 and λ = 1/2, respectively. Approximate 100(1 − α) percent confidence
intervals for λ are available.

OTHER TRANSFORMATIONS (mainly for simple linear regression models)

Some guidelines: You may want to try transformations of the Y -variable when there is
evidence of non-normality and/or non-constant variance problems in one or more residual
plots. Try transformations of the X-variable(s) (e.g., X −1 , X 2 , ln (X)) when there are strong
nonlinear trends in one or more residual plots. If either of the above trials does not work,
consider transforming both the response Y and the predictor X (e.g. ln (Y ), ln (X)).

If the error variances are unequal, try ‘stabilizing the variance’ by transforming Y :

- When the response is a Poisson count, take the Y transformation.

- For a binomial proportion, use the ’arcsine transformation’ sin−1 ( y) or more commonly
do the logistic regression.

- When the response isn’t anything special but showing unequal variances, use ln(Y ), or 1/Y .

PAGE 19
c
HYON-JUNG KIM, 2017

2 Multiple Regression

While the straight-line model serves as an adequate description for many situations, more
often than not, researchers who are engaged in model building consider more than just one
predictor variable X. In fact, it is often the case that the researcher has a set of p − 1
candidate predictor variables, say, X1 , X2 , ..., Xp−1 , and desires to model Y as a function of
one or more of these p − 1 variables. To accommodate this situation, we must extend our
linear regression model to handle more than one predictor variable.

MULTIPLE REGRESSION SETTING: Consider an experiment in which n observations are


collected on the response variable Y and p − 1 predictor variables X1 , X2 , ..., Xp−1 .

Individual Y X1 X2 ... Xp−1


1 Y1 X11 X12 ... X1p−1
2 Y2 X21 X22 ... X2p−1

n Yn Xn1 Xn2 ... Xnp−1

To describe Y as a function of the p − 1 independent variables X1 , X2 , ..., Xp−1 , we posit


the multiple linear regression model

Yi = β0 + β1 Xi1 + ... + βp−1 Xip−1 + ǫi

for i = 1, 2, ..., n, where n > p and ǫi ∼ N (0, σ 2 ). The values β0 , β1 , ..., βp−1 are regression
coefficients as before and we assume that X1 , X2 , ..., Xp−1 are all fixed. The random errors
ǫi ’s are still assumed to be independent and have a normal distribution with mean zero and
a common variance σ 2 . Then,
Y = Xβ + ǫ,

where ǫ ∼ MVN (0, σ 2 I).

In the multiple regression setting, because of the potentially large number of predictors, it
is more efficient to use matrices to define the regression model and the subsequent analyses.

PAGE 20
c
HYON-JUNG KIM, 2017

Note that the matrix X is called the design matrix, since it contains all of the predictor
variable information.

The least squares method is used to estimate the regression parameters β0 , β1 , ..., βp−1
for which we can easily find closed-form solutions in terms of matrices and vectors.

Provided that X ′ X is full rank, the least-squares estimator of β is

b = (X ′ X)−1 X ′ Y
β

NOTE: For the least-squares estimator β to be unique, we need X to be of full column


rank; i.e., r(X) = p. That is, there are no linear dependencies among the columns of X. If
r(X) < p, then r(X) = r(X ′ X) and X ′ X does not have a unique inverse. In this case, the
normal equations can not be solved uniquely.

To avoid the more technical details of working with non-full rank matrices we will assume
that X is of full rank, unless otherwise stated.

Example. The taste of matured cheese is related to the concentration of several chemicals
in the final product. In a study of cheddar cheese from the LaTrobe Valley of Victoria,
Australia, samples of cheese were analyzed for their chemical composition and were subjected
to taste tests. Overall taste scores were obtained by combining the scores from several tasters.
Data were collected on the following variables:

Y = taste test score (TASTE)

X1 = concentration of acetic acid (ACETIC)

X2 = concentration of hydrogen sulfide (H2S)

X3 = concentration of lactic acid (LACTIC)

Variables ACETIC and H2S are both on the (natural) log scale. The variable LACTIC
has not been transformed. Table below contains concentrations of the various chemicals in
n = 30 specimens of mature cheddar cheese and the observed taste score.

Suppose that the researchers postulate that each of the three chemical composition co-
variates X1 , X2 , and X3 are important in describing the taste. In this case, they might

PAGE 21
c
HYON-JUNG KIM, 2017

TASTE ACETIC H2S LACTIC TASTE ACETIC H2S LACTIC


12.3 4.543 3.135 0.86 40.9 6.365 9.588 1.74
20.9 5.159 5.043 1.53 15.9 4.787 3.912 1.16
39.0 5.366 5.438 1.57 6.4 5.412 4.700 1.49
47.9 5.759 7.496 1.81 18.0 5.247 6.174 1.63
5.6 4.663 3.807 0.99 38.9 5.438 9.064 1.99
25.9 5.697 7.601 1.09 14.0 4.564 4.949 1.15
37.3 5.892 8.726 1.29 15.2 5.298 5.220 1.33
21.9 6.078 7.966 1.78 32.0 5.455 9.242 1.44
18.1 4.898 3.850 1.29 56.7 5.855 10.20 2.01
21.0 5.242 4.174 1.58 16.8 5.366 3.664 1.31
34.9 5.740 6.142 1.68 11.6 6.043 3.219 1.46
57.2 6.446 7.908 1.90 26.5 6.458 6.962 1.72
0.7 4.477 2.996 1.06 0.7 5.328 3.912 1.25
25.9 5.236 4.942 1.30 13.4 5.802 6.685 1.08
54.9 6.151 6.752 1.52 5.5 6.176 4.787 1.25

initially consider the following regression model

Yi = β0 + β1 Xi1 + β2 Xi2 + β3 Xi3 + ǫi

for i = 1, 2, ..., 30. Are there other predictor variables that influence taste not considered here?
Alternatively, what if not all of X1 , X2 , and X3 are needed in the model? For example, if
the acetic acid concentration (X1 ) is not helpful in describing taste, then we might consider
a smaller model which excludes it; i.e.,

Yi = β0 + β2 Xi2 + β3 Xi3 + ǫi

for i = 1, 2, ..., 30. The goal of any regression modeling problem should be to identify each
of the important predictors, and then find the simplest model that does the best job.

The full model, Yi = β0 + β1 Xi1 + β2 Xi2 + β3 Xi3 + ǫi in matrix notation is Y = Xβ + ǫ,

PAGE 22
c
HYON-JUNG KIM, 2017

   
12.3 1 4.543 3.135 0.86  
    β0
 20.9  1 5.159 5.043 1.53   
     β1 
     
where Y = 
 39.0 , X =  1
  5.366 5.438 1.57  and β =  .
 .  . .. .. 
..   β 
 ..   ..  2
   . . . 
β3
5.5 1 6.176 4.787 1.25
   
30 164.941 178.254 43.260 736.000
   
 164.941 916.302 1001.806 240.879   4194.442 
   
We compute X ′ X =   and X ′ Y =  .
 178.254 1001.806 1190.343 269.113   5130.932 
   
43.26 240.879 269.113 65.052 1162.065
Thus, the least squares estimate of β for these data is given by

b = (X ′ X)−1 X ′ Y = (−28.877 0.328 3.912 19.670)′ .


β

The least-squares regression equation becomes

Ŷi = −28.877 + 0.328Xi1 + 3.912Xi2 + 19.670Xi3

or, in terms of the variable names,

\ i = −28.877 + 0.328 ACETICi + 3.912 H2Si + 19.670 LACTICi .


TASTE

Using the first model, the ANOVA table for the cheese data is shown below. The F

Source DF Sum of Squares Mean Squares F Pr > F

Regression 3 4994.509 1664.836 16.22 < 0.0001


Error 26 2668.378 102.629
Total 29 7662.887

statistic is used to test H0 : β1 = β2 = β3 = 0 vs. H1 : β1 , β2 , β3 not all zero. Since the


P -value for the test is so small, we would conclude that at least one of X1 , X2 , or X3 is
important in describing taste. The coefficient of determination is R2 ≈ 0.652. Thus, about
65 percent of the variability in the taste data is explained by X1 , X2 , and X3 assuming the
model is true.

PAGE 23
2.1 Sampling Distributions and Inference for the parameters HYON-JUNG
c KIM, 2017

2.1 Sampling Distributions and Inference for the parameters

Sampling Distribution of β̂

Since β̂ is a linear combination of Y , β̂ has a multivariate Normal distribution as ǫ and Y .

β̂ ∼ MVN (β, σ 2 (X ′ X)−1 ).

Now (X ′ X)−1 is a p × p matrix. So is Var(β̂), the variance-covariance matrix of β̂.

IMPLICATIONS:

(a) E(β̂k ) = βk , for k = 0, 1, ..., p − 1.; unbiased.

(b) Var(β̂k ) = (X ′ X)−1 2


kk σ , for k = 0, 1, ..., p − 1.

The value (X ′ X)−1 ′


kk represents the kth diagonal element of the (X X)
−1
matrix.

(c) Cov(β̂k , β̂l )) = (X ′ X)−1 2


kl σ , for k 6= l.

The value (X ′ X)−1 ′


kl is the entry in the kth row and lth column of the (X X)
−1
matrix.

(d) Marginally, β̂k ∼ N (βk , (X ′ X)−1 2


kk σ ), for k = 0, 1, ..., p − 1.

(e) E(MSE) =σ 2 where


SSE MSE(n − p)
2
= 2
∼ χ2n−p .
σ σ

CONFIDENCE INTERVALS:

The 100(1 − α)% confidence interval for βk is β̂k ± t1− α2 ,n−p s(β̂k ), k = 0, . . . , p − 1 where
q
s(β̂k ) is the standard error of β̂k , i.e., (X ′ X)−1
kk MSE.

(β̂ − β)′ X ′ X(β̂ − β)


Since ∼ χ2p is independent of MSE ,
σ2
(β̂ − β)′ X ′ X(β̂ − β) χ2p /p
∼ 2 ≡ Fp,n−p
pMSE χn−p /(n − p)
Thus, a 100(1 − α)% (simultaneous) confidence region for β can be formed as

(β̂ − β)′ X ′ X(β̂ − β) ≤ p MSE F1−α,(p,n−p) .

These regions are ellisoidally shaped and cannot be easily visualized due to high-dimensionality.

PAGE 24
2.1 Sampling Distributions and Inference for the parameters HYON-JUNG
c KIM, 2017

CONFIDENCE INTERVAL FOR THE MEAN RESPONSE: Given a new set of predic-
tors X0 , what is the estimated mean response? We call it Yˆ0 , which is X0 ′ β̂. Var(X0 ′ β̂) =
X0 ′ (X ′ X)−1 X0 σ 2 . So a 100(1 − α)% confidence interval for the average of the responses
with given X0 is
q
Yˆ0 ± t1− α2 ,n−p MSE X0 ′ (X ′ X)−1 X0 .

PREDICTION INTERVAL OF A SINGLE FUTURE RESPONSE FOR X0 : Recall that


a future observation is predicted to be Yˆ0 = X0 ′ β̂ (where we don’t know what the future
response will turn out to be.) Then, a 100(1 − α)% prediction interval for a single future
response is
q
Yˆ0 ± t1− α2 ,n−p M SE[1 + X0 ′ (X ′ X)−1 X0 ].

HYPOTHESIS TESTS:

For testing H0 : βk = βk,0 vs. H1 : βk 6= βk,0 , we use

βˆk − βk,0
|t∗ | = q > t1− α2 ,n−p .
(X ′ X)−1
kk MSE

to reject H0 ; Otherwise, do not reject H0 .

Example (cheese data continued). Considering the full model,

a) Find the estimate of the variance-covariance matrix of β̂.


 
3.795 −0.760 0.087 −0.071
 
 −0.760 0.194 −0.020 −0.128 
 
We first compute (X ′ X)−1 =  
 0.087 −0.020 0.015 −0.046 
 
−0.071 −0.128 −0.046 0.726
b) To assess the importance of the hydrogen sulfide concentration and its influence on
taste of cheese, we can test H0 : β2 = 0 vs. H1 : β2 6= 0.

c) H0 : β1 = β2 = β3 = 0 vs. H1 : β1 , β2 , β3 not all zero.

Note that if H0 is rejected, we do not know which one or how many of the βk ’s are
nonzero; we only know that at least one is.

PAGE 25
2.1 Sampling Distributions and Inference for the parameters HYON-JUNG
c KIM, 2017

SUMS OF SQUARES AS QUADRATIC FORMS Recall that in page 8 the corrected sums
of squares were partitioned and expressed in terms of random vectors. To enhance the
understanding of these sums of squares, we often express them in terms of quadratic forms.

1 ′ 1
SST = Y ′ (I − 11 )Y = Y ′ (I − J)Y
n n
1
SSR = Y ′ (H − J)Y
n

SSE = Y (I − H)Y

Recall that any n × n real symmetric matrix A determines a quadratic form such that
n X
X n

Y AY = aij Yi Yj .
i=1 j=1

Pn
Def. The trace of a n × n square matrix A is tr(A) = i=1 Aii

Lemma: E(Y ′ AY ) = tr(AΣ) + µ′ Aµ, where E(Y ) = µ and V ar(Y ) = Σ.

EXPECTED MEAN SQUARES

We have seen that MSE is unbiased for σ 2 and this holds whether or not H0 : β0 = . . . =
βp−1 = 0 is true. On the other hand,

E(SSR) = E[Y (H − n−1 11′ )Y ]

= tr[(H − n−1 11′ )σ 2 I] + (Xβ)′ (H − n−1 11′ )Xβ

= (p − 1)σ 2 , only when H0 is true.


 
SSR
E(MSR) = E = σ 2 , only when H0 is true.
p−1

NOTE. When H0 is true, both MSR and MSE are estimating the same quantity, and the
F statistic should be close to one. When H0 is not true, the (Xβ)′ (H − n−1 11′ )Xβ > 0,
and hence, MSR is estimating something larger than σ 2 . In this case, we would expect F to
be larger than one. This gives an intuitive explanation of why F statistic is computed large
when H0 is not true. Also, the name of ‘Analysis of Variance’ is coined with this kind of
study because we are conducting hypothesis tests by comparing different estimators for the
variance. The extra term (Xβ)′ (H − n−1 11′ )Xβ is called a noncentrality parameter.

PAGE 26
2.1 Sampling Distributions and Inference for the parameters HYON-JUNG
c KIM, 2017

Source degrees of freedom Sum of Squares Mean Square


′ SSR
Regression p−1 β̂ X ′ Y − n1 Y ′ 11′ Y p−1
′ ′ ′ SSE
Error n−p Y Y − β̂ X Y n−p

Total n−1 Y ′ Y − n1 Y ′ 11′ Y

Example (cheese data continued). Considering the full model,

d) Find a 95 percent confidence interval for the mean taste rating when concentration of
acetic acid, hydrogen sulfide, lactic acid are 5,6,1, respectively.

e) Find a 95 percent prediction interval for a particular taste rating score when concen-
tration of acetic acid, hydrogen sulfide, lactic acid are 5,6,1, respectively.

Coefficient of Determination, R-squared, and Adjusted R-squared

As in simple linear regression, R2 =SSR/SSE, represents the proportion of variation


in Y (about its mean) ‘explained’ by the multiple linear regression model with predictors.
However, unlike in simple linear regression R2 always increases (or stays the same) as more
predictors are added to a multiple linear regression model, even if the predictors added are
unrelated to the response variable. Thus, by itself, R2 cannot be used to help us identify
which predictors should be included in a model. An alternative measure, adjusted R2 , does
not necessarily increase as more predictors are added, and can be used to help us identify
which predictors should be excluded in a model.

Interpretation of the Model Parameters

Each βi coefficient represents the change in the mean response, E(Y ), per unit increase
in the associated predictor variable when all the other predictors are held constant.

In the case of two predictors, the estimated regression equation yields a plane (as opposed
to a line in the simple linear regression setting). For more than two predictors, the estimated
regression equation yields a hyperplane.

PAGE 27
2.2 General linear tests - REDUCED vs. FULL MODEL c
HYON-JUNG KIM, 2017

2.2 General linear tests - REDUCED vs. FULL MODEL

Often we want to test a hypothesis about several parameters simultaneously (but not all of
them as in the case of testing regression relation). For example, in the model
2
Yi = β0 + β1 Xi1 + β2 Xi2 + β3 Xi1 Xi2 + β4 Xi2 + ǫi

we might want to test the hypothesis that all second order terms are zero; i.e. β3 = β4 = 0.
It is not a good idea to do t∗ test for each parameter because of the correlations among
the regressor (independent) variables. These t∗ tests for each parameter test the following
hypotheses:

H0 : β3 = 0 given that β0 , β1 , β2 and β4 are not zero, or

H0 : β4 = 0 given that β0 , β1 , β2 and β3 are in the model.

2 2
If Xi1 Xi2 and Xi2 are highly correlated, then dropping Xi2 may not have a noticeable
2
impact. Likewise, if Xi2 is kept in the model, dropping Xi1 Xi2 may not have a big effect. It is
entirely possible to do two individual t∗ tests and conclude that neither variable is required.
However, it can be that the pair would significantly improve the fit of the model, if you test
the two variables simultaneously.

A common way to test H0 : β3 = β4 = 0 vs. Ha : not both are zero, is to fit the “reduced”
model
REDUCED: Yi = β0 + β1 Xi1 + β2 Xi2 + ǫi

Then compare the residual sums of squares from reduced model with the residual sum of
squares from the full model. The test statistic is
(SSE(reduced) − SSE(full))/(dfred − dff ull )
F∗ =
MSE(full)
The test criterion is: reject H0 if F ∗ > F1−α, (dfred −dff ull , dff ull ) .

This type of general linear test includes the F test for a regression relation and is a very
flexible method of testing hypotheses. It is also possible to test any linear hypotheses where
the reduced model is nested within the full model. For example, if
2
FULL: Yi = β0 + β1 Xi1 + β2 Xi2 + β3 Xi1 + ǫi

PAGE 28
2.2 General linear tests - REDUCED vs. FULL MODEL c
HYON-JUNG KIM, 2017

and the hypothesis to be tested is H0 : β3 = 0 and β1 = β2 , then we can write the reduced
model as
REDUCED: Yi = β0 + β1 (Xi1 + Xi2 ) + ǫi
 
1 X11 + X12
. .. 
so that the design matrix has two columns: X =  .. . 
 
1 Xn1 + Xn2
The parameters of this model are obtained by putting restrictions on the parameters of the
full model, so we say that this model is nested within the full model. The test statistic has
2 and n − 4 degrees of freedom.

GENERAL LINEAR TESTS in another way

The general linear test statistic can be computed another way which does not require
fitting the reduced model. Any linear hypothesis can be stated in the form H0 : Cβ = h.
For example to test the hypothesis in the above model, write
" # " #
0 0 0 1 0
C= and h =
0 1 −1 0 0

Then  
β0
" #  " # " #
0 0 01  β
 1
 β3 0
H0 : Cβ =  = =
0 1 −1 0  
 β2  β1 − β2 0
β3
The (extra) sum of squares due to fitting the full model over the reduced model is

SSE(reduced) − SSE(full) = (C β̂ − h)′ (C(X′ X)−1 C ′ )−1 (C β̂ − h)

The F ∗ test statistic is then

(C β̂ − h)′ (C(X′ X)−1 C ′ )−1 (C β̂ − h)/df


F∗ =
MSE(full)

The degrees of freedom(df ) for the numerator is the rank of C(X′ X)−1 C ′ .

This way of writing the F ∗ statistic gives a different perspective on the test. Notice that

d β̂ − h) = C(X′ X)−1 C MSE


Var(C

PAGE 29
2.2 General linear tests - REDUCED vs. FULL MODEL c
HYON-JUNG KIM, 2017

Therefore, F ∗ is (C β̂−h)′ times the inverse of its estimated covariance matrix times (C β̂−h).
This is a similar idea to a t∗ statistic. We are essentially standardizing the difference C β̂ − h
based on its covariance matrix. If df = 1, so that C β̂ is a scalar, then F ∗ = (t∗ )2 .

In this test, we are testing H0 : Cβ = h using a variance estimate that theoretically


includes only random error. The numerator and denominator of the F ∗ ratio must be inde-
pendent quantities that both estimate σ 2 under H0 but estimate different quantities under
Ha . Sometimes if another estimate of σ 2 is available, this may be used in the denominator
instead of MSE(full) with a corresponding degrees of freedom.

EXTRA SUMS OF SQUARES

The general linear test is very flexible and can be used to test all sorts of hypotheses.
Certain hypotheses are standard and are often included in an analysis of variance table.
That is, we can partition the total sums of squares into sums of squares that test different
hypotheses. The difference in SSE between a full model and a reduced model is called the
“extra sum of squares” due to fitting the additional parameters.

Example: For a model with three independent variables, Xi , i = 1, 2, 3, an n × 1 vector each,


 
β0
 
 β1 
 
Y = [1|X1 |X2 |X3 |]   + ǫ.
 β2 
 
β3

Let SSE(X1 , X2 , X3 ) be the error sum of squares for full model, i.e. SSE(FULL). Define

SSE(X1 ) = error sum of squares for model Yi = β0 + β1 Xi1 + ǫi

SSE(X2 ) = error sum of squares for model Yi = β0 + β2 Xi2 + ǫi

SSE(X3 ) = error sum of squares for model Yi = β0 + β2 Xi3 + ǫi

SSE(X1 , X2 ) = error sum of squares for model Yi = β0 + β1 Xi1 + β2 Xi2 + ǫi

SSE(X1 , X3 ) = error sum of squares for model Yi = β0 + β1 Xi1 + β2 Xi3 + ǫi

and SSE(X2 , X3 ) = error sum of squares for model Yi = β0 + β1 Xi2 + β2 Xi3 + ǫi

PAGE 30
2.2 General linear tests - REDUCED vs. FULL MODEL c
HYON-JUNG KIM, 2017

If you want to test whether β3 is needed, given that β1 and β2 are already in the model, then
the extra sums of squares are defined as

R(X3 |X1 , X2 ) = SSE(X1 , X2 )−SSE(X1 , X2 , X3 ).

The “R” stands for the reduction in the error sums of squares due to fitting the additional
term. The F ∗ test for β3 = 0 is

R(X3 |X1 , X2 )/(dfred − dff ull ) (SSE(reduced) − SSE(full))/(dfred − dff ull )


F∗ = =
MSE(full) MSE(full)

which is the same test as we had before but expressed in different notation.

The extra sums of squares can also be written as a difference of regression sums of squares.
For example,

R(X3 |X1 , X2 ) = SSE(X1 , X2 )−SSE(X1 , X2 , X3 )


= [SST − SSR(X1 , X2 )]− [SST − SSE(X1 , X2 , X3 ) ]
= SSR(X1 , X2 , X3 )− SSR(X1 , X2 )

Sequential Sums of Squares or “Type I sums of squares” (in SAS)

They are extra sums of squares for adding each term given that preceding terms in the model
statement are already in the model. For the model with 3 independent variables: the Type
I sums of squares are (in SAS MODEL Y= X1 X2 X3 ;) These sequential sums of squares

SOURCE df SS
X1 1 R(X1 )
X2 1 R(X2 |X1 )
X3 1 R(X3 |X1 , X2 )
error n−4 SSE(X1 , X2 , X3 )

add up to SSR(X1 , X2 , X3 ).

R(X1 )+ R(X2 |X1 )+ R(X3 |X1 , X2 )

=SSR(X1 )+ SSR(X1 , X2 )−SSR(X1 )+ SSR(X1 , X2 , X3 )−SSR(X1 , X2 ) =SSR(X1 , X2 , X3 )

PAGE 31
2.2 General linear tests - REDUCED vs. FULL MODEL c
HYON-JUNG KIM, 2017

The sequential sums of squares depend on the order that you write the independent
variables in the model statement. For example, the model statement: MODEL Y= X3 X2 X1
would give

SOURCE df SS
X3 1 R(X3 )
X2 1 R(X2 |X3 )
X1 1 R(X1 |X2 , X3 )
error n−4 SSE(X1 , X2 , X3 )

Partial Sums of Squares or “Type II” of sums of squares (in SAS)

These are extra sums of squares for adding a term to the model given that all the other
independent variables in the model statement are already in the model. The order of terms
in the model makes no difference. For either of the following model statements

MODEL Y= X1 X2 X3 MODEL Y= X3 X2 X1

the partial sums of squares are the same; But these sums of squares do not add up to the

SOURCE df SS
X1 1 R(X1 |X2 , X3 )
X2 1 R(X2 |X1 , X3 )
X3 1 R(X3 |X1 , X2 )
error n−4 SSE(X1 , X2 , X3 )

total regression sum of squares. In other words, this is not a partition of the regression sum
of squares. The only case that the Type II sums of squares do add up to SSR(X1 , X2 , X3 ) is
when the three independent variables are orthogonal to each other. This does not happen
often in the regression setting, but designed experiments are usually set up to take advantage
of this.

Note that if X1 and X2 are uncorrelated, then SSR(X1 ) = R(X1 |X2 ) and SSR(X2 ) =
R(X2 |X1 ).

PAGE 32
2.2 General linear tests - REDUCED vs. FULL MODEL c
HYON-JUNG KIM, 2017

Example: DATA

Y X0 X1 X2
-2 1 -2 1
-3 1 1 0
0 1 2 0
2 1 1 3

Y ′ Y = 17 , SST = 17 − 9/4 = 14.75

(X′ X) (X′ X)−1 X′ Y β̂

Regress Y on X0 only 4 1/4 -3 -3/4


(SSE= 14.75)
" # " # " # " #
4 2 10/36 −2/36 −3 −1.0
Regress Y on X0 , X1
2 10 −2/36 4/36 3 0.5
(SSE= 12.50)
       
4 2 4 0.467 −0.076 −0.179 −3 −2.344
       
Regress Y on X0 , X1 , X2  2 10 1   −0.076 0.113 0.019   3   0.642 
       
4 1 10 −0.179 0.019 0.170 4 1.274
(SSE = 2.9481)

SSE(X2 ) =6.5833

R(X2 |X1 ) = 9.552 = SSR(X1 , X2 )− SSR(X1 )

SOURCE TYPE I SSq(sequential) TYPE II SSq(partial)


X1 R(X1 ) = 2.25 R (X1 |X2 ) =?
X2 R(X2 |X1 ) = 9.552 R (X2 |X1 ) = 9.552

? = 3.6353

(last row: type I and II are equal)

PAGE 33
2.2 General linear tests - REDUCED vs. FULL MODEL c
HYON-JUNG KIM, 2017

Example: (cheese data)

a) We want to know whether or not the variables X1 (ACETIC) and X2 (H2S) should be
added to the model. Does the smaller model do just as well at describing the data as the
full model?
Analysis of Variance: Reduced Model
Source DF Sum of Squares Mean Squares F Pr > F

Regression 1 3800.4 3800.4 27.55 < 0.0001


Error 28 3862.5 137.9
Total 29 7662.887

Analysis of Variance: Full Model


Source DF Sum of Squares Mean Squares F Pr > F

Regression 3 4994.509 1664.836 16.22 < 0.0001


Error 26 2668.378 102.629
Total 29 7662.887

b) The ‘lm’ in R produces the following sequential sums of squares for the cheese data:

> ch.lm=lm(TASTE ∼ ACETIC+H2S+LACTIC, data= cheese)

Obtain the following sequential sums of squares to check whether or not each variable
should be added to the model (given the preceding terms): R(X1 ), R(X2 |X1 ), R(X3 |X1 , X2 )

State the appropriate test hypotheses for each F-value computed in the ANOVA table.

Source DF Sum of Squares Mean Squares F Pr > F

ACETIC 1 2314.14 2314.14 22.55 < 0.0001


H2S 1 2147.11 2147.11 20.92 0.0001
LACTIC 1 533.26 533.26 5.20 0.0311

PAGE 34
2.2 General linear tests - REDUCED vs. FULL MODEL c
HYON-JUNG KIM, 2017

c) Suppose that we had used the different ordering of model:

> ch.lm1=lm(TASTE ∼ H2S+LACTIC+ACETIC, data= cheese)

Compute R(X1 ), R(X2 |X1 ), R(X3 |X1 , X2 ). What is the major difference between results of
b) and c) due to different ordering?

Source DF Sum of Squares Mean Squares F Pr > F

H2S 1 4376.8 4376.8 42.65 < 0.0001


LACTIC 1 617.1 617.1 6.01 0.02123
ACETIC 1 0.6 0.6 0.0054 0.94193

d) In R, > drop1(ch.lm, test = ”F”) gives the partial sums of squares.

After adjusting for the effects of acetic and lactic concentrations, do we have significant
evidence that the hydrogen sulfide concentration is important in describing taste?

Source DF Sum of Squares Mean Squares F Pr > F

ACETIC 1 0.56 0.56 0.0054 0.9419


H2S 1 1007.69 1007.69 9.8187 0.0042
LACTIC 1 533.26 533.26 5.1959 0.0311

The Lack of Fit Test

When we want to determine whether a specified regression function adequately fits the
data, we can conduct a lack of fit test. However, it is important to note that the test requires
repeated observations (replications) for at least one of the values of the predictors (X). This
test is also based on decomposing sums of squares (due to Errors) and the test procedure
can be derived in the same way as testing the full vs. reduced model.
(SSLF − SSPE)/(n − p − (n − c))
F∗ =
MSPE
where p is the number of regression parameters and c is the number of distinct X values,
SSLF denotes the lack of fit sum of squares, and SSPE (thus, MSPE for mean squares)
stands for the sum of squares due to pure error.

PAGE 35
2.3 Qualitative Independent Variables c
HYON-JUNG KIM, 2017

2.3 Qualitative Independent Variables

Types of variables

• Qualitative variables : Numerical measurements on the phenomena of interest are not


possible. Rather, the observations are categorical.

e.g. gender (female, male), Company status (private, public), Treatment (yes, no),
blood pressure rating (low, average, high)

• Quantitative variables: The observations are in the form of numerical values.

e.g. age, income, temperature, number of defectives, etc.

Qualitative or “classification” variables can be included as explanatory variables in regression


models by using indicator variables (also called ‘dummy’ or ‘binary’ variables).

Example: On average, do smoking mothers have babies with lower birth weight?

Response (Y ): birth weight in grams of baby


X1 : length of gestation in weeks, X2 : Smoking status of mother (smoker or non-smoker)

Then, a first order model with one binary and one quantitative predictor appears to be a
natural model to formulate for these data:

 1 if mother i smokes
Yi = β0 + β1 Xi1 + β2 Xi2 + ǫi , where Xi2 =
 0 otherwise

Q. Why not just fit two separate regression functions one for the smokers and one for the
non-smokers?

The combined regression model assumes that the slope for the two groups are equal and
that the variances of the error terms are equal. Then, it is better to use as much data
as possible to estimate standard errors of regression coefficients for testing and confidence
intervals.

Pooling your data and fitting the combined regression function allows you to easily and
efficiently answer research questions concerning the binary predictor variable.

PAGE 36
2.3 Qualitative Independent Variables c
HYON-JUNG KIM, 2017

Example. If we are interested in quantifying the relationship between total population (of
a metropolitan area) and number of active physicians it may be important to take (4)
geographic regions into account. The indicator variables can be easily used to identify each
of 4 regions as follows:
 
 1 if region 1  1 if region 2
X1 = X2 =
 0 otherwise  0 otherwise
 
 1 if region 3  1 if region 4
X3 = X4 =
 0 otherwise  0 otherwise

Then, we can fit a model for the effect of total population on the number of physicians
with a separate intercept for each region:

Yi = β0 + β1 Xi1 + β2 Xi2 + β3 Xi3 + β4 Xi4 + β5 (Pop)i + ǫi

For n = 10 observations, the design matrix would look something like this:
 
1 1 0 0 0 X15
 
 1 1 0 0 0 X25  region 1
 
 
− − − − − − 
 
 
 1 0 1 0 0 X35 
 
   
 1 0 1 0 0 X45  region 2 β0
 
   
− − − − −
 − 
 β1 
 
   
 1 0 0 1 0 X55   β2 
   
X=  and β =  
 1 0 0 1 0 X65   β3 
   
   
 1 0 0 1 0 X75  region 3  β4 
   
 
− − − − − −  β5

 
 1 0 0 0 1 X85 
 
 
 1 0 0 0 1 X95  region 4
 
 
 1 0 0 0 1 X10 5 
 
− − − − − −
This is not of full rank, so in order to compute parameter estimates we need to either
put restrictions on the parameters or redefine some of the independent variables. When the

PAGE 37
2.3 Qualitative Independent Variables c
HYON-JUNG KIM, 2017

design matrix is not of full rank, we are trying to estimate too many parameters. One simple
way to deal with this problem is to drop one column. In the above, columns 2 through 5 of
X matrix add up to the first column. If we drop the last of these columns, X4 , the result is
the following reparameterized model:

Yi = β0 + β1 Xi1 + β2 Xi2 + β3 Xi3 + β4 (Pop)i + ǫi

this model is essentially the same as before, but parameters have different meanings. For
each region there is a separate intercept and all regions have a common slope.

region1: Yi = β0 + β1 Xi1 + β4 (Pop)i + ǫi

region2: Yi = β0 + β2 Xi2 + β4 (Pop)i + ǫi

region3: Yi = β0 + β3 Xi3 + β4 (Pop)i + ǫi

region4: Yi = β0 + β4 (Pop)i + ǫi

In this parameterization β0 is not the average intercept for all regions. Instead it is the
intercept for region 4. The other parameters such as β1 can be interpreted as the difference
between the intercept for that region and the intercept for region 4.

To test whether all regions have the same intercept we simply test H0 : β1 = β2 = β3 = 0
using a general linear test. Reject H0 if
R(X1 , X2 , X3 |Pop)/3
F∗ = > F1−α (3, n − 5)
MSE(full)
The model with separate intercepts but common slopes may not be realistic. You may need
separate slopes as well. You can fit a model that has entirely distinct lines for each region
using indicator variables as well. One parameterization for this type of model just builds on
our separate intercepts model:

Yi = β0 + β1 Xi1 + β2 Xi2 + β3 Xi3 + β4 (Pop)i + β5 Xi1 (Pop)i + β6 Xi2 (Pop)i + β7 Xi3 (Pop)i + ǫi

In this model β0 is the intercept for region 4 and β4 is the slope for region 4. The terms
β5 ,β6 and β7 give the difference between each region’s slope and the slope for region 4.

side notes: Fitting the full model this way gives the same slopes and intercepts for the four
regions as if you had fit four separate regression lines. The MSE from the full model is the

PAGE 38
2.3 Qualitative Independent Variables c
HYON-JUNG KIM, 2017

pooled variance estimate from the individual regressions: MSE= SSE1 +n1SSE 2 +SSE3 +SSE4
+n2 +n3 +n4 −8
If
the variances are different for the different groups, then it is best to do separate regressions
so that you get separate variance estimates. Sometimes pointing out that the variances are
different is just as important as determining whether the slopes are different.

If the variances are similar for the different groups, or if a transformation will make the
variance fairly homogeneous, then it is more efficient to fit the single regression model than
to fit four separate regressions.

There are other ways of dealing with the non-full rank design matrix. In particular, if
we code the data with 1,0, or -1 for each region as follows,
 

 1 if region 1 
 1 if region 2

 

Z1 = 0 if region 2 or 3 Z2 = 0 if region 1 or 3

 


 −1 
 −1
region 4 region 4


 1 if region 3


Z3 = 0 if region 1 or 2 the resulting design matrix for this model is



 −1 region 4,
 
1 1 0 0 X15
 
1 1 0 0 X 
 25 
   
1 0 1 0 X  β0
 35 
   
1 0 1 0 X45   β1 
   
   
1 0 0 1 X   β2 
 55   
X=  and β =  
1 0 0 1 X65   β3 
  
   
1 0 0 1 X75   β4 
   
 
 1 −1 −1 −1 X85  β5
 
 
 1 −1 −1 −1 X95 
 
1 −1 −1 −1 X10 5

The equations for each region are

region1: Yi = β0 + β1 + β4 (Pop)i + ǫi

PAGE 39
2.3 Qualitative Independent Variables c
HYON-JUNG KIM, 2017

region2: Yi = β0 + β2 + β4 (Pop)i + ǫi

region3: Yi = β0 + β3 + β4 (Pop)i + ǫi

region4: Yi = β0 − (β1 + β2 + β3 ) + β4 (Pop)i + ǫi

Interaction Effects

A regression model contains interaction effects if the response function is not additive and
cannot be separated into distinct functions of each of the individual predictors. Two predic-
tors interact if the effect on the response variable of one predictor depends on the value of
the other.

Example. Some researchers were interested in comparing the effectiveness of three treatments
(A,B,C) for severe depression and collected a random sample of n = 36 severely depressed
individuals.

Yi = measure of the effectiveness of the treatment for individual i

Xi1 = age (in years) of individual i

Xi2 = 1 if individual i received treatment A and 0, if not.

Xi3 = 1 if individual i received treatment B and 0, if not.

Y 56 41 40 28 55 25 46 71 48 63 52 62
50 45 58 46 58 34 65 55 57 59 64 61
62 36 69 47 73 64 60 62 71 62 70 71
AGE 21 23 30 19 28 23 33 67 42 33 33 56
45 43 38 37 43 27 43 45 48 47 48 53
58 29 53 29 58 66 67 63 59 51 67 63
TRT A B B C A C B C B A A C
C B A C B C A B B C A A
B C A B A B B A C C A C

PAGE 40
2.3 Qualitative Independent Variables c
HYON-JUNG KIM, 2017

As the data indicates that X1 and X2 interact and that X1 and X3 interact, we formulate
the model as:

Yi = β0 + β1 Xi1 + β2 Xi2 + β3 Xi3 + β4 Xi1 Xi2 + β5 Xi1 Xi3 + ǫi .

The estimated regression function is

Y = 6.21 + 1.0334AGE + 41.30X2 + 22.71X3 − 0.703AGE X2 − 0.510AGE X3

When we plug the possible values for X2 and X3 into the estimated regression function, we
obtain the three best fitting lines one for each treatment (A, B and C) through the data.

PAGE 41
2.3 Qualitative Independent Variables c
HYON-JUNG KIM, 2017

Source DF Sum of Squares Mean Squares F Pr > F

AGE 1 3424.4 3424.4 222.29 < 0.0001


X2 1 803.8 803.8 52.18 < 0.0001
X3 1 1.2 1.2 0.077 0.7831
AGE X2 1 375.0 375.0 24.34 < 0.0001
AGE X3 1 328.4 328.4 21.32 < 0.0001

a) For every age, is there a difference in the mean effectiveness for the three treatments? i.e.
test whether the regression functions are identical for each age, regardless of three treatments.
H0 : β2 = β3 = β4 = β5 = 0 vs. H1 : at least one of these slope parameters is not 0.

b) Does the effect of age on the treatment’s effectiveness depend on treatment? H0 : β4 =


β5 = 0 vs. H1 : at least one of these parameters are not zero.

Piecewise Linear Regression Models

The basic idea behind the piecewise linear regression is that if the data follow different linear
trends over different regions of the data, then we should model the regression function in
“pieces.” The pieces can be connected or not connected. Here, we’ll fit a model in which
the pieces are connected. We also use a dummy variable and an interaction term to define
a piecewise linear regression model.

Example. The data set contains information on compressive strength (Y ) of n = 18 batches


of concrete against the proportion of water (X) mixed in with the cement:

A straight line estimated by the least squares may fit the data fairly well in some overall sense,
but we could do better if we use a piecewise linear function. We instead split our original
scatter plot into two pieces (where the water-cement ratio is 70%) and fit two separate, but
connected lines, one for each piece.

Yi = β0 + β1 Xi1 + β2 (Xi1 − 70)Xi2 + ǫi .

Alternatively, we can write the formulated piecewise model above as:


Yi = β0 + β1 Xi1 + β2 Xi2 + ǫi ,

PAGE 42
2.3 Qualitative Independent Variables c
HYON-JUNG KIM, 2017

where Yi is the comprehensive strength, in pounds per square inch, of concrete batch i

Xi1 is the water-cement ratio, in %, of concrete batch i

Xi2 is a dummy variable (0, if Xi1 ≤ 70 and 1, if Xi1 > 70) of concrete batch i


Xi2 denotes the interaction term (Xi1 − 70)Xi2

Incidentally, the X-value at which the two pieces of the model connect is called the “knot
value.” The regression equation is ‘strength = 7.79 − 0.0663 ratio − 0.101X2∗ .’

which yields two estimated regression lines, connected at X = 70 fitting the data quite well.

Regression with a straight line with a piecewise linear function

PAGE 43
2.3 Qualitative Independent Variables c
HYON-JUNG KIM, 2017

Polynomial Regression

In many cases the usual linear relationship between the dependent variable and any
independent variable we have assumed may not be adequate. One way to account for such
a nonlinear relationship is through a polynomial or trigonometric regression model. Such
models are linear in the parameters and the least squares method can be used for estimation
of the parameters as long as the usual assumptions on the errors are appropriate. For
example, a model for a single predictor, X, is:

Y = β0 + β1 X + β2 X 2 + ... + βp X p + ǫ,

where p is the degree of the polynomial. For lower degrees, the relationship has a specific
name (i.e., p = 2 is called quadratic, p = 3 is called cubic, p = 4 is called quartic, and
so on). However, polynomial regression models may have other predictor variables in them
as well, which could lead to interaction terms and the model can grow depending on your
application.
 
1 X1 X12
 
1 X2 X22 
 
The design matrix for the second-degree polynomial model is: X =  . .. ..  .
 .. . . 
 
1 Xn Xn2
For the polynomial regression in one independent variable, an important aspect that
distinguishes it from other multiple regression models is that the mean of the dependent
variable is a function of a single independent variable. The fact that the independent vari-
ables in a simple polynomial model are functions of a single independent variable affects the
interpretation of the parameters. In the second-degree model, the parameter β1 is the slope
only at X = 0. The parameter β2 is half the rate of change in the slope of E(Y ).

The higher-degree polynomial models provide greatly increased flexibility in the response
surface. Although it is unlikely that any complex process will be truly polynomial in form,
the flexibility of the higher-degree polynomials allows any true model to be approximated to
any desired degree of precision.

PAGE 44
2.3 Qualitative Independent Variables c
HYON-JUNG KIM, 2017

The full second-degree polynomial model in two variables is

2 2
Yi = β0 + β1 Xi1 + β2 Xi2 + β3 Xi1 + β4 Xi1 Xi2 + β5 Xi2 + ǫi .

The degree (or order) of an individual term in a polynomial is defined as the sum of the
powers of the independent variables in the term. The degree of the entire polynomial is
defined as the degree of the highest-degree term.

The squared terms allow for a curved response in each variable and the product term
allows for the surface to be twisted. A quadratic response surface will have a maximum,
a minimum, or a saddle point, depending on the coefficients in the regression equation.
(Refer to Box and Draper (1987) for a further discussion of the analysis of the properties of
quadratic response surfaces.)

Due to its extreme flexibility, some caution is needed in the use of polynomial models; it is
easy to overfit a set of data with polynomial models. Extrapolation is particularly dangerous
when higher-degree polynomial models are being used, since minor extrapolations can have
serious errors. Nevertheless, polynomial response models have proven to be extremely useful
for summarizing relationships.

The polynomial model is built sequentially, starting either with a first- degree polynomial
and adding progressively higher-order terms as needed. The lowest-degree polynomial that
accomplishes the degree of approximation needed or warranted by the data is adopted. It is
common practice to retain in the model all lower-degree terms, regardless of their significance,
that are contained in (or are subsets of) any significant term.

Example. Data are from a growth experiment with blue-green algae Spirulina platensis for
the treatment where CO2 is bubbled through the culture. The measure of algae density
is the dependent variable. Consider a cubic polynomial model with the data for the first
replicate. The (ordinary) least squares fit of the model is given by

Yi = .00948 + .53074Xi + 0.00595Xi2 − .00119Xi3 + ǫi ,

(.16761) (.09343) (.01422) (.00062)

PAGE 45
2.3 Qualitative Independent Variables c
HYON-JUNG KIM, 2017

Day rep1 rep2 Day rep1 rep2


1 .530 .184 8 4.059 3.892
2 1.183 .664 9 4.349 4.367
3 1.603 1.553 10 4.699 4.551
4 1.994 1.910 11 4.983 4.656
5 2.708 2.585 12 5.100 4.754
6 3.006 3.009 13 5.288 4.842
7 3.867 3.403 14 5.374 4.969

where the standard errors of the estimates are given in parentheses. Assuming that a cubic
model is adequate, we can test the hypotheses

a) that a quadratic polynomial model is adequate.

b) that a linear trend model is adequate. SSE(full)= 0.013658, SSE(Red)= 1.458.

c) With two replicates per each day, we can test the adequacy of a quadratic polynomial
model. We fit the quadratic model with whole data (including rep1 and rep2) to obtain

SSE(reduced) = .7984 and pure-error sum of squares = .6344.

Note that the natural polynomials with terms Xi , Xi2 , and Xi3 , etc. give (nearly) linearly
dependent columns of the X matrix leading to multicollinearity problems. For the cubic
polynomial model (as in the above example), a set of orthogonal polynomials can be defined:

O0i = 1,

O1i = 2Xi − 15,

O2i = .5Xi2 − 7.5Xi + 20,


5 3 698.5
O3i = Xi − 37.5Xi2 + Xi − 340.
3 3
where O1i , O2i , and O3i are linear combinations of the natural polynomials Xi , Xi2 , and Xi3 .

For the data above, Yi = 3.48164 + .19198O1i − .04179O2i − .00072O3i + ǫi ,

(.03123) (.00387) (.00433) (.00037)

PAGE 46
2.4 Data Transformations c
HYON-JUNG KIM, 2017

The orthogonal polynomials can be obtained using the Gram Schmidt orthogonalization
procedure or with a function ‘poly’ in R.

2.4 Data Transformations

There are many situations in which transformations of the dependent or independent vari-
ables are helpful in least squares regression. Transformations of the dependent variable are
often used as remedies for nonnormality and for heterogeneous variances of the errors.

Note that data transformation definitely requires a ‘trial and error’ approach. In building
the model, we try a transformation and then check to see if the transformation eliminated
the problems with the model. If it doesn’t help, we try another transformation until we have
an adequate model.

Previously, we learned tools for detecting problems with a linear regression model. Once
we’ve identified problems with the model, we have a number of options:

• If important predictor variables are omitted, see whether adding the omitted predictors
improves the model.

• If the mean of the response is not a linear function of the predictors, try a different
function. e.g. polynomial regression or applying a logarithmic transformation to the
response variable allowing for a nonlinear relationship between the response and the
predictors while remaining within the multiple linear regression framework.

• If there are unequal error variances, try transforming the response and/or predictor
variables or use “weighted least squares regression” (or generalized least squares).

• If an outlier exists, try using a “robust estimation procedure.”

• If error terms are not independent, try fitting a “time series model.”

Another reason for making transformations is to simplify the relationship between the depen-
dent variable and the independent variables. Curvilinear relationships between two variables

PAGE 47
2.4 Data Transformations c
HYON-JUNG KIM, 2017

frequently can be simplified by a transformation on either one or both of the variables. Many
models nonlinear in the parameters can be linearized, reexpressed as a linear function of the
parameters, by appropriate transformations. For example, the relationship

Y = αX β

is linearized by taking the logarithm of both sides of the equality giving

ln(Y ) = ln(α) + β ln X or Y ∗ = α∗ + βX ∗ .

Logarithms are often used in data transformations because they are connected to common
power curve (as above) and exponential growth Y = α expXβ relationships.

It should be remembered that it may not be possible to find a set of transformations


that will satisfy all objectives. A transformation on the dependent variable will change
the variance of the Y -variable and the errors, Or, a transformation to stabilize variance may
cause nonnormality. Fortunately, transformations for homogeneity of variance and normality
often tend to go hand-in-hand so that both assumptions are more nearly satisfied after an
appropriate transformation (Bartlett, 1947). The logit, arcsin, and probit transformations
of the Y -variable that are used to stabilize variance and straighten relationships also make
the distribution more normal-like by stretching the tails of the distribution, values near zero
or one, to give a more bell-shaped distribution.

Transformations of the X-variable(s) (e.g., X −1 , X 2 , ln(X)) are recommended in practice


when there are strong nonlinear trends in one or more residual plots. Box-Cox transforma-
tions are a family of power transformations on Y (refer to page 16), where λ is a parameter
to be determined using the data.

It may be that no transformation adequately stabilized the variances, or a transformation


made to simplify a relationship left heterogeneous variances. The weighted least squares and
generalized least squares provide minimum variance linear unbiased (least squares) estimates
when the variance-covariance matrix of the errors is an arbitrary symmetric positive definite
matrix.

PAGE 48
2.5 Weighted Least Squares c
HYON-JUNG KIM, 2017

2.5 Weighted Least Squares

The least squares regression method discussed so far was based on the assumptions that the
errors are normally distributed independent random variables with a common variance. Least
squares estimators based on these assumptions is referred to as ordinary least squares
estimates and they have the desirable property of being the best (minimum variance) among
all possible linear unbiased estimators by Gauss-Markov theorem. When the normality
assumption is satisfied, the least squares estimators are also maximum likelihood estimators.

If the variance of Y is not constant, there may be a transformation of Y that will stabilize
the variance. This is the most common approach, but under certain circumstances there
is another method that is very attractive. If the variance heterogeneity is known up to a
constant (and the observations Yi are uncorrelated), then we can use the method of weighted
least squares:
2
Var(ǫi ) = σi2 = σwi = Var(Yi ) and Cov(ǫi , ǫj ) = 0
 
w1 0 . . . 0
 
 0 w2 
 
Let W =  . ... 
 .. 
 
0 wn
What we do here is minimizing a sum of squares, (Y −Xβ)′ W (Y −Xβ), where we weight
each observation according to its variability. If an observation is highly variable it gets low
weight, and if an observation has small variance it gets greater weight.

The weighted least squares estimates of β are

b = (X ′ W X)−1 X ′ W Y

where the weighting matrix, W, is proportional to the inverse of the variance-covariance


matrix of Y. Also,

Var(b) = (X ′ W X)−1 X ′ W Var(Y )W X(X ′ W X)−1

= σ 2 (X ′ W X)−1

PAGE 49
2.5 Weighted Least Squares c
HYON-JUNG KIM, 2017

When you compare the weighted least squares (WLS) results with those from ordinary
least squares it often happens that the estimated coefficients do not change very much.
However, the weighted least squares is more efficient than the ordinary least squares; i.e.
the parameters are estimated more precisely. Note, however, that if you use ordinary least
squares when heteroscedasticity is present (non-constant error variances), the estimated
variances of the regression coefficients that come out of a standard regression program will
not actually be the correct variances, so they may look smaller than those under WLS. Also,
if you compute confidence bands or prediction intervals, the ones computed using WLS will
get wider as X increases, as they should. This is probably the most important benefit of
using the weighted least squares. We want the confidence and prediction intervals to be
narrow when Var(Yi ) is small.

When you do the weighted least squares, it is important to use studentized residuals
rather than raw residuals for diagnostic plots. The raw residuals do not have homogeneous
variances and will not show the results of the weighting, but the studentized residuals do
have homogeneous variances.

The weighted least squares is a special case of generalized least squares (GLS) where
the error terms not only may have different variances but pairs of error terms may also be
correlated.

The generalized least squares (GLS) estimator

The generalized least squares estimates of β is

β̂ GLS = (X ′ V −1 X)−1 X ′ V −1 Y

where V is the variance-covariance matrix of Y . When V is a diagonal matrix, β̂ GLS = β̂ W LS .

Note that there exists a n × n nonsingular matrix K such that KK ′ = V , Var(ǫ) = V .

In practice, V is not known and need to be estimated by V̂ . Then, the estimated


generalized least squares estimates of β is obtained by an iterative algorithm:

β̂ EGLS = (X ′ V̂ −1 X)−1 X ′ V̂ −1 Y.

PAGE 50
2.6 Model Building c
HYON-JUNG KIM, 2017

2.6 Model Building

In many practical applications, a researcher has a large set of candidate predictor variables
from which he/she tries to identify the most appropriate predictors to include in his regression
model. However, for example, if a researcher has (only) 10 candidate predictor variables,
then there are 210 = 1024 possible regression models from which to choose.

Some models have a natural hierarchy (polynomial models/Models with interactions). In


this case, lower order terms should not be removed from the model before higher order terms
in the same variable.

In general, we want to explain the data in the simplest way, since collinearity is caused
by having too many variables or unnecessary predictors will add noise to the estimation of
other quantities that we are interested in. However, an underspecified model is a model in
which important predictors are missing and such a model yields biased regression coefficients
and biased predictions of the response.

Here we aim to learn some variable selection methods such as stepwise regression, best
subsets regression and LASSO regression so that the resulting model is simple and useful.

Stepwise Regression

We build our regression model from a set of candidate predictor variables by entering
and removing predictors in a stepwise manner into our model until there is no justifiable
reason to enter or remove any more. we start with no predictors in our “stepwise model.”
Then, at each step along the way we either enter or remove a predictor based on the partial
F-tests that is, the t-tests for the slope parameters.

(a) Backward Elimination

1. Start with all the predictors in the model

2. Remove the predictor with highest p-value greater than αR

3. Refit the model and goto 2

4. Stop when all p-values are less than αR

PAGE 51
2.6 Model Building c
HYON-JUNG KIM, 2017

(b) Forward Selection

1. Start with no variables.

2. For all predictors not in the model, check their p-value if they are added to the
model. Choose the one with lowest p-value less than αE

3. Continue until no new predictors can be added.

Note that αR and αE will typically be greater than the usual 0.05 level so that it is
not too difficult to enter predictors into the model.

(c) Stepwise Regression: combination of backward elimination and forward selection.

1. Fit each of the one-predictor models. Of those predictors, the predictor that has
the smallest t-test p-value than αE is put in the stepwise model. If no predictor
has a p-value than αE , stop.

2. Fit each of the two-predictor models that include the predictor chosen in 1. Again,
of those predictors, the predictor that has the smallest t-test p-value than αE is
put in the model. Step back and see if entering the newly chosen variable into the
model somehow affects the significance of the previously chosen predictor in 1. If
p-value for the previously chosen variable is greater than αR , remove the variable.

3. Fit each of the three-predictor models with two predictors chosen in 2 and so on.

4. Continue the same procedure as above with more variables.

5. Stop when adding an additional predictor does not yield a p-value less than αE

Warnings:

- The procedure yields a single final model, although there are often several equally good
models. The final model is not guaranteed to be optimal in any specified sense: it does not
mean that all the important predictor variables for predicting Y have been identified, nor
that all the unimportant predictor variables have been eliminated.

- Stepwise regression does not take into account a researcher’s knowledge about the
predictors. It may be necessary to force the procedure to include important predictors.

PAGE 52
2.6 Model Building c
HYON-JUNG KIM, 2017

- One should not over-interpret the order in which predictors are entered into the model.

- It is possible that we may have committed a Type I or Type II error along the way.

Best subsets regression

We select the subset of predictors that does the best at meeting some well-defined ob-
jective criterion, such as having the largest R2 (adjusted R2 ) value or the smallest MSE.

1. Identify all of the possible regression models derived from all of the possible combina-
tions of the candidate predictors.

2. Determine the one-predictor models that do the ‘best’ at meeting some well-defined
criteria, the two-predictor models that do the ‘best’, the three-predictor models that
do the ‘best’, and so on.

3. Further evaluate and refine models by residual analyses, transforming variables, adding
interaction terms, etc. until the model meets necessary conditions allowing you to
answer your research question.

The adjusted R2 -value is defined as:


     
2 n − 1 SSE n−1 2 n−1
Ra = 1 − =1− (1 − R ) = 1 − M SE.
n − p SST n−p SST

Criterion-based procedures

1. information criterion statistics:

Akaike’s Information Criterion (AIC), Bayesian Information Criterion (BIC) (or called
Schwartz’s Bayesian Criterion (SBC)), and Amemiya’s Prediction Criterion (APC)

AIC = n ln(SSE) − n ln n + 2p

BIC = n ln(SSE) − n ln n + p ln(n)


n+p
AP C = SSE
n(n − p)

PAGE 53
2.6 Model Building c
HYON-JUNG KIM, 2017

where n: sample size and p: number of regression coefficients in the model being
evaluated including the intercept.

Notice that the only difference between AIC and BIC is the multiplier of p. Each of
the information criteria is used in a similar way in comparing two models, the model
with the lower value is preferred.

The BIC places a higher penalty on the number of parameters in the model so will
tend to reward more parsimonious (smaller) models. This stems from one criticism of
AIC in that it tends to overfit models.

2. PRESS (prediction residual sum of squares):


n
X
P RESS = (e(i) )2
i=1

where e(i) ’s are the residuals calculated omitting ith observation in the regression fit.
The model with the lowest PRESS value is selected. This tends to pick larger models
(which may be desirable if prediction is the objective).

3. Mallows’ Cp -statistic

It estimates the size of the bias that is introduced into the predicted responses by
having an underspecified model.

The bias in predicted response:

Biasi = E(Ŷi ) − E(Yi )

and the average MSE of prediction:


1
E(Ŷi − E(Yi ))2
σ2
which can be estimated by the Cp statistic:
(M SEp − M SEall )(n − p) SSEp
Cp = p + = + 2p − n
M SEall M SEall
where M SEp is the mean squared error from fitting the model containing the subset
of p − 1 predictors (p parameters with the intercept). and M SEall is the mean squared
error obtained from fitting the model containing all of the candidate predictors.

PAGE 54
2.6 Model Building c
HYON-JUNG KIM, 2017

Using Cp to identify ‘best’ models:

- Identify subsets of predictors for which the Cp value is near p (if possible). The full
model always yields Cp = p, so don’t select the full model based on Cp .

- If all models, except the full model, yield a large Cp not near p, it suggests some
important predictor(s) are missing from the analysis.

- If a number of models have Cp near p, choose the model with the smallest Cp value,
thereby insuring that the combination of the bias and the variance is at a minimum.

- When more than one model has a small value of Cp value near p, in general, choose
the simpler model or the model that meets your research needs.

Cross-validation

It is a model validation technique with which the regression equation of the model fit is
used to the original dataset to make predictions for the new dataset. Then, we can calculate
the prediction errors (differences between the actual response values and the predictions)
and summarize the predictive ability of the model by the mean squared prediction error
(MSPE). This gives an indication of how well the model will predict in the future.

The sample data is usually partitioned into a training (or model-building) set, which we
can use to develop the model, and a validation (or prediction) set, which is used to evaluate
the predictive ability of the model.

The K-fold cross-validation partitions the sample dataset into K parts which are (roughly)
equal in size. For each part, we use the remaining K − 1 parts to estimate the model of
interest (i.e., the training sample) and test the predictability of the model with the remain-
ing part (i.e., the validation sample). Then, the sum of squared prediction errors can be
computed and combining K estimates of prediction error produces a K-fold cross-validation
estimate.

When K = 2, it is usually preferable to residual diagnostic methods and takes not much
longer to compute.

When K = n, this is called leave-one-out cross-validation. That means that n

PAGE 55
2.7 Multicollinearity c
HYON-JUNG KIM, 2017

separate data sets are trained on all of the data (except one point) and then prediction is made
for that one point. The evaluation of this method is very good, but often computationally
expensive. Note that the K-fold cross-validation estimate of prediction error is identical to
the PRESS statistic.

2.7 Multicollinearity

Multicollinearity exists when two or more of the predictors in a regression model are mod-
erately or highly correlated.

Types of multicollinearity

- Structural multicollinearity: a mathematical artifact caused by creating new predictors


from other predictors, such as creating the predictor X2 from the predictor X.

- Data-based multicollinearity: a result of a poorly designed experiment, reliance on


purely observational data, or the inability to manipulate the system on which the data are
collected.

When it exists, there can be one of the following problems:

• the estimated regression coefficient of any one variable depends on other predictors
that are included in the model.

• the precision of the estimated regression coefficients decreases as more predictors are
added to the model.

• the marginal contribution of any one predictor variable in reducing the error sum of
squares depends on the other predictors that are included in the model.

• hypothesis tests for βk = 0 may yield different conclusions depending on which predic-
tors are in the model.

Example. The researchers were interested in determining if a relationship exists between

PAGE 56
2.8 Ridge Regression c
HYON-JUNG KIM, 2017

blood pressure and age, weight, body surface area, duration, pulse rate and/or stress level.
(refer to the data set on the web)

Y : blood pressure (BP, in mm Hg) - 20 individuals with high blood pressure

X1 : age (in years), X2 : weight (in kg), X3 : body surface area (BSA in square meter)

X4 : duration of hypertension (in years), X5 : basal pulse (in beats per minute), X6 : stress
index

Q. What is the effect on the regression analysis if the predictors are perfectly uncorre-
lated?

When there is no apparent relationship at all between the predictors X1 and X2 so that
the correlation between them is zero (X1 and X2 are orthogonal to each other), the regression
parameter estimates (and their standard errors) remain the same regardless of the used model
form and the sequential sum of squares are the same as the partial sum of squares.

Reducing Multicollinearity

- Remove one or more of the violating predictors from the regression model.

- Collect additional data under different experimental or observational conditions. (Data


need to be collected in such a way to ensure that the correlation among the violating pre-
dictors is actually reduced.)

- When multicollinearity is a mathematical artifact caused by creating new predictors


from other predictors (e.g. in polynomial regression), multicollinearity can be reduced by
‘centering the predictors’ i.e. centering a predictor merely entails subtracting the mean of
the predictor values in the data set from each predictor value.

2.8 Ridge Regression

In some cases, the least squares estimates can not be computed or is subject to a very
high-variance:

PAGE 57
2.8 Ridge Regression c
HYON-JUNG KIM, 2017

1. X is not of full rank (i.e., its columns are not linearly independent).

2. The number of predictors is larger than the number of responses (p > n).

3. Multicollinearity: X is of full rank, but there exists strong correlation between


variables. ⇒ β̂j ’s can explode since (X ′ X)−1 can grow very large in terms of matrix
norm (leading to a high variance).

4. The number of predictors p is very large (p ≈ n) – estimates suffer from high-variance


when more terms are included in the model.

To control the variance of β̂j ’s, we can regularize the coefficients, i.e., control how large they
can grow. Ridge regression (or Tikhonov Regularization) estimators can be computed in all
1-4 cases stated above (Hoerl and Kennard, 1970).

Ridge regression estimator β̂λ minimizes the penalized residual sum of squares criterion:

Jλ (β) = kY − Xβk2 + λkβk2


P P
= ni=1 (Yi − Xi⊤ β)2 + λ pj=1 βj2

where λ > 0 is i the ridge (shrinkage) parameter. Equivalently,

β̂λ = arg min kY − Xβk2 subject to kβk2 ≤ s


β

which makes explicit the size constraint on the parameters. There is 1-to-1 correspondence
between λ and s. λ offers a tradeoff between the regularizing constraint and minimization
of the sum of squared residuals. The bigger the λ the greater is the amount of shrinkage of
the coefficients toward zero.

Note that Jλ (β) is strictly convex and hence has a unique solution β̂λ . Thus, setting the
derivatives to zero, ∇β Jλ (β) = 0 yields that β̂λ solves

(X ′ X + λI)β = X ′ Y

Then, the ridge regression (RR) estimator has a closed-form solution:

β̂λ = (X ′ X + λI)−1 X ′ Y

PAGE 58
2.8 Ridge Regression c
HYON-JUNG KIM, 2017

The additional “ridge” down the diagonal elements guarantees that X ′ X is invertible.

When X ′ X = I (i.e., X is orthonormal) and n > p: (β̂ denotes OLS estimates.)

β̂
β̂λ = ,
1+λ

• This form illustrates the essential feature of ridge regression: shrinkage towards zero

• The ridge penalty introduces bias but reduces the the variance of the estimate (bias-
variance tradeoff).

As λ → 0, β̂λ → β̂, whereas λ → ∞, β̂λ → 0.

Theorem. If ǫi ’s are i.i.d. zero mean with variance Var(ǫi ) = σ 2 , then the bias of the ridge
regression estimator is
Bias(β̂λ ) = E[β̂λ ] − β = −λRλ β

and the variance-covariance matrix is

V ar(β̂λ ) = E[(β̂ − E[β̂])(β̂ − E[β̂])′ ]

= σ 2 Rλ (X ′ X)Rλ , where Rλ = (X ′ X + λI)−1 .

Recall that mean squared error (MSE) = variance + (bias)2 , and in multiparameter problems:

MSE (β̂λ ) = E[(β̂λ − β)(β̂λ − β)⊤ ] = V ar(β̂λ ) + Bias (β̂λ ) · [ Bias (hatβλ )]′

which also implies that “total MSE”:


p p p
X X X
MSE (β̂λ,i ) = V ar(β̂λ,i ) + [ Bias (β̂λ,i )]2
i=1
|i=1 {z } |i=1 {z }
↓ as λ↑ ↑ as λ↑

There always exists λ such that the total MSE of β̂λ is smaller than the total MSE of
the least squares estimate β̂.

Q. How to choose the shrinkage parameter λ?

PAGE 59
2.9 LASSO (Least Absolute Shrinkage and Selection Operator) HYON-JUNG
c KIM, 2017

• Hoerl and Kennard (1970) suggested using ridge traces: - Plot the components of β̂λ
against λ or rather DF(λ) due to 1-to-1 relation. - Choose a λ for which the coefficients
are not rapidly changing and have sensible signs.

• This is a heuristic approach (requiring extensive computations for different choices of


λ) and can be criticized from many perspectives.

• Nowadays, more standard approach is to use cross-validation (CV) of information


theoretic criterions such as Bayesian Information Criterion (BIC).
Ridge Regression Coefficient Paths

ltg

bmi
ldl

map

tch
Coefficient

hdl
glu
age

sex

tc

0 2 4 6 8 10

DF

Regression with a straight line with a piecewise linear function

2.9 LASSO (Least Absolute Shrinkage and Selection Operator)

Another way to deal with variable selection is to use regularization (or penalization) - regu-
larize the regression coefficients. LASSO (Tibshirani,1996) has been a popular technique for
simultaneous linear regression estimation and variable selection. Specifically, we define β̂ to
minimize the penalized sums of squares

kY − Xβk2 + λkβk1

Or
n
X
min (Yi − Xi′ β)2 subject to kβk1 ≤ s
θ
i=1

PAGE 60
2.9 LASSO (Least Absolute Shrinkage and Selection Operator) HYON-JUNG
c KIM, 2017

As before (in the ridge regression), the shrinkage (penalty) parameter λ offers a tradeoff
between the regularizing constraint and minimization of the sum of squared residuals. The
bigger the λ the greater is the amount of shrinkage. With LASSO, some of the estimated
coefficients are shrunk all the way to zero. Another important role of λ is for automated
variable selection.

Example: Prostate cancer data (n = 97, p = 8) used in many text-books.

To study relationship between Y = the level of prostate-specific antigen (lpsa) and a num-
ber of clinical measures in men who were about to receive a radical prostatectomy :

X1 = log cancer volume (lcavol)

X2 = log prostate weight (lweight)

X3 = age

X4 = log of the amount of benign prostatic hyperplasia(lbph),

X5 = seminal vesicle invasion (svi, binary)

X6 = log of capsular penetration (lcp),

X7 = Gleason score (gleason, ordered categorical)

X8 = percent of Gleason scores 4 or 5 (pgg45)

PAGE 61
2.10 Robust Regression c
HYON-JUNG KIM, 2017

Table 1: Coefficient estimates


OLS LASSO
intercept 0.669 0.355
lcavol 0.587 0.516
lweight 0.454 0.345
age -0.020
lbph 0.107 0.051
svi 0.766 0.567
lcp -0.105
gleason 0.045
pgg45 0.005 0.002

0.8
s vi

0.6
lcavol
Coefficients

lweight
0.4

0.2

lbph
gleas on
0 pgg45
age
lcp

−0.2
0 0.2 0.4 0.6 0.8 1
nor malized k βˆλk 1

Note: The dashed vertical line depicts the minimum BIC value

2.10 Robust Regression

Recall that the ordinary least squares estimates for linear regression are optimal when all
of the regression assumptions are valid. When the errors are nonnormal, or there are big

PAGE 62
2.10 Robust Regression c
HYON-JUNG KIM, 2017

outliers, least squares regression can perform poorly. Robust regression is not overly affected
by violations of assumptions by the underlying data-generating process.

Robust loss functions

1. Least absolute deviance (LAD) with L1 -loss: ρ(e) = |e|

2. Huber’s loss function: convex and differentiable



 1 2
e, for |e| ≤ k
2
ρk (e) =
 k|e| − 1 k 2 , for |e| > k
2

where c is a user-defined tuning constant that affects robustness and efficiency of the
method.

3. Tukey’s biweight loss fucntion


n o
2 3
ρk (e) = min 1, 1 − (1 − (e/k) )

4. Least trimmed squares (LTS)


h
X
(r2 )(i)
j=1

where (r2 )(i) are the order statistics of the squared residuals. LTS estimator chooses
the regression coefficients minimize the sum of the smallest h of the squared residuals.

L 1 -los s Huber los s Tukey los s


4 5 4

4
3 3

3
2 2
2

1 1
1

0 0 0
ï4 ï2 0 2 4 ï4 ï2 0 2 4 ï4 ï2 0 2 4

PAGE 63

You might also like