Reg Analysis
Reg Analysis
Reg Analysis
Regression, one of the most widely used statistical technique, estimates relationships among
variables. Regression models provide a very flexible framework for describing and testing
hypotheses about relationships between explanatory variables and a response variable.
The basis of regression analysis is the linear model. The model can be characterized as
follows. We have n sets of observations {X1i , X2i , ..., Xpi , Yi }, i = 1, ..., n, which represent
a random sample from a larger population. It is assumed that these observations satisfy a
linear relationship
where the β coefficients are unknown parameters, and the ǫi are random error terms.
Note: By a linear model, it is meant that the model is linear in the parameters.
Linear models may not represent a true representation of reality; but perhaps it provides a
useful representation of reality and works well for a wide range of circumstances.
George Box: “All models are wrong, but some are useful.”
9
eg. Fahrenheit = 5
Celsius +32
Circumference = π× diameter
PAGE 1
c
HYON-JUNG KIM, 2017
a statistical method that allows us to summarize and study relationships between two
continuous (quantitative) variables.
Yi = β0 + β1 Xi + ǫi , ǫi ∼ uncorrelated (0, σ 2 )
The first step in any analysis is to look at the data; in the regression context, that means
looking at histograms and a scatter plot. Estimating the unknown parameters β0 and β1
corresponds to putting a straight line through the point cloud in the scatter plot.
The standard approach is the least squares regression, where the estimates are chosen
to minimize
n
X
(Yi − β0 − β1 Xi )2 .
i=1
This is a standard calculus problem, and was solved for the first time either by Legendre in
1805, or by Gauss in 1794 (Legendre published first). It can be shown that the least squares
estimates satisfy
P 2
P (
P
X i )2
Sxx = i (Xi − X) = Xi2 − n
Pn P (
P
Xi )(
P
Yi )
Sxy = i (Xi − X)(Yi − Y ) = Xi Y i − n
PAGE 2
c
HYON-JUNG KIM, 2017
Terminology
• The predicted (or fitted) values : Ybi = β̂0 − β̂1 Xi for i = 1, ..., n
• The squared prediction error for data point i: e2i = (Yi − Ybi )2
• Because the formulas for β̂0 and β̂1 are derived using the least squares criterion, the
resulting equation often referred to as the “least squares regression line”, or simply the
“least squares line”. It is also sometimes called the estimated regression equation.
We need a few assumptions in order to justify using the least squares regression:
(a) the expected value of the errors is zero. i.e. E(ǫi ) = 0 for all i. (or the mean of the
response E(Yi ) at each value of the predictor Xi is a linear function of Xi .)
(d) the variance of the errors is constant (equal). i.e. Var(ǫi ) = σ 2 for all i.
Gauss-Markov theorem: under the above conditions of the simple linear model (SLM) the
least squares estimators are unbiased and have minimum variance among all unbiased linear
estimators.
PAGE 3
c
HYON-JUNG KIM, 2017
Four dose levels of ozone and the resulting mean seed yield of soybeans are given. The
dose of ozone is the average concentration (parts per million, ppm) during the growing
season. Yield is reported in grams per plant.
X Y
Ozone (ppm) Yield (gm/plt)
.02 242
.07 237
.11 231
.15 201
Yi Ŷi ei e2i
242
237
231
201
PAGE 4
1.1 INTERVAL ESTIMATES and HYPOTHESES TESTS c
HYON-JUNG KIM, 2017
ǫi ∼ Normal(0, σ 2 ),
i.e. ǫi is normally distributed, independent of all other ǫj and all have the same variance,
σ 2 , and mean zero.
Since the random errors are unknown, the residuals can be used to get an estimate of
the error variance:
n n
1 X 2 1 X
ei = (Yi − Yb )2 = MSE
n − p i=1 n − p i=1
σ2
(c) Var(β̂1 ) = Sxx
(f) Both β̂0 and β̂1 are normally distributed. (MSE is independent of both β̂1 and β̂1 .)
PAGE 5
1.1 INTERVAL ESTIMATES and HYPOTHESES TESTS c
HYON-JUNG KIM, 2017
Under the normality assumption, in general, the 100(1 − α)% confidence interval for βk is
Reject H0 if
β̂k − 0
t∗ = < tα,n−p ; otherwise do not reject H0 .
s(β̂k )
2 (n−2)MSE (n−2)MSE
A 100(1 − α)% confidence interval for σ is given by χ2n−2,1−α/2
, χ2 .
n−2,α/2
E(Y ) at X=X0 = β0 + β1 X0
Thus, a 100(1 − α)% confidence interval for the mean response of Y at a fixed value X0 is
given by s
1 (X0 − X)2
(β̂0 + β̂1 X0 ) ± t1− α2 ,n−2 MSE + .
n Sxx
PAGE 6
1.1 INTERVAL ESTIMATES and HYPOTHESES TESTS c
HYON-JUNG KIM, 2017
Pn b
i=1 (Yi −Y )2
REMARK: In the simple linear regression case, the F ∗ test using F ∗ = MSE is the
same as the t∗ test for H0 : β1 = 0 vs. Ha : β1 =
6 0. In general, the t∗ statistic is the square
root of F ∗ for one degree of freedom (two-sided) hypotheses tests. So if a two-sided test is
desired, either t∗ or F ∗ give identical results (Numerator of F ∗ has one degree of freedom).
The rejection rule is: Reject H0 if F ∗ > F1−α,df1 ,df2 ; otherwise do not reject H0 .
b) Test H0 : β1 = 0 vs. H0 : β1 6= 0.
c) Find the 95% prediction interval for the soybean yield at X0 = .02.
PAGE 7
1.2 Analysis of Variance approach c
HYON-JUNG KIM, 2017
SUMS OF SQUARES
The total variation of Y , corrected only for the mean ( SST) is a measure of the variation
of Y around the mean Y . The total variation can be split up into two component sums of
squares, sum of squares due to regression (SSR) and sum of squares due to errors (SSE).
SSR is the variation of the fitted points around the mean and SSE is the residual variation
in Y that is not explained by the regression curve.
n
X n
X n
X
2
(Yi − Y ) = b 2
( Yi − Y ) + (Yi − Yb )2
i=1 i=1 i=1
This kind of partitioning is called the analysis of variance because the total sum of
squares (SST) is the sum of squares that is used to compute the variance of Y if no model
is fit. It is this variance that is being “analyzed” or partitioned in the name of analysis of
variance (ANOVA).
Typically this breakdown of sums of squares is put into an analysis of variance table: Let
p denote the number of parameters to be estimated in a regression model.
This immediately implies that a good regression is one with a large R2 , where
Pn b
2 ( Yi − Y ) 2 SSR SSE
R = Pi=1 n = = 1 − .
i=1 (Y i − Y ) 2 SST SST
PAGE 8
1.2 Analysis of Variance approach c
HYON-JUNG KIM, 2017
The R2 value (called the coefficient of determination) measures the proportion of variability
in Y accounted for by the regression. Values closer to 1 indicate a strong regression, while
values closer to 0 indicate a weaker one. When it is small, it may be that there is a lot of
random inherent variation in the data.
Warnings
1. A large R2 value does not imply that the estimated regression line fits the data well.
Another function might better describe the trend in the data.
2. One data point (or a few data points) can greatly affect the R2 value.
4. Correlations that are based on rates or averages tend to overstate the strength of an
association.
5. A “statistically significant” R2 value does not imply that the slope is meaningfully
different from 0. A large r-squared value does not necessarily imply useful predictions.
PAGE 9
1.3 Matrix approach to Simple Linear Models c
HYON-JUNG KIM, 2017
The rank of a matrix is the maximum number of linearly independent columns which
can be selected from the columns of the matrix. Note that if the rank of a matrix is 1, then
there is one column such that all other columns are direct multiples.
For any matrix A, the rank of A is the same as the rank of A′ A. The row rank of any
matrix is always equal to the column rank.
RANDOM VECTORS
Let Y = (Y1 , Y2 , ..., Yn )′ be a random vector with the probability density function (pdf)
denoted by f (y) (describing how Y1 , Y2 , ..., Yn are jointly distributed).
Then, E(Y ) = (E(Y1 ), E(Y2 ), ..., E(Yn ))′ and the variance-covariance matrix of Y is
Var(Y1 ) Cov(Y1 , Y2 ) · · · Cov(Y1 , Yn )
Cov(Y2 , Y1 ) Var(Y ) · · · Cov(Y , Y )
2 2 n
.. ... ..
. .
Cov(Yn , Y1 ) Cov(Yn , Y2 ) · · · Var(Yn )
PAGE 10
1.3 Matrix approach to Simple Linear Models c
HYON-JUNG KIM, 2017
defined by E[(Y − E(Y ))(Y − E(Y ))′ ] = Var(Y ). Since Cov(Yi , Yj ) = Cov(Yj , Yi ), it
follows that the variance-covariance matrix is symmetric.
BASIC RESULTS
Suppose that A, B are n × m matrices of constants and that c is a vector of constants. Let
V be the variance-covariance matrix of Y .
i) E(AY ) = A E(Y )
E(Y ′ AY ) = µ′ Aµ + tr(AV ),
Pn
where E(Y ) = µ and tr(W ) = i=1 wii is the trace of Wn×n .
m
X m
X
E( ak U k ) = ak E(Uk )
k=1 k=1
Definition: The random vector Y = (Y1 , Y2 , ..., Yn )′ is said to have a multivariate normal distribution
with mean µ and variance-covariance matrix V if its (joint) probability density function is
PAGE 11
1.3 Matrix approach to Simple Linear Models c
HYON-JUNG KIM, 2017
given by
1 1
fY (y) = exp{− (y − µ)′ V −1 (y − µ)},
(2π)n/2 |V |1/2 2
for all y ∈ Rn . Shorthand notation for this statement is Y ∼ MVN(µ, V ).
AY + c ∼ MVN (Aµ + c, AV A′ ).
For more complicated models it is useful to write the model in matrix notation. When
we expand out the simple linear model, Yi = β0 + βi Xi + ǫi , i = 1, . . . , n
Y1 = β0 + β1 X1 + ǫ1
Y2 = β0 + β1 X2 + ǫ2
..
.
Yn = β0 + β1 Xn + ǫn
Y1 1 X1 ǫ1
" #
Y2 1 X2 ǫ2
β0
. =.
.. + .
.. .. . β1 ..
Yn 1 Xn ǫn
or
Y = Xβ + ǫ
The normal equations can be written in matrix notation as well. They are
X ′ Xβ = X ′ Y
PAGE 12
1.3 Matrix approach to Simple Linear Models c
HYON-JUNG KIM, 2017
or
" P #" # " P #
n Xi β0 Yi
P P = P
Xi Xi2 β1 Xi Y i
Thus, if the design matrix X has full column rank, then the normal equations can be
solved by pre-multiplying both sides by (X′ X)−1 .
β̂ = (X′ X)−1 X ′ Y
You should verify for yourselves that in the simple linear regression case
" #
Y − β̂1X
β̂ = (X′ X)−1 X ′ Y = Sxy
Sxx
Recall that the vector ǫ is what makes the above model a peculiar statistical model. One of
the assumption of the ordinary regression model is that ǫi ∼ uncorrelated (0, σ 2 ).
The least squares estimates of the parameters are random vectors since they are linear
combinations of Y.
= (X′ X)−1 X ′ Xβ
= β
= (X′ X)−1 X ′ σ 2 In X(X′ X)−1 , when Cov(ǫi , ǫj ) = 0 and Var(ǫi ) = σ 2 for all i, j
= (X′ X)−1 σ 2
PAGE 13
1.3 Matrix approach to Simple Linear Models c
HYON-JUNG KIM, 2017
The residuals are deviations of the observed points from the fitted regression line. They
are important for assessing the fit of the model and for computing an estimate of the variance.
e = Y − Yb = Y − X β̂ = Y − H Y = (I − H )Y
Properties of the residuals that are used in checking the model assumptions:
1. The sum of squares of the residuals is minimized by the criterion for least squares.
2. X ′ e = 0 (The residuals are orthogonal to the columns of X matrix). This means that
a) For a model with an intercept (a column of ones) the sum of residuals is zero because
n
X
′
1e= ei = 0
i=1
This means that if you tried to regress the residuals on X you would get a slope of zero
and the scatter plot of e vs X would be centered about zero.
′ ′
3. The residuals are also orthogonal to Yb since Yb e = β̂ X ′ e = 0. Notice that the
residuals are not orthogonal to the observed values of Y .
Note that the ANOVA sums of squares are all quadratic forms.
PAGE 14
1.3 Matrix approach to Simple Linear Models c
HYON-JUNG KIM, 2017
Simple example. DATA (X,Y): (4,5), (7,7), (5,6), (4,6) Assume that σ 2 = 0.25 known.
5 1 4 ǫ1
" #
7 1 7 β0 ǫ2
= + .
6 1 5 β1 ..
6 1 4 ǫn
We find the estimated parameters using the formula β̂ = (X′ X)−1 X′ Y = (3.5, 0.5)′ (Verify!)
(3) Estimate the mean value of Y at X = 6 and give a 95% confidence interval.
" #
β0
We are estimating: β0 + 6β1 = [1, 6] and l′ β̂ =
β1
" #" #
1.1042 −0.2083 1
d ′ β̂) = l′ Var(
Var(l d β̂)l = [1, 6] = 0.1042.
−0.2083 0.0417 6
Note that the 97.5% quantile of the t distribution with 2 d.f. = 4.30.
PAGE 15
1.4 DIAGNOSTICS and MODEL EVALUATION c
HYON-JUNG KIM, 2017
When a simple linear model is selected for an application, one cannot be certain in advance
that the model is appropriate to the dataset in question. Note that all the tests, intervals,
predictions, etc., are based on believing that the assumptions of the regression hold. To assess
whether the assumptions underlying the model seem reasonable, we study the residuals with
graphical analyses. That is, we analyze the residuals to see if they support the assumptions
of linearity, independence, normality and equal variances.
- The residuals should roughly form a horizontal band around the 0 line, suggesting that
the variances of the error terms are equal.
- If the predictor on the X axis is a new and different predictor, the residuals vs. predictor
plot can help to determine whether the predictor should be added to the model (and hence
a multiple regression model used instead).
- constructed by plotting the n ordered residuals against the n ordered quantiles from
the standard normal distribution.
- To assesses the apparent normality of the residuals and should look like a straight line
(roughly).
- Isolated points represent unusual observations, while a curved line indicates that the
errors are probably not normally distributed, and tests and intervals might not be trust-
worthy.
PAGE 16
1.4 DIAGNOSTICS and MODEL EVALUATION c
HYON-JUNG KIM, 2017
-To check if your data has a time structure to it. If not, there should be no apparent
pattern.
ei ei
OK autocorrelation
ei ei ei
i i i
OUTLIERS: An outlier is an observation that is unusually small or large; i.e. outliers are
the data points that do not fit well with the pattern of the rest of the data. In straight-line
regression, an outlier might be an observation that falls far off the apparent approximate
straight line trajectory followed by the remaining observations. Practitioners often “toss
out” such anomalous points, which may or may not be a good idea. If it is clear that an
outlier is the result of a mishap or a gross recording error, then this may be acceptable. On
the other hand, if no such basis may be identified, the outlier may, in fact, be a genuine
response; in this case, it contains information about the process under study, and may be
reflecting a legitimate phenomenon. Then, “throwing out” an outlier may lead to misleading
conclusions, because a legitimate feature is being ignored.
a. Delete outliers and redo the analysis (new outliers may surface).
PAGE 17
1.4 DIAGNOSTICS and MODEL EVALUATION c
HYON-JUNG KIM, 2017
b. Sometimes the purpose of the experiment is just to identify the outliers. In this case,
there is no need to redo the analysis.
c. Check the experimental circumstances surrounding the data collection for the outlying
cases.
d. Report the analysis both with and without the analysis and let the reader decide.
LEVERAGES: To identify outliers, we should consider first looking at the residual plot of
ei versus Ŷi . Recall the prperty of residuals:
where
1 (Xi − X)2
hii = + , which is called the leverage of ith case.
n Sxx
So the residuals (unlike the errors) do not have constant variance and are slightly correlated!
Observations where Xi is far away from X will have large values of hii . However, not all
observations with large leverages are necessarily outliers.
ei
ri = p
MSE(1 − hii )
have E(ri ) = 0, Var(ri ) = 1 so that the studentized residuals have a constant variance
regardless of the location of the X’s. Values of |ri | larger than 3 or so should cause concern.
PAGE 18
1.4 DIAGNOSTICS and MODEL EVALUATION c
HYON-JUNG KIM, 2017
transformed scale. Sometimes it may be that, although the data do exhibit constant variance
on the original scale, they may be analyzed better on some transformed scale. It is also
important to remember that if a transformation is used, the resulting inferences apply to
this transformed scale (and no longer to the original scale). One should remember that
transforming the data may fix one problem, but it may create other violations of the model.
In general, transforming your data almost always involves lots of trial and error.
was suggested by Box and Cox (1964). The log and square root transformations are special
cases with λ = 0 and λ = 1/2, respectively. Approximate 100(1 − α) percent confidence
intervals for λ are available.
Some guidelines: You may want to try transformations of the Y -variable when there is
evidence of non-normality and/or non-constant variance problems in one or more residual
plots. Try transformations of the X-variable(s) (e.g., X −1 , X 2 , ln (X)) when there are strong
nonlinear trends in one or more residual plots. If either of the above trials does not work,
consider transforming both the response Y and the predictor X (e.g. ln (Y ), ln (X)).
If the error variances are unequal, try ‘stabilizing the variance’ by transforming Y :
√
- When the response is a Poisson count, take the Y transformation.
√
- For a binomial proportion, use the ’arcsine transformation’ sin−1 ( y) or more commonly
do the logistic regression.
- When the response isn’t anything special but showing unequal variances, use ln(Y ), or 1/Y .
PAGE 19
c
HYON-JUNG KIM, 2017
2 Multiple Regression
While the straight-line model serves as an adequate description for many situations, more
often than not, researchers who are engaged in model building consider more than just one
predictor variable X. In fact, it is often the case that the researcher has a set of p − 1
candidate predictor variables, say, X1 , X2 , ..., Xp−1 , and desires to model Y as a function of
one or more of these p − 1 variables. To accommodate this situation, we must extend our
linear regression model to handle more than one predictor variable.
for i = 1, 2, ..., n, where n > p and ǫi ∼ N (0, σ 2 ). The values β0 , β1 , ..., βp−1 are regression
coefficients as before and we assume that X1 , X2 , ..., Xp−1 are all fixed. The random errors
ǫi ’s are still assumed to be independent and have a normal distribution with mean zero and
a common variance σ 2 . Then,
Y = Xβ + ǫ,
In the multiple regression setting, because of the potentially large number of predictors, it
is more efficient to use matrices to define the regression model and the subsequent analyses.
PAGE 20
c
HYON-JUNG KIM, 2017
Note that the matrix X is called the design matrix, since it contains all of the predictor
variable information.
The least squares method is used to estimate the regression parameters β0 , β1 , ..., βp−1
for which we can easily find closed-form solutions in terms of matrices and vectors.
b = (X ′ X)−1 X ′ Y
β
To avoid the more technical details of working with non-full rank matrices we will assume
that X is of full rank, unless otherwise stated.
Example. The taste of matured cheese is related to the concentration of several chemicals
in the final product. In a study of cheddar cheese from the LaTrobe Valley of Victoria,
Australia, samples of cheese were analyzed for their chemical composition and were subjected
to taste tests. Overall taste scores were obtained by combining the scores from several tasters.
Data were collected on the following variables:
Variables ACETIC and H2S are both on the (natural) log scale. The variable LACTIC
has not been transformed. Table below contains concentrations of the various chemicals in
n = 30 specimens of mature cheddar cheese and the observed taste score.
Suppose that the researchers postulate that each of the three chemical composition co-
variates X1 , X2 , and X3 are important in describing the taste. In this case, they might
PAGE 21
c
HYON-JUNG KIM, 2017
for i = 1, 2, ..., 30. Are there other predictor variables that influence taste not considered here?
Alternatively, what if not all of X1 , X2 , and X3 are needed in the model? For example, if
the acetic acid concentration (X1 ) is not helpful in describing taste, then we might consider
a smaller model which excludes it; i.e.,
Yi = β0 + β2 Xi2 + β3 Xi3 + ǫi
for i = 1, 2, ..., 30. The goal of any regression modeling problem should be to identify each
of the important predictors, and then find the simplest model that does the best job.
PAGE 22
c
HYON-JUNG KIM, 2017
12.3 1 4.543 3.135 0.86
β0
20.9 1 5.159 5.043 1.53
β1
where Y =
39.0 , X = 1
5.366 5.438 1.57 and β = .
. . .. ..
.. β
.. .. 2
. . .
β3
5.5 1 6.176 4.787 1.25
30 164.941 178.254 43.260 736.000
164.941 916.302 1001.806 240.879 4194.442
We compute X ′ X = and X ′ Y = .
178.254 1001.806 1190.343 269.113 5130.932
43.26 240.879 269.113 65.052 1162.065
Thus, the least squares estimate of β for these data is given by
Using the first model, the ANOVA table for the cheese data is shown below. The F
PAGE 23
2.1 Sampling Distributions and Inference for the parameters
HYON-JUNG
c KIM, 2017
Sampling Distribution of β̂
IMPLICATIONS:
CONFIDENCE INTERVALS:
The 100(1 − α)% confidence interval for βk is β̂k ± t1− α2 ,n−p s(β̂k ), k = 0, . . . , p − 1 where
q
s(β̂k ) is the standard error of β̂k , i.e., (X ′ X)−1
kk MSE.
These regions are ellisoidally shaped and cannot be easily visualized due to high-dimensionality.
PAGE 24
2.1 Sampling Distributions and Inference for the parameters
HYON-JUNG
c KIM, 2017
CONFIDENCE INTERVAL FOR THE MEAN RESPONSE: Given a new set of predic-
tors X0 , what is the estimated mean response? We call it Yˆ0 , which is X0 ′ β̂. Var(X0 ′ β̂) =
X0 ′ (X ′ X)−1 X0 σ 2 . So a 100(1 − α)% confidence interval for the average of the responses
with given X0 is
q
Yˆ0 ± t1− α2 ,n−p MSE X0 ′ (X ′ X)−1 X0 .
HYPOTHESIS TESTS:
βˆk − βk,0
|t∗ | = q > t1− α2 ,n−p .
(X ′ X)−1
kk MSE
Note that if H0 is rejected, we do not know which one or how many of the βk ’s are
nonzero; we only know that at least one is.
PAGE 25
2.1 Sampling Distributions and Inference for the parameters
HYON-JUNG
c KIM, 2017
SUMS OF SQUARES AS QUADRATIC FORMS Recall that in page 8 the corrected sums
of squares were partitioned and expressed in terms of random vectors. To enhance the
understanding of these sums of squares, we often express them in terms of quadratic forms.
1 ′ 1
SST = Y ′ (I − 11 )Y = Y ′ (I − J)Y
n n
1
SSR = Y ′ (H − J)Y
n
′
SSE = Y (I − H)Y
Recall that any n × n real symmetric matrix A determines a quadratic form such that
n X
X n
′
Y AY = aij Yi Yj .
i=1 j=1
Pn
Def. The trace of a n × n square matrix A is tr(A) = i=1 Aii
We have seen that MSE is unbiased for σ 2 and this holds whether or not H0 : β0 = . . . =
βp−1 = 0 is true. On the other hand,
NOTE. When H0 is true, both MSR and MSE are estimating the same quantity, and the
F statistic should be close to one. When H0 is not true, the (Xβ)′ (H − n−1 11′ )Xβ > 0,
and hence, MSR is estimating something larger than σ 2 . In this case, we would expect F to
be larger than one. This gives an intuitive explanation of why F statistic is computed large
when H0 is not true. Also, the name of ‘Analysis of Variance’ is coined with this kind of
study because we are conducting hypothesis tests by comparing different estimators for the
variance. The extra term (Xβ)′ (H − n−1 11′ )Xβ is called a noncentrality parameter.
PAGE 26
2.1 Sampling Distributions and Inference for the parameters
HYON-JUNG
c KIM, 2017
d) Find a 95 percent confidence interval for the mean taste rating when concentration of
acetic acid, hydrogen sulfide, lactic acid are 5,6,1, respectively.
e) Find a 95 percent prediction interval for a particular taste rating score when concen-
tration of acetic acid, hydrogen sulfide, lactic acid are 5,6,1, respectively.
Each βi coefficient represents the change in the mean response, E(Y ), per unit increase
in the associated predictor variable when all the other predictors are held constant.
In the case of two predictors, the estimated regression equation yields a plane (as opposed
to a line in the simple linear regression setting). For more than two predictors, the estimated
regression equation yields a hyperplane.
PAGE 27
2.2 General linear tests - REDUCED vs. FULL MODEL c
HYON-JUNG KIM, 2017
Often we want to test a hypothesis about several parameters simultaneously (but not all of
them as in the case of testing regression relation). For example, in the model
2
Yi = β0 + β1 Xi1 + β2 Xi2 + β3 Xi1 Xi2 + β4 Xi2 + ǫi
we might want to test the hypothesis that all second order terms are zero; i.e. β3 = β4 = 0.
It is not a good idea to do t∗ test for each parameter because of the correlations among
the regressor (independent) variables. These t∗ tests for each parameter test the following
hypotheses:
2 2
If Xi1 Xi2 and Xi2 are highly correlated, then dropping Xi2 may not have a noticeable
2
impact. Likewise, if Xi2 is kept in the model, dropping Xi1 Xi2 may not have a big effect. It is
entirely possible to do two individual t∗ tests and conclude that neither variable is required.
However, it can be that the pair would significantly improve the fit of the model, if you test
the two variables simultaneously.
A common way to test H0 : β3 = β4 = 0 vs. Ha : not both are zero, is to fit the “reduced”
model
REDUCED: Yi = β0 + β1 Xi1 + β2 Xi2 + ǫi
Then compare the residual sums of squares from reduced model with the residual sum of
squares from the full model. The test statistic is
(SSE(reduced) − SSE(full))/(dfred − dff ull )
F∗ =
MSE(full)
The test criterion is: reject H0 if F ∗ > F1−α, (dfred −dff ull , dff ull ) .
This type of general linear test includes the F test for a regression relation and is a very
flexible method of testing hypotheses. It is also possible to test any linear hypotheses where
the reduced model is nested within the full model. For example, if
2
FULL: Yi = β0 + β1 Xi1 + β2 Xi2 + β3 Xi1 + ǫi
PAGE 28
2.2 General linear tests - REDUCED vs. FULL MODEL c
HYON-JUNG KIM, 2017
and the hypothesis to be tested is H0 : β3 = 0 and β1 = β2 , then we can write the reduced
model as
REDUCED: Yi = β0 + β1 (Xi1 + Xi2 ) + ǫi
1 X11 + X12
. ..
so that the design matrix has two columns: X = .. .
1 Xn1 + Xn2
The parameters of this model are obtained by putting restrictions on the parameters of the
full model, so we say that this model is nested within the full model. The test statistic has
2 and n − 4 degrees of freedom.
The general linear test statistic can be computed another way which does not require
fitting the reduced model. Any linear hypothesis can be stated in the form H0 : Cβ = h.
For example to test the hypothesis in the above model, write
" # " #
0 0 0 1 0
C= and h =
0 1 −1 0 0
Then
β0
" # " # " #
0 0 01 β
1
β3 0
H0 : Cβ = = =
0 1 −1 0
β2 β1 − β2 0
β3
The (extra) sum of squares due to fitting the full model over the reduced model is
The degrees of freedom(df ) for the numerator is the rank of C(X′ X)−1 C ′ .
This way of writing the F ∗ statistic gives a different perspective on the test. Notice that
PAGE 29
2.2 General linear tests - REDUCED vs. FULL MODEL c
HYON-JUNG KIM, 2017
Therefore, F ∗ is (C β̂−h)′ times the inverse of its estimated covariance matrix times (C β̂−h).
This is a similar idea to a t∗ statistic. We are essentially standardizing the difference C β̂ − h
based on its covariance matrix. If df = 1, so that C β̂ is a scalar, then F ∗ = (t∗ )2 .
The general linear test is very flexible and can be used to test all sorts of hypotheses.
Certain hypotheses are standard and are often included in an analysis of variance table.
That is, we can partition the total sums of squares into sums of squares that test different
hypotheses. The difference in SSE between a full model and a reduced model is called the
“extra sum of squares” due to fitting the additional parameters.
Let SSE(X1 , X2 , X3 ) be the error sum of squares for full model, i.e. SSE(FULL). Define
PAGE 30
2.2 General linear tests - REDUCED vs. FULL MODEL c
HYON-JUNG KIM, 2017
If you want to test whether β3 is needed, given that β1 and β2 are already in the model, then
the extra sums of squares are defined as
The “R” stands for the reduction in the error sums of squares due to fitting the additional
term. The F ∗ test for β3 = 0 is
which is the same test as we had before but expressed in different notation.
The extra sums of squares can also be written as a difference of regression sums of squares.
For example,
They are extra sums of squares for adding each term given that preceding terms in the model
statement are already in the model. For the model with 3 independent variables: the Type
I sums of squares are (in SAS MODEL Y= X1 X2 X3 ;) These sequential sums of squares
SOURCE df SS
X1 1 R(X1 )
X2 1 R(X2 |X1 )
X3 1 R(X3 |X1 , X2 )
error n−4 SSE(X1 , X2 , X3 )
add up to SSR(X1 , X2 , X3 ).
PAGE 31
2.2 General linear tests - REDUCED vs. FULL MODEL c
HYON-JUNG KIM, 2017
The sequential sums of squares depend on the order that you write the independent
variables in the model statement. For example, the model statement: MODEL Y= X3 X2 X1
would give
SOURCE df SS
X3 1 R(X3 )
X2 1 R(X2 |X3 )
X1 1 R(X1 |X2 , X3 )
error n−4 SSE(X1 , X2 , X3 )
These are extra sums of squares for adding a term to the model given that all the other
independent variables in the model statement are already in the model. The order of terms
in the model makes no difference. For either of the following model statements
MODEL Y= X1 X2 X3 MODEL Y= X3 X2 X1
the partial sums of squares are the same; But these sums of squares do not add up to the
SOURCE df SS
X1 1 R(X1 |X2 , X3 )
X2 1 R(X2 |X1 , X3 )
X3 1 R(X3 |X1 , X2 )
error n−4 SSE(X1 , X2 , X3 )
total regression sum of squares. In other words, this is not a partition of the regression sum
of squares. The only case that the Type II sums of squares do add up to SSR(X1 , X2 , X3 ) is
when the three independent variables are orthogonal to each other. This does not happen
often in the regression setting, but designed experiments are usually set up to take advantage
of this.
Note that if X1 and X2 are uncorrelated, then SSR(X1 ) = R(X1 |X2 ) and SSR(X2 ) =
R(X2 |X1 ).
PAGE 32
2.2 General linear tests - REDUCED vs. FULL MODEL c
HYON-JUNG KIM, 2017
Example: DATA
Y X0 X1 X2
-2 1 -2 1
-3 1 1 0
0 1 2 0
2 1 1 3
SSE(X2 ) =6.5833
? = 3.6353
PAGE 33
2.2 General linear tests - REDUCED vs. FULL MODEL c
HYON-JUNG KIM, 2017
a) We want to know whether or not the variables X1 (ACETIC) and X2 (H2S) should be
added to the model. Does the smaller model do just as well at describing the data as the
full model?
Analysis of Variance: Reduced Model
Source DF Sum of Squares Mean Squares F Pr > F
b) The ‘lm’ in R produces the following sequential sums of squares for the cheese data:
Obtain the following sequential sums of squares to check whether or not each variable
should be added to the model (given the preceding terms): R(X1 ), R(X2 |X1 ), R(X3 |X1 , X2 )
State the appropriate test hypotheses for each F-value computed in the ANOVA table.
PAGE 34
2.2 General linear tests - REDUCED vs. FULL MODEL c
HYON-JUNG KIM, 2017
Compute R(X1 ), R(X2 |X1 ), R(X3 |X1 , X2 ). What is the major difference between results of
b) and c) due to different ordering?
After adjusting for the effects of acetic and lactic concentrations, do we have significant
evidence that the hydrogen sulfide concentration is important in describing taste?
When we want to determine whether a specified regression function adequately fits the
data, we can conduct a lack of fit test. However, it is important to note that the test requires
repeated observations (replications) for at least one of the values of the predictors (X). This
test is also based on decomposing sums of squares (due to Errors) and the test procedure
can be derived in the same way as testing the full vs. reduced model.
(SSLF − SSPE)/(n − p − (n − c))
F∗ =
MSPE
where p is the number of regression parameters and c is the number of distinct X values,
SSLF denotes the lack of fit sum of squares, and SSPE (thus, MSPE for mean squares)
stands for the sum of squares due to pure error.
PAGE 35
2.3 Qualitative Independent Variables c
HYON-JUNG KIM, 2017
Types of variables
e.g. gender (female, male), Company status (private, public), Treatment (yes, no),
blood pressure rating (low, average, high)
Example: On average, do smoking mothers have babies with lower birth weight?
Then, a first order model with one binary and one quantitative predictor appears to be a
natural model to formulate for these data:
1 if mother i smokes
Yi = β0 + β1 Xi1 + β2 Xi2 + ǫi , where Xi2 =
0 otherwise
Q. Why not just fit two separate regression functions one for the smokers and one for the
non-smokers?
The combined regression model assumes that the slope for the two groups are equal and
that the variances of the error terms are equal. Then, it is better to use as much data
as possible to estimate standard errors of regression coefficients for testing and confidence
intervals.
Pooling your data and fitting the combined regression function allows you to easily and
efficiently answer research questions concerning the binary predictor variable.
PAGE 36
2.3 Qualitative Independent Variables c
HYON-JUNG KIM, 2017
Example. If we are interested in quantifying the relationship between total population (of
a metropolitan area) and number of active physicians it may be important to take (4)
geographic regions into account. The indicator variables can be easily used to identify each
of 4 regions as follows:
1 if region 1 1 if region 2
X1 = X2 =
0 otherwise 0 otherwise
1 if region 3 1 if region 4
X3 = X4 =
0 otherwise 0 otherwise
Then, we can fit a model for the effect of total population on the number of physicians
with a separate intercept for each region:
For n = 10 observations, the design matrix would look something like this:
1 1 0 0 0 X15
1 1 0 0 0 X25 region 1
− − − − − −
1 0 1 0 0 X35
1 0 1 0 0 X45 region 2 β0
− − − − −
−
β1
1 0 0 1 0 X55 β2
X= and β =
1 0 0 1 0 X65 β3
1 0 0 1 0 X75 region 3 β4
− − − − − − β5
1 0 0 0 1 X85
1 0 0 0 1 X95 region 4
1 0 0 0 1 X10 5
− − − − − −
This is not of full rank, so in order to compute parameter estimates we need to either
put restrictions on the parameters or redefine some of the independent variables. When the
PAGE 37
2.3 Qualitative Independent Variables c
HYON-JUNG KIM, 2017
design matrix is not of full rank, we are trying to estimate too many parameters. One simple
way to deal with this problem is to drop one column. In the above, columns 2 through 5 of
X matrix add up to the first column. If we drop the last of these columns, X4 , the result is
the following reparameterized model:
this model is essentially the same as before, but parameters have different meanings. For
each region there is a separate intercept and all regions have a common slope.
region4: Yi = β0 + β4 (Pop)i + ǫi
In this parameterization β0 is not the average intercept for all regions. Instead it is the
intercept for region 4. The other parameters such as β1 can be interpreted as the difference
between the intercept for that region and the intercept for region 4.
To test whether all regions have the same intercept we simply test H0 : β1 = β2 = β3 = 0
using a general linear test. Reject H0 if
R(X1 , X2 , X3 |Pop)/3
F∗ = > F1−α (3, n − 5)
MSE(full)
The model with separate intercepts but common slopes may not be realistic. You may need
separate slopes as well. You can fit a model that has entirely distinct lines for each region
using indicator variables as well. One parameterization for this type of model just builds on
our separate intercepts model:
Yi = β0 + β1 Xi1 + β2 Xi2 + β3 Xi3 + β4 (Pop)i + β5 Xi1 (Pop)i + β6 Xi2 (Pop)i + β7 Xi3 (Pop)i + ǫi
In this model β0 is the intercept for region 4 and β4 is the slope for region 4. The terms
β5 ,β6 and β7 give the difference between each region’s slope and the slope for region 4.
side notes: Fitting the full model this way gives the same slopes and intercepts for the four
regions as if you had fit four separate regression lines. The MSE from the full model is the
PAGE 38
2.3 Qualitative Independent Variables c
HYON-JUNG KIM, 2017
pooled variance estimate from the individual regressions: MSE= SSE1 +n1SSE 2 +SSE3 +SSE4
+n2 +n3 +n4 −8
If
the variances are different for the different groups, then it is best to do separate regressions
so that you get separate variance estimates. Sometimes pointing out that the variances are
different is just as important as determining whether the slopes are different.
If the variances are similar for the different groups, or if a transformation will make the
variance fairly homogeneous, then it is more efficient to fit the single regression model than
to fit four separate regressions.
There are other ways of dealing with the non-full rank design matrix. In particular, if
we code the data with 1,0, or -1 for each region as follows,
1 if region 1
1 if region 2
Z1 = 0 if region 2 or 3 Z2 = 0 if region 1 or 3
−1
−1
region 4 region 4
1 if region 3
Z3 = 0 if region 1 or 2 the resulting design matrix for this model is
−1 region 4,
1 1 0 0 X15
1 1 0 0 X
25
1 0 1 0 X β0
35
1 0 1 0 X45 β1
1 0 0 1 X β2
55
X= and β =
1 0 0 1 X65 β3
1 0 0 1 X75 β4
1 −1 −1 −1 X85 β5
1 −1 −1 −1 X95
1 −1 −1 −1 X10 5
region1: Yi = β0 + β1 + β4 (Pop)i + ǫi
PAGE 39
2.3 Qualitative Independent Variables c
HYON-JUNG KIM, 2017
region2: Yi = β0 + β2 + β4 (Pop)i + ǫi
region3: Yi = β0 + β3 + β4 (Pop)i + ǫi
Interaction Effects
A regression model contains interaction effects if the response function is not additive and
cannot be separated into distinct functions of each of the individual predictors. Two predic-
tors interact if the effect on the response variable of one predictor depends on the value of
the other.
Example. Some researchers were interested in comparing the effectiveness of three treatments
(A,B,C) for severe depression and collected a random sample of n = 36 severely depressed
individuals.
Y 56 41 40 28 55 25 46 71 48 63 52 62
50 45 58 46 58 34 65 55 57 59 64 61
62 36 69 47 73 64 60 62 71 62 70 71
AGE 21 23 30 19 28 23 33 67 42 33 33 56
45 43 38 37 43 27 43 45 48 47 48 53
58 29 53 29 58 66 67 63 59 51 67 63
TRT A B B C A C B C B A A C
C B A C B C A B B C A A
B C A B A B B A C C A C
PAGE 40
2.3 Qualitative Independent Variables c
HYON-JUNG KIM, 2017
As the data indicates that X1 and X2 interact and that X1 and X3 interact, we formulate
the model as:
When we plug the possible values for X2 and X3 into the estimated regression function, we
obtain the three best fitting lines one for each treatment (A, B and C) through the data.
PAGE 41
2.3 Qualitative Independent Variables c
HYON-JUNG KIM, 2017
a) For every age, is there a difference in the mean effectiveness for the three treatments? i.e.
test whether the regression functions are identical for each age, regardless of three treatments.
H0 : β2 = β3 = β4 = β5 = 0 vs. H1 : at least one of these slope parameters is not 0.
The basic idea behind the piecewise linear regression is that if the data follow different linear
trends over different regions of the data, then we should model the regression function in
“pieces.” The pieces can be connected or not connected. Here, we’ll fit a model in which
the pieces are connected. We also use a dummy variable and an interaction term to define
a piecewise linear regression model.
A straight line estimated by the least squares may fit the data fairly well in some overall sense,
but we could do better if we use a piecewise linear function. We instead split our original
scatter plot into two pieces (where the water-cement ratio is 70%) and fit two separate, but
connected lines, one for each piece.
∗
Yi = β0 + β1 Xi1 + β2 Xi2 + ǫi ,
PAGE 42
2.3 Qualitative Independent Variables c
HYON-JUNG KIM, 2017
where Yi is the comprehensive strength, in pounds per square inch, of concrete batch i
Xi2 is a dummy variable (0, if Xi1 ≤ 70 and 1, if Xi1 > 70) of concrete batch i
∗
Xi2 denotes the interaction term (Xi1 − 70)Xi2
Incidentally, the X-value at which the two pieces of the model connect is called the “knot
value.” The regression equation is ‘strength = 7.79 − 0.0663 ratio − 0.101X2∗ .’
which yields two estimated regression lines, connected at X = 70 fitting the data quite well.
PAGE 43
2.3 Qualitative Independent Variables c
HYON-JUNG KIM, 2017
Polynomial Regression
In many cases the usual linear relationship between the dependent variable and any
independent variable we have assumed may not be adequate. One way to account for such
a nonlinear relationship is through a polynomial or trigonometric regression model. Such
models are linear in the parameters and the least squares method can be used for estimation
of the parameters as long as the usual assumptions on the errors are appropriate. For
example, a model for a single predictor, X, is:
Y = β0 + β1 X + β2 X 2 + ... + βp X p + ǫ,
where p is the degree of the polynomial. For lower degrees, the relationship has a specific
name (i.e., p = 2 is called quadratic, p = 3 is called cubic, p = 4 is called quartic, and
so on). However, polynomial regression models may have other predictor variables in them
as well, which could lead to interaction terms and the model can grow depending on your
application.
1 X1 X12
1 X2 X22
The design matrix for the second-degree polynomial model is: X = . .. .. .
.. . .
1 Xn Xn2
For the polynomial regression in one independent variable, an important aspect that
distinguishes it from other multiple regression models is that the mean of the dependent
variable is a function of a single independent variable. The fact that the independent vari-
ables in a simple polynomial model are functions of a single independent variable affects the
interpretation of the parameters. In the second-degree model, the parameter β1 is the slope
only at X = 0. The parameter β2 is half the rate of change in the slope of E(Y ).
The higher-degree polynomial models provide greatly increased flexibility in the response
surface. Although it is unlikely that any complex process will be truly polynomial in form,
the flexibility of the higher-degree polynomials allows any true model to be approximated to
any desired degree of precision.
PAGE 44
2.3 Qualitative Independent Variables c
HYON-JUNG KIM, 2017
2 2
Yi = β0 + β1 Xi1 + β2 Xi2 + β3 Xi1 + β4 Xi1 Xi2 + β5 Xi2 + ǫi .
The degree (or order) of an individual term in a polynomial is defined as the sum of the
powers of the independent variables in the term. The degree of the entire polynomial is
defined as the degree of the highest-degree term.
The squared terms allow for a curved response in each variable and the product term
allows for the surface to be twisted. A quadratic response surface will have a maximum,
a minimum, or a saddle point, depending on the coefficients in the regression equation.
(Refer to Box and Draper (1987) for a further discussion of the analysis of the properties of
quadratic response surfaces.)
Due to its extreme flexibility, some caution is needed in the use of polynomial models; it is
easy to overfit a set of data with polynomial models. Extrapolation is particularly dangerous
when higher-degree polynomial models are being used, since minor extrapolations can have
serious errors. Nevertheless, polynomial response models have proven to be extremely useful
for summarizing relationships.
The polynomial model is built sequentially, starting either with a first- degree polynomial
and adding progressively higher-order terms as needed. The lowest-degree polynomial that
accomplishes the degree of approximation needed or warranted by the data is adopted. It is
common practice to retain in the model all lower-degree terms, regardless of their significance,
that are contained in (or are subsets of) any significant term.
Example. Data are from a growth experiment with blue-green algae Spirulina platensis for
the treatment where CO2 is bubbled through the culture. The measure of algae density
is the dependent variable. Consider a cubic polynomial model with the data for the first
replicate. The (ordinary) least squares fit of the model is given by
PAGE 45
2.3 Qualitative Independent Variables c
HYON-JUNG KIM, 2017
where the standard errors of the estimates are given in parentheses. Assuming that a cubic
model is adequate, we can test the hypotheses
c) With two replicates per each day, we can test the adequacy of a quadratic polynomial
model. We fit the quadratic model with whole data (including rep1 and rep2) to obtain
Note that the natural polynomials with terms Xi , Xi2 , and Xi3 , etc. give (nearly) linearly
dependent columns of the X matrix leading to multicollinearity problems. For the cubic
polynomial model (as in the above example), a set of orthogonal polynomials can be defined:
O0i = 1,
PAGE 46
2.4 Data Transformations c
HYON-JUNG KIM, 2017
The orthogonal polynomials can be obtained using the Gram Schmidt orthogonalization
procedure or with a function ‘poly’ in R.
There are many situations in which transformations of the dependent or independent vari-
ables are helpful in least squares regression. Transformations of the dependent variable are
often used as remedies for nonnormality and for heterogeneous variances of the errors.
Note that data transformation definitely requires a ‘trial and error’ approach. In building
the model, we try a transformation and then check to see if the transformation eliminated
the problems with the model. If it doesn’t help, we try another transformation until we have
an adequate model.
Previously, we learned tools for detecting problems with a linear regression model. Once
we’ve identified problems with the model, we have a number of options:
• If important predictor variables are omitted, see whether adding the omitted predictors
improves the model.
• If the mean of the response is not a linear function of the predictors, try a different
function. e.g. polynomial regression or applying a logarithmic transformation to the
response variable allowing for a nonlinear relationship between the response and the
predictors while remaining within the multiple linear regression framework.
• If there are unequal error variances, try transforming the response and/or predictor
variables or use “weighted least squares regression” (or generalized least squares).
• If error terms are not independent, try fitting a “time series model.”
Another reason for making transformations is to simplify the relationship between the depen-
dent variable and the independent variables. Curvilinear relationships between two variables
PAGE 47
2.4 Data Transformations c
HYON-JUNG KIM, 2017
frequently can be simplified by a transformation on either one or both of the variables. Many
models nonlinear in the parameters can be linearized, reexpressed as a linear function of the
parameters, by appropriate transformations. For example, the relationship
Y = αX β
ln(Y ) = ln(α) + β ln X or Y ∗ = α∗ + βX ∗ .
Logarithms are often used in data transformations because they are connected to common
power curve (as above) and exponential growth Y = α expXβ relationships.
PAGE 48
2.5 Weighted Least Squares c
HYON-JUNG KIM, 2017
The least squares regression method discussed so far was based on the assumptions that the
errors are normally distributed independent random variables with a common variance. Least
squares estimators based on these assumptions is referred to as ordinary least squares
estimates and they have the desirable property of being the best (minimum variance) among
all possible linear unbiased estimators by Gauss-Markov theorem. When the normality
assumption is satisfied, the least squares estimators are also maximum likelihood estimators.
If the variance of Y is not constant, there may be a transformation of Y that will stabilize
the variance. This is the most common approach, but under certain circumstances there
is another method that is very attractive. If the variance heterogeneity is known up to a
constant (and the observations Yi are uncorrelated), then we can use the method of weighted
least squares:
2
Var(ǫi ) = σi2 = σwi = Var(Yi ) and Cov(ǫi , ǫj ) = 0
w1 0 . . . 0
0 w2
Let W = . ...
..
0 wn
What we do here is minimizing a sum of squares, (Y −Xβ)′ W (Y −Xβ), where we weight
each observation according to its variability. If an observation is highly variable it gets low
weight, and if an observation has small variance it gets greater weight.
b = (X ′ W X)−1 X ′ W Y
= σ 2 (X ′ W X)−1
PAGE 49
2.5 Weighted Least Squares c
HYON-JUNG KIM, 2017
When you compare the weighted least squares (WLS) results with those from ordinary
least squares it often happens that the estimated coefficients do not change very much.
However, the weighted least squares is more efficient than the ordinary least squares; i.e.
the parameters are estimated more precisely. Note, however, that if you use ordinary least
squares when heteroscedasticity is present (non-constant error variances), the estimated
variances of the regression coefficients that come out of a standard regression program will
not actually be the correct variances, so they may look smaller than those under WLS. Also,
if you compute confidence bands or prediction intervals, the ones computed using WLS will
get wider as X increases, as they should. This is probably the most important benefit of
using the weighted least squares. We want the confidence and prediction intervals to be
narrow when Var(Yi ) is small.
When you do the weighted least squares, it is important to use studentized residuals
rather than raw residuals for diagnostic plots. The raw residuals do not have homogeneous
variances and will not show the results of the weighting, but the studentized residuals do
have homogeneous variances.
The weighted least squares is a special case of generalized least squares (GLS) where
the error terms not only may have different variances but pairs of error terms may also be
correlated.
β̂ GLS = (X ′ V −1 X)−1 X ′ V −1 Y
β̂ EGLS = (X ′ V̂ −1 X)−1 X ′ V̂ −1 Y.
PAGE 50
2.6 Model Building c
HYON-JUNG KIM, 2017
In many practical applications, a researcher has a large set of candidate predictor variables
from which he/she tries to identify the most appropriate predictors to include in his regression
model. However, for example, if a researcher has (only) 10 candidate predictor variables,
then there are 210 = 1024 possible regression models from which to choose.
In general, we want to explain the data in the simplest way, since collinearity is caused
by having too many variables or unnecessary predictors will add noise to the estimation of
other quantities that we are interested in. However, an underspecified model is a model in
which important predictors are missing and such a model yields biased regression coefficients
and biased predictions of the response.
Here we aim to learn some variable selection methods such as stepwise regression, best
subsets regression and LASSO regression so that the resulting model is simple and useful.
Stepwise Regression
We build our regression model from a set of candidate predictor variables by entering
and removing predictors in a stepwise manner into our model until there is no justifiable
reason to enter or remove any more. we start with no predictors in our “stepwise model.”
Then, at each step along the way we either enter or remove a predictor based on the partial
F-tests that is, the t-tests for the slope parameters.
PAGE 51
2.6 Model Building c
HYON-JUNG KIM, 2017
2. For all predictors not in the model, check their p-value if they are added to the
model. Choose the one with lowest p-value less than αE
Note that αR and αE will typically be greater than the usual 0.05 level so that it is
not too difficult to enter predictors into the model.
1. Fit each of the one-predictor models. Of those predictors, the predictor that has
the smallest t-test p-value than αE is put in the stepwise model. If no predictor
has a p-value than αE , stop.
2. Fit each of the two-predictor models that include the predictor chosen in 1. Again,
of those predictors, the predictor that has the smallest t-test p-value than αE is
put in the model. Step back and see if entering the newly chosen variable into the
model somehow affects the significance of the previously chosen predictor in 1. If
p-value for the previously chosen variable is greater than αR , remove the variable.
3. Fit each of the three-predictor models with two predictors chosen in 2 and so on.
5. Stop when adding an additional predictor does not yield a p-value less than αE
Warnings:
- The procedure yields a single final model, although there are often several equally good
models. The final model is not guaranteed to be optimal in any specified sense: it does not
mean that all the important predictor variables for predicting Y have been identified, nor
that all the unimportant predictor variables have been eliminated.
- Stepwise regression does not take into account a researcher’s knowledge about the
predictors. It may be necessary to force the procedure to include important predictors.
PAGE 52
2.6 Model Building c
HYON-JUNG KIM, 2017
- One should not over-interpret the order in which predictors are entered into the model.
- It is possible that we may have committed a Type I or Type II error along the way.
We select the subset of predictors that does the best at meeting some well-defined ob-
jective criterion, such as having the largest R2 (adjusted R2 ) value or the smallest MSE.
1. Identify all of the possible regression models derived from all of the possible combina-
tions of the candidate predictors.
2. Determine the one-predictor models that do the ‘best’ at meeting some well-defined
criteria, the two-predictor models that do the ‘best’, the three-predictor models that
do the ‘best’, and so on.
3. Further evaluate and refine models by residual analyses, transforming variables, adding
interaction terms, etc. until the model meets necessary conditions allowing you to
answer your research question.
Criterion-based procedures
Akaike’s Information Criterion (AIC), Bayesian Information Criterion (BIC) (or called
Schwartz’s Bayesian Criterion (SBC)), and Amemiya’s Prediction Criterion (APC)
AIC = n ln(SSE) − n ln n + 2p
PAGE 53
2.6 Model Building c
HYON-JUNG KIM, 2017
where n: sample size and p: number of regression coefficients in the model being
evaluated including the intercept.
Notice that the only difference between AIC and BIC is the multiplier of p. Each of
the information criteria is used in a similar way in comparing two models, the model
with the lower value is preferred.
The BIC places a higher penalty on the number of parameters in the model so will
tend to reward more parsimonious (smaller) models. This stems from one criticism of
AIC in that it tends to overfit models.
where e(i) ’s are the residuals calculated omitting ith observation in the regression fit.
The model with the lowest PRESS value is selected. This tends to pick larger models
(which may be desirable if prediction is the objective).
3. Mallows’ Cp -statistic
It estimates the size of the bias that is introduced into the predicted responses by
having an underspecified model.
PAGE 54
2.6 Model Building c
HYON-JUNG KIM, 2017
- Identify subsets of predictors for which the Cp value is near p (if possible). The full
model always yields Cp = p, so don’t select the full model based on Cp .
- If all models, except the full model, yield a large Cp not near p, it suggests some
important predictor(s) are missing from the analysis.
- If a number of models have Cp near p, choose the model with the smallest Cp value,
thereby insuring that the combination of the bias and the variance is at a minimum.
- When more than one model has a small value of Cp value near p, in general, choose
the simpler model or the model that meets your research needs.
Cross-validation
It is a model validation technique with which the regression equation of the model fit is
used to the original dataset to make predictions for the new dataset. Then, we can calculate
the prediction errors (differences between the actual response values and the predictions)
and summarize the predictive ability of the model by the mean squared prediction error
(MSPE). This gives an indication of how well the model will predict in the future.
The sample data is usually partitioned into a training (or model-building) set, which we
can use to develop the model, and a validation (or prediction) set, which is used to evaluate
the predictive ability of the model.
The K-fold cross-validation partitions the sample dataset into K parts which are (roughly)
equal in size. For each part, we use the remaining K − 1 parts to estimate the model of
interest (i.e., the training sample) and test the predictability of the model with the remain-
ing part (i.e., the validation sample). Then, the sum of squared prediction errors can be
computed and combining K estimates of prediction error produces a K-fold cross-validation
estimate.
When K = 2, it is usually preferable to residual diagnostic methods and takes not much
longer to compute.
PAGE 55
2.7 Multicollinearity c
HYON-JUNG KIM, 2017
separate data sets are trained on all of the data (except one point) and then prediction is made
for that one point. The evaluation of this method is very good, but often computationally
expensive. Note that the K-fold cross-validation estimate of prediction error is identical to
the PRESS statistic.
2.7 Multicollinearity
Multicollinearity exists when two or more of the predictors in a regression model are mod-
erately or highly correlated.
Types of multicollinearity
• the estimated regression coefficient of any one variable depends on other predictors
that are included in the model.
• the precision of the estimated regression coefficients decreases as more predictors are
added to the model.
• the marginal contribution of any one predictor variable in reducing the error sum of
squares depends on the other predictors that are included in the model.
• hypothesis tests for βk = 0 may yield different conclusions depending on which predic-
tors are in the model.
PAGE 56
2.8 Ridge Regression c
HYON-JUNG KIM, 2017
blood pressure and age, weight, body surface area, duration, pulse rate and/or stress level.
(refer to the data set on the web)
X1 : age (in years), X2 : weight (in kg), X3 : body surface area (BSA in square meter)
X4 : duration of hypertension (in years), X5 : basal pulse (in beats per minute), X6 : stress
index
Q. What is the effect on the regression analysis if the predictors are perfectly uncorre-
lated?
When there is no apparent relationship at all between the predictors X1 and X2 so that
the correlation between them is zero (X1 and X2 are orthogonal to each other), the regression
parameter estimates (and their standard errors) remain the same regardless of the used model
form and the sequential sum of squares are the same as the partial sum of squares.
Reducing Multicollinearity
- Remove one or more of the violating predictors from the regression model.
In some cases, the least squares estimates can not be computed or is subject to a very
high-variance:
PAGE 57
2.8 Ridge Regression c
HYON-JUNG KIM, 2017
1. X is not of full rank (i.e., its columns are not linearly independent).
2. The number of predictors is larger than the number of responses (p > n).
To control the variance of β̂j ’s, we can regularize the coefficients, i.e., control how large they
can grow. Ridge regression (or Tikhonov Regularization) estimators can be computed in all
1-4 cases stated above (Hoerl and Kennard, 1970).
Ridge regression estimator β̂λ minimizes the penalized residual sum of squares criterion:
which makes explicit the size constraint on the parameters. There is 1-to-1 correspondence
between λ and s. λ offers a tradeoff between the regularizing constraint and minimization
of the sum of squared residuals. The bigger the λ the greater is the amount of shrinkage of
the coefficients toward zero.
Note that Jλ (β) is strictly convex and hence has a unique solution β̂λ . Thus, setting the
derivatives to zero, ∇β Jλ (β) = 0 yields that β̂λ solves
(X ′ X + λI)β = X ′ Y
β̂λ = (X ′ X + λI)−1 X ′ Y
PAGE 58
2.8 Ridge Regression c
HYON-JUNG KIM, 2017
The additional “ridge” down the diagonal elements guarantees that X ′ X is invertible.
β̂
β̂λ = ,
1+λ
• This form illustrates the essential feature of ridge regression: shrinkage towards zero
• The ridge penalty introduces bias but reduces the the variance of the estimate (bias-
variance tradeoff).
Theorem. If ǫi ’s are i.i.d. zero mean with variance Var(ǫi ) = σ 2 , then the bias of the ridge
regression estimator is
Bias(β̂λ ) = E[β̂λ ] − β = −λRλ β
Recall that mean squared error (MSE) = variance + (bias)2 , and in multiparameter problems:
MSE (β̂λ ) = E[(β̂λ − β)(β̂λ − β)⊤ ] = V ar(β̂λ ) + Bias (β̂λ ) · [ Bias (hatβλ )]′
There always exists λ such that the total MSE of β̂λ is smaller than the total MSE of
the least squares estimate β̂.
PAGE 59
2.9 LASSO (Least Absolute Shrinkage and Selection Operator)
HYON-JUNG
c KIM, 2017
• Hoerl and Kennard (1970) suggested using ridge traces: - Plot the components of β̂λ
against λ or rather DF(λ) due to 1-to-1 relation. - Choose a λ for which the coefficients
are not rapidly changing and have sensible signs.
ltg
bmi
ldl
map
tch
Coefficient
hdl
glu
age
sex
tc
0 2 4 6 8 10
DF
Another way to deal with variable selection is to use regularization (or penalization) - regu-
larize the regression coefficients. LASSO (Tibshirani,1996) has been a popular technique for
simultaneous linear regression estimation and variable selection. Specifically, we define β̂ to
minimize the penalized sums of squares
kY − Xβk2 + λkβk1
Or
n
X
min (Yi − Xi′ β)2 subject to kβk1 ≤ s
θ
i=1
PAGE 60
2.9 LASSO (Least Absolute Shrinkage and Selection Operator)
HYON-JUNG
c KIM, 2017
As before (in the ridge regression), the shrinkage (penalty) parameter λ offers a tradeoff
between the regularizing constraint and minimization of the sum of squared residuals. The
bigger the λ the greater is the amount of shrinkage. With LASSO, some of the estimated
coefficients are shrunk all the way to zero. Another important role of λ is for automated
variable selection.
To study relationship between Y = the level of prostate-specific antigen (lpsa) and a num-
ber of clinical measures in men who were about to receive a radical prostatectomy :
X3 = age
PAGE 61
2.10 Robust Regression c
HYON-JUNG KIM, 2017
0.8
s vi
0.6
lcavol
Coefficients
lweight
0.4
0.2
lbph
gleas on
0 pgg45
age
lcp
−0.2
0 0.2 0.4 0.6 0.8 1
nor malized k βˆλk 1
Note: The dashed vertical line depicts the minimum BIC value
Recall that the ordinary least squares estimates for linear regression are optimal when all
of the regression assumptions are valid. When the errors are nonnormal, or there are big
PAGE 62
2.10 Robust Regression c
HYON-JUNG KIM, 2017
outliers, least squares regression can perform poorly. Robust regression is not overly affected
by violations of assumptions by the underlying data-generating process.
where c is a user-defined tuning constant that affects robustness and efficiency of the
method.
where (r2 )(i) are the order statistics of the squared residuals. LTS estimator chooses
the regression coefficients minimize the sum of the smallest h of the squared residuals.
4
3 3
3
2 2
2
1 1
1
0 0 0
ï4 ï2 0 2 4 ï4 ï2 0 2 4 ï4 ï2 0 2 4
PAGE 63