Chapter 9
Chapter 9
Chapter 9
E(Y |x) = β0 + β1 x
where E(), which is read “expected value of”, indicates a population mean; Y |x,
which is read “Y given x”, indicates that we are looking at the possible values of
Y when x is restricted to some single value; β0 , read “beta zero”, is the intercept
parameter; and β1 , read “beta one”. is the slope parameter. A common term for
any parameter or parameter estimate used in an equation for predicting Y from
213
214 CHAPTER 9. SIMPLE LINEAR REGRESSION
The error model that we use is that for each particular x, if we have or could
collect many subjects with that x value, their distribution around the population
mean is Gaussian with a spread, say σ 2 , that is the same value for each value
of x (and corresponding population mean of y). Of course, the value of σ 2 is
an unknown parameter, and we can make an estimate of it from the data. The
error model described so far includes not only the assumptions of “Normality” and
“equal variance”, but also the assumption of “fixed-x”. The “fixed-x” assumption
is that the explanatory variable is measured without error. Sometimes this is
possible, e.g., if it is a count, such as the number of legs on an insect, but usually
there is some error in the measurement of the explanatory variable. In practice,
9.1. THE MODEL BEHIND LINEAR REGRESSION 215
we need to be sure that the size of the error in measuring x is small compared to
the variability of Y at any given x value. For more on this topic, see the section
on robustness, below.
In addition to the three error model assumptions just discussed, we also assume
“independent errors”. This assumption comes down to the idea that the error
(deviation of the true outcome value from the population mean of the outcome for a
given x value) for one observational unit (usually a subject) is not predictable from
knowledge of the error for another observational unit. For example, in predicting
time to complete a task from the dose of a drug suspected to affect that time,
knowing that the first subject took 3 seconds longer than the mean of all possible
subjects with the same dose should not tell us anything about how far the next
subject’s time should be above or below the mean for their dose. This assumption
can be trivially violated if we happen to have a set of identical twins in the study,
in which case it seems likely that if one twin has an outcome that is below the mean
for their assigned dose, then the other twin will also have an outcome that is below
the mean for their assigned dose (whether the doses are the same or different).
A more interesting cause of correlated errors is when subjects are trained in
groups, and the different trainers have important individual differences that affect
the trainees performance. Then knowing that a particular subject does better than
average gives us reason to believe that most of the other subjects in the same group
will probably perform better than average because the trainer was probably better
than average.
Another important example of non-independent errors is serial correlation
in which the errors of adjacent observations are similar. This includes adjacency
in both time and space. For example, if we are studying the effects of fertilizer on
plant growth, then similar soil, water, and lighting conditions would tend to make
the errors of adjacent plants more similar. In many task-oriented experiments, if
we allow each subject to observe the previous subject perform the task which is
measured as the outcome, this is likely to induce serial correlation. And worst of
all, if you use the same subject for every observation, just changing the explanatory
216 CHAPTER 9. SIMPLE LINEAR REGRESSION
variable each time, serial correlation is extremely likely. Breaking the assumption
of independent errors does not indicate that no analysis is possible, only that linear
regression is an inappropriate analysis. Other methods such as time series methods
or mixed models are appropriate when errors are correlated.
Before going into the details of linear regression, it is worth thinking about the
variable types for the explanatory and outcome variables and the relationship of
ANOVA to linear regression. For both ANOVA and linear regression we assume
a Normal distribution of the outcome for each value of the explanatory variable.
(It is equivalent to say that all of the errors are Normally distributed.) Implic-
itly this indicates that the outcome should be a continuous quantitative variable.
Practically speaking, real measurements are rounded and therefore some of their
continuous nature is not available to us. If we round too much, the variable is
essentially discrete and, with too much rounding, can no longer be approximated
by the smooth Gaussian curve. Fortunately regression and ANOVA are both quite
robust to deviations from the Normality assumption, and it is OK to use discrete
or continuous outcomes that have at least a moderate number of different values,
e.g., 10 or more. It can even be reasonable in some circumstances to use regression
or ANOVA when the outcome is ordinal with a fairly small number of levels.
The explanatory variable in ANOVA is categorical and nominal. Imagine we
are studying the effects of a drug on some outcome and we first do an experiment
comparing control (no drug) vs. drug (at a particular concentration). Regression
and ANOVA would give equivalent conclusions about the effect of drug on the
outcome, but regression seems inappropriate. Two related reasons are that there
is no way to check the appropriateness of the linearity assumption, and that after
a regression analysis it is appropriate to interpolate between the x (dose) values,
and that is inappropriate here.
Now consider another experiment with 0, 50 and 100 mg of drug. Now ANOVA
and regression give different answers because ANOVA makes no assumptions about
the relationships of the three population means, but regression assumes a linear
relationship. If the truth is linearity, the regression will have a bit more power
9.1. THE MODEL BEHIND LINEAR REGRESSION 217
15
10
Y
5
0
0 2 4 6 8 10
●
●
600
●
● ●
●
500
●
Final Weight (gm)
●
●
400
●
●
●
300
●
●
●
●
●
200
●
●
100
●
●
●
●
0 20 40 60 80 100
mg.
EDA, in the form of a scatterplot is shown in figure 9.2.
We want to use EDA to check that the assumptions are reasonable before
trying a regression analysis. We can see that the assumptions of linearity seems
plausible because we can imagine a straight line from bottom left to top right
going through the center of the points. Also the assumption of equal spread is
plausible because for any narrow range of nitrogen values (horizontally), the spread
of weight values (vertically) is fairly similar. These assumptions should only be
doubted at this stage if they are drastically broken. The assumption of Normality
is not something that human beings can test by looking at a scatterplot. But if
we noticed, for instance, that there were only two possible outcomes in the whole
experiment, we could reject the idea that the distribution of weights is Normal at
each nitrogen level.
The assumption of fixed-x cannot be seen in the data. Usually we just think
about the way the explanatory variable is measured and judge whether or not it
is measured precisely (with small spread). Here, it is not too hard to measure the
amount of nitrogen fertilizer added to each pot, so we accept the assumption of
220 CHAPTER 9. SIMPLE LINEAR REGRESSION
The basic regression analysis uses fairly simple formulas to get estimates of the
parameters β0 , β1 , and σ 2 . These estimates can be derived from either of two
basic approaches which lead to identical results. We will not discuss the more
complicated maximum likelihood approach here. The least squares approach is
fairly straightforward. It says that we should choose as the best-fit line, that line
which minimizes the sum of the squared residuals, where the residuals are the
vertical distances from individual points to the best-fit “regression” line.
The principle is shown in figure 9.3. The plot shows a simple example with
four data points. The diagonal line shown in black is close to, but not equal to the
“best-fit” line.
Any line can be characterized by its intercept and slope. The intercept is the
y value when x equals zero, which is 1.0 in the example. Be sure to look carefully
at the x-axis scale; if it does not start at zero, you might read off the intercept
incorrectly. The slope is the change in y for a one-unit change in x. Because the
line is straight, you can read this off anywhere. Also, an equivalent definition is the
change in y divided by the change in x for any segment of the line. In the figure,
a segment of the line is marked with a small right triangle. The vertical change is
2 units and the horizontal change is 1 unit, therefore the slope is 2/1=2. Using b0
for the intercept and b1 for the slope, the equation of the line is y = b0 + b1 x.
9.4. REGRESSION CALCULATIONS 221
25
●
20
21−19=2
● 10−9=1
Slope=2/1=2
15
Y
10
Residual=3.5−11=−7.5
5
b0=1.0
0
0 2 4 6 8 10 12
By plugging different values for x into this equation we can find the corre-
sponding y values that are on the line drawn. For any given b0 and b1 we get a
potential best-fit line, and the vertical distances of the points from the line are
called the residuals. We can use the symbol ŷi , pronounced “y hat sub i”, where
“sub” means subscript, to indicate the fitted or predicted value of outcome y for
subject i. (Some people also use the yi0 “y-prime sub i”.) For subject i, who has
explanatory variable xi , the prediction is ŷi = b0 + b1 xi and the residual is yi − ŷi .
The least square principle says that the best-fit line is the one with the smallest
sum of squared residuals. It is interesting to note that the sum of the residuals
(not squared) is zero for the least-squares best-fit line.
In practice, we don’t really try every possible line. Instead we use calculus to
find the values of b0 and b1 that give the minimum sum of squared residuals. You
don’t need to memorize or use these equations, but here they are in case you are
interested. Pn
(xi − x̄)(yi − ȳ)
b1 = i=1
(xi − x̄)2
b0 = ȳ − b1 x̄
Here are the derivations of the coefficient estimates. SSR indicates sum
of squared residuals, the quantity to minimize.
n
(yi − (β0 + β1 xi ))2
X
SSR = (9.1)
i=1
n
yi2 − 2yi (β0 + β1 xi ) + β02 + 2β0 β1 xi + β12 x2i
X
= (9.2)
i=1
9.4. REGRESSION CALCULATIONS 223
n
∂SSR X
= (−2yi + 2β0 + 2β1 xi ) (9.3)
∂β0 i=1
n
X
0 = −yi + β̂0 + β̂1 xi (9.4)
i=1
A little algebra shows that this formula for β̂1 is equivalent to the one
shown above because c ni=1 (zi − z̄) = c · 0 = 0 for any constant c and
P
variable z.
In multiple regression, the matrix formula for the coefficient estimates is
(X X)−1 X 0 y, where X is the matrix with all ones in the first column (for
0
Because the intercept and slope estimates are statistics, they have sampling
distributions, and these are determined by the true values of β0 , β1 , and σ 2 , as
well as the positions of the x values and the number of subjects at each x value.
If the model assumptions are correct, the sampling distributions of the intercept
and slope estimates both have means equal to the true values, β0 and β1 , and
are Normally distributed with variances that can be calculated according to fairly
simple formulas which involve the x values and σ 2 .
In practice, we have to estimate σ 2 with s2 . This has two consequences. First
we talk about the standard errors of the sampling distributions of each of the betas
224 CHAPTER 9. SIMPLE LINEAR REGRESSION
The formulas for the standard errors come from the formula for the
variance covariance matrix of the joint sampling distributions of β̂0 and
β̂1 which is σ 2 (X 0 X)−1 , where X is the matrix with all ones in the first
column (for the intercept) and the values of the explanatory variable in
the second column. This formula also works in multiple regression where
there is a column for each explanatory variable. The standard errors of the
coefficients are obtained by substituting s2 for the unknown σ 2 and taking
the square roots of the diagonal elements.
For simple regression this reduces to
v
P 2
x
u
u
SE(b0 ) = st
(x2 ) − ( x)2
P P
n
and s
n
SE(b1 ) = s .
(x2 ) − ( x)2
P P
n
The basic regression output is shown in table 9.1 in a form similar to that
produced by SPSS, but somewhat abbreviated. Specifically, “standardized coeffi-
cients” are not included.
In this table we see the number 94.821 to the right of the “(Constant)” label
and under the labels “Unstandardized Coefficients” and “B”. This is called the
intercept estimate, estimated intercept coefficient, or estimated constant, and can
9.4. REGRESSION CALCULATIONS 225
Unstandardized
Coefficients 95% Confidence Interval for B
B Std. Error t Sig. Lower Bound Upper Bound
(Constant) 94.821 18.116 4.682 .000 47.251 122.391
Nitrogen added 5.269 .299 17.610 .000 4.684 5.889
bj − hypothesized value of βj
tj = .
SE(bj )
Then the computer uses the null sampling distributions of the t-statistics, i.e.,
the t-distribution with n − 2 df, to compute the 2-sided p-values as the areas under
the null sampling distribution more extreme (farther from zero) than the coefficient
estimates for this experiment. SPSS reports this as “Sig.”, and as usual gives the
misleading output “.000” when the p-value is really “< 0.0005”.
226 CHAPTER 9. SIMPLE LINEAR REGRESSION
SPSS also gives Standardized Coefficients (not shown here). These are the
coefficient estimates obtained when both the explanatory and outcome variables
are converted to so-called Z-scores by subtracting their means then dividing by
their standard deviations. Under these conditions the intercept estimate is zero,
so it is not shown. The main use of standardized coefficients is to allow compari-
son of the importance of different explanatory variables in multiple regression by
showing the comparative effects of changing the explanatory variables by one stan-
dard deviation instead of by one unit of measurement. I rarely use standardized
coefficients.
The output above also shows the “95% Confidence Interval for B” which is gen-
erated in SPSS by clicking “Confidence Intervals” under the “Statistics” button.
In the given example we can say “we are 95% confident that βN is between 4.68
and 5.89.” More exactly, we know that using the method of construction of coeffi-
cient estimates and confidence intervals detailed above, and if the assumptions of
regression are met, then each time we perform an experiment in this setting we will
get a different confidence interval (center and width), and out of many confidence
intervals 95% of them will contain βN and 5% of them will not.
first check the range of x values covered by the experimental data. If there is no
x data near zero, then the intercept is still needed for calculating ŷ and residual
values, but it should not be interpreted because it is an extrapolated value.
If there are x values near zero, then to interpret the intercept you must express
it in terms of the actual meanings of the outcome and explanatory variables. For
the example of this chapter, we would say that b0 (94.8) is the estimated corn plant
weight (in grams) when no nitrogen is added to the pots (which is the meaning of
x = 0). This point estimate is of limited value, because it does not express the
degree of uncertainty associated with it. So often it is better to use the CI for b0 .
In this case we say that we are 95% confident that the mean weight for corn plants
with no added nitrogen is between 47 and 122 gm, which is quite a wide range. (It
would be quite misleading to report the mean no-nitrogen plant weight as 94.821
gm because it gives a false impression of high precision.)
After interpreting the estimate of b0 and it’s CI, you should consider whether
the null hypothesis, β0 = 0 makes scientific sense. For the corn example, the null
hypothesis is that the mean plant weight equals zero when no nitrogen is added.
Because it is unreasonable for plants to weigh nothing, we should stop here and not
interpret the p-value for the intercept. For another example, consider a regression
of weight gain in rats over a 6 week period as it relates to dose of an anabolic
steroid. Because we might be unsure whether the rats were initially at a stable
weight, it might make sense to test H0 : β0 = 0. If the null hypothesis is rejected
then we conclude that it is not true that the weight gain is zero when the dose is
zero (control group), so the initial weight was not a stable baseline weight.
Interpret the estimate, b0 , only if there are data near zero and setting
the explanatory variable to zero makes scientific sense. The meaning
of b0 is the estimate of the mean outcome when x = 0, and should
always be stated in terms of the actual variables of the study. The p-
value for the intercept should be interpreted (with respect to retaining
or rejecting H0 : β0 = 0) only if both the equality and the inequality of
the mean outcome to zero when the explanatory variable is zero are
scientifically plausible.
For interpretation of a slope coefficient, this section will assume that the setting
is a randomized experiment, and conclusions will be expressed in terms of causa-
228 CHAPTER 9. SIMPLE LINEAR REGRESSION
A plot of all residuals on the y-axis vs. the predicted values on the x-axis, called
a residual vs. fit plot, is a good way to check the linearity and equal variance
assumptions. A quantile-normal plot of all of the residuals is a good way to check
the Normality assumption. As mentioned above, the fixed-x assumption cannot be
checked with residual analysis (or any other data analysis). Serial correlation can
be checked with special residual analyses, but is not visible on the two standard
residual plots. The other types of correlated errors are not detected by standard
residual analyses.
To analyze a residual vs. fit plot, such as any of the examples shown in figure
9.4, you should mentally divide it up into about 5 to 10 vertical stripes. Then each
stripe represents all of the residuals for a number of subjects who have a similar
predicted values. For simple regression, when there is only a single explanatory
variable, similar predicted values is equivalent to similar values of the explanatory
variable. But be careful, if the slope is negative, low x values are on the right.
(Note that sometimes the x-axis is set to be the values of the explanatory variable,
in which case each stripe directly represents subjects with similar x values.)
To check the linearity assumption, consider that for each x value, if the mean of
Y falls on a straight line, then the residuals have a mean of zero. If we incorrectly fit
a straight line to a curve, then some or most of the predicted means are incorrect,
and this causes the residuals for at least specific ranges of x (or the predicated Y )
to be non-zero on average. Specifically if the data follow a simple curve, we will
tend to have either a pattern of high then low then high residuals or the reverse.
So the technique used to detect non-linearity in a residual vs. fit plot is to find the
230 CHAPTER 9. SIMPLE LINEAR REGRESSION
A B
10
● ●
●
●
● ● ●
●
5 ● ● ●
● ●
5
● ● ●
● ●
● ● ● ● ● ●
Residual
Residual
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
● ● ●
0
0
● ● ●● ● ●● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ● ● ●
●
● ● ● ● ●● ● ●● ●
●
−5
● ● ●
−5
● ● ● ●●
● ●
●
−10
−10
● ●
20 40 60 80 100 20 40 60 80 100
Fitted value Fitted value
C D
15
20
● ●
●
●
●
● ●
● ● ●
● ● ● ●
●
●
●
10
● ● ● ●●●
5
●
● ● ●● ●
Residual
Residual
●● ● ● ●
● ●
● ● ● ●
● ● ● ●
● ● ●
●● ● ● ● ●
●●
0
−5
●● ● ●● ● ●
● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ●●
●
−10
● ●● ● ●
● ●
● ●
−15
● ●
● ●●
● ●
20 25 30 35 0 20 60 100
Fitted value Fitted value
Figure 9.4: Sample residual vs. fit plots for testing linearity.
(vertical) mean of the residuals for each vertical stripe, then actually or mentally
connect those means, either with straight line segments, or possibly with a smooth
curve. If the resultant connected segments or curve is close to a horizontal line
at 0 on the y-axis, then we have no reason to doubt the linearity assumption. If
there is a clear curve, most commonly a “smile” or “frown” shape, then we suspect
non-linearity.
Four examples are shown in figure 9.4. In each band the mean residual is
marked, and lines segments connect these. Plots A and B show no obvious pattern
away from a horizontal line other that the small amount of expected “noise”. Plots
C and D show clear deviations from normality, because the lines connecting the
mean residuals of the vertical bands show a clear frown (C) and smile (D) pattern,
rather than a flat line. Untransformed linear regression is inappropriate for the
9.6. RESIDUAL CHECKING 231
A B
10
● ● ●
10
●
● ●
●
●
5
●
● ●●● ●
● ● ● ● ● ●
●
5
● ● ●● ● ●● ● ● ●
Residual
Residual
● ●● ● ● ● ●
● ● ● ● ●●
●● ●● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ●● ●
0
●
●
● ● ● ● ●● ●●● ●
● ●● ● ●● ●
● ● ● ●
0
● ● ● ● ●
● ● ● ● ● ●● ●
●● ●
●
● ● ●● ●
● ● ● ● ● ●
● ●● ● ●
●
● ● ● ● ●
−5
●● ●● ● ● ●
● ●
−5
● ●
● ● ●
● ● ●
● ● ● ●
● ●●
−10
●
● ●
20 40 60 80 100 0 20 40 60 80 100
Fitted value Fitted value
C D
● ●
100
● ● ●
● ● ● ● 50
● ●●
50
● ●● ● ●
●
Residual
Residual
● ● ● ●●● ●● ● ● ●
● ● ● ● ●● ●
● ● ●●
●●●
●●●●
●●
● ● ●
● ●● ● ● ● ●●●
● ●● ●● ● ●●
0
●● ● ● ● ● ● ● ● ●
●
● ●● ●
●
●
● ●
● ●● ● ● ● ●● ●● ●
0
●● ●● ● ● ● ● ● ●
●●● ● ● ● ●
●
●
●
● ●●
● ●● ●
● ● ● ● ●● ● ● ●● ●
● ● ● ●
●● ● ●
●
−100
●
● ●● ●
●
●
−100
●
● ●
20 40 60 80 100 0 20 40 60 80
Fitted value Fitted value
Figure 9.5: Sample residual vs. fit plots for testing equal variance.
data that produced plots C and D. With practice you will get better at reading
these plots.
To detect unequal spread, we use the vertical bands in a different way. Ideally
the vertical spread of residual values is equal in each vertical band. This takes
practice to judge in light of the expected variability of individual points, especially
when there are few points per band. The main idea is to realize that the minimum
and maximum residual in any set of data is not very robust, and tends to vary a
lot from sample to sample. We need to estimate a more robust measure of spread
such as the IQR. This can be done by eyeballing the middle 50% of the data.
Eyeballing the middle 60 or 80% of the data is also a reasonable way to test the
equal variance assumption.
232 CHAPTER 9. SIMPLE LINEAR REGRESSION
Figure 9.5 shows four residual vs. fit plots, each of which shows good linearity.
The red horizontal lines mark the central 60% of the residuals. Plots A and B show
no evidence of unequal variance; the red lines are a similar distance apart in each
band. In plot C you can see that the red lines increase in distance apart as you
move from left to right. This indicates unequal variance, with greater variance at
high predicted values (high x values if the slope is positive). Plot D show a pattern
with unequal variance in which the smallest variance is in the middle of the range
of predicted values, with larger variance at both ends. Again, this takes practice,
but you should at least recognize obvious patterns like those shown in plots C and
D. And you should avoid over-reading the slight variations seen in plots A and B.
The residual vs. fit plot can be used to detect non-linearity and/or
unequal variance.
The check of normality can be done with a quantile normal plot as seen in
figure 9.6. Plot A shows no problem with Normality of the residuals because the
points show a random scatter around the reference line (see section 4.3.4). Plot B
is also consistent with Normality, perhaps showing slight skew to the left. Plot C
shows definite skew to the right, because at both ends we see that several points
are higher than expected. Plot D shows a severe low outlier as well as heavy tails
(positive kurtosis) because the low values are too low and the high values are too
high.
A B
Quantiles of Standard Normal
2
● ●
● ●
● ●
●● ●
●
●● ●●
1
1
●
● ●
●
●
● ●
●
●
● ●●
●●●
●
●
●●
●●●
●●
● ●
●
●
●● ●●●
●
0
0
●
●● ●
●
●
●
●● ●
● ●
●
●
● ●●
●●
● ●
●●●
●●
● ●
●
●
●● ●
−1
−1
● ●
● ●●
●● ●
●
● ●
● ●
● ●
−2
−2
● ●
−5 0 5 10 −10 −5 0 5
C D
Quantiles of Standard Normal
● ●
2
● ●
● ●
● ●
●●
●
●
●
●
●● ●
1
●
●●● ●
●
●
●● ●●
●
● ●
●
●
●●● ●
●
●
●
●●
●●
●
●
●
●
● ●
0
●●
● ●
●
●
● ●●
●●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●●● ●
●
●
−1
−1
● ●
● ●● ●
●
●
● ●
● ●
● ●
● ●
−2
−2
● ●
slope will be retained far too often. Alternate techniques are required if the fixed-x
assumption is broken, including so-called Type 2 regression or “errors in variables
regression”.
The independent errors assumption is also critically important to regression.
A slight violation, such as a few twins in the study doesn’t matter, but other mild
to moderate violations destroy the validity of the p-value and confidence intervals.
In that case, use alternate techniques such as the paired t-test, repeated measures
analysis, mixed models, or time series analysis, all of which model correlated errors
rather than assume zero correlation.
the prediction line. Some programs report the mean squared error (MSE), which
is the estimate of σ 2 .
In simple regression, how close the simple correlation of x and y is to 1 or -1 is
a measure of the strength of the association. Because there are many correlations
in multiple regression (one for each x) we need a different way to measure the
overall strength of the regression. We use a quantitiy call the R2 value or multiple
correlation coefficient. Note that when applied to simple regression, R2 is equal
to the square of the simple correlation.
R2 can be interpreted as the fraction (or percent if multiplied by 100) of the
total variation in the outcome that is “accounted for” by regressing the outcome
on the explanatory variables.
A little math helps here. The total variation used in the case of R2 , is the sum
of squared deviations of each y value from the mean of y, (SStot ). Note that this
quantity ignores x. Since the mean of y is the best guess of the outcome for any
subject if the values of the explanatory variables are unknown, we can think of
total variation as measuring how well we can predict y without knowing x.
If we perform regression and then focus on the residuals, we can square and
then sum these residual to get SSres . The better x helps to predict y, the smaller
the residuals will be, and therefore, the smaller SSres will be. We can interpret
SSres as a measure of ”unexplained variablility”.
If we compute total minus residual variability (SStot − SSres ) we can call the
result “explained variability”. It represents the amount of variability in y that is
9.9. USING TRANSFORMATIONS 237