Introduction To Statistics and Data Analysis - 11
Introduction To Statistics and Data Analysis - 11
Introduction To Statistics and Data Analysis - 11
11
linear regression, introduced in Sect. 11.6, addresses the issue where the outcome
depends on more than one variable.
To begin with, we consider only two quantitative variables X and Y in which the
outcome Y depends on X and we explore the quantification of their relationship.
Consider the scatter plots from Fig. 4.2 on p. 80. Plotting X -values against Y -values
enables us to visualize the relationship between two variables. Figure 11.1a reviews
what a positive (linear) association between X and Y looks like: the higher the X
values, the higher the Y -values (and vice versa). The plot indicates that there may
be a linear relationship between X and Y . On the other hand, Fig. 11.1b displays a
scatter plot which shows no clear relationship between X and Y . The R 2 measure,
shown in the two figures and explained in more detail later, equates to the squared
correlation coefficient of Bravais–Pearson.
11.1 The Linear Model 251
It may be noted that if the regression parameters α and β are known, then the linear
model is completely known. An important objective in identifying the model is to
determine the values of the regression parameters using the available observations
for X and Y . It can be understood from Fig. 11.1a that an ideal situation would be
when all the data points lie exactly on the line or, in other words, the error ei is zero
for each observation. It seems meaningful to determine the values of the parameters
in such a way that the errors are minimized.
There are several methods to estimate the values of the regression parameters. In
the remainder of this chapter, we describe the methods of least squares and maximum
likelihood estimation.
y = α + βx
y3
Slope
e3 e4
y4
y1
e1
e2
y2 α
x1 x2 x3 x4
difference of the other three data points from the line: e2 , e3 , e4 . The error is zero if
the point lies exactly on the line. The problem we would like to solve in this example
is to minimize the sum of squares of e1 , e2 , e3 , and e4 , i.e.
4
min (yi − α − βxi )2 . (11.4)
α,β
i=1
We want the line to fit the data well. This can generally be achieved by choosing
α and β such that the squared errors of all the n observations are minimized:
n
n
min ei2 = min (yi − α − βxi )2 . (11.5)
α,β α,β
i=1 i=1
If we solve this optimization problem by the principle of maxima and minima, we
obtain estimates of α and β as
S
(x −x̄)(y − ȳ)
n
xi yi −n x̄ ȳ
β̂ = Sxx xy = i (x −x̄)i 2 = i=1 n
i i=1 xi −n x̄
2 2
, (11.6)
α̂ = ȳ − b̂ x̄
see Appendix C.6 for a detailed derivation. Here, α̂ and β̂ represent the estimates of
the parameters α and β, respectively, and are called the least squares estimator of
α and β, respectively. This gives us the model y = α̂ + β̂x which is called the fitted
model or the fitted regression line. The literal meaning of “regression” is to move
back. Since we are acquiring the data and then moving back to find the parameters
of the model using the data, it is called a regression model. The fitted regression
line y = α̂ + β̂x describes the postulated relationship between Y and X . The sign
of β determines whether the relationship between X and Y is positive or negative.
If the sign of β is positive, it indicates that if X increases, then Y increases too. On
the other hand, if the sign of β is negative, it indicates that if X increases, then Y
decreases. For any given value of X , say xi , the predicted value ŷi is calculated by
ŷi = α̂ + β̂xi
and is called the ith fitted value.
If we compare the observed data point (xi , yi ) with the point suggested (predicted,
fitted) by the regression line, (xi , ŷi ), the difference between yi and ŷi is called the
residual and is given as
êi = yi − ŷi = yi − (α̂ + β̂xi ). (11.7)
This can be viewed as an estimator of the error ei . Note that it is not really an estimator
in the true statistical sense because ei is random and so it cannot be estimated.
However, since the ei are unknown and eˆi measures the same difference between the
estimated and true values of the y’s, see for example Fig. 11.2, it can be treated as
estimating the error ei .
254 11 Linear Regression
Example 11.2.1 A physiotherapist advises 12 of his patients, all of whom had the
same knee surgery done, to regularly perform a set of exercises. He asks them to
record how long they practise. He then summarizes the average time they practised
(X , time in minutes) and how long it takes them to regain their full range of motion
again (Y , time in days). The results are as follows:
i 1 2 3 4 5 6 7 8 9 10 11 12
xi 24 35 64 20 33 27 42 41 22 50 36 31
yi 90 65 30 60 60 80 45 45 80 35 50 45
Using (11.6), we can now easily find the least squares estimates α̂ and β̂ as
Sx y (xi − x̄)(yi − ȳ) −2125.65
β̂ = = = ≈ −1.22,
Sx x (xi − x̄)2 1748.94
α̂ = ȳ − β̂ x̄ = 57.08 − (−1.215) · 35.41 = 100.28.
The fitted regression line is therefore
y = 100.28 − 1.22 · x.
We can interpret the results as follows:
• For an increase of 1 min in exercising, the recovery time decreases by 1.22 days
because β̂ = −1.22. The negative sign of β̂ indicates that the relationship between
exercising time and recovery time is negative; i.e. as exercise time increases, the
recovery time decreases.
• When comparing two patients with a difference in exercising time of 10 min,
the linear model estimates a difference in recovery time of 12.2 days because
10 · 1.22 = 12.2.
11.2 Method of Least Squares 255
100
Example 11.2.1
80
Recovery Time
40 6020
0
0 10 20 30 40 50 60 70
Exercise Time
We can also obtain these results by using R. The command lm(Y∼X) fits a linear
model and provides the estimates of α̂ and β̂.
lm(Y∼X)
We can draw the regression line onto a scatter plot using the command abline,
see also Fig. 11.3.
plot(X,Y)
abline(a=100.28,b=-1.22)
There are a couple of interesting results related to the regression line and the least
square estimates.
(i) As a rule of thumb, one should interpret the regression line ŷi = α̂ + β̂xi only in
the interval [x(1) , x(n) ]. For example, if X denotes “Year”, with observed values
256 11 Linear Regression
from 1999 to 2015, and Y denotes the “annual volume of sales of a particular
company”, then a prediction for the year 2030 may not be meaningful or valid
because a linear relationship discovered in the past may not continue to hold in
the future.
(ii) For the points P̂i = (xi , ŷi ), forming the regression line, we can write
ŷi = α̂ + β̂xi = ȳ + β̂(xi − x̄). (11.8)
(iii) It follows for xi = x̄ that ŷi = ȳ, i.e. the point (x̄, ȳ) always lies on the regres-
sion line. The linear regression line therefore always passes through (x̄, ȳ).
(iv) The sum of the residuals is zero. The ith residual is defined as
êi = yi − ŷi = yi − (α̂ + β̂xi ) = yi − [ ȳ + β̂(xi − x̄)].
The sum is therefore
n
n
n
n
êi = yi − ȳ − β̂ (xi − x̄)
i=1 i=1 i=1 i=1
= n ȳ − n ȳ − β̂(n x̄ − n x̄) = 0. (11.9)
(v) The arithmetic mean of ŷ is the same as the arithmetic mean of y:
1
n
1
ŷ¯ = ŷi = (n ȳ + β̂(n x̄ − n x̄)) = ȳ. (11.10)
n n
i=1
(vi) The least squares estimate β̂ has a direct relationship with the correlation coef-
ficient of Bravais–Pearson:
Sx y Sx y S yy S yy
β̂ = =√ · =r . (11.11)
Sx x Sx x S yy S xx S xx
The slope of the regression line is therefore proportional to the correlation coef-
ficient r : a positive correlation coefficient implies a positive estimate of β and
vice versa. However, a stronger correlation does not necessarily imply a steeper
slope in the regression analysis because the slope depends upon S yy /Sx x as
well.
While one can easily fit a linear regression model, this does not mean that the model
is necessarily good. Consider again Fig. 11.1: In Fig. 11.1a, the regression line is
clearly capturing the linear trend seen in the scatter plot. The fit of the model to the
data seems good. Figure 11.1b shows however that the data points vary considerably
around the line. The quality of the model does not seem to be very good. If we would
use the regression line to predict the data, we would likely obtain poor results. It
is obvious from Fig. 11.1 that the model provides a good fit to the data when the
observations are close to the line and capture a linear trend.
11.3 Goodness of Fit 257
R2=0.97 R2=0.57
A look at the scatter plot provides a visual and qualitative approach to judging
the quality of the fitted model. Consider Fig. 11.4a, b which both show a linear trend
in the data, but the observations in Fig. 11.4a are closer to the regression line than
those in Fig. 11.4b. Therefore, the goodness of fit is worse in the latter figure and any
quantitative measure should capture this.
A quantitative measure for the goodness of fit can be derived by means of vari-
ance decomposition of the data. The total variation of y is partitioned into two
components—sum of squares due to the fitted model and sum of squares due to
random errors in the data:
n
n
n
(yi − ȳ)2 = ( ŷi − ȳ)2 + (yi − ŷi )2 . (11.12)
i=1 i=1 i=1
and the goodness of fit should be bad. If the sum of squares due to regression is zero,
it is evident that the model fit is the worst possible. These thoughts are reflected in
the criterion for the goodness of fit, also known as R 2 :
S Q Regression S Q Error
R2 = =1− . (11.13)
S Q Total S Q Total
It follows from the above definition that 0 ≤ R 2 ≤ 1. The closer R 2 is to 1, the better
the fit because S Q Error will then be small. The closer R 2 is to 0, the worse the fit,
because S Q Error will then be large. If R 2 takes any other value, say R 2 = 0.7, it
means that only 70 % of the variation in data is explained by the fitted model, and
hence, in simple language, the model is 70 % good. An important point to remember
is that R 2 is defined only when there is an intercept term in the model (an assumption
we make throughout this chapter). So it is not used to measure the goodness of fit in
models without an intercept term.
Example 11.3.1 Consider again Fig. 11.1: In Fig. 11.1a, the line and data points fit
well together. As a consequence R 2 is high, R 2 = 0.82. Figure 11.1b shows data
points with large deviations from the regression line; therefore, R 2 is small, here
R 2 = 0.002. Similarly, in Fig. 11.4a, an R 2 of 0.97 relates to an almost perfect
model fit, whereas in Fig. 11.4b, the model describes the data only moderately well
(R 2 = 0.57).
Example 11.3.2 Consider again Example 11.2.1 where we analysed the relationship
between exercise intensity and recovery time for patients convalescing from knee
surgery. To calculate R 2 , we need the following table:
We conclude that the regression model provides a reasonable but not perfect fit to
the data because 0.66 is not close to 0, but also not very close to 1. About 66 % of the
variability in the data can be explained by the fitted linear regression model. The rest
is random error: for example, individual variation in the recovery time of patients
due to genetic and environmental factors, surgeon performance, and others.
We can also obtain this result in R by looking at the summary of the linear model:
summary(lm(Y∼X))
Please note that we give a detailed explanation of the model summary in Sect. 11.7.
Example 11.3.3 Consider again Examples 11.3 and 11.5 where we analysed the
association of exercising time and time to regain full range of motion after knee
surgery.√We calculated
√ R 2 = 0.66. We therefore know that the correlation coefficient
is r = R = 0.66 ≈ 0.81.
2
Until now, we have assumed that the covariate X is continuous. It is however also
straightforward to fit a linear regression model when X is binary, i.e. if X has two
categories. In this case, the values of X in the first category are usually coded as
0 and the values of X in the second category are coded as 1. For example, if the
binary variable is “gender”, we replace the word “male” with the number 0 and
the word “female” with 1. We can then fit a linear regression model using these
numbers, but the interpretation differs from the interpretations in case of a continuous
variable. Recall the definition of the linear model, Y = α + β X + e with E(e) = 0;
if X = 0 (male) then E(Y |X = 0) = α and if X = 1 (female), then E(Y |X = 1) =
α + β · 1 = α + β. Thus, α is the average value of Y for males, i.e. E(Y |X = 0),
and β = E(Y |X = 1) − E(Y |X = 0). It follows that those subjects with X = 1 (e.g.
females) have on average Y -values which are β units higher than subjects with X = 0
(e.g. males).
260 11 Linear Regression
Example 11.4.1 Recall Examples 11.2.1, 11.3.2, and 11.3.3 where we analysed the
association of exercising time and recovery time after knee surgery. We keep the
values of Y (recovery time, in days) and replace values of X (exercising time, in
minutes) with 0 for patients exercising for less than 40 min and with 1 for patients
exercising for 40 min or more. We have therefore a new variable X indicating whether
a patient is exercising a lot (X = 1) or not (X = 0). To estimate the linear regression
line ŷ = α̂ + β̂x, we first calculate x̄ = 4/12 and ȳ = 57.08. To obtain Sx y and Sx x ,
we need the following table:
We can now calculate the least squares estimates of α and β using (11.6) as
Sx y (xi − x̄)(yi − ȳ) −72.34
β̂ = = = ≈ −27.4,
Sx x (xi − x̄)2 2.64
4
α̂ = ȳ − β̂ x̄ = 57.08 − (−27.4) · = 66.2.
12
The fitted regression line is:
y = 66.2 − 27.4 · x.
The interpretation is:
• The average recovery time for patients doing little exercise is 66.2 − 27.4 · 0 =
66.2 days.
• Patients exercising heavily (x = 1) have, on average, a 27.4 days shorter recovery
period (β = −27.4) than patients with a short exercise time (x = 0).
• The average recovery time for patients exercising heavily is 38.8 days (66.2 −
27.4 · 1 for x = 1).
• These interpretations are visualized in Fig. 11.5. Because “exercise time” (X )
is considered to be a nominal variable (two categories representing the intervals
[0; 40) and (40; ∞)), we can plot the regression line on a scatter plot as two parallel
lines.
11.5 Linear Regression with a Transformed Covariate 261
100
regression lines for binary X
Examples 11.4.1 and 11.5.1 square root of X
80
Recovery Time
60
^
b=27.5
40
20
20 30 40 50 60 70
Exercise Time
Recall that a model is said to be linear when it is linear in its parameters. The definition
of a linear model is not at all related to the linearity in the covariate. For example,
Y = α + β2 X + e
is not a linear model because the right-hand side of the equation is not a linear
function in β. However,
Y = α + β X2 + e
is a linear model. This model can be fitted as for any other linear model: we obtain
α̂ and β̂ as usual and simply use the squared values of X instead of X . This can
be justified by considering X ∗ = X 2 , and then, the model is written as Y = α +
β X ∗ + e which is again a linear model. Only the interpretation changes: for each
unit increase in the squared value of X , i.e. X ∗ , Y increases by β units. Such an
interpretation is often not even needed. One could simply plot the regression line of
Y on X ∗ to visualize the functional form of the effect. This is also highlighted in the
following example.
Example 11.5.1 Recall Examples 11.2.1, 11.3.2, 11.3.3, and 11.4.1 where we
analysed the association of exercising time and recovery time after knee surgery. We
estimated β̂ as −1.22 by using X as it is, as a continuous variable, see also Fig. 11.3.
When using a binary X , based√ on a cut-off of 40 min, we obtained β̂ = −27.4, see
also Fig. 11.5. If we now use X rather than X , we obtain β̂ = −15.1. This means for
262 11 Linear Regression
an increase of 1 unit of the square root of exercising time, the recovery time decreases
by 15.1 days. Such an interpretation will be difficult to understand
√ for many people.
It is better to plot the linear regression line y = 145.8 − 15.1 · x, see Fig. 11.5. We
can see that the new nonlinear line (obtained from a linear model) fits the data nicely
and it may even be preferable to a straight regression line. Moreover, the value of α
substantially increased with this modelling approach, highlighting that no exercis-
ing at all may severely delay recovery from the surgery (which is biologically√more
meaningful). In R, we obtain these results by either creating a new variable X or
by using the I() command which allows specifying transformations in regression
models.
Generally the
√ covariate X can be replaced by any transformation of it, such as
using log(X ), X , sin(X ), or X 2 . The details are discussed in Sect. 11.6.3. It is
also possible to compare the quality and goodness of fit of models with different
transformations (see Sect. 11.8).
Example 11.6.1 For the pizza delivery data, a particular linear model with multiple
covariates could be specified as follows:
Delivery Time = β0 + β1 Number pizzas ordered + β2 Weekend(0/1)
+β3 Operator (0/1) + e.
If we have a data set of n observations, then every set of observations satisfies model
(11.15) and we can write
y1 = β0 + β1 x11 + β2 x12 + · · · + β p x1 p +e1
y2 = β0 + β1 x21 + β2 x22 + · · · + β p x2 p +e2
.. .. ..
. . .
yn = β0 + β1 xn1 + β2 xn2 + · · · + β p xnp +en (11.16)
It is possible to write the n equations in (11.16) in matrix notation as
y = Xβ + e (11.17)
where
⎛ ⎞ ⎛ ⎞ ⎛ ⎞ ⎛ ⎞
y1 1 x11 x12 · · · x1 p β0 e1
⎜ y2 ⎟ ⎜ 1 x21 x22 · · · x2 p ⎟ ⎜ β1 ⎟ ⎜ e2 ⎟
⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟
y = ⎜ . ⎟, X=⎜. ⎟, β = ⎜ . ⎟, e = ⎜ . ⎟.
⎝ .. ⎠ ⎝ .. ⎠ ⎝ .. ⎠ ⎝ .. ⎠
yn 1 xn1 xn2 · · · xnp βp en
The letters y, X, β, e are written in bold because they refer to vectors and matrices
rather than scalars. The capital letter X makes it clear that X is a matrix of order
n × p representing the n observations on each of the covariates X 1 , X 2 , . . . , X p .
Similarly, y is a n × 1 vector of n observations on Y, β is a p × 1 vector of regression
coefficients associated with X 1 , X 2 , . . . , X p , and e is a n × 1 vector of n errors. The
lower case letter x relates to a vector representing a variable which means we can
denote the multiple linear model from now on also as
y = β0 1 + β1 x1 + · · · + β p xp + e (11.18)
where 1 is the n × 1 vector of 1’s. We assume that E(e) = 0 and Cov(e) = σ 2 I (see
Sect. 11.9 for more details).
We would like to highlight that X is not the data matrix. The matrix X is called
the design matrix and contains both a column of 1’s denoting the presence of the
intercept term and all explanatory variables which are relevant to the multiple linear
model (including possible transformations and interactions, see Sects. 11.6.3 and
11.7.3). The errors e reflect the deviations of the observations from the regression
line and therefore the difference between the observed and fitted relationships. Such
deviations may occur for various reasons. For example, the measured values can be
264 11 Linear Regression
affected by variables which are not included in the model, the variables may not be
accurately measured, there is unmeasured genetic variability in the study population,
e. The estimate of β is obtained by using the
and all of which are covered in the error
least squares principle by minimizing i=1 n
ei2 = e e. The least squares estimate
of β is given by
β̂ = (X X)−1 X y. (11.19)
The vector β̂ contains the estimates of (β0 , β1 , . . . , β p ) . We can interpret it as
earlier: β̂0 is the estimate of the intercept which we obtain because we have added the
column of 1’s (and is identical to α in (11.1)). The estimates β̂1 , β̂2 , . . . , β̂ p refer to
the regression coefficients associated with the variables x1 , x2 , . . . , xp , respectively.
The interpretation of β̂ j is that it represents the partial change in yi when the value
of xi changes by one unit keeping all other covariates fixed.
A possible interpretation of the intercept term is that when all covariates equal
zero then
E(y) = β0 + β1 · 0 + · · · + β p · 0 = β0 . (11.20)
There are many situations in real life for which there is no meaningful interpretation
of the intercept term because one or many covariates cannot take the value zero. For
instance, in the pizza data set, the bill can never be e0, and thus, there is no need
to describe the average delivery time for a given bill of e0. The intercept term here
serves the purpose of improvement of the model fit, but it will not be interpreted.
In some situations, it may happen that the average value of y is zero when all
covariates are zero. Then, the intercept term is zero as well and does not improve
the model. For example, suppose the salary of a person depends on two factors—
education level and type of qualification. If any person is completely illiterate, even
then we observe in practice that his salary is never zero. So in this case, there is a
benefit of the intercept term. In another example, consider that the velocity of a car
depends on two variables—acceleration and quantity of petrol. If these two variables
take values of zero, the velocity is zero on a plane surface. The intercept term will
therefore be zero as well and yields no model improvement.
Example 11.6.2 Consider the pizza data described in Appendix A.4. The data matrix
X is as follows:
Day Date Time · · · Discount
⎛ ⎞
Thursday 1-May-14 35.1 · · · 1
⎜ Thursday 1-May-14 25.2 · · · 0 ⎟
⎜ ⎟
X = ⎜ .. .. ⎟
⎝ . . ⎠
Saturday 31-May-14 35.7 · · · 0
Suppose the manager has the hypothesis that the operator and the overall bill (as
a proxy for the amount ordered from the customer) influence the delivery time. We
can postulate a linear model to describe this relationship as
Delivery Time = β0 + β1 Bill + β2 Operator + e.
11.6 Linear Regression with Multiple Covariates 265
y X e
lm(time∼bill+operator)
If we have more than one covariate in a linear regression model, we simply add
all of them separated by the + sign when specifying the model formula in the lm()
function. We obtain β̂0 = 23.1, β̂1 = 0.26, β̂2 = 0.16. The interpretation of these
parameters is as follows:
• For each extra e that is spent, the delivery time increases by 0.26 min. Or, for each
extra e10 spent, the delivery time increases on average by 2.6 min.
• The delivery time is on average 0.16 min longer for operator 1 compared with
operator 0.
• For operator 0 and a bill of e0, the expected delivery time is β0 = 23.1 min.
However, there is no bill of e0, and therefore, the intercept β0 cannot be interpreted
meaningfully here and is included only for improvement of the model fit.
We now consider the case when covariates are categorical rather than continuous.
Examples
Consider a variable x which has more than two categories, say k > 2 categories. To
include such a variable in a regression model, we create k − 1 new variables xi , i =
1, 2, . . . , k − 1. Similar to how we treat binary variables, each of these variables is
a dummy variable and consists of 1’s for units which belong to the category of
interest and 0’s for the other category
1 for category i,
xi = (11.21)
0 otherwise.
The category for which we do not create a dummy variable is called the reference
category, and the interpretation of the parameters of the dummy variables is with
respect to this category. For example, for category i, the y-values are on average βi
higher than for the reference category. The concept of creating dummy variables and
interpreting them is explained in the following example.
Example 11.6.3 Consider again the pizza data set described in Appendix A.4. The
manager may hypothesize that the delivery times vary with respect to the branch.
There are k = 3 branches: East, West, Centre. Instead of using the variable x =
branch, we create (k − 1), i.e. (3 − 1) = 2 new variables denoting x1 = East and
x2 = West. We set x1 = 1 for those deliveries which come from the branch in the
East and set x1 = 0 for other deliveries. Similarly, we set x2 = 1 for those deliveries
which come from the West and x2 = 0 for other deliveries. The data then is as follows:
Delivery y (Delivery Time) x (Branch) x1 (East) x2 (West)
⎛ ⎞
1 35.1 East 1 0
⎜ 2 25.2 East 1 0 ⎟
⎜ ⎟
⎜ 3 45.6 West 0 1 ⎟
⎜ ⎟
⎜ 4 29.4 East 1 0 ⎟
⎜ ⎟
⎜ 5 30.0 West 0 1 ⎟
⎜ ⎟
⎜ 6 40.3 Centre 0 0 ⎟
⎜ .. .. .. .. ⎟
⎝ . . . . ⎠
1266 35.7 West 0 1
Deliveries which come from the East have x1 = 1 and x2 = 0, deliveries which
come from the West have x1 = 0 and x2 = 1, and deliveries from the Centre have
x1 = 0 and x2 = 0. The regression model of interest is thus
y = β0 + β1 East + β2 West + e.
We can calculate the least squares estimate β̂ = (β̂0 , β̂1 , β̂2 ) = (X X)−1 X y via R:
either (i) we create the dummy variables ourselves or (ii) we ask R to create it for
us. This requires that “branch” is a factor variable (which it is in our data, but if it
was not, then we would have to define it in the model formula via as.factor()).
• The average delivery time for the branch in the Centre is 36.3 min, the delivery
time for the branch in the East is 36.3 − 5.2 = 31.1 min, and the predicted delivery
time for the West is 36.3 − 1.1 = 35.2 min.
• Therefore, deliveries arrive on average 5.2 min earlier in the East compared with
the centre (β̂1 = −5.2), and deliveries in the West arrive on average 1.1 min earlier
than in the centre (β̂2 = −1.1). In other words, it takes 5.2 min less to deliver a
pizza in the East than in the Centre. The deliveries in the West take on average
1.1 min less than in the Centre.
Consider now a covariate with k = 3 categories for which we create two new
dummy variables, x1 and x2 . The linear model is y = β0 + β1 x1 + β2 x2 + e.
Remark 11.6.1 There are other ways of recoding categorical variables, such as effect
coding. However, we do not describe them in this book.
11.6.3 Transformations
100
regression lines for Example linear
11.6.4 polynomial, quadratic
polynomial, cubic
80
Recovery Time
60 40
20
20 30 40 50 60 70
Exercise Time
Example 11.6.4 Consider again Examples 11.2.1, 11.3.2, 11.3.3, 11.4.1, and 11.5.1
where we analysed the association of intensity of exercising and recovery time after
knee surgery. The linear regression line was estimated as
Recovery Time = 100.28 − 1.22 Exercising Time.
One could question whether the association is indeed linear and fit the second- and
third-order polynomials:
Recovery Time = β0 − β1 Exercise + β2 Exercise2 ,
Recovery Time = β0 − β1 Exercise + β2 Exercise2 + β3 Exercise3 .
To obtain the estimates we can use R:
lm(Y∼X+I(X 2 ))
lm(Y∼X+I(X 2 )+I(X 3 ))
regression line which suggests a heavily increased recovery time for exercising times
greater than 65 min. While it is possible that too much exercising causes damage to
the knee and delays recovery, the data does not seem to be clear enough to support
the shape suggested by the model. This example illustrates the trade-off between
fit (i.e. how good the regression line fits the data) and parsimony (i.e. how well a
model can be understood if it becomes more complex). Section 11.8 explores this
non-trivial issue in more detail.
So far we have introduced linear regression as a method which describes the rela-
tionship between dependent and independent variables through the best fit to the
data. However, as we have highlighted in earlier chapters, often we are interested
not only in describing the data but also in drawing conclusions from a sample about
the population of interest. For instance, in an earlier example, we have estimated the
association of branch and delivery time for the pizza delivery data. The linear model
was estimated as delivery time = 36.3 − 5.2 East − 1.1 West; this indicates that the
delivery time is on average 5.2 min shorter for the branch in the East of the town
compared with the central branch and 1.1 min shorter for the branch in the West com-
pared with the central branch. When considering all pizza deliveries (the population),
not only those collected by us over the period of a month (the sample), we might ask
ourselves whether 1.1 min is a real difference valid for the entire population or just
caused by random error involved in the sample we chose. In particular, we would
also like to know what the 95 % confidence interval for our parameter estimate is.
Does the interval cover “zero”? If yes, we might conclude that we cannot be very
sure that there is an association and do not reject the null hypothesis that βWest = 0.
In reference to Chaps. 9 and 10, we now apply the concepts of statistical inference
and testing in the context of regression analysis.
270 11 Linear Regression
Point and interval estimation in the linear model. We now rewrite the model for
the purpose of introducing statistical inference to linear regression as
y = β0 1 + β1 x1 + · · · + β p xp + e
= Xβ + e with e ∼ N (0, σ 2 I) . (11.24)
where y is the n × 1 vector of outcomes and X is the n × ( p + 1) design matrix
(including a column of 1’s for the intercept). The identity matrix I consists of 1’s on
the diagonal and 0’s elsewhere, and the parameter vector is β = (β0 , β1 , . . . , β p ) .
We would like to estimate β to make conclusions about a relationship in the pop-
ulation. Similarly, e reflects the random errors in the population. Most importantly,
the linear model now contains assumptions about the errors. They are assumed to
be normally distributed, N (0, σ 2 I ), which means that the expectation of the errors
is 0, E(ei ) = 0, the variance is Var(ei ) = σ 2 (and therefore the same for all ei ), and
it follows from Var(e) = σ 2 I that Cov(ei , ei ) = 0 for all i
= i . The assumption of a
normal distribution is required to construct confidence intervals and test of hypothe-
ses statistics. More details about these assumptions and the procedures to test their
validity on the basis of a given sample of data are explained in Sect. 11.9.
The least squares estimate of β is obtained by
β̂ = (X X)−1 X y. (11.25)
It can be shown that β̂ follow a normal distribution with mean E(β̂) = β and covari-
ance matrix Var(β̂) = σ 2 I as
β̂ ∼ N (β, σ 2 (X X)−1 ). (11.26)
Note that β̂ is unbiased (since E(β̂) = β); more details about (11.26) can be found
in Appendix C.6. An unbiased estimator of σ 2 is
(y − Xβ̂) (y − Xβ̂) ê ê 1 n
σ̂ 2 = = = êi2 . (11.27)
n − ( p + 1) n − ( p + 1) n − ( p + 1)
i=1
The errors are estimated from the data as ê = y − Xβ̂ and are called residuals.
Before giving a detailed data example, we would like to outline how to draw
conclusions about the population of interest from the linear model. As we have seen,
both β and σ 2 are unknown in the model and are estimated from the data using (11.25)
and (11.27). These are our point estimates. We note that if β j = 0, then β j xj = 0,
and then, the model will not contain the term β j xj . This means that the covariate xj
does not contribute to explaining the variation in y. Testing the hypothesis β j = 0
is therefore equivalent to testing whether x j is associated with y or not in the sense
that it helps to explain the variations in y or not. To test whether the point estimate is
different from zero, or whether the deviations of estimates from zero can be explained
by random variation, we have the following options:
11.7 The Inductive View of Linear Regression 271
follows a tn− p−1 distribution under H0 . If |T | > tn− p−1;1−α/2 , we can reject the
null hypothesis (at α level of significance); otherwise, we accept it.
3. The decisions we get from the confidence interval and the T -statistic from points
1 and 2 are identical to those obtained by checking whether the p-value (see also
Sect. 10.2.6) is smaller than α, in which case we also reject the null hypothesis.
Example 11.7.1 Recall Examples 11.2.1, 11.3.2, 11.3.3, 11.4.1, 11.5.1, and 11.6.4
where we analysed the association of exercising time and time of recovery from knee
surgery. We can use R to obtain a full inductive summary of the linear model:
summary(lm(Y∼X))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 100.1244 10.3571 9.667 2.17e-06 ***
X -1.2153 0.2768 -4.391 0.00135 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
• Under “Estimate”, the parameter estimates are listed and we read that the linear
model is fitted as y (recovery time) = 100.1244 − 1.2153 · x (exercise time). The
subtle differences to Example 11.2.1 are due to rounding of numerical values.
• The variance is estimated as σ̂ 2 = 11.582 (“Residual standard
error”). This is
easily calculated manually using the residual sum of squares: êi2 /(n − p − 1) =
1/10 · {(90 − 70.84)2 + · · · + (45 − 62.30)2 } = 11.582 .
272 11 Linear Regression
• The standard errors σ̂β̂ are listed under “Std. Error”. Given that n − p − 1 =
12 − 1 − 1 = 10, and therefore t10;0.975 = 2.28, we can construct a confidence
interval for age:
−1.22 ± 2.28 · 0.28 = [−1.86; −0.58]
The interval does not include 0, and we therefore conclude that there is an asso-
ciation between exercising time and recovery time. The random error involved
in the data is not sufficient to explain the deviation of β̂i = −1.22 from 0 (given
α = 0.05).
• We therefore reject the null hypothesis that β j = 0. This can also be seen by
com-
paring the test statistic (listed under “t value” and obtained by (β̂ j − 0)/ σ̂ 2 =
β̂
−1.22/0.277) with t10,0.975 , | − 4.39| > 2.28. Moreover, p = 0.001355 < α =
0.05. We can say that there is a significant association between exercising and
recovery time.
Sometimes, one is interested in whether a regression model is useful in the sense
that all βi ’s are different from zero, and we therefore can conclude that there is an
association between any xi and y. The null hypothesis is
H0 : β1 = β2 = · · · = β p = 0
and can be tested by the overall F-test
(ŷ − ȳ) (ŷ − ȳ)/( p) n − p − 1 i ( ŷi − ȳ)2
F= = 2 .
(y − ŷ) (y − ŷ)/(n − p − 1) p i ei
The null hypothesis is rejected if F > F1−α; p,n− p−1 . Note that the null hypothesis in
this case tests only the equality of slope parameters and does not include the intercept
term.
Example 11.7.2 In this chapter, we have already explored the associations between
branch, operator, and bill with delivery time for the pizza data (Appendix A.4). If we
fit a multiple linear model including all of the three variables, we obtain the following
results:
summary(lm(time∼branch+bill+operator))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.19138 0.78752 33.258 < 2e-16 ***
branchEast -3.03606 0.42330 -7.172 1.25e-12 ***
branchWest -0.50339 0.38907 -1.294 0.196
bill 0.21319 0.01535 13.885 < 2e-16 ***
operatorMelissa 0.15973 0.31784 0.503 0.615
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
By looking at the p-values in the last column, one can easily see (without calculat-
ing the confidence intervals or evaluating the t-statistic) that there is a significant asso-
ciation between the bill and delivery time; also, it is evident that the average delivery
time in the branch in the East is significantly different (≈ 3 min less) from the central
branch (which is the reference category here). However, the estimated difference in
delivery times for both the branches in the West and the operator was not found to be
significantly different from zero. We conclude that some variables in the model are
associated with delivery time, while for others, we could not show the existence of
such an association. The last line of the output confirms this: the overall F-test has a
test statistic of 97.87 which is larger than F1−α; p,n− p−1 = F0.95;4,1261 = 2.37; there-
fore, the p-value is smaller 0.05 (2.2 × 10−16 ) and the null hypothesis is rejected. The
test suggests that there is at least one variable which is associated with delivery time.
1. The least squares estimator of β is unbiased, E(β̂) = β (see Appendix C.6 for
the proof).
2. The estimator σ̂ 2 as introduced in equation (11.27) is also unbiased, i.e. E(σ̂ 2 ) =
σ 2.
3. The least squares estimator of β is consistent, i.e. β̂ converges to β as n approaches
infinity.
4. The least squares estimator of β is asymptotically normally distributed.
5. The least squares estimator of β has the smallest variance among all linear and
unbiased estimators (best linear unbiased estimator) of β.
Theorem 11.7.1 For the linear model (11.24), the least squares estimator and the
maximum likelihood estimator for β are identical. However, the maximum likelihood
2 = 1/n(ê ê) of σ 2 which is a biased estimator of σ 2 , but it is
estimator of σ 2 is σ̂ML
asymptotically unbiased.
How to obtain the maximum likelihood estimator for the linear model is presented in
Appendix C.6. The important message of Theorem 11.7.1 is that no matter whether
we apply the least squares principle or the maximum likelihood principle, we always
274 11 Linear Regression
obtain β̂ = (X X)−1 X y; this does not apply when estimating the variance, but it is
an obvious choice to go for the unbiased estimator (11.27) in any given analysis.
We see that the average delivery time of the branches in the East and the Centre
(reference category) is different and that the average delivery time of the branches in
the West and the Centre is different (because p < 0.05). This is useful information,
but it does not answer the question if branch as a whole influences the delivery time.
It seems that this is the case, but the hypothesis we may have in mind may be
H0 : μEast = μWest = μCentre
which corresponds to
H0 : βEast = βWest = βCentre
in the context of the regression model. These are two identical hypotheses because
in the regression set-up, we are essentially comparing three conditional means
E(Y |X = x1 ) = E(Y |X = x2 ) = E(Y |X = x3 ). The ANOVA table summarizes the
corresponding F-Test which tests this hypothesis:
m1 <- lm(time∼branch)
anova(m1)
Response: time
Df Sum Sq Mean Sq F value Pr(>F)
branch 2 6334 3166.8 86.05 < 2.2e-16 ***
Residuals 1263 46481 36.8
11.7 The Inductive View of Linear Regression 275
We see that the null hypothesis of 3 equal means is rejected because p is close to
zero.
What does this table mean more generally? If we deal with linear regression with
one (possibly categorical) covariate, the table will be as follows:
The table summarizes the sum of squares regression and residuals (see Sect. 11.3 for
the definition), standardizes them by using the appropriate degrees of freedom (df),
and uses the corresponding ratio as the F-statistic. Note that in the above example,
this corresponds to the overall F-test introduced earlier. The overall F-test tests the
hypothesis that any β j is different from zero which is identical to the hypothesis
above. Thus, if we fit a linear regression model with one variable, the ANOVA table
will yield the same conclusions as the overall F-test which we obtain through the
main summary. However, if we consider a multiple linear regression model, the
ANOVA table may give us more information.
Example 11.7.4 Suppose we are not only interested in the branch, but also in how
the pizza delivery times are affected by operator and driver. We may for example
hypothesize that the delivery time is identical for all drivers given they deliver for
the same branch and speak to the same operator. In a standard regression output, we
would get 4 coefficients for 5 drivers which would help us to compare the average
delivery time of each driver to the reference driver; it would however not tell us if
on an average, they all have the same delivery time. The overall F-test would not
help us either because it would test if any β j is different from zero which includes
coefficients from branch and operator, not only driver. Using the anova command
yields the results of the F-test of interest:
m2 <- lm(time∼branch+operator+driver)
anova(m2)
Response: time
Df Sum Sq Mean Sq F value Pr(>F)
branch 2 6334 3166.8 88.6374 < 2.2e-16 ***
operator 1 16 16.3 0.4566 0.4994
driver 4 1519 379.7 10.6282 1.798e-08 ***
Residuals 1258 44946 35.7
We see that the null hypothesis of equal delivery times of the drivers is rejected.
We can also test other hypotheses with this output: for instance, the null hypothesis
of equal delivery times for each operator is not rejected because p ≈ 0.5.
276 11 Linear Regression
11.7.3 Interactions
It may be possible that the joint effect of some covariates affects the response.
For example, drug concentrations may have a different effect on men, woman, and
children; a fertilizer could work differently in different geographical areas; or a new
education programme may show benefit only with certain teachers. It is fairly simple
to target such questions in linear regression analysis by using interactions. Interac-
tions are measured as the product of two (or more) variables. If either one or both
variables are categorical, then one simply builds products for each dummy variable,
thus creating (k − 1) × (l − 1) new variables when dealing with two categorical
variables (with k and l categories respectively). These product terms are called inter-
actions and estimate how an association of one variable differs with respect to the
values of the other variable. Now, we give examples for continuous–categorical,
categorical–categorical, and continuous–continuous interactions for two variables
x1 and x2 .
Categorical–Continuous Interaction. Suppose one variable x1 is categorical with
k categories, and the other variable x2 is continuous. Then, k − 1 new variables
have to be created, each consisting of the product of the continuous variable and a
dummy variable, x2 × x1i , i ∈ 1, . . . , k − 1. These variables are added to the regres-
sion model in addition to the main effects relating to x1 and x2 as follows:
y = β0 + β1 x11 + · · · + βk−1 x1k−1 + βk x2
+βk+1 x11 x2 + · · · + β p x1k−1 x2 + e.
It follows that for the reference category of x1 , the effect of x2 is just βk (because
each x2 x1i is zero since each x1 j is zero). However, the effect for all other categories
is β2 + β j where β j refers to x1j x2 . Therefore, the association between x2 and the
outcome y differs by β j between category j and the reference category. Testing
H0 : β j = 0 thus helps to identify whether there is an interaction effect with respect
to category j or not.
Example 11.7.5 Consider again the pizza data described in Appendix A.4. We may
be interested in whether the association of delivery time and temperature varies
with respect to branch. In addition to time and branch, we therefore need additional
interaction variables. Since there are 3 branches, we need 3 − 1 = 2 interaction
variables which are essentially the product of (i) time and branch “East” and (ii)
time and branch “West”. This can be achieved in R by using either the “” operator
(which will create both the main and interaction effects) or the “:” operator (which
only creates the interaction term).
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 70.718327 1.850918 38.207 < 2e-16 ***
time -0.288011 0.050342 -5.721 1.32e-08 ***
branchEast 10.941411 2.320682 4.715 2.69e-06 ***
branchWest 1.102597 2.566087 0.430 0.66750
time:branchEast -0.195885 0.066897 -2.928 0.00347 **
time:branchWest 0.004352 0.070844 0.061 0.95103
The main effects of the model tell us that the temperature is almost 11 degrees
higher for the eastern branch compared to the central branch (reference) and about
1 degree higher for the western branch. However, only the former difference is
significantly different from 0 (since the p-value is smaller than α = 0.05). Moreover,
the longer the delivery time, the lower the temperature (0.29 degrees for each minute).
The parameter estimates related to the interaction imply that this association differs
by branch: the estimate is indeed βtime = −0.29 for the reference branch in the
Centre, but the estimate for the branch in the East is −0.29 − 0.196 = −0.486 and the
estimate for the branch in the West is −0.29 + 0.004 = −0.294. However, the latter
difference in the association of time and temperature is not significantly different
from zero. We therefore conclude that the delivery time and pizza temperature are
negatively associated and this is more strongly pronounced in the eastern branch
compared to the other two branches. It is also possible to visualize this by means of
a separate regression line for each branch, see Fig. 11.7.
Example 11.7.5
West
50 60 7040 80
0 10 20 30 40 50 60
Delivery Time (in Minutes)
278 11 Linear Regression
One can see that the pizzas delivered by the branch in the East are overall hotter
but longer delivery times level that benefit off. One might speculate that the delivery
boxes from the eastern branch are not properly closed and therefore—despite the
overall good performance—the temperature falls more rapidly over time for this
branch.
Example 11.7.6 Consider again the pizza data. If we have the hypothesis that the
delivery time depends on the operator (who receives the phone calls), but the effect is
different for different branches, then a regression model with branch (3 categories, 2
dummy variables), operator (2 categories, one dummy variable), and their interaction
(2 new variables) can be used.
lm(time∼operator*branch)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.4203 0.4159 87.567 <2e-16 ***
operatorMelissa -0.2178 0.5917 -0.368 0.7129
branchEast -5.6685 0.5910 -9.591 <2e-16 ***
branchWest -1.3599 0.5861 -2.320 0.0205 *
operatorMelissa:branchEast 0.8599 0.8425 1.021 0.3076
operatorMelissa:branchWest 0.4842 0.8300 0.583 0.5598
11.7 The Inductive View of Linear Regression 279
• If we are interested in the operator, we see that the delivery time is on aver-
age 0.21 min shorter for operator “Melissa”. When this operator deals with a
branch other than the reference (Centre), the estimate changes to −0.21 + 0.86 =
0.64 min longer delivery in the case of branch “East” and −0.21 + 0.48 =
0.26 min for branch “West”.
• If we are interested in the branches, we observe that the delivery time is shortest
for the eastern branch which has on average a 5.66 min shorter delivery time
than the central branch. However, this is the estimate for the reference operator
only; if operator “Melissa” is in charge, then the difference in delivery times
for the two branches is only −5.66 + 0.86 = −4.8 min. The same applies when
comparing the western branch with the central branch: depending on the operator,
the difference between these two branches is estimated as either −1.36 or −1.36 +
0.48 = 0.88 min, respectively.
• The interaction terms are not significantly different from zero. We therefore con-
clude that the hypothesis of different delivery times for the two operators, possibly
varying by branch, could not be confirmed.
Example 11.7.7 If we again consider the pizza data, with pizza temperature as an
outcome, we may wish to test for an interaction of bill and time.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 92.555943 2.747414 33.688 < 2e-16 ***
bill -0.454381 0.068322 -6.651 4.34e-11 ***
time -0.679537 0.086081 -7.894 6.31e-15 ***
bill:time 0.008687 0.002023 4.294 1.89e-05 ***
The R output above reveals that there is a significant interaction effect. The inter-
pretation is more difficult here. It is clear that a longer delivery time and a larger bill
decrease the pizza’s temperature. However, for a large product of bill and time (i.e.,
when both are large), these associations are less pronounced because the negative
coefficients become more and more outweighed by the positive interaction term. On
the contrary, for a small bill and short delivery time, the temperature can be expected
to be quite high.
Combining Estimates. As we have seen in the examples in this section, it can
make sense to combine regression estimates when interpreting interaction effects.
While it is simple to obtain the point estimates, it is certainly also important to report
280 11 Linear Regression
where σ̂(β̂i +β̂ j ) is obtained from the estimated covariance matrix Cov(β̂) via
i ) + Var(β
σ̂(β̂i +β̂ j ) = Var(β j ) + 2Cov(βi , β j ).
Example 11.7.8 Recall Example 11.7.5 where we estimated the association between
pizza temperature, delivery time, and branch. There was a significant interaction
effect for time and branch. Using R, we obtain the covariance matrix as follows:
The point estimate for the association between time and temperature in the eastern
branch
√ is −0.29 − 0.19 (see Example 11.7.5). The standard error is calculated as
0.00253 + 0.00447 − 2 · 0.00253 = 0.044. The confidence interval is therefore
−0.48 ± 1.96 · 0.044 = [−0.56; −0.39].
Remark 11.7.1 If there is more than one interaction term, then it is generally possible
to test whether there is an overall interaction effect, such as β1 = β2 = β3 = β4 = 0
in the case of four interaction variables. These tests belong to the general class of
“linear hypotheses”, and they are not explained in this book. It is also possible to
create interactions between more than two variables. However, the interpretation
then becomes difficult.
There are many situations where different multiple linear models can be fitted to a
given data set, but it is unclear which is the best one. This relates to the question of
which variables should be included in a model and which ones can be removed.
11.8 Comparing Different Models 281
For example, when we modelled the association of recovery time and exercising
time by using different transformations, it remained unclear which of the proposed
transformations was the best. In the pizza delivery example, we have about ten
covariates: Does it make sense to include all of them in the model or do we understand
the associations in the data better by restricting ourselves to the few most important
variables? Is it necessary to include interactions or not?
There are various model selection criteria to compare the quality of different
fitted models. They can be useful in many situations, but there are also a couple of
limitations which make model selection a difficult and challenging task.
Example 11.8.1 In Fig. 11.6, the association of exercising time and recovery time
after knee surgery is modelled linearly, quadratically, and cubically; in Fig. 11.5,
this association is modelled by means of a square-root transformation. The model
summary in R returns both R 2 (under “Multiple R-squared”) and the adjusted R 2
(under “adjusted R-squared”). The results are as follows:
R2 2
Rad j
Linear 0.6584 0.6243
Quadratic 0.6787 0.6074
Cubic 0.7151 0.6083
Square root 0.6694 0.6363
It can be seen that R 2 is larger for the models with more variables; i.e. the cubic
model (which includes three variables) has the largest R 2 . The adjusted R 2 , which
takes the different model sizes into account, favours the model with the square-root
transformation. This model provides therefore the best fit to the data among the four
models considered using R 2 .
282 11 Linear Regression
Akaike’s Information Criterion (AIC) is another criterion for model selection. The
AIC is based on likelihood methodology and has its origins in information theory.
We do not give a detailed motivation of this criterion here. It can be written for the
linear model as
S Q Error
AIC = n log + 2( p + 1). (11.31)
n
The smaller the AIC, the better the model. The AIC takes not only the fit to the data
via S Q Error into account but also the parsimony of the model via the term 2( p + 1).
It is in this sense a more mature criterion than Rad 2
j which considers only the fit
to the data via S Q Error . Akaike’s Information Criterion can be calculated in R via
the extractAIC() command. There are also other commands, but the results differ
slightly because the different formulae use different constant terms. However, no
matter what formula is used, the results regarding the model choice are always the
same.
Backward selection. Two models, which differ only by one variable x j , can be
compared by simply looking at the test result for β j = 0: if the null hypothesis is
rejected, the variable is kept in the model; otherwise, the other model is chosen. If
there are more than two models, then it is better to consider a systematic approach to
comparing them. For example, suppose we have 10 potentially relevant variables and
we are not sure which of them to include in the final model. There are 210 = 1024
possible different combinations of variables and in turn so many choices of models!
The inclusion or deletion of variables can be done systematically with various proce-
dures, for example with backward selection (also known as backward elimination)
as follows:
1. Start with the full model which contains all variables of interest, Υ = {x1 , x2 ,
. . . , xp }.
2. Remove the variable xi ∈ Υ which optimizes a criterion, i.e. leads to the smallest
AIC (the highest Rad 2 , the highest test statistic, or the smallest significance)
j
among all possible choices.
3. Repeat step 2. until a stop criterion is fulfilled, i.e. until no improvement regarding
AIC, R 2 , or the p-value can be achieved.
11.8 Comparing Different Models 283
There are several other approaches. Instead of moving in the backward direction,
we can also move in the forward direction. Forward selection starts with a model
with no covariates and adds variables as long as the model continues to improve with
respect to a particular criterion. The stepwise selection approach combines forward
selection and backward selection: either one does backward selection and checks in
between whether adding variables yields improvement, or one does forward selection
and continuously checks whether removing the already added variables improves the
model.
Example 11.8.3 Consider the pizza data: if delivery time is the outcome, and branch,
bill, operator, driver, temperature, and number of ordered pizzas are potential covari-
ates, we may decide to include only the relevant variables in a final model. Using
the stepAIC function of the library(MASS) allows us implementing backward
selection with R.
library(MASS)
ms <- lm(time∼branch+bill+operator+driver
+temperature+pizzas)
stepAIC(ms, direction='back')
At the first step, the R output states that the AIC of the full model is 4277.56.
Then, the AIC values are displayed when variables are removed: 4275.9 if operator
is removed, 4279.2 if driver is removed, and so on. One can see that the AIC is
minimized if operator is excluded.
Start: AIC=4277.56
time ~ branch+bill+operator+driver+temperature+pizzas
Now R fits the model without operator. The AIC is 4275.88. Excluding further
variables does not improve the model with respect to the AIC.
284 11 Linear Regression
Step: AIC=4275.88
time ~ branch + bill + driver + temperature + pizzas
We therefore conclude that the final “best” model includes all variables considered,
except operator. Using stepwise selection (with option both instead of back) yields
the same results. We could now interpret the summary of the chosen model.
• If categorical variables are included in the model, then all respective dummy vari-
ables need to be considered as a whole: all dummy variables of this variable should
either be removed or kept in the model. For example, a variable with 3 categories
is represented by two dummy variables in a regression model. In the process of
model selection, both of these variables should either be kept in the model or
removed, but not just one of them.
• A similar consideration refers to interactions. If an interaction is added to the
model, then the main effects should be kept in the model as well. This enables
straightforward interpretations as outlined in Sect. 11.7.3.
• The same applies to polynomials. If adding a cubic transformation of xi , the squared
transformation as well as the linear effect should be kept in the model.
• When applying backward, forward, or stepwise regression techniques, the results
may vary considerably depending on the method chosen! This may give different
models for the same data set. Depending on the strategy used to add and remove
variables, the choice of the criterion (e.g. AIC versus p-values), and the detailed
set-up (e.g. if the strategy is to add/remove a variable if the p-value is smaller than
α, then the choice of α is important), one may obtain different results.
• All the above considerations show that model selection is a non-trivial task which
should always be complemented by substantial knowledge of the subject matter.
If there are not too many variables to choose from, simply reporting the full model
can be meaningful too.
11.9 Checking Model Assumptions 285
The assumptions made for the linear model mostly relate to the error terms: e ∼
N (0, σ 2 I). This implies a couple of things: i) the difference between y and Xβ
(which is e) needs to follow a normal distribution, ii) the variance for each error ei
is constant as σ 2 and therefore does not depend on i, and iii) there is no correlation
between the error terms, Cov(ei , ei ) = 0 for all i, i . We also assume that Xβ is
adequate in the sense that we included all relevant variables and interactions in the
model and that the estimator β̂ is stable and adequate due to influential observations
or highly correlated variables. Now, we demonstrate how to check assumptions i)
and ii).
4
Standardized residuals
Standardized residuals
1 2
2
−1 0
0
−2
−2
−3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Theoretical Quantiles Theoretical Quantiles
Small deviations from the normality assumption are not a major problem as the
least squares estimator of β remains unbiased. However, confidence intervals and
tests of hypothesis rely on this assumption, and particularly for small sample sizes
and/or stronger violations of the normality assumptions, these intervals and conclu-
sions from tests of hypothesis no longer remain valid. In some cases, these problems
can be fixed by transforming y.
Heteroscedasticity. If errors have a constant variance, we say that the errors are
homoscedastic and we call this phenomenon homoscedasticity. When the variance
of errors is not constant, we say that the errors are heteroscedastic. This phenom-
enon is also known as heteroscedasticity. If the variance σ 2 depends on i, then
the variability of the ei will be different for different groups of observations. For
example, the daily expenditure on food (y) may vary more among persons with a
high income (x1 ), so fitting a linear model yields stronger variability of the ei among
higher income groups. Plotting the fitted values ŷi (or alternatively xi ) against the
standardized residuals (or a transformation thereof) can help to detect whether or not
problems exist. If there is no pattern in the plot (random plot), then there is likely no
violation of the assumption (see Fig. 11.10a). However, if there is a systematic trend,
i.e. higher/lower variability for higher/lower ŷi , then this indicates heteroscedasticity
(Fig. 11.10b, trumpet plot). The consequences are again that confidence intervals and
tests may no longer be correct; again, in some situations, a transformation of y can
help.
Example 11.9.1 Recall Example 11.8.3 where we identified a good model to describe
the delivery time for the pizza data. Branch, bill, temperature, number of pizzas
ordered, and driver were found to be associated with delivery time. To explore
whether the normality assumption is fulfilled for this model, we can create a his-
togram for the standardized residuals (using the R function rstandard()). A QQ-
plot is contained in the various model diagnostic plots of the plot() function.
fm <- lm(time∼branch+bill+driver+temperature
+pizzas)
hist(rstandard(fm))
plot(fm, which=2)
250
4
3
200
Standardized residuals
2
150
Frequency
1
100
−1 0
50
−2
0
−4 −2 0 2 4 −3 −2 −1 0 1 2 3
Standardized residuals Theoretical Quantiles
lm(time ~ branch + bill + driver + temperature + pizzas)
(a) Via the histogram of the standardized (b) Via a QQ-plot of the standardized
residuals residuals
because the observed quantiles lie approximately on the bisecting line. It seems that
the lowest residuals deviate a bit from the expected normal distribution, but not
extremely. The normality assumption does not seem to be violated.
Plotting the fitted valuesŷi against the square root of the absolute values of the
standardized residuals (= |êi∗ |, used by R for stability reasons) yields a plot with
no visible structure (see Fig. 11.10a). There is no indication of heteroscedasticity.
2.0
2.5
2.0
|Standardized residuals|
|Standardized residuals|
1.5
1.5
1.0
1.0
0.5
0.5
0.0
0.0
20 25 30 35 40 45 20 25 30 35 40 45
Fitted Values Fitted Values
plot(fm, which=3)
There is a difference between association and causation: association says that higher
values of X relate to higher values of Y (or vice versa), whereas causation says that
because values of X are higher, values of Y are higher too. Regression analysis, as
introduced in this chapter, establishes associations between X i ’s and Y , not causation,
unless strong assumptions are made.
For example, recall the pizza delivery example from Appendix A.4. We have seen
in several examples and exercises that the delivery time varies by driver. In fact, in a
multiple linear regression model where Y is the delivery time, and the covariates are
branch, bill, operator, driver, and temperature, we get significant differences of the
mean delivery times of some drivers (e.g. Domenico and Bruno). Does this mean that
because Domenico is the driver, the pizzas are delivered faster? Not necessarily. If
Domenico drives the fastest scooter, then this may be the cause of his speedy deliv-
eries. However, we have not measured the variable “type of scooter” and thus cannot
take it into account. The result we obtain from the regression model is still useful
because it provides the manager with useful hypotheses and predictions about his
delivery times, but the interpretation that the driver’s abilities cause shorter delivery
times may not necessarily be appropriate.
This example highlights that one of the most important assumptions to interpret
regression parameters in a causal way is to have measured (and used) all variables
X i which affect both the outcome Y and the variable of interest A (e.g. the variable
“driver” above). Another assumption is that the relationship between all X i ’s and
Y is modelled correctly, for example non-linear if appropriate. Moreover, we need
some a priori assumptions about how the variables relate to each other, and some
technical assumptions need to be met as well. The interested reader is referred to
Hernan and Robins (2017) for more details.
11.11 Key Points and Further Issues 289
Note:
11.12 Exercises
Exercise 11.1 The body mass index (BMI) and the systolic blood pressure of 6
people were measured to study a cardiovascular disease. The data are as follows:
(a) The research hypothesis is that a high BMI relates to a high blood pressure.
Estimate the linear model where blood pressure is the outcome and BMI is the
covariate. Interpret the coefficients.
(b) Calculate R 2 to judge the goodness of fit of the model.
Exercise 11.2 A psychologist speculates that spending a lot of time on the internet
has a negative effect on children’s sleep. Consider the following data on hours of deep
sleep (Y ) and hours spent on the internet (X ) where xi and yi are the observations
on internet time and deep sleep time of the ith (i = 1, 2, . . . , 9) child respectively:
Child i 1 2 3 4 5 6 7 8 9
Internet time xi (in h) 0.3 2.2 0.5 0.7 1.0 1.8 3.0 0.2 2.3
Sleep time yi (in h) 5.8 4.4 6.5 5.8 5.6 5.0 4.8 6.0 6.1
(a) Estimate the linear regression model for the given data and interpret the coeffi-
cients.
(b) Calculate R 2 to judge the goodness of fit of the model.
(c) Reproduce the results of a) and b) in R and plot the regression line.
(d) Now assume that we only distinguish between spending more than 1 hour on the
internet (X = 1) and spending less than (or equal to) one hour on the internet
(X = 0). Estimate the linear model again and compare the results. How can β̂
now be interpreted? Describe how β̂ changes if those who spend more than one
hour on the internet are coded as 0 and the others as 1.
Exercise 11.3 Consider the following data on weight and height of 17 female stu-
dents:
Student i 1 2 3 4 5 6 7 8 9
Weight yi 68 58 53 60 59 60 55 62 58
Height xi 174 164 164 165 170 168 167 166 160
Student i 10 11 12 13 14 15 16 17
Weight y 53 53 50 64 77 60 63 69
Height x 160 163 157 168 179 170 168 170
n
(a) Calculate the correlation coefficient
n of Bravais–Pearson (use i=1 x i yi =
n
170, 821, x̄ = 166.65, ȳ = 60.12, i=1 yi2 = 62, 184, i=1 xi2 = 472, 569).
What does this imply with respect to a linear regression of height on weight?
11.12 Exercises 291
(b) Now estimate and interpret the linear regression model where “weight” is the
outcome.
(c) Predict the weight of a student with a height 175 cm.
(d) Now produce a scatter plot of the data (manually or by using R) and interpret it.
(e) Add the following two points to the scatter plot (x18 , y18 ) = (175, 55) and
(x19 , y19 ) = (150, 75). Speculate how the linear regression estimate will change
after adding these points.
(f)
Re-estimate the model using all 19 observations and xi yi = 191, 696 and
xi2 = 525, 694.
(g) Given the results of the two regression models: What are the general implications
with respect to the least squares estimator of β?
Exercise 11.4 To study the association of the monthly average temperature (in ◦ C,
X ) and hotel occupation (in %, Y ), we consider data from three cities: Polenca
(Mallorca, Spain) as a summer holiday destination, Davos (Switzerland) as a winter
skiing destination, and Basel (Switzerland) as a business destination.
(a) Interpret the following regression model output where the outcome is “hotel
occupation” and “temperature” is the covariate.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.33459 7.81792 6.438 2.34e-07 ***
X 0.07717 0.51966 0.149 0.883
(b) Interpret the following output where “city” is treated as a covariate and “hotel
occupation” is the outcome.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 48.3333 7.9457 6.083 7.56e-07 ***
cityDavos 7.9167 11.2369 0.705 0.486
cityPolenca 0.9167 11.2369 0.082 0.935
292 11 Linear Regression
(c) Interpret the following output and compare it with the output from b):
Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
city 2 450.1 225.03 0.297 0.745
Residuals 33 25001.2 757.61
(d) In the following multiple linear regression model, both “city” and “tempera-
ture” are treated as covariates. How can the coefficients be interpreted?
Estimate Std. Error t value Pr(>|t|)
(Intercept) 44.1731 10.9949 4.018 0.000333 ***
X 0.3467 0.6258 0.554 0.583453
cityDavos 9.7946 11.8520 0.826 0.414692
cityPolenca -1.1924 11.9780 -0.100 0.921326
(e) Now consider the regression model for hotel occupation and temperature fitted
separately for each city: How can the results be interpreted and what are the
implications with respect to the models estimated in (a)–(d)? How can the models
be improved?
Davos:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 73.9397 4.9462 14.949 3.61e-08 ***
X -2.6870 0.4806 -5.591 0.000231 ***
Polenca:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22.6469 16.7849 -1.349 0.20701
X 3.9759 0.8831 4.502 0.00114 **
Basel:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 32.574 13.245 2.459 0.0337 *
X 1.313 0.910 1.443 0.1796
(f) Describe what the design matrix will look like if city, temperature, and the
interaction between them are included in a regression model.
(g) If the model described in (f) is fitted the output is as follows:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 32.5741 10.0657 3.236 0.002950 **
X 1.3133 0.6916 1.899 0.067230 .
cityDavos 41.3656 12.4993 3.309 0.002439 **
cityPolenca -55.2210 21.0616 -2.622 0.013603 *
X:cityDavos -4.0003 0.9984 -4.007 0.000375 ***
X:cityPolenca 2.6626 1.1941 2.230 0.033388 *
Interpret the results.
11.12 Exercises 293
Exercise 11.5 The theatre data (see Appendix A.4) describes the monthly expen-
diture on theatre visits of 699 residents of a Swiss city. It captures not only the
expenditure on theatre visits (in SFR) but also age, gender, yearly income (in 1000
SFR), and expenditure on cultural activities in general as well as expenditure on
theatre visits in the preceding year.
(a) The summary of the multiple linear model where expenditure on theatre visits
is the outcome is as follows:
2.0
|Standardized residuals|
Standardized residuals
2
1.5
1
1.0
−1 0
0.5
−2
0.0
−3
(e) Judge the quality of the model from d) by means of Figs. 11.12a and 11.12b.
What do they look like compared with those from b)?
Exercise 11.6 Consider the pizza delivery data described in Appendix A.4.
(a) Read the data into R. Fit a multiple linear regression model with delivery time
as the outcome and temperature, branch, day, operator, driver, bill, number of
ordered pizzas, and discount customer as covariates. Give a summary of the
coefficients.
11.12 Exercises 295
(b) Use R to calculate the 95 % confidence intervals of all coefficients. Hint: the
standard errors of the coefficients can be accessed either via the covariance
matrix or the model summary.
(c) Reproduce the least squares estimate of σ 2 . Hint: use residuals() to obtain
the residuals.
(d) Now use R to estimate both R 2 and Rad 2 . Compare the results with the model
j
output from a).
(e) Use backward selection by means of the stepAIC function from the library MASS
to find the best model according to AIC.
(f) Obtain Rad2
j from the model identified in e) and compare it to the full model
from a).
(g) Identify whether the model assumptions are satisfied or not.
(h) Are all variables from the model in (e) causing the delivery time to be either
delayed or improved?
(i) Test whether it is useful to add a quadratic polynomial of temperature to the
model.
(j) Use the model identified in (e) to predict the delivery time of the last captured
delivery (i.e. number 1266). Use the predict() command to ease the calculation
of the prediction.
∗ Source Toutenburg, H., Heumann, C., Deskriptive Statistik, 7th edition, 2009,
Springer, Heidelberg