HW4 Solutions: Problem 6.2
HW4 Solutions: Problem 6.2
HW4 Solutions: Problem 6.2
Problem 6.2.
library(faraway)
data(teengamb)
(a). We use a residuals vs. fitted plot to examine the constant variance assumption:
mod = lm(gamble ~ . , data = teengamb)
mod.log = lm(log(1 +gamble) ~ ., data = teengamb)
par(mfrow = c(1,2))
plot(fitted(mod), (residuals(mod)), xlab = "Fitted Values", ylab = "Residuals")
abline(h=0)
plot(fitted(mod.log), (residuals(mod.log)), xlab = "Fitted Values", ylab = "Residuals",
main = "log-transformed model")
abline(h = 0)
log−transformed model
100
2
1
50
Residuals
Residuals
0
0
−1
−2
−50
0 20 40 60 80 0 1 2 3 4
The first plot suggests an increasing variance along the fitted values. We would try a log-transform on the
response, but because some of the yi ’s are zero, we instead try y → log(1 + y) which solves the problem.
(b). To check normality, we use a qqplot of the residuals.
1
par(mfrow = c(1,3))
res = residuals(mod)
qqnorm(res)
qqline(res)
res.log = residuals(mod.log)
qqnorm(res.log, ylab = "sample quantiles for res.log")
qqline(res.log)
normal.sample = rnorm(length(res), mean = 0, sd = sd(res))
qqnorm(normal.sample, main = "normal sample quantiles")
qqline(normal.sample)
40
1
sample quantiles for res.log
50
20
Sample Quantiles
Sample Quantiles
0
0
0
−20
−1
−40
−2
−50
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
From the left plot, the residuals look slightly right-skewed, with a long tail possibly due to outliers. The
middle plot shows the residuals for the log-transformed model. The last plot shows a sample of normal
random variables with mean zero and standard deviation equal to that of the residuals. We use this plot as a
reference to see how much variation we should expect in a Q-Q plot, and indeed, comparing the middle one
and the last one, the assumption of normality of residuals for the log-transformed model seems reasonable.
As the last plot shows, we expect some deviation from the straight qq-line even for normal samples. A more
principled way of deciding whether these deviations are chance (and thus to be expected) or systematic
difference in distributions is to do a hypothesis test. Here we use the Shapiro-Wilk test of normality:
shapiro.test(res)
##
## Shapiro-Wilk normality test
##
## data: res
2
## W = 0.86839, p-value = 8.16e-05
shapiro.test(res.log)
##
## Shapiro-Wilk normality test
##
## data: res.log
## W = 0.97609, p-value = 0.4418
As can be seen, normality is rejected for the original residuals (at α = 0.05), but we fail to reject normality
for the residuals of the log-transformed model.
(c).
hatval = hatvalues(mod)
hatval[which(hatval>10/47)]
## 31 33 35 42
## 0.2395031 0.2213439 0.3118029 0.3016088
2(p+1) 10
These are the only observations with leverage greater than n = 47 .
(d). We look at externally studentized residuals, and adjust the level of the test (using Bonferroni’s correction,
the new level is αMultiple = # ofαtests ) because we are conducting multiple tests at the same time. (we also
divide α by two because this is a two-sided test.)
stud.r = rstudent(mod)
stud.r[which(abs(stud.r) > abs(qt(0.05/(47*2),47-1-4-1)))]
## 24
## 6.016116
Observation #24 has a large studentized residual, and so there’s evidence that it is an outlier.
(e). We look at the Cook’s distance to detect influential points:
cook = cooks.distance(mod)
cook[which(cook > 4/(47-4-1))]
## 24 39
## 0.5565011 0.1124498
The observations #24 and #29 have high cook distances and need a more careful inspection because a large
Cook distance typically signifies a high influence on the fit. We compare the fit of the model before and after
the removal of these observations:
newmod = lm(gamble ~ ., data = teengamb, subset = (cook<cook[39]))
summary(mod)
##
## Call:
## lm(formula = gamble ~ ., data = teengamb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.082 -11.320 -1.451 9.452 94.252
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.55565 17.19680 1.312 0.1968
3
## sex -22.11833 8.21111 -2.694 0.0101 *
## status 0.05223 0.28111 0.186 0.8535
## income 4.96198 1.02539 4.839 1.79e-05 ***
## verbal -2.95949 2.17215 -1.362 0.1803
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.69 on 42 degrees of freedom
## Multiple R-squared: 0.5267, Adjusted R-squared: 0.4816
## F-statistic: 11.69 on 4 and 42 DF, p-value: 1.815e-06
summary(newmod)
##
## Call:
## lm(formula = gamble ~ ., data = teengamb, subset = (cook < cook[39]))
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.505 -7.782 -0.760 8.742 45.033
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.6772 11.8650 0.563 0.57673
## sex -16.9242 5.6323 -3.005 0.00457 **
## status 0.2314 0.1922 1.204 0.23568
## income 4.8395 0.7216 6.707 4.81e-08 ***
## verbal -2.2001 1.4870 -1.480 0.14681
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.36 on 40 degrees of freedom
## Multiple R-squared: 0.6438, Adjusted R-squared: 0.6082
## F-statistic: 18.07 on 4 and 40 DF, p-value: 1.501e-08
We see a relative increase of 24% in the coefficient for sex and 343% in the coefficient for status, among other
changes.
(f). We look at some partial residual plots.
par(mfrow= c(1,3))
termplot(mod, partial.resid = TRUE, terms = c(2,3,4))
4
100
100
100
Partial for income
50
50
50
0
0
−50
−50
−50
20 30 40 50 60 70 2 4 6 8 12 2 4 6 8 10
The first plot shows a near zero coefficient for status, which is also confirmed by the large p-value in the
summary of the linear model. No severe non-linearity is visible in the plots.
Problem 7.8:
library(faraway)
data(fat)
fatmod = lm(brozek ~ . -siri -density -free -adipos,data = fat)
X = model.matrix(fatmod)[,-1]
## [1] 555.6707
We have a large (>30) condition number, suggesting collinearity among columns. The VIFs are also listed
below:
round(vif(X), 2)
5
The
√ VIF’s for weight, hip and abdom are large (>10). In the case of weight, for instance, we interpret
33.51 = 5.79 as how much larger SD[β̂weight ] is compared with the situtation where no collinearity existed.
(If weight were not collinear with other predictors, the standard deviation would be 5.79 times smaller.)
(b).
newmod = lm(brozek ~ . -siri -density -free -adipos, data = fat, subset = c(-39, -42))
X.new = model.matrix(newmod)[,-1]
e1 <- eigen(t(X.new)%*%X.new)
sqrt(e1$val[1]/e1$val[length(e1$val)])
## [1] 554.7978
round(vif(X.new),2)
## [1] 22.6725
vif(X2)
## $age
## [1] 43
##
## $weight
## [1] 176.5
##
## $height
6
## [1] 70
predict(fatmod2, x.median, interval = "prediction")
7
78
76
70
74
fat$height[index]
60
fat$height
72
70
50
68
40
66
30
fat$weight fat$weight[index]
We see a scatter plot of height over weight on the left. To improve the resolution, we remove the two extreme
points on the right and at the bottom of the scatterplot, to get the second plot. It is clear from this plot that
no data points exist with weight close to 130 and height over 70. Thus the predictors for the last part are
quite unusual and lead to extrapolation. The predictors in part (e) follow the linear relationship between
height and weight, but the one in part (f) does not.