330 Lecture7 2014

STATS 330: Lecture 7
Prediction
5.08.2014
Office hours
I Lecturers
Office auckland.ac.nz day time
Steffen Klaere 303.219 s.klaere Thu, 10:0012:00
Alan Lee 303S.265 aj.lee Tue, 10:3012:00
Thu, 10:3012:00
I Tutors (Room 303.326)

aucklanduni.ac.nz day time
Savannah Post spos008 Mon, 10:0012:00
Thu, 14:3015:30
Leshun Xu lxu472 Tue, 13:0014:00
Wed, 13:0014:00
Thu, 13:0014:00
Hongbin Guo hguo033 Tue, 11:0012:00
Wed, 14:0016:00
Thu, 10:0011:00
Fri, 11:0012:00
R-hint(s) of the day
Seeing variable names of R objects

> names(cherry.df)
[1] "diameter" "height" "volume"
Identifying interesting chunks of data

> which(cherry.df$height>85)
[1] 18 31
> cherry.df[cherry.df$height>85,]
diameter height volume
18 13.3 86 27.4
31 20.6 87 77.0
Aims of todays lecture
I Describe how to use the regression model to predict future

values of the response, given the values of the covariates.
I Describe how to estimate the mean response, for particular

values of the covariates.
Typical questions
I Given the height and diameter, can we predict the volume of a

cherry tree? E.g.if a particular tree has diameter 11 inches
and height 85 feet, can we predict its volume?
I What is the average volume of all cherry trees having a given

height and diameter? E.g. if we consider every cherry tree
with diameter 11 inches and height 85 feet, can we estimate
their mean volume?
Y =+
Y
X1
X2
Prediction: how to do it
I Suppose we want to predict the response of an individual

whose covariates are
x1 , . . . , xk
I E.g., in the cherry tree example, we have k = 2, and we want

to predict the average volume for height x1 = 85, and
diameter x2 = 11.
I Since the response we want to predict is Y = + , we obtain

Y by making separate predictions of and , and adding the
results together.
Prediction: how to do it
I Since is the value of the true regression plane at x1 , . . . , xk

we predict by the value of the fitted plane at x1 , . . . , xk .
This is
b0 + b1 x1 + + bk xk
I We have no information about other than the fact that it is

sampled from a normal distribution with zero mean. Thus, we
predict = 0.
I Therefore, to predict the response y for the covariates

x1 , . . . , xk we use the fitted plane at x1 , . . . , xk as the
predictor.
The inner product form of a predictor

I Vector of regression coefficients: b = b0 , b1 , . . . , bk ;
I Vector of predictor variables: x = (1, x1 , . . . , xk );
I Inner product: xT b = b0 + b1 x1 + + bk xk .
Prediction error
I Prediction error is measured by
sP2 = E (observation predictor)2 = Var (predictor) + 2 .
I Var(predictor) comes from predicting the mean, 2 comes

from predicting the error.
I Var(predictor) depends on the covariates and on 2 .
I Standard error of prediction is sP .
I Prediction interval approximately
predictor 2 standard error,
or more precisely...
Prediction interval
I A prediction interval is an interval which contains the actual

value of the response with a given probability, most commonly
0.95
I The 1 interval for the inner product xT b is
xT b sP tnk1 (1 /2)
Variance of the predictor
Arrange covariate data (used for fitting the model) into a matrix X
such that
1 x11 . . . x1k
X = ... .. .. .. .

. . .
1 xn1 . . . xnk
Then the variance of the prediction for point b
x is
1
xT XT X
Var(predictor) = 2b x.
b
Doing it in R
I Suppose we want to predict the volume of 2 cherry trees. The

first has diameter 11 inches and height 85 feet, the second
has diameter 12 inches and height 90 feet.
I First step is to generate a data frame containing the new

data. Names must be the same as in the original data.
I Then we need to combine the results of the fit (regression

coefficients, estimate of error variance) with the new data
(the predictor variables) to calculate the predictor.
Doing it in R
# Calculate the fit from the original data

> cherry.lm <- lm(volume~diameter+height,data=cherry.df)
# Make a new data frame

> new.df <- data.frame(diameter=c(11,12),height=c(85,90))
# Do the prediction
> predict(cherry.lm,new.df)
# Output
1 2
22.63846 29.04288
Doing it in R
> predict(cherry.lm,new.df,se.fit=T,interval="prediction")
$fit
fit lwr upr
1 22.63846 13.94717 31.32976
2 29.04288 19.97235 38.11340
$se.fit
1 2
1.712901 2.130571
$df
[1] 28
$residual.scale
[1] 3.881832
Hand calculation
predictor = 22.63846
p
SE(predictor) = se.fit2 + residual.scale2
p
= 1.7129012 + 3.8818322 = 4.242953
tnk1 (.975) = t28 (.975) = 2.048407
Prediction interval = predictor SE(predictor) tnk1 (.975)

= 22.63846 4.242953 2.048407
= [13.94717, 31.32975]
Estimating the mean response
I Suppose we want to estimate the mean response of all

individuals whose covariates are x1 , . . . , xk .
I Since the mean we want to predict is the height of the true

regression plane at x1 , . . . , xk , we use as our estimate the
height of the fitted plane at x1 , . . . , xk . This is
b0 + b1 x1 + + bk xk
Standard error of the estimate
I The standard error of the estimate is the square root of the

variance of the predictor.
I Note that this is less than the standard error of the prediction!
q
SE(predictor) = Var(predictor) + 2 ,
p
SE(estimate) = Var(predictor).
I Confidence interval
predictor SE(estimate) tnk1 (1 /2)

Doing it in R
I Suppose we want to estimate the mean volume of all cherry

trees having diameter 11 inches and height 85 feet, or
diameter 12 inches and height 90 feet.
I As before, the first step is to make a data frame containing the

new data. Names must be the same as in the original data.
Doing it in R
> predict(cherry.lm,new.df,se.fit=T,interval="confidence")
$fit
fit lwr upr
1 22.63846 19.12974 26.14718
2 29.04288 24.67860 33.40716
$se.fit
1 2
1.712901 2.130571
$df
[1] 28
$residual.scale
[1] 3.881832
Example: Hydrocarbon data
When petrol is pumped into a tank, hydrocarbon vapours are

forced into the atmosphere. To reduce this significant source of air
pollution, devices are installed to capture the vapour. A laboratory
experiment was conducted in which the amount of vapour given off
was measured under carefully controlled conditions. In addition to
the response, there were four variables which were thought relevant
for prediction:
t.temp: initial tank temperature (degrees F)
p.temp: temperature of dispensed petrol (degrees F)
t.vp: initial vapour pressure in tank (psi)
p.vp: vapour pressure of dispensed petrol (psi)
hc: emitted dispensed hydrocarbons (g) (response)
Pairs plot
30 50 70 90 3 4 5 6 7

20 30 40 50

hc

90

70

t.temp

50

30
80

p.temp

60

40

t.vp
5
p.vp
20 30 40 50 40 60 80 3 4 5 6 7
Pairs plot
30 50 70 90 3 4 5 6 7

20 30 40 50

hc

90

70

t.temp

50

30
80

p.temp

60

40

t.vp
5
p.vp
20 30 40 50 40 60 80 3 4 5 6 7
Pairs plot
30 50 70 90 3 4 5 6 7

hc

20 30 40 50

90

t.temp
70
0.81

50

30

p.temp
80
0.88

0.81

60

40

t.vp

7
0.85 0.94 0.77

5
p.vp
7
6
0.91 0.93 0.83 0.98
5
4
3
20 30 40 50 40 60 80 3 4 5 6 7
Preliminary conclusions
I All variables seem related to the response
I p.vp and t.vp seem highly correlated
I Quite strong correlations between some of the other variables
I No obvious outliers
Fitting the full model
> vapour.reg <- lm(hc~t.temp+p.temp+t.vp+p.vp,data=vapour.df)

> summary(vapour.reg)
Call:
lm(formula = hc ~ t.temp + p.temp + t.vp + p.vp,
data = vapour.df)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.16609 1.02198 0.163 0.87117
t.temp -0.07764 0.04801 -1.617 0.10850
p.temp 0.18317 0.04063 4.508 1.53e-05 ***
t.vp -4.45230 1.56614 -2.843 0.00526 **
p.vp 10.27271 1.60882 6.385 3.37e-09 ***
Residual standard error: 2.723 on 120 degrees of freedom

Multiple R-squared: 0.8959, Adjusted R-squared: 0.8925
F-statistic: 258.2 on 4 and 120 DF, p-value: < 2.2e-16
Conclusions
I Large R 2
I Significant parameters except for t.temp
I Model seems satisfactory
I Move on to prediction
I Let us predict hydrocarbon emissions when
t.temp = 28, p.temp = 30, t.vp = 3, p.vp = 3.5

Prediction
> vapour.pred <- data.frame(t.temp=28,p.temp=30,t.vp=3,

p.vp=3.5)
> predict(vapour.reg,vapour.pred,interval="p")
fit lwr upr
1 26.08471 20.23045 31.93898
Hydrocarbon emissions are predicted to be between 20 and 32

grams.
http://xkcd.com/1131/

330 Lecture7 2014

Uploaded by

Copyright:

Available Formats

330 Lecture7 2014

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

330 Lecture7 2014

Uploaded by

Copyright:

Available Formats

STATS 330: Lecture 7

I Tutors (Room 303.326)

Seeing variable names of R objects

Identifying interesting chunks of data

I Describe how to use the regression model to predict future

I Describe how to estimate the mean response, for particular

I Given the height and diameter, can we predict the volume of a

I What is the average volume of all cherry trees having a given

I Suppose we want to predict the response of an individual

I E.g., in the cherry tree example, we have k = 2, and we want

I Since the response we want to predict is Y = + , we obtain

I Since is the value of the true regression plane at x1 , . . . , xk

I We have no information about other than the fact that it is

I Therefore, to predict the response y for the covariates

I Vector of predictor variables: x = (1, x1 , . . . , xk );

I Prediction error is measured by

sP2 = E (observation predictor)2 = Var (predictor) + 2 .

I Var(predictor) comes from predicting the mean, 2 comes

predictor 2 standard error,

I A prediction interval is an interval which contains the actual

I The 1 interval for the inner product xT b is

I Suppose we want to predict the volume of 2 cherry trees. The

I First step is to generate a data frame containing the new

I Then we need to combine the results of the fit (regression

# Calculate the fit from the original data

# Make a new data frame

tnk1 (.975) = t28 (.975) = 2.048407

Prediction interval = predictor SE(predictor) tnk1 (.975)

I Suppose we want to estimate the mean response of all

I Since the mean we want to predict is the height of the true

I The standard error of the estimate is the square root of the

predictor SE(estimate) tnk1 (1 /2)

I Suppose we want to estimate the mean volume of all cherry

I As before, the first step is to make a data frame containing the

When petrol is pumped into a tank, hydrocarbon vapours are

0.85 0.94 0.77

I All variables seem related to the response

I p.vp and t.vp seem highly correlated

I Quite strong correlations between some of the other variables

> vapour.reg <- lm(hc~t.temp+p.temp+t.vp+p.vp,data=vapour.df)

Residual standard error: 2.723 on 120 degrees of freedom

I Significant parameters except for t.temp

I Model seems satisfactory

I Let us predict hydrocarbon emissions when

t.temp = 28, p.temp = 30, t.vp = 3, p.vp = 3.5

> vapour.pred <- data.frame(t.temp=28,p.temp=30,t.vp=3,

Hydrocarbon emissions are predicted to be between 20 and 32

You might also like