Course Regression Model Strategies PDF
Course Regression Model Strategies PDF
Course Regression Model Strategies PDF
Frank E Harrell Jr
Department of Biostatistics
Vanderbilt University School of Medicine
Nashville TN 37232 USA
f.harrell@vanderbilt.edu
biostat.mc.vanderbilt.edu/rms
Google group regmod
General questions: stats.stackexchange.com, tag regression-strategies
Supplemental material: biostat.mc.vanderbilt.edu/ClinStat:
Biostatistics for Biomedical Research
Chicago
2016-07-31
Contents
1 Introduction
1-1
1.1
1.2
1.3
1.4
1.5
1.6
2-1
2.1
2.2
2.3
2.4
2.3.1
2.3.2
Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8
2.3.3
2.3.4
CONTENTS
2.5
iii
2.4.1
2.4.2
2.4.3
Splines for Estimating Shape of Regression Function and Determining Predictor Transformations . . . . . . . . . . . . . . . . 2-18
2.4.4
2.4.5
2.4.6
2.4.7
2.4.8
2.6
2.7
2.7.2
2.7.3
2.7.4
3 Missing Data
3-1
3.1
3.2
3.3
3.4
. . . . . . . . . . . 3-4
CONTENTS
iv
3.5
3.6
3.7
3.8
3.9
Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-17
4-1
4.1.2
4.2
4.3
4.4
4.5
Shrinkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-18
4.6
Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-21
4.7
. . . . . . . . . . . . 4-16
4.7.1
4.7.2
4.7.3
4.7.4
. . . . 4-26
CONTENTS
4.7.5
4.7.6
4.7.7
4.8
4.9
5-1
5.1.2
5.2
5.3
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11
5.3.2
5.3.3
Data-Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . 5-15
5.3.4
CONTENTS
vi
5.3.5
5.4
5.5
5.6
5.5.1
5.5.2
6 R Software
6-1
6.1
6.2
6.3
6.4
7-1
7.1
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-1
7.2
7.2.2
7.2.3
7.3
7.4
7.5
7.6
CONTENTS
vii
7.7
R Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-15
7.8
7.8.2
8-1
8.1
8.2
8.3
8.4
8.5
8.6
8.7
9.2
9.3
9.4
9.5
CONTENTS
viii
9.5.2
11-1
in the right margin indicates a hyperlink to a YouTube video related to the subject.
in the right margin is a hyperlink to an audio file elaborating on the notes. Red
letters and numbers in the right margin are cues referred to within the audio recordings.
CONTENTS
ix
Course Philosophy
Satisfaction of model assumptions improves precision and
increases statistical power
It is more productive to make a model fit step by step (e.g.,
transformation estimation) than to postulate a simple model
and find out what went wrong
Graphical methods should be married to formal inference
Overfitting occurs frequently, so data reduction and model
validation are important
Software without multiple facilities for assessing and fixing
model fit may only seem to be user-friendly
Carefully fitting an improper model is better than badly fitting (and overfitting) a well-chosen one
Methods which work for all types of regression models are
the most valuable.
In most research projects the cost of data collection far outweighs the cost of data analysis, so it is important to use the
most efficient and accurate modeling techniques, to avoid
categorizing continuous variables, and to not remove data
from the estimation sample just to be able to validate the
CONTENTS
model.
The bootstrap is a breakthrough for statistical modeling and
model validation.
Using the data to guide the data analysis is almost as dangerous as not doing so.
A good overall strategy is to decide how many degrees
of freedom (i.e., number of regression parameters) can be
spent, where they should be spent, to spend them with no
regrets.
See the excellent text Clinical Prediction Models by Steyerberg [151].
Chapter 1
Introduction
1.1
CHAPTER 1. INTRODUCTION
1-2
log-rank Cox
Models not only allow for multiplicity adjustment but for
shrinkage of estimates
Statisticians comfortable with P -value adjustment but
fail to recognize that the difference between the most
different treatments is badly biased
Statistical estimation is usually model-based
Relative effect of increasing cholesterol from 200 to 250
mg/dl on hazard of death, holding other risk factors constant
Adjustment depends on how other risk factors relate to hazard
Usually interested in adjusted (partial) effects, not unadjusted (marginal or crude) effects
CHAPTER 1. INTRODUCTION
1-3
1.2
CHAPTER 1. INTRODUCTION
1-4
CHAPTER 1. INTRODUCTION
1-5
1.3
CHAPTER 1. INTRODUCTION
1-6
CHAPTER 1. INTRODUCTION
1-7
CHAPTER 1. INTRODUCTION
1-8
CHAPTER 1. INTRODUCTION
1-9
An answer by a dichotomizer:
If you are among other cars, are you going faster than 73?
If you are exposed are your going faster than 67?
Better:
CHAPTER 1. INTRODUCTION
1-10
1.4
CHAPTER 1. INTRODUCTION
1-11
CHAPTER 1. INTRODUCTION
1-12
Iezzoni [87] lists these dimensions to capture, for patient outcome studies:
1. age
2. sex
3. acute clinical stability
4. principal diagnosis
5. severity of principal diagnosis
6. extent and severity of comorbidities
7. physical functional status
8. psychological, cognitive, and psychosocial functioning
9. cultural, ethnic, and socioeconomic attributes and behaviors
10. health status and quality of life
11. patient attitudes and preferences for outcomes
General aspects to capture in the predictors:
1. baseline measurement of response variable
2. current status
3. trajectory as of time zero, or past levels of a key variable
4. variables explaining much of the variation in the response
5. more subtle predictors whose distributions strongly differ between levels of the key variable of interest in an observational
study
CHAPTER 1. INTRODUCTION
1-13
1.5
CHAPTER 1. INTRODUCTION
1-14
1.6
Chapter 2
2-1
2-2
Hypothesis testing
Deep understanding of uncertainties associated with all model
components
Simplest example: confidence interval for the slope of a
predictor
Confidence intervals for predicted values; simultaneous
confidence intervals for a series of predicted values
E.g.: confidence band for Y over a series of values of
X
Alternative: Stratification
Cross-classify subjects on the basis of the Xs, estimate a
property of Y for each stratum
Only handles a small number of Xs
Does not handle continuous X
Alternative: Single Trees (recursive partitioning/CART)
Interpretable because they are over-simplified and usually
wrong
Cannot separate effects
2-3
2-4
2-5
2.1
0
1, . . . , p
X
2-6
2.2
Model Formulations
General regression model
C(Y |X) = g(X).
General linear regression model
C(Y |X) = g(X).
Examples
C(Y |X) =
E(Y |X) =
X,
Y |X
N (X, 2)
C(Y |X) = Prob{Y = 1|X} = (1 + exp(X))1
Linearize: h(C(Y |X)) = X, h(u) = g 1(u)
Example:
C(Y |X) = Prob{Y = 1|X} = (1 + exp(X))1
u
)
h(u) = logit(u) = log(
1u
h(C(Y |X)) = C 0(Y |X) (link)
General linear regression model:
C 0(Y |X) = X.
2-7
2.3
Nominal Predictors
Nominal (polytomous) factor with k levels : k 1 dummy
variables. E.g. T = J, K, L, M :
C(Y |T = J)
C(Y |T = K)
C(Y |T = L)
C(Y |T = M )
=
=
=
=
0
0 + 1
0 + 2
0 + 3.
2-8
X2 = 1 if T = L, 0 otherwise
X3 = 1 if T = M, 0 otherwise.
The test for any differences in the property C(Y ) between treatments is H0 : 1 = 2 = 3 = 0.
2.3.2
Interactions
G
2-9
2.3.3
Meaning
C(Y |age = 0, sex = m)
C(Y |age = x + 1, sex = m) C(Y |age = x, sex = m)
C(Y |age = 0, sex = f ) C(Y |age = 0, sex = m)
C(Y |age = x + 1, sex = f ) C(Y |age = x, sex = f )
[C(Y |age = x + 1, sex = m) C(Y |age = x, sex = m)]
can also think of the last part of the model as being 3 X3 , where X3 = age I[sex = f ].
2-10
2-11
In the model
y age + sex + weight + waist + tricep
2-12
2-13
2.4
Avoiding Categorization
Natura non facit saltus
(Nature does not make jumps)
Relationships seldom linear except when predicting one variable from itself measured earlier
Categorizing continuous predictors into intervals is a disaster; see references
[138, 2, 82, 100, 4, 11, 53, 135, 156, 24]
[114, 142, 5, 84, 118, 170, 55, 60] and Biostatistics for Biomedical
2-14
2-15
a cutpoint is chosen that minimizes the P -value and the resulting P -value is 0.05, the true type I error can easily be above 0.5 [84].
2-16
2-17
2.4.2
NHANES III; Diabetes Care 32:1327-34; 2009 adapted from Diabetes Care 20:1183-1197; 1997.
2-18
2-19
f (X)
= 0 + 1X,
= 0 + 1X + 2(X a)
= 0 + 1X + 2(X a) + 3(X b)
= 0 + 1X + 2(X a)
+3(X b) + 4(X c)
Xa
a<Xb
b<Xc
c < X.
2-20
f(X)
X
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
2.4.4
X1 = X
X2 = X 2
X3 = X 3 X4 = (X a)3+
X5 = (X b)3+ X6 = (X c)3+.
k knots k + 3 coefficients excluding intercept.
X 2 and X 3 terms must be included to allow nonlinearity when
2-21
X < a.
2.4.5
Stone and Koo [155]: cubic splines poorly behaved in tails. Constrain function to be linear in tails.
k + 3 k 1 parameters [44].
To force linearity when X < a: X 2 and X 3 terms must be
omitted
To force linearity when X > last knot: last two s are redundant, i.e., are just combinations of the other s.
The restricted spline function with k knots t1, . . . , tk is given
by [44]
f (X) = 0 + 1X1 + 2X2 + . . . + k1Xk1,
where X1 = X and for j = 1, . . . , k 2,
Xj+1 = (X tj )3+ (X tk1)3+(tk tj )/(tk tk1)
+ (X tk )3+(tk1 tj )/(tk tk1).
Xj is linear in X for X tk .
For numerical behavior and to put all basis functions for X on
the same scale, R Hmisc and rms package functions by default
divide the terms above by = (tk t1)2.
require ( Hmisc )
x rcspline.eval ( seq (0 ,1 , .01 ) ,
knots = seq ( .05 , .95 , length =5) , inclx = T )
xm x
xm [ xm > .0106 ] NA
2-22
0.010
1.0
0.008
0.8
0.006
0.6
0.004
0.4
0.002
0.2
0.000
0.0
0.0
0.4
0.8
0.0
0.4
0.8
Figure 2.2: Restricted cubic spline component variables for k = 5 and knots at X = .05, .275, .5, .725, and .95. Nonlinear basis
functions are scaled by . The left panel is a ymagnification of the right panel. Fitted functions such as those in Figure 2.3 will be
linear combinations of these basis functions as long as knots are at the same locations used here.
1.0
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
2-23
0.2
0.4
X
3 knots
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
X
5 knots
0.8
1.0
0.6
0.8
1.0
4 knots
1.0
0.0
0.6
0.4
X
6 knots
Figure 2.3: Some typical restricted cubic spline functions for k = 3, 4, 5, 6. The yaxis is X. Arrows indicate knots. These curves
were derived by randomly choosing values of subject to standard deviations of fitted functions being normalized.
2-24
#
#
if
if
using
not
RStudio
2-25
2.4.6
k
3
4
5
.05
6 .05 .23
7 .025 .1833
Quantiles
.10
.5 .90
.05
.35 .65
.275 .5 .725
.41
.59 .77
.3417 .5 .6583
2-26
.95
.95
.95
.8167 .975
n < 100 replace outer quantiles with 5th smallest and 5th
largest X [155].
Choice of k:
Flexibility of fit vs. n and variance
Usually k = 3, 4, 5. Often k = 4
Large n (e.g. n 100) k = 5
Small n (< 30, say) k = 3
Can use Akaikes information criterion (AIC) [6, 163] to choose
k
This chooses k to maximize model likelihood ratio 2 2k.
See [64] for a comparison of restricted cubic splines, fractional
polynomials, and penalized splines.
2-27
2.4.7
Nonparametric Regression
Y
E(Y |X =
) =
3
3
overlap OK
problem in estimating E(Y ) at outer X-values
estimates very sensitive to bin width
Moving linear regression far superior to moving avg. (moving
flat line)
2-28
Clevelands [33] moving linear regression smoother loess (locally weighted least squares) is the most popular smoother.
To estimate central tendency of Y at X = x:
take all the data having X values within a suitable interval
about x (default is 32 of the data)
fit weighted least squares linear regression within this
neighborhood
points near x given the most weighte
points near extremes of interval receive almost no weight
loess works much better at extremes of X than moving
avg.
provides an estimate at each observed X; other estimates
obtained by linear interpolation
outlier rejection algorithm built-in
loess works for binary Y just turn off outlier detection
Other popular smoother: Friedmans super smoother
For loess or supsmu amount of smoothing can be controlled
by analyst
e Weight here means something different than regression coefficient. It means how much a point is emphasized in developing the regression
coefficients.
2-29
place knots at all the observed data points but penalize coefficient estimates towards smoothness.
burdensome computations.
2-30
2-31
2.5
2-32
2-33
2-34
2-35
2-36
2.6
2-37
2-38
2.7
Regression Assumptions
The general linear regression model is
C(Y)
X1 = 1
X1 = 0
X2
Figure 2.4: Regression assumptions for one binary and one continuous predictor
2-39
2-40
2-41
Disadvantages:
Complexity
Generally difficult to allow for interactions when assessing
patterns of effects
Confidence limits, formal inference can be problematic for methods 1-4.
2-42
Tentative transformation g(X2) check adequacy by expanding g(X2) in spline function and testing linearity
Can find transformations by plotting g(X2) vs. f(X2) for
variety of g
Multiple continuous predictors expand each using spline
Example: assess linearity of X2, X3
2-43
C(Y |X) =
+
+
+
2-44
50
100
SYSBP
150
200
20
40
60
80
DIABP
100
120
140
4 knots, penalty=21
Figure 2.5: Logistic regression estimate of probability of a hemorrhagic stroke for patients in the GUSTO-I trial given t-PA, using
a tensor spline of two restricted cubic splines and penalization (shrinkage). Dark (cold color) regions are low risk, and bright (hot)
regions are higher risk.
Figure 2.5 is particularly interesting because the literature had suggested (based on
approximately 24 strokes) that pulse pressure was the main cause of hemorrhagic stroke
whereas this flexible modeling approach (based on approximately 230 strokes) suggests
that mean arterial blood pressure (roughly a 45 line) is what is most important over a
2-45
broad range of blood pressures. At the far right one can see that pulse pressure (axis
perpendicular to 45 line) may have an impact although a non-monotonic one.
Other issues:
2-46
2-47
2-48
Distributional Assumptions
Z
Chapter 3
Missing Data
3.1
aAlthough missing at random (MAR) is a non-testable assumption, it has been pointed out in the literature that we can get very close to
MAR if we include enough variables in the imputation models [72].
3-1
3-2
3.2
Prelude to Modeling
B
3-3
3.3
b Twist
et al. [159] found instability in using multiple imputation of longitudinal data, and advantages of using instead full likelihood models.
and Royston [175] provide a method for multiply imputing missing covariate values using censored survival time data.
d Y is so valuable that if one is only missing a Y value, imputation is not worthwhile, and imputation of Y is not advised if MCAR or MAR.
c White
3-4
3.4
imputation of Y in that case does not improve the analysis and assumes the imputation model is correct.
3-5
3-6
Results of 1000 Simulations With 1 = 1.0 with MAR and Two Types of
Imputation
Imputation
1 2
Method
Indicator
0.55 0.51
Overall mean 0.55
In the incomplete observations the constant X1 is uncorrelated
with X2.
3-7
3.5
3-8
3-9
3-10
(used in loess):
wi = (1 min(di/s, 1)3)3,
di = |yi y |
s = 0.2 mean|yi y | is a good default scale factor
scale so that P wi = 1
Recursive partitioning with surrogate splits handles case
where a predictor of a variable needing imputation is missing
itself
[176] discusses an alternative method based on choosing a
donor observation at random from the q closest matches
(q = 3, for example)
3-11
3.6
3-12
3.7
Multiple Imputation
N
Single imputation could use a random draw from the conditional distribution for an individual
j = Z + , Z = [X j,
Y ] plus auxiliary variables
X
= n(0, ) or a random draw from the calculated residuals
bootstrap
approximate Bayesian bootstrap [139, 72]: sample with replacement from sample with replacement of residuals
Multiple imputations (M ) with random draws
Draw sample of M residuals for each missing value to be
imputed
Average M
In general can provide least biased estimates of
3-13
BUT full multiple imputation needs to account for uncertainty in the imputation models by refitting these models
for each of the M draws
transcan does not do that; aregImpute does
Note that multiple imputation can and should use the response variable for imputing predictors [117]
aregImpute algorithm [117]
Takes all aspects of uncertainty into account using the
bootstrap
Different bootstrap resamples used for each imputation
by fitting a flexible additive model on a sample with replacement from the original data
This model is used to predict all of the original missing
and non-missing values for the target variable for the current imputation
Uses flexible parametric additive regression models to impute
There is an option to allow target variables to be optimally transformed, even non-monotonically (but this can
overfit)
By default uses predictive mean matching for imputation;
3-14
See Barzi and Woodward [10] for a nice review of multiple imputation with
detailed comparison of results (point estimates and confidence limits for
the effect of the sometimes-missing predictor) for various imputation meth-
3-15
ods. Barnes et al. [9] have a good overview of imputation methods and a
comparison of bias and confidence interval coverage for the methods when
applied to longitudinal data with a small number of subjects. Horton and
Kleinman [85] have a good review of several software packages for dealing
with missing data, and a comparison of them with aregImpute. Harel and
Zhou [72] provide a nice overview of multiple imputation and discuss some of
the available software. White and Carlin [174] studied bias of multiple imputation vs. complete-case analysis. White et al. [176] provide much practical
guidance.
Caution: Methods can generate imputations having very reasonable distributions but still not having the property that final response model regression
coefficients have nominal confidence interval coverage. It is worth checking
that imputations generate the correct collinearities among covariates.
3-16
3-17
3.9
Diagnostics
R
3-18
duplicate
originally missing
impute
all missing
3-19
3.10
Method
Deletion
Single Multiple
Allows non-random missing
x
x
Reduces sample size
x
x
Apparent S.E. of too low
Increases real S.E. of
x
biased
if not MCAR
x
3-20
3-21
4. Same as previous but the predictors can somewhat be predicted from non-missing predictors: nc = 750
5. The outcome variable was not assessed on a random
subjects: nc = 800
1
5
of
Chapter 4
4-2
4-3
4.1
Overfitting
flat
relationship
require ( rms )
set.seed (1)
x runif (1000)
y runif (1000 , -0.5 , 0 .5 )
dd datadist (x , y ) ; options ( datadist = dd )
par ( mfrow = c (2 ,2) , mar = c (2 , 2 , 3 , 0 .5 ) )
pp function ( actual ) {
yhat predict (f , data.frame ( x = xs ) )
yreal actual ( xs )
plot (0 , 0 , xlim = c (0 ,1) ,
ylim = range ( c ( quantile (y , c (0 .1 , 0 .9 ) ) , yhat ,
yreal ) ) ,
type = n , axes = FALSE )
axis (1 , labels = FALSE ) ; axis (2 , labels = FALSE )
lines ( xs , yreal )
lines ( xs , yhat , col = blue )
}
f ols ( y rcs (x , 5) )
xs seq (0 , 1 , length =150)
pp ( function ( x ) 0 * x )
title ( Mild Error :\ nOverfitting a Flat Relationship ,
cex =0 .5 )
y x + runif (1000 , -0.5 , 0 .5 )
f ols ( y rcs (x , 5) )
pp ( function ( x ) x )
title ( Mild Error :\ nOverfitting a Linear Relationship ,
cex =0 .5 )
y x 4 + runif (1000 , -1 , 1)
f ols ( y x )
pp ( function ( x ) x 4)
title ( Serious Error :\ nUnderfitting a Steep Relationship ,
cex =0 .5 )
y - ( x - 0 .5 ) 2 + runif (1000 , -0.2 , 0 .2 )
f ols ( y x )
pp ( function ( x ) - ( x - 0 .5 ) 2)
title ( Tragic Error :\ nMonotonic Fit to \ nNon-Monotonic Relationship ,
cex =0 .5 )
4-4
Mild Error:
Overfitting a Linear Relationship
Mild Error:
Overfitting a Flat Relationship
Tragic0Error:
Monotonic Fit to
NonMonotonic Relationship
0
Serious Error:
Underfitting a Steep Relationship
4-5
4-6
4-7
4.1.2
test statistic does not inform the analyst of which groups are different from one another.
4-8
4-9
4.2
4-10
4.3
Variable Selection
J
4-11
3. The size of the sample was of little practical importance in determining the number of authentic variables contained in the final model.
4. The population multiple coefficient of determination
could be faithfully estimated by adopting a statistic
that is adjusted by the total number of candidate
predictor variables rather than the number of variables in the final model.
Global test with p d.f. insignificant stop
Simulation experiment, true 2 = 6.25, 8 candidate variables,
4 of them related to Y in the population. Select best model
using AIC.
require ( MASS )
sim function (n , sigma =2 .5 , pr = FALSE , prcor = FALSE ) {
x1 rnorm ( n )
x2 x1 + 0 .5 * rnorm ( n )
x3 rnorm ( n )
x4 x3 + 1 .5 * rnorm ( n )
x5 x1 + rnorm ( n ) / 1 .3
x6 x2 + rnorm ( n ) / 1 .3
x7 x3 + x4 + rnorm ( n )
x8 x7 + 0 .5 * rnorm ( n )
if ( prcor ) return ( round ( cor ( cbind ( x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 ) ) ,2) )
lp x1 + x2 + .5 * x3 + .4 * x7
y lp + sigma * rnorm ( n )
f lm ( y x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 )
g stepAIC (f , trace =0)
p g $ rank - 1
xs if ( p == 0) none else
gsub ( [ \\+ x ] , , as.character ( formula ( g ) ) [3])
if ( pr ) print ( formula ( g ) , showEnv = FALSE )
ssesw sum ( resid ( g ) 2)
s2s ssesw / g $ df.residual
# Set SSEsw / ( n - gdf - 1) = true sigma 2
gdf n - 1 - ssesw / ( sigma 2)
#
Compute
root
mean
squared
error
against
true
linear
predictor
4-12
xselected = xs )
}
rsim function (B , n ) {
xs character ( B )
r matrix ( NA , nrow =B , ncol =6)
for ( i in 1: B ) {
w
sim ( n )
r [i ,] w $ stats
xs [ i ] w $ xselected
}
colnames ( r ) names ( w $ stats )
s apply (r , 2 , median )
p r [ , apparentdf ]
s [ apparentdf ] mean ( p )
print ( round (s , 2) )
print ( table ( p ) )
cat ( Prob [ correct model ]= , round ( sum ( xs == 1237 ) /B , 2) , \ n )
}
actual
vratio
0.70
p
1 2 3 4 5 6 7
4 16 32 22 14 9 3
Prob [ correct model ]= 0
model
not
selected
gdf apparentdf
8.09
3.65
once
rmse . full
1.62
rmse . step
1.56
4-13
vratio
0.87
gdf apparentdf
7.34
3.06
rmse . full
1.21
rmse . step
1.15
gdf apparentdf
9.08
3.81
rmse . full
0.59
rmse . step
0.62
gdf apparentdf
9.26
4.21
rmse . full
0.43
rmse . step
0.41
gdf apparentdf
6.30
4.58
rmse . full
0.17
rmse . step
0.15
p
1 2 3 4 5 6 7
2 34 33 21 8 1 1
Prob [ correct model ]= 0
rsim (100 , 150)
n
150.00
vratio
0.97
p
2 3 4 5 6
10 24 44 19 3
Prob [ correct model ]= 0.13
rsim (100 , 300)
n
300.00
vratio
0.98
p
3 4 5 6
12 60 23 5
Prob [ correct model ]= 0.38
rsim (100 , 2000)
n
2000.00
vratio
1.00
p
4 5 6 7
54 35 10 1
Prob [ correct model ]= 0.52
4-14
1
2
candidate
4-15
4-16
4.4
m
15 [75, 76, 148, 125, 124, 168, 132]
4-17
a If one considers the power of a two-sample binomial test compared with a Wilcoxon test if the response could be made continuous and the
proportional odds assumption holds, the effective sample size for a binary response is 3n1 n2 /n 3 min(n1 , n2 ) if nn1 is near 0 or 1 [177, Eq. 10,
15]. Here n1 and n2 are the marginal frequencies of the two response levels [124].
b Based on the power of a proportional odds model two-sample test when the marginal cell sizes for the response are n , . . . , n , compared
1
k
with all cell sizes equal to unity (response is continuous) [177, Eq, 3]. If all cell sizes are equal, the relative efficiency of having k response categories
compared to a continuous response is 1 k12 [177, Eq. 14], e.g., a 5-level response is almost as efficient as a continuous one if proportional odds
holds across category cutoffs.
c This is approximate, as the effective sample size may sometimes be boosted somewhat by censored observations, especially for nonproportional hazards methods such as Wilcoxon-type tests [14].
4-18
4.5
Shrinkage
Slope of calibration plot; regression to the mean
4-19
0.7
Group Mean
0.6
0.5
16 6 17 2 10 14 20 9
7 11 18 5
0.4
0.3
1 15 13 19 12
Group
Figure 4.1: Sorted means from 20 samples of size 50 from a uniform [0, 1] distribution. The reference line at 0.5 depicts the true
population value of all of the means.
2
2
OLSd: = np1
n1 Radj /R
n1
2
Radj
= 1 (1 R2) np1
d An
4-20
4-21
4.6
Collinearity
W
4-22
4-23
4.7
Data Reduction
Z
4-24
Redundancy Analysis
A
4-25
Variable Clustering
B
4-26
4-27
4-28
4-29
4-30
4-31
Transformed blood.pressure
Transformed heart.rate
0
6
50
100
150
200
250
300
50
heart.rate
100
150
blood.pressure
Figure 4.2: Transformations fitted using transcan. Tick marks indicate the two imputed values for blood pressure.
100
50
Transformed bp
blood.pressure
150
50
100
150
200
heart.rate
250
300
Transformed hr
Figure 4.3: The lower left plot contains raw data (Spearman = 0.02); the lower right is a scatterplot of the corresponding
transformed values ( = 0.13). Data courtesy of the SUPPORT study [93].
4-32
4-33
itive
Construct X2 = number of positives
Test whether original variables add to X1 or X2
4.7.6
4-34
4-35
Reasons
Methods
Variable clustering
Subject
knowledge
matter
Group predictors to
maximize proportion
of variance explained
by P C1 of each group
Hierarchical clustering using a matrix
of similarity measures
between predictors
Transform predictors
Allows predictors to
be optimally combined
Canonical variates on
the total set of predictors
Use in customized
model for imputing
missing values on
each predictor
Score a group of predictors
Multiple
dimensional
scoring of all predictors
P C1
Simple point scores
Principal components
1, 2, . . . , k, k
<
p
computed from all
transformed predictors
4-36
4.8
4-37
4-38
4-39
4.10
Level playing field (independent datasets, same no. candidate d.f., careful bootstrapping)
Criteria:
1. calibration
2. discrimination
3. face validity
4. measurement errors in required predictors
5. use of continuous predictors (which are usually better defined than categorical ones)
6. omission of insignificant variables that nonetheless make
sense as risk factors
7. simplicity (though this is less important with the availability of computers)
8. lack of fit for specific types of subjects
Goal is to rank-order: ignore calibration
Otherwise, dismiss a model having poor calibration
Good calibration compare discrimination (e.g., R2 [120],
model 2, Somers Dxy , Spearmans , area under ROC
curve)
4-40
4-41
4.11
4-42
Global Strategies
Use a method known not to work well (e.g., stepwise variable selection without penalization; recursive partitioning),
document how poorly the model performs (e.g. using the
bootstrap), and use the model anyway
Develop a black box model that performs poorly and is difficult to interpret (e.g., does not incorporate penalization)
4-43
4-44
4.12
4-45
4-46
17. Shrink parameter estimates if there is overfitting but no further data reduction is desired (unless shrinkage built-in to
estimation)
18. When missing values were imputed, adjust final variancecovariance matrix for imputation. Do this as early as possible because it will affect other findings.
19. When all steps of the modeling strategy can be automated,
consider using Faraways method [54] to penalize for the randomness inherent in the multiple steps.
20. Develop simplifications to the final model as needed.
4.12.2
1. Less need for parsimony; even less need to remove insignificant variables from model (otherwise CLs too narrow)
2. Careful consideration of interactions; inclusion forces estimates to be conditional and raises variances
3. If variable of interest is mostly the one that is missing, multiple imputation less valuable
4. Complexity of main variable specified by prior beliefs, compromise between variance and bias
5. Dont penalize terms for variable of interest
6. Model validation less necessary
4-47
4.12.3
Chapter 5
Interpreting Effects
A
5-2
acting factors
Subtract estimates, anti-log, e.g., to get inter-quartile-range
odds or hazards ratios. Base C.L. on s.e. of difference. See
Figure 10.10.
5-3
5-4
gIndex
5-5
5-6
5.2
The Bootstrap
H
If know population model, use simulation or analytic derivations to study behavior of statistical estimator
Suppose Y has a cumulative dist. fctn. F (y) = Prob{Y
y}
We have sample of size n from F (y),
Y1, Y2, . . . , Yn
Steps:
1. Repeatedly simulate sample of size n from F
2. Compute statistic of interest
3. Study behavior over B repetitions
Example: 1000 samples, 1000 sample medians, compute
their sample variance
F unknown estimate by empirical dist. fctn.
n
1 X
Fn(y) =
[Yi y].
n i=1
Example: sample of size n = 30 from a normal distribution
with mean 100 and SD 10
set.seed (6)
x rnorm (30 , 100 , 20)
xs seq (50 , 150 , length =150)
5-7
1.0
0.8
Prob[X x]
0.6
0.4
0.2
0.0
60
80
100
120
140
x
Figure 5.1: Empirical and population cumulative distribution function
5-8
Efrons bootstrap general-purpose technique for estimating properties of estimators without assuming or knowing
distribution of data F
Take B samples of size n with replacement, choose B so
that summary measure of individual statistics summary if
B=
Bootstrap based on distribution of observed differences between a resampled parameter estimate and the original estimate telling us about the distribution of unobservable differences between the original estimate and the unknown parameter
Example: Data (1, 5, 6, 7, 8, 9), obtain 0.80 confidence interval
for population median, and estimate of population expected
value of sample median (only to estimate the bias in the original
estimate of the median).
options ( digits =3)
y c (2 ,5 ,6 ,7 ,8 ,9 ,10 ,11 ,12 ,13 ,14 ,19 ,20 ,21)
y c (1 ,5 ,6 ,7 ,8 ,9)
set.seed (17)
n length ( y )
n2 n / 2
n21 n2 +1
B 400
M double ( B )
plot (0 , 0 , xlim = c (0 , B ) , ylim = c (3 ,9) ,
xlab = " Bootstrap Samples Used " ,
ylab = " Mean and 0 .1 , 0 .9 Quantiles " , type = " n " )
for ( i in 1: B ) {
s sample (1: n , n , replace = T )
x sort ( y [ s ])
m .5 * ( x [ n2 ]+ x [ n21 ])
M[i] m
if ( i 20) {
w as.character ( x )
cat (w , " & & " , sprintf ( % .1f ,m ) ,
if ( i < 20) " \\\\\ n " else " \\\\ \\ hline \ n " ,
5-9
3 3.5
10
7
4 4.5
8
2
5 5.5
23 43
6 6.5
75 59
7 7.5
66 47
8 8.5
42 11
9
1
hist (M , nclass = length ( unique ( M ) ) , xlab = " " , main = " " )
60
Frequency
40
20
4
0
0
100
200
300
400
First 20 samples:
Bootstrap Sample
166789
155568
578999
777889
157799
156678
788888
555799
155779
155778
115577
115578
155778
156788
156799
667789
157889
668999
115569
168999
5-10
Sample Median
6.5
5.0
8.5
7.5
7.0
6.0
8.0
6.0
6.0
6.0
5.0
5.0
6.0
6.5
6.5
7.0
7.5
8.5
5.0
8.5
N
5-11
5.3
Model Validation
5.3.1
Introduction
O
in many cases it is better to combine data and include country or calendar time as a predictor.
5-12
1. Attempting several validations (internal or external) and reporting only the one that worked
2. Reporting apparent performance on the training dataset (no
validation)
3. Reporting predictive accuracy on an undersized independent
test sample
4. Internal validation using data splitting where at least one of
the training and test samples is not huge and the investigator
is not aware of the arbitrariness of variable selection done
on a single sample
5. Strong internal validation using 100 repeats of 10-fold crossvalidation or several hundred bootstrap resamples, repeating
all analysis steps involving Y afresh at each re-sample and
the arbitrariness of selected important variables is reported
(if variable selection is used)
6. External validation on a large test sample, done by the original research team
7. Re-analysis by an independent research team using strong
internal validation of the original dataset
5-13
8. External validation using new test data, done by an independent research team
9. External validation using new test data generated using different instruments/technology, done by an independent research team
Some points to consider:
5-14
But when the model was fitted using machine learning, feature screening, variable selection, or model selection, the
model developed using training data is usually only an example of a model, and the test sample validation could be
called an example validation
When resampling is used to repeat all modeling steps for
each resample, rigorous internal validation tests the process
used to develop the model and happens to also provide a
high-precision estimate of the likely future performance of
the final model developed using that process, properly penalizing for model uncertainty.
Resampling also reveals the volatility of the model selection
process
See BBR 10.11
5.3.2
5-15
2
= 0.79
data 0.7, Radj
Data-Splitting
W
5-16
5-17
5.3.4
Example of -validation:
1. Split data at random into 10 tenths
1
of data at a time
2. Leave out 10
9
3. Develop model on 10
, including any variable selection,
pre-testing, etc.
1
4. Freeze coefficients, evaluate on 10
5. Average R2 over 10 reps
Drawbacks:
1. Choice of number of groups and repetitions
2. Doesnt show full variability of var. selection
3. Does not validate full model
4. Lower precision than bootstrap
5-18
1. Randomly permute Y
2. Optimism = performance of fitted model compared to
what expect by chance
5.3.5
5-19
model .
2
final model has apparent R2 (Rapp
) =0.4
2
how inflated is Rapp
?
corrected
2
= Rapp
optimism
Example: See P. ??
Is estimating unconditional (not conditional on X) distribution of R2, etc. [54, p. 217]
5-20
See [56] for warnings about the bootstrap, and [49] for variations
5-21
Apparent Rank
OverBias-Corrected
Correlation of Optimism
Correlation
Predicted vs.
Observed
Full Model
0.50
0.06
0.44
Stepwise Model
0.47
0.05
0.42
5-22
5-23
5.4
#
#
require ( rms )
n 300
set.seed (1)
d data.frame ( x1 = runif ( n ) , x2 = runif ( n ) , x3 = runif ( n ) , x4 = runif ( n ) ,
x5 = runif ( n ) , x6 = runif ( n ) , x7 = runif ( n ) , x8 = runif ( n ) ,
x9 = runif ( n ) , x10 = runif ( n ) , x11 = runif ( n ) , x12 = runif ( n ) )
d $ y with (d , 1 * x1 + 2 * x2 + 3 * x3 + 4 * x4 + 5 * x5 + 6 * x6 + 7 * x7 +
8 * x8 + 9 * x9 + 10 * x10 + 11 * x11 + 12 * x12 + 9 * rnorm ( n ) )
f ols ( y x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 , data = d )
B 1000
ranks matrix ( NA , nrow =B , ncol =12)
rankvars function ( fit )
rank ( plot ( anova ( fit ) , sort = none , pl = FALSE ) )
Rank rankvars ( f )
for ( i in 1: B ) {
j sample (1: n , n , TRUE )
bootfit update (f , data =d , subset = j )
ranks [i ,] rankvars ( bootfit )
}
lim t ( apply ( ranks , 2 , quantile , probs = c ( .025 , .975 ) ) )
predictor factor ( names ( Rank ) , names ( Rank ) )
w data.frame ( predictor , Rank , lower = lim [ ,1] , upper = lim [ ,2])
require ( ggplot2 )
ggplot (w , aes ( x = predictor , y = Rank ) ) + geom_point () + coord_flip () +
5-24
x12
x11
x10
predictor
x9
x8
x7
x6
x5
x4
x3
x2
x1
10
11
12
Rank
Figure 5.3: Bootstrap percentile 0.95 confidence limits for ranks of predictors in an OLS model. Ranking is on the basis of partial
2 minus d.f. Point estimates are original ranks
5-25
5.5
5-26
5-27
5.6
Chapter 6
R Software
R allows interaction spline functions, wide variety of predictor parameterizations, wide variety of models, unifying model
formula language, model validation by resampling.
R is comprehensive:
Easy to write R functions for new models wide variety
of modern regression models implemented (trees, nonparametric, ACE, AVAS, survival models for multiple events)
Designs can be generated for any model all handle class
var, interactions, nonlinear expansions
Single R objects (e.g., fit object) can be self-documenting
automatic hypothesis tests, predictions for new data
Superior graphics
Classes and generic functions
6-1
CHAPTER 6. R SOFTWARE
6-2
6.1
#
#
#
#
functions ,
race is an
with dummy v a r i a b l e s g e n e r a t e d if
R factor ( classification ) variable
poly ( age ,2)
# poly generates orthogonal
y sex +
polynomials
race.sex interaction ( race , sex )
y age + race.sex
# for when you want dummy variables for
#
all
combinations
of
the
factors
a lrm
CHAPTER 6. R SOFTWARE
6-3
6.2
User-Contributed Functions
R is high-level object-oriented language.
R (UNIX, Linux, Mac, Windows)
Multitude of user-contributed functions freely available
International community of users
Some R functions:
See Venables and Ripley
Hierarchical clustering: hclust
Principal components: princomp, prcomp
Canonical correlation: cancor
Nonparametric transform-both-sides additive models:
ace, avas
CHAPTER 6. R SOFTWARE
6-4
CHAPTER 6. R SOFTWARE
6-5
6.3
Function
ols
lrm
orm
psm
cph
bj
Glm
Gls
Rq
Purpose
Ordinary least squares linear model
Binary and ordinal logistic regression model
Has options for penalized MLE
Ordinal semi-parametric regression model for
continuous Y and several link functions
Accelerated failure time parametric survival
models
Cox proportional hazards regression
Buckley-James censored least squares model
rms version of glm
rms version of gls
rms version of rq
Related R
Functions
lm
glm
polr,lrm
survreg
coxph
survreg,lm
glm
gls (nlme package)
rq (quantreg package)
CHAPTER 6. R SOFTWARE
6-6
Table 6.2: rms Transformation Functions
Function
Purpose
asis
rcs
pol
lsp
catg
scored
matrx
strat
Function
Purpose
Print parameters and statistics of fit
Fitted regression coefficients
Formula used in the fit
Detailed specifications of fit
Fetch covariance matrix
Fetch maximized log-likelihood
Fetch AIC with option to put on chi-square basis
Likelihood ratio test for two nested models
Compute all univariable LR 2
Robust covariance matrix estimates
Bootstrap covariance matrix estimates
and bootstrap distributions of estimates
Find optimum penalty factors by tracing
effective AIC for a grid of penalties
Print effective d.f. for each type of variable
in model, for penalized fit or pentrace result
Summary of effects of predictors
Plot continuously shaded confidence bars
for results of summary
Wald tests of most meaningful hypotheses
Graphical depiction of anova
General contrasts, C.L., tests
Easily generate predictor combinations
Obtain predicted values or design matrix
Obtain predicted values and confidence limits easily
varying a subset of predictors and others set at
default values
Plot the result of Predict using lattice
Plot the result of Predict using ggplot2
Fast backward step-down variable selection
(or resid) Residuals, influence stats from fit
Sensitivity analysis for unmeasured confounder
Which observations are overly influential
LATEX representation of fitted model
print
coef
formula
specs
vcov
logLik
AIC
lrtest
univarLR
robcov
bootcov
pentrace
effective.df
summary
plot.summary
anova
plot.anova
contrast
gendata
predict
Predict
plot.Predict
ggplot.Predict
fastbw
residuals
sensuc
which.influence
latex
Related R
Functions
I
ns
poly
factor
ordered
matrix
strata
Related Functions
step
residuals
Function
CHAPTER 6. R SOFTWARE
Function
Function
Hazard
Survival
Quantile
Mean
nomogram
survest
survplot
validate
val.prob
val.surv
calibrate
vif
naresid
naprint
impute
Purpose
R function analytic representation of X
from a fitted regression model
R function analytic representation of a fitted
hazard function (for psm)
R function analytic representation of fitted
survival function (for psm, cph)
R function analytic representation of fitted
function for quantiles of survival time
(for psm, cph)
R function analytic representation of fitted
function for mean survival time or for ordinal logistic
Draws a nomogram for the fitted model
Estimate survival probabilities (psm, cph)
Plot survival curves (psm, cph)
Validate indexes of model fit using resampling
External validation of a probability model
External validation of a survival model
Estimate calibration curve using resampling
Variance inflation factors for fitted model
Bring elements corresponding to missing data
back into predictions and residuals
Print summary of missing values
Impute missing values
6-7
Related Functions
latex
latex, plot
survfit
plot.survfit
lrm
calibrate
val.prob
aregImpute
Example:
treat: categorical variable with levels "a","b","c"
num.diseases: ordinal variable, 0-4
age: continuous
Restricted cubic spline
cholesterol: continuous
(3 missings; use median)
log(cholesterol+10)
CHAPTER 6. R SOFTWARE
6-8
Could
have
used
ddist
datadist ( data.frame.name )
fit
robcov ( fit )
specs ( fit )
anova ( fit )
anova ( fit , treat , cholesterol )
plot ( anova ( fit ) )
summary ( fit )
plot ( summary ( fit ) )
summary ( fit , treat = " b " , age =60)
summary ( fit , age = c (50 ,70) )
summary ( fit , age = c (50 ,60 ,70) )
If
had
not
Estimate
defined
and
test
datadist ,
treatment
#
#
#
#
Test
#
#
#
#
#
#
these
by
themselves
# I n c r e a s e age f r o m 50 to 70 , a d j u s t to
# 60 when e s t i m a t i n g e f f e c t s of other
# factors
would have to define ranges for all var.
( b-a )
effect
averaged
over
cholesterols
of
to
get
p Predict ( fit , age = seq (20 ,80 , length =100) , treat , conf.int = FALSE )
plot ( p )
# Plot relationship between age and log
#
or
ggplot (p)
#
#
#
odds , s e p a r a t e curve
no C.I.
Same but 2 panels
for
#
#
#
3 -dimensional perspective
cholesterol , and log odds
ranges for both variables
each
treat ,
p l o t for age ,
using default
Again ,
if
no
datadist
were
# log odds
defined , would
( or use
have to
ggplot () )
tell plot
all
limits
logit predict ( fit , expand.grid ( treat = " b " , num.dis =1:3 , age = c (20 ,40 ,60) ,
cholesterol = seq (100 ,300 , length =10) ) )
#
Could
also
obtain
list
of
predictor
settings
interactively }
CHAPTER 6. R SOFTWARE
6-9
ag 10:80
logit predict ( fit , expand.grid ( treat = " a " , num.dis =0 , age = ag ,
cholesterol = median ( cholesterol ) ) , type = " terms " ) [ , " age " ]
#
Note :
#
#
#
#
#
#
if
age
Predict
#
#
#
if
interacted
all
other
with
terms
anything ,
were
plot ( ag .5 , logit )
plot ( ag 1 .5 , logit )
Draw
nomogram
for
the
try
try
square root
1 .5 power
# invokes
fit
model
would
be
the
age
ignored
#
#
latex ( fit )
this
vs.
latex.lrm ,
spline
creates
transform.
fit.tex
Compose
function
to
evaluate
linear
predictors
analytically
g Function ( fit )
g ( treat = b , cholesterol =260 , age =50)
#
Letting
num.diseases
default
to
reference
value
For
automatic
ranges
later ,
add
age.tertile
to
datadist
input
CHAPTER 6. R SOFTWARE
6-10
6.4
Other Functions
supsmu: Friedmans super smoother
lowess: Clevelands scatterplot smoother
glm: generalized linear models (see Glm)
gam: Generalized additive models
rpart: Like original CART with surrogate splits for missings,
censored data extension (Atkinson & Therneau)
validate.rpart: in rms; validates recursive partitioning with
respect to certain accuracy indexes
loess: multi-dimensional scatterplot smoother
f loess ( y age * pressure )
plot ( f )
# cross-sectional plots
ages seq (20 ,70 , length =40)
pressures seq (80 ,200 , length =40)
pred predict (f , expand.grid ( age = ages , pressure = pressures ) )
persp ( ages , pressures , pred )
# 3 -d plot
Chapter 7
Notation
A
N subjects
Subject i (i = 1, 2, . . . , N ) has ni responses measured at
times ti1, ti2, . . . , tini
Response at time t for subject i: Yit
Subject i has baseline covariates Xi
Generally the response measured at time ti1 = 0 is a covariate in Xi instead of being the first measured response
Yi0
Time trend in response is modeled with k parameters so
that the time main effect has k d.f.
7-1
7-2
Let the basis functions modeling the time effect be g1(t), g2(t), . . . , g
7-3
7.2
k dummy variables for k + 1 unique times (assumes no functional form for time but may spend many d.f.)
k = 1 for linear time trend, g1(t) = t
korder polynomial in t
k + 1knot restricted cubic spline (one linear term, k 1
nonlinear terms)
7.2.2
7-4
7-5
the estimated treatment effect. Additionally, any other prognostic factors correlated with the outcome variable will
also be correlated with the baseline value of that outcome, and this has two important consequences. First, this
greatly reduces the need to enter a large number of prognostic factors as covariates in the linear models. Their
effect is already mediated through the baseline value of the outcome variable. Secondly, any imbalances across the
treatment arms in important prognostic factors will induce an imbalance across the treatment arms in the baseline
value of the outcome. Including the baseline value thereby reduces the need to enter these variables as covariates
in the linear models.
The final reason that baseline cannot be modeled as the response at time zero is that many studies have inclusion/exclusion criteria that include cutoffs on the baseline variable.
In other words, the baseline measurement comes from a truncated distribution. In general it is not appropriate to model
the baseline with the same distributional shape as the followup measurements. Thus the approach recommended by Liang
and Zeger [107] and Liu et al. [110] are problematica.
a In addition to this, one of the papers conclusions that analysis of covariance is not appropriate if the population means of the baseline
variable are not identical in the treatment groups is not correct [144]. See [92] for a rebuke of [110].
7-6
7.3
7-7
We will assume that Yit|Xi has a multivariate normal distribution with mean given above and with variance-covariance
matrix Vi, an ni ni matrix that is a function of ti1, . . . , tini
Assumes normality
Assumes independence of
measurements within subject
Assumes a correlation structuref
Requires same measurement
times for all subjects
Does not allow smooth modeling
of time to save d.f.
Does not allow adjustment for
baseline covariates
Does not easily extend to
non-continuous Y
Loses information by not using
intermediate measurements
Does not allow widely varying #
of observations per subject
Does not allow for subjects
to have distinct trajectoriesk
Assumes subject-specific effects
are Gaussian
Badly biased if non-random
dropouts
Biased in general
Harder to get tests & CLs
Requires large # subjects/clusters
SEs are wrong
Assumptions are not verifiable
in small samples
Does not extend to complex
settings such as time-dependent
covariates and dynamico models
a Thanks
Repeated
Measures
ANOVA
GEE
Mixed
Effects
Model
GLS LOCF
7-8
ab
Summary
Statisticc
N/A
to Charles Berry, Brian Cade, Peter Flom, Bert Gunter, and Leena Choi for valuable input.
generalized estimating equations; GLS: generalized least squares; LOCF: last observation carried forward.
c E.g., compute within-subject slope, mean, or area under the curve over time. Assumes that the summary measure is an adequate summary
of the time profile and assesses the relevant treatment effect.
d Unless one uses the Huynh-Feldt or Greenhouse-Geisser correction
e For full efficiency, if using the working independence model
f Or requires the user to specify one
g For full efficiency of regression coefficient estimates
h Unless the last observation is missing
i The cluster sandwich variance estimator used to estimate SEs in GEE does not perform well in this situation, and neither does the working
independence model because it does not weight subjects properly.
j Unless one knows how to properly do a weighted analysis
k Or uses population averages
l Unlike GLS, does not use standard maximum likelihood methods yielding simple likelihood ratio 2 statistics. Requires high-dimensional
integration to marginalize random effects, using complex approximations, and if using SAS, unintuitive d.f. for the various tests.
m Because there is no correct formula for SE of effects; ordinary SEs are not penalized for imputation and are too small
n If correction not applied
o E.g., a model with a predictor that is a lagged value of the response variable
b GEE:
7-9
Gardiner et al. [59] compared several longitudinal data models, especially with regard to assumptions and how regression
coefficients are estimated. Peters et al. [130] have an empirical study confirming that the use all available data approach
of likelihoodbased longitudinal models makes imputation of
follow-up measurements unnecessary.
7-10
7.4
7-11
7-12
7.5
nlme corCompSymm
corCAR1
corExp
corGaus
corLin
sdmin
dmin + d
max dmin
7-13
corRatio
corSpher
, 1 if t1 = t2 [147]
7-14
7.6
7-15
7.7
R Software
P
7-16
7.8
Case Study
Consider the dataset in Table 6.9 of Davis [42, pp. 161-163] from
a multicenter, randomized controlled trial of botulinum toxin
type B (BotB) in patients with cervical dystonia from nine U.S.
sites.
Randomized to placebo (N = 36), 5000 units of BotB (N =
36), 10,000 units of BotB (N = 37)
Response variable: total score on Toronto Western Spasmodic Torticollis Rating Scale (TWSTRS), measuring severity, pain, and disability of cervical dystonia (high scores mean
more impairment)
TWSTRS measured at baseline (week 0) and weeks 2, 4, 8,
12, 16 after treatment began
Dataset cdystonia from web site
7.8.1
Construct
unique
subject
ID
Tabulate
patterns
of
subjects
time
points
7-17
0
1
0 2 4 8 12 16
94
0 2 4
1
0 2 4 8 16
1
superposing
Plot
raw
data ,
0 2 4 12 16
3
0 2 8 12 16
2
0 2 4 8
1
0 4 8 12 16
4
0 2 4 8 12
1
0 4 8 16
1
subjects
60
10000U
40
60
5000U
TWSTRStotal score
20
40
20
60
Placebo
40
20
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Week
Figure 7.1: Time profiles for individual subjects, stratified by study site and dose
#
Show
quartiles
require ( data.table )
cdystonia data.table ( cdystonia )
cdys cdystonia [ , j = as.list ( quantile ( twstrs , (1 : 3) / 4) ) ,
by = list ( treat , week ) ]
cdys upData ( cdys , rename = c ( 25% = Q1 , 50% = Q2 , 75% = Q3 ) , print = FALSE )
ggplot ( cdys , aes ( x = week , y = Q2 ) ) + xl + yl + ylim (0 , 70) +
geom_line () + facet_wrap ( treat , nrow =2) +
10000U
Fig.
7-18
7.2
5000U
60
TWSTRStotal score
40
20
0
Placebo
60
40
20
0
0
10
15
Week
Figure 7.2: Quartiles of TWSTRS stratified by dose
#
Show
means
with
bootstrap
nonparametric
CLs
cdys
Fig.
7.3
10000U
5000U
60
TWSTRStotal score
40
20
0
Placebo
60
40
20
0
0
10
15
Week
Figure 7.3: Mean responses and nonparametric bootstrap 0.95 confidence limits for population means, stratified by dose
7-19
7-20
7.8.2
z [[1]]
z [[2]]
z [[3]]
z [[4]]
z [[5]]
z [[6]]
Model
1
2
3
4
5
6
df
20
20
20
20
20
20
AIC
3553.906
3553.906
3587.974
3575.079
3621.081
3570.958
BIC
3638.357
3638.357
3672.426
3659.531
3705.532
3655.409
logLik
-1756.953
-1756.953
-1773.987
-1767.540
-1790.540
-1765.479
7-21
are tied for the best. For the remainder of the analysis use
corCAR1, using Gls.
a Gls ( twstrs treat * rcs ( week , 3) + rcs ( twstrs0 , 3) +
rcs ( age , 4) * sex , data = both ,
correlation = corCAR1 ( form =week | uid ) )
print (a , latex = TRUE )
Intercept
treat=5000U
treat=Placebo
week
week
twstrs0
twstrs0
age
age
age
sex=M
treat=5000U * week
treat=Placebo * week
treat=5000U * week
treat=Placebo * week
age * sex=M
age * sex=M
age * sex=M
Coef
-0.3093
0.4344
7.1433
0.2879
0.7313
0.8071
0.2129
-0.1178
0.6968
-3.4018
24.2802
0.0745
-0.1256
-0.4389
-0.6459
-0.5846
1.4652
-4.0338
S.E.
11.8804
2.5962
2.6133
0.2973
0.3078
0.1449
0.1795
0.2346
0.6484
2.5599
18.6208
0.4221
0.4243
0.4363
0.4381
0.4447
1.2388
4.8123
t
Pr(> |t|)
-0.03
0.9792
0.17
0.8672
2.73
0.0065
0.97
0.3334
2.38
0.0179
5.57 < 0.0001
1.19
0.2360
-0.50
0.6158
1.07
0.2830
-1.33
0.1845
1.30
0.1929
0.18
0.8599
-0.30
0.7674
-1.01
0.3149
-1.47
0.1411
-1.31
0.1892
1.18
0.2375
-0.84
0.4023
7-22
surements taken one week apart on the same subject. The estimated correlation for measurements 10 weeks apart is 0.867210 =
0.24.
v Variogram (a , form = week | uid )
plot ( v ) # F i g u r e 7.4
Semivariogram
0.6
10
0.4
0.2
12
14
Distance
Figure 7.4: Variogram, with assumed correlation pattern superimposed
Figure
Table
7.1
7.6
Residuals
20
20
5000U
20
Placebo
Residuals
10000U
20
40
40
20 30 40 50 60 70 20 30 40 50 60 70 20 30 40 50 60 70
30
fitted
40
50
60
twstrs0
20
20
10
Residuals
Residuals
7-23
20
10
40
20
4
week
12
16
theoretical
Figure 7.5: Three residual plots to check for absence of trends in central tendency and in variability. Upper right panel shows the
baseline score on the x-axis. Bottom left panel shows the mean 2SD. Bottom right panel is the QQ plot for checking normality
of residuals from the GLS fit.
sex
age * sex
age
treat * week
2
22.11
14.94
77.27
14.94
6.61
233.83
1.41
9.68
4.86
7.59
5.67
4.86
14.94
2.27
2.27
4.86
3.76
3.76
15.03
19.75
28.54
322.98
d.f.
6
4
6
4
3
2
1
6
3
4
4
3
4
2
2
3
2
2
8
7
11
17
P
0.0012
0.0048
< 0.0001
0.0048
0.0852
< 0.0001
0.2354
0.1388
0.1826
0.1077
0.2252
0.1826
0.0048
0.3208
0.3208
0.1826
0.1526
0.1526
0.0586
0.0061
0.0027
< 0.0001
treat
week
twstrs0
50
100
150
200
2 df
Figure 7.6: Results of anova.rms from generalized least squares fit with continuous time AR1 correlation structure
7-24
7-25
Treatment
10000U
5000U
Placebo
60
^
X
48
^
X
44
40
40
20
36
4
12
30
16
40
Sex
50
60
TWSTRStotal score
Week
50
^
X
45
40
35
30
25
40
50
60
70
80
Age, years
Figure 7.7: Estimated effects of time, baseline TWSTRS, age, and sex
Low High
Effect
S.E.
Lower 0.95 Upper 0.95
week
4
12 8 6.69100 1.10570
4.5238
8.8582
twstrs0
39
53 14 13.55100 0.88618
11.8140
15.2880
age
46
65 19 2.50270 2.05140
-1.5179
6.5234
treat 5000U:10000U
1
2
0.59167 1.99830
-3.3249
4.5083
treat Placebo:10000U
1
3
5.49300 2.00430
1.5647
9.4212
sex M:F
1
2
-1.08500 1.77860
-4.5711
2.4011
#
To
for
get
results
for
week
treatment ,
use
e.g.
summary (a ,
Compare
low
dose
with
for
placebo ,
different
w e e k =4 ,
separately
reference
group
treat = Placebo )
at
each
time
7-26
1
2
3
4*
5*
Compare
high
dose
with
placebo
1
2
3
4*
5*
7-27
12
16
week
week
Figure 7.8: Contrasts and 0.95 confidence limits from GLS fit
12
16
10
20
30
40
50
60
70
7-28
80
90
100
Points
TWSTRStotal score
20
25
30
50
35
40
45
50
55
60
65
70
60
age (sex=F)
85
80
40 70 20
60
70
age (sex=M)
50
40
85
30
20
5000U
treat (week=2)
10000U
Placebo
5000U
treat (week=4)
10000U
Placebo
5000U
treat (week=8)
10000U
Placebo
10000U
treat (week=12)
5000U
Placebo
treat (week=16)
5000U
10000U
Total Points
0
10
20
30
40
50
60
70
80
90
100
110
120
130
50
55
60
65
70
Linear Predictor
15
20
25
30
35
40
45
Figure 7.9: Nomogram from GLS fit. Second axis is the baseline score.
140
Chapter 8
Data source: The Titanic Passenger List edited by Michael A. Findlay, originally
published in Eaton & Haas (1994) Titanic: Triumph and Tragedy, Patrick Stephens
Ltd, and expanded with the help of the Internet community. The original html files were
obtained from Philip Hind (1999) (http://atschool.eduweb.co.uk/phind). The dataset
was compiled and interpreted by Thomas Cason. It is available in Rand spreadsheet
formats from biostat.mc.vanderbilt.edu/DataSets under the name titanic3.
8-1
8-2
8.1
Descriptive Statistics
require ( rms )
getHdata ( titanic3 )
#
List
of
names
of
web
site
t3
1309 Observations
6 Variables
pclass
n
1309
missing
0
unique
3
survived : Survived
n
1309
missing
0
unique
2
Info
0.71
Sum
500
Mean
0.382
unique
98
Info
1
Mean
29.88
.05
5
missing
263
.10
14
.25
21
.50
28
.75
39
.90
50
.95
57
sex
n
1309
missing
0
unique
2
missing
0
unique
7
Info
0.67
Mean
0.4989
0
1 2 3 4 5 8
Frequency 891 319 42 20 22 6 9
%
68 24 3 2 2 0 1
missing
0
unique
8
Info
0.55
Mean
0.385
0
1
2 3 4 5 6 9
Frequency 1002 170 113 8 6 6 2 2
%
77 13
9 1 0 0 0 0
dd datadist ( t3 )
#
describe
distributions
of
variables
to
rms
options ( datadist = dd )
s summary ( survived age + sex + pclass +
cut2 ( sibsp ,0:3) + cut2 ( parch ,0:3) , data = t3 )
plot (s , main = , subtitles = FALSE )
# F i g u r e 8.1
Show 4-way relationships after collapsing levels. Suppress estimates based on < 25 passengers.
8-3
Age [years]
[ 0.167,22.0)
[22.000,28.5)
[28.500,40.0)
[40.000,80.0]
Missing
290
246
265
245
263
sex
female
male
pclass
1st
2nd
3rd
0
1
2
[3,8]
891
319
42
57
0
1
2
[3,9]
0.2
0.3
0.4
1309
0.5
0.6
Survived
tn transform ( t3 ,
agec = ifelse ( age < 21 , child , adult ) ,
sibsp = ifelse ( sibsp == 0 , no sib / sp , sib / sp ) ,
parch = ifelse ( parch == 0 , no par / child , par / child ) )
g function ( y ) if ( length ( y ) < 25) NA else mean ( y )
s with ( tn , summarize ( survived ,
llist ( agec , sex , pclass , sibsp , parch ) , g ) )
llist ,
Figure
summarize
in
Hmisc
1002
170
113
24
Overall
#
#
323
277
709
466
843
package
8.2 :
ggplot ( subset (s , agec ! = NA ) ,
aes ( x = survived , y = pclass , shape = sex ) ) +
geom_point () + facet_grid ( agec sibsp * parch ) +
xlab ( Proportion Surviving ) + ylab ( Passenger Class ) +
scale_x_continuous ( breaks = c (0 , .5 , 1) )
0.7
no sib/sp
sib/sp
sib/sp
no par/child
par/child
no par/child
par/child
2nd
adult
1st
sex
3rd
2nd
1st
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
Proportion Surviving
Figure 8.2: Multi-way summary of Titanic survival
female
male
child
Passenger Class
3rd
no sib/sp
8-4
1.0
8-5
8.2
1.00
1.00
0.75
0.75
8-6
sex
0.50
female
0.50
male
0.25
0.25
0.00
0.00
0
20
40
60
80
20
age
40
60
80
age
1.00
1.00
0.75
0.75
sex
female
pclass
male
1st
0.50
2nd
0.50
pclass
1st
3rd
0.25
2nd
0.25
3rd
0.00
0.00
0
20
40
age
60
80
20
40
60
80
age
Figure 8.3: Nonparametric regression (loess) estimates of the relationship between age and the probability of surviving the Titanic,
with tick marks depicting the age distribution. The top left panel shows unstratified estimates of the probability of survival. Other
panels show nonparametric estimates by various stratifications.
siblings/spouses
[2,8]
parents/children
1.00
1.00
0.75
0.75
0.50
0.50
0.25
0.25
0.00
8-7
[2,9]
0.00
0
20
40
age
60
80
20
40
60
80
age
Figure 8.4: Relationship between age and survival stratified by the number of siblings or spouses on board (left panel) or by the
number of parents or children of the passenger on board (right panel).
8-8
2
187.15
59.74
100.10
46.51
56.20
34.57
28.66
19.67
12.13
3.51
3.51
42.43
15.89
14.47
4.17
13.47
12.92
6.88
12.13
1.76
1.76
3.51
1.80
1.80
8.34
7.74
28.66
75.61
79.49
241.93
d.f.
15
14
20
18
32
28
24
5
4
5
4
10
12
9
3
16
12
6
4
3
3
4
3
3
8
6
24
30
33
39
P
< 0.0001
< 0.0001
< 0.0001
0.0003
0.0052
0.1826
0.2331
0.0014
0.0164
0.6217
0.4761
< 0.0001
0.1962
0.1066
0.2441
0.6385
0.3749
0.3324
0.0164
0.6235
0.6235
0.4761
0.6147
0.6147
0.4006
0.2581
0.2331
< 0.0001
< 0.0001
< 0.0001
8.3
sex
0
pclass
0
Obs
1046
0
619
1
427
L
6
max | log
|
610
age
263
sibsp
0
Model Likelihood
Ratio Test
LR 2
553.87
d.f.
26
2
Pr(> ) < 0.0001
Intercept
sex=male
pclass=2nd
pclass=3rd
age
age
age
age
sibsp
sex=male * pclass=2nd
sex=male * pclass=3rd
sex=male * age
sex=male * age
sex=male * age
sex=male * age
pclass=2nd * age
pclass=3rd * age
pclass=2nd * age
pclass=3rd * age
pclass=2nd * age
pclass=3rd * age
pclass=2nd * age
pclass=3rd * age
age * sibsp
age * sibsp
age * sibsp
age * sibsp
Coef
3.3075
-1.1478
6.7309
-1.6437
0.0886
-0.7410
4.9264
-6.6129
-1.0446
-0.7682
2.1520
-0.2191
1.0842
-6.5578
8.3716
-0.5446
-0.1634
1.9156
0.8205
-8.9545
-5.4276
9.3926
7.5403
0.0357
-0.0467
0.5574
-1.1937
Discrimination
Indexes
2
R
0.555
g
2.427
gr
11.325
gp
0.365
Brier
0.130
Rank Discrim.
Indexes
C
0.878
Dxy
0.756
0.758
a
0.366
S.E.
Wald Z Pr(> |Z|)
1.8427
1.79
0.0727
1.0878
-1.06
0.2914
3.9617
1.70
0.0893
1.8299
-0.90
0.3691
0.1346
0.66
0.5102
0.6513
-1.14
0.2552
4.0047
1.23
0.2186
5.4100
-1.22
0.2216
0.3441
-3.04
0.0024
0.7083
-1.08
0.2781
0.6214
3.46
0.0005
0.0722
-3.04
0.0024
0.3886
2.79
0.0053
2.6511
-2.47
0.0134
3.8532
2.17
0.0298
0.2653
-2.05
0.0401
0.1308
-1.25
0.2118
1.0189
1.88
0.0601
0.6091
1.35
0.1780
5.5027
-1.63
0.1037
3.6475
-1.49
0.1367
6.9559
1.35
0.1769
4.8519
1.55
0.1202
0.0340
1.05
0.2933
0.2213
-0.21
0.8330
1.6680
0.33
0.7382
2.5711
-0.46
0.6425
8-9
8-10
2
199.42
56.14
108.73
42.83
47.04
24.51
22.72
19.95
10.99
35.40
10.08
8.17
8.17
6.86
6.11
6.11
10.99
1.81
1.81
22.72
67.58
70.68
253.18
d.f.
7
6
12
10
20
16
15
5
4
2
4
3
3
8
6
6
4
3
3
15
18
21
26
P
< 0.0001
< 0.0001
< 0.0001
< 0.0001
0.0006
0.0789
0.0902
0.0013
0.0267
< 0.0001
0.0391
0.0426
0.0426
0.5516
0.4113
0.4113
0.0267
0.6134
0.6134
0.0902
< 0.0001
< 0.0001
< 0.0001
x = TRUE ,
y= TRUE
adds
raw
data
to
fit
object
so
can
bootstrap
set.seed (131)
# so can r e p l i c a t e r e - s a m p l e s
latex ( validate (f , B =200) , digits =2 , size = Ssize )
1st
2nd
8-11
3rd
1.00
0.75
sex
female
0.50
male
0.25
0.00
0
20
40
60
20
40
60
20
40
60
Age, years
Figure 8.5: Effects of predictors on probability of survival of Titanic passengers, estimated for zero siblings or spouses
log odds
Age, years
10
15
4
20
50
6
0
Figure 8.6: Effect of number of siblings and spouses on the log odds of surviving, for third class males
Index
Original Training
Test
Optimism
Sample Sample Sample
Dxy
0.76
0.77
0.74
0.03
2
R
0.55
0.58
0.53
0.05
Intercept
0.00
0.00 0.08
0.08
Slope
1.00
1.00
0.87
0.13
Emax
0.00
0.00
0.05
0.05
D
0.53
0.56
0.50
0.06
U
0.00
0.00
0.01
0.01
Q
0.53
0.56
0.49
0.07
B
0.13
0.13
0.13
0.01
g
2.43
2.75
2.37
0.37
gp
0.37
0.37
0.35
0.02
cal calibrate (f , B =200)
plot ( cal , subtitles = FALSE )
Figure
8-12
Corrected
Index
0.72
0.50
0.08
0.87
0.05
0.46
0.01
0.46
0.14
2.05
0.35
n
200
200
200
200
200
200
200
200
200
200
200
8.7
n =1046
Mean absolute error =0.009
Mean squared error =0.00012
0.9 Quantile of absolute error =0.017
Actual Probability
1.0
0.8
0.6
0.4
Apparent
Biascorrected
Ideal
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
Predicted Pr{survived=1}
Figure 8.7: Bootstrap overfitting-corrected loess nonparametric calibration curve for casewise deletion model
8-13
8.4
package
pclass=ab
|
parch>=0.5
0.09167
0.0
0.2
0.4
0.6
Fraction
of NAs
0.1806
0.3249
0.8
survived
name
pclass
sex
parch
fare
ticket
sibsp
age
0.3
cabin
0.2
boat
home.dest
0.4
body
Fraction Missing
0.1
embarked
0.0
Figure 8.8: Patterns of missing data. Upper left panel shows the fraction of observations missing on each predictor. Lower panel
depicts a hierarchical cluster analysis of missingness combinations. The similarity measure shown on the Y -axis is the fraction of
observations for which both variables are missing. Right panel shows the result of recursive partitioning for predicting is.na(age).
The rpart function found only strong patterns according to passenger class.
sex
female
male
pclass
1st
2nd
3rd
N
466
843
323
277
709
Survived
809
500
3
4
5
8
891
319
42
20
22
6
9
No
Yes
0
1
2
3
4
5
6
9
Overall
0.0
0.2
1002
170
113
8
6
6
2
2
1309
0.4
0.6
0.8
1.0
is.na(age)
8-14
8-15
Intercept
sex=male
pclass=2nd
pclass=3rd
survived
sibsp
parch
sex=male * pclass=2nd
sex=male * pclass=3rd
Coef
-2.2030
0.6440
-1.0079
1.6124
-0.1806
0.0435
-0.3526
0.1347
-0.8563
2
5.61
5.58
68.43
5.58
0.98
0.35
7.92
5.58
82.90
d.f.
3
2
4
2
1
1
1
2
8
P
0.1324
0.0614
< 0.0001
0.0614
0.3232
0.5548
0.0049
0.0614
< 0.0001
S.E.
Wald Z Pr(> |Z|)
0.3641
-6.05 < 0.0001
0.3953
1.63
0.1033
0.6658
-1.51
0.1300
0.3596
4.48 < 0.0001
0.1828
-0.99
0.3232
0.0737
0.59
0.5548
0.1253
-2.81
0.0049
0.7545
0.18
0.8583
0.4214
-2.03
0.0422
pclass
8-16
8.5
sex pclass
0.076 0.242
sibsp
0.249
parch
0.291
sibsp
0.245
parch
0.288
Adjusted R2 :
age
0.260
sex pclass
0.073 0.239
unique
24
.95
42.77
Info
0.91
Mean
28.53
.05
17.34
.10
21.77
.25
26.17
.50
28.10
age
28
#
#
sex pclass
2
3
Look at mean i m p u t e d
a g e . i is age , f i l l e d
sibsp
0
8-17
parch
0
v a l u e s by sex , p c l a s s and
in with c o n d i t i o n a l mean
observed means
estimates
rcs(age.i,
Rank Discrim.
Indexes
C
0.861
Dxy
0.723
0.728
a
0.341
p1 Predict (f ,
age ,
pclass , sex , sibsp =0 , fun = plogis )
p2 Predict ( f.si , age.i , pclass , sex , sibsp =0 , fun = plogis )
p rbind ( Casewise Deletion = p1 , Single Imputation = p2 ,
rename = c ( age.i = age ) )
# creates .set. variable
ggplot (p , groups = sex , ylab = Probability of Surviving )
# F i g u r e 8.10
latex ( anova ( f.si ) , file = , label = titanic-anova.si )
Table
8.4
Casewise Deletion
8-18
Single Imputation
1.00
0.75
1st
0.50
0.25
1.00
0.75
sex
2nd
Probability of Surviving
0.00
0.50
female
male
0.25
0.00
1.00
0.75
3rd
0.50
0.25
0.00
0
20
40
60
20
40
60
Age, years
Figure 8.10: Predicted probability of survival for males from fit using casewise deletion (bottom) and single conditional mean
imputation (top). sibsp is set to zero for these predicted values.
2
245.39
52.85
112.07
36.79
49.32
25.62
19.71
22.02
12.28
30.29
8.91
5.62
5.62
6.05
5.44
5.44
12.28
2.05
2.05
19.71
67.00
69.53
305.74
d.f.
7
6
12
10
20
16
15
5
4
2
4
3
3
8
6
6
4
3
3
15
18
21
26
P
< 0.0001
< 0.0001
< 0.0001
0.0001
0.0003
0.0595
0.1835
0.0005
0.0154
< 0.0001
0.0633
0.1319
0.1319
0.6421
0.4888
0.4888
0.0154
0.5614
0.5614
0.1835
< 0.0001
< 0.0001
< 0.0001
8-19
8-20
8.6
Multiple Imputation
The following uses aregImpute with predictive mean matching.
By default, aregImpute does not transform age when it is being predicted from the other variables. Four knots are used to
transform age when used to impute other variables (not needed
here as no other missings were present). Since the fraction of
263
= 0.2 we use 20 imputaobservations with missing age is 1309
tions.
set.seed (17)
# so can r e p r o d u c e random a s p e c t s
mi aregImpute ( age + sex + pclass +
sibsp + parch + survived ,
data = t3 , n.impute =20 , nk =4 , pr = FALSE )
mi
p: 6
Number of NAs :
age
sex
263
0
age
sex
pclass
sibsp
parch
survived
Imputations : 20
pclass
0
sibsp
0
nk : 4
parch survived
0
0
type d . f .
s
1
c
1
c
2
s
2
s
2
l
1
the
first
10
imputations
for
the
first
10
passengers
having
missing
8-21
age
16
38
41
47
60
70
71
75
81
107
[ ,1] [ ,2] [ ,3] [ ,4] [ ,5] [ ,6] [ ,7] [ ,8] [ ,9] [ ,10]
40
49
24
29 60.0
58
64
36
50
61
33
45
40
49 80.0
2
38
38
36
53
29
24
19
31 40.0
60
64
42
30
65
40
42
29
48 36.0
46
64
30
38
42
52
40
22
31 38.0
22
19
24
40
33
16
14
23
23 18.0
24
19
27
59
23
30
62
57
30 42.0
31
64
40
40
63
43
23
36
61 45.5
58
64
27
24
50
44
57
47
31 45.0
30
64
62
39
67
52
18
24
62 32.5
38
64
47
19
23
plot ( mi )
Ecdf ( t3 $ age , add = TRUE , col = gray , lwd =2 ,
subtitles = FALSE ) # F i g . 8.11
Proportion <= x
1.0
0.8
0.6
0.4
0.2
0.0
0
20
40
60
80
Imputed age
Figure 8.11: Distributions of imputed and actual ages for the Titanic dataset. Imputed values are in black and actual ages in gray.
Fit logistic models for 20 completed datasets and print the ratio
of imputation-corrected variances to average ordinary variances
f.mi fit.mult.impute (
survived ( sex + pclass + rcs ( age ,5) ) 2 +
rcs ( age ,5) * sibsp ,
lrm , mi , data = t3 , pr = FALSE )
latex ( anova ( f.mi ) , file = , label = titanic-anova.mi ,
size = small )
# T a b l e 8.5
8-22
2
240.42
54.56
114.21
36.43
50.37
25.88
24.21
24.22
12.86
30.99
11.38
8.15
8.15
5.30
4.63
4.63
12.86
1.84
1.84
24.21
67.12
70.99
298.78
d.f.
7
6
12
10
20
16
15
5
4
2
4
3
3
8
6
6
4
3
3
15
18
21
26
P
< 0.0001
< 0.0001
< 0.0001
0.0001
0.0002
0.0557
0.0616
0.0002
0.0120
< 0.0001
0.0226
0.0430
0.0430
0.7246
0.5918
0.5918
0.0120
0.6058
0.6058
0.0616
< 0.0001
< 0.0001
< 0.0001
Single Imputation
8-23
Multiple Imputation
1.00
0.75
1st
0.50
0.25
1.00
0.75
sex
2nd
Probability of Surviving
0.00
0.50
female
male
0.25
0.00
1.00
0.75
3rd
0.50
0.25
0.00
0
20
40
60
20
40
60
Age, years
Figure 8.12: Predicted probability of survival for males from fit using single conditional mean imputation again (top) and multiple
random draw imputation (bottom). Both sets of predictions are for sibsp=0.
8-24
8.7
Get
predicted
values
for
certain
types
of
passengers
override
default
ranges
for
variables
# Figure
8.13
0.10
0.50
2.00
5.00
age 30:1
sibsp 1:0
sex female:male
pclass 1st:3rd
pclass 2nd:3rd
Can
also
use
1
2
3
4
5
6
7
8
9
10
11
12
age
2
21
50
2
21
50
2
21
50
2
21
50
pclass ,
13
14
15
16
17
18
2 female
21 female
50 female
2
male
21
male
50
male
3 rd
3 rd
3 rd
3 rd
3 rd
3 rd
0
0
0
0
0
0
8-25
0.85
0.57
0.37
0.91
0.13
0.06
Note : if
normally
don t d e f i n e s i b s p to p r e d . l o g i t , d e f a u l t s to
just type the f u n c t i o n name to see its body
Run
the
newly
created
function
8-26
rparta
a Written
R Software Used
Functions
Purpose
Miscellaneous functions summary,plsmo,naclus,llist,latex
Imputation
Modeling
Model presentation
Model validation
Recursive partitioning
summarize,Dotplot,describe
transcan,impute,fit.mult.impute,aregImpute
datadist,lrm,rcs
plot,summary,nomogram,Function
validate,calibrate
rpart
Chapter 9
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION
9-2
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION
9-3
9.1
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION
9-4
w
4629 Observations
18 Variables
seqn : Respondent sequence number
n
4629
missing
0
unique
4629
Info
1
Mean
56902
.05
52136
.10
52633
.25
54284
.50
56930
.75
59495
.90
61079
.95
61641
sex
n
4629
missing
0
unique
2
missing
0
unique
703
Info
1
Mean
48.57
.05
23.33
.10
26.08
.25
33.92
.50
46.83
.75
61.83
.90
74.83
.95
80.00
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION
re : Race/Ethnicity
n
4629
missing
0
unique
5
missing
240
unique
14
wt : Weight [kg]
n
4629
missing
0
unique
890
Info
1
Mean
80.49
.05
52.44
.10
57.18
.25
66.10
.50
77.70
.75
91.40
.90
106.52
.95
118.00
.05
151.1
.10
154.4
.25
160.1
.50
167.2
.75
175.0
.90
181.0
.95
184.8
.05
20.02
.10
21.35
.25
24.12
.50
27.60
.75
31.88
.90
36.75
.95
40.68
missing
0
unique
512
Info
1
Mean
167.5
missing
0
unique
1994
Info
1
Mean
28.59
missing
155
unique
216
Info
1
Mean
38.39
.05
32.0
.10
33.5
.25
36.0
.50
38.4
.75
41.0
.90
43.3
.95
44.6
.90
40.6
.95
41.7
.90
39.1
.95
41.4
lowest : 20.4 24.9 25.0 25.1 26.4, highest: 49.0 49.5 49.8 50.0 50.3
missing
127
unique
156
Info
1
Mean
37.01
.05
32.6
.10
33.5
.25
35.0
.50
37.0
.75
39.0
lowest : 24.8 27.0 27.5 29.2 29.5, highest: 45.2 45.5 45.6 46.0 47.0
missing
130
unique
290
Info
1
Mean
32.87
.05
25.4
.10
26.9
.25
29.5
.50
32.5
.75
35.8
lowest : 17.9 19.0 19.3 19.5 19.9, highest: 54.2 54.9 55.3 56.0 61.0
missing
164
unique
716
Info
1
Mean
97.62
.05
74.8
.10
78.6
.25
86.9
.50
96.3
.75
107.0
.90
117.8
.95
125.0
missing
334
lowest :
2.6
unique
342
3.1
3.2
Info
1
3.3
Mean
18.94
.05
7.2
.10
8.8
.25
12.0
.50
18.0
.75
25.2
.90
31.0
.95
33.8
missing
655
lowest :
3.8
unique
329
4.2
4.6
Info
1
4.8
Mean
20.8
.05
8.60
.10
10.30
.25
14.40
.50
20.30
.75
26.58
gh : Glycohemoglobin [%]
n
4629
missing
0
lowest :
4.0
unique
63
4.1
4.2
Info
0.99
4.3
Mean
5.533
.05
4.8
.10
5.0
.25
5.2
.50
5.5
.75
5.8
.90
6.0
.95
6.3
.90
32.00
.95
35.00
9-5
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION
albumin : Albumin [g/dL]
n
4576
missing
53
unique
26
Info
0.99
Mean
4.261
.05
3.7
.10
3.9
.25
4.1
.50
4.3
.75
4.5
.90
4.7
.95
4.8
.75
15
.90
19
.95
22
.75
0.99
.90
1.14
lowest : 2.6 2.7 3.0 3.1 3.2, highest: 4.9 5.0 5.1 5.2 5.3
missing
53
1
unique
50
3
Info
0.99
Mean
13.03
.05
7
.10
8
.25
10
.50
12
5, highest: 49 53 55 56 63
missing
53
lowest :
highest:
0.34
5.98
unique
167
0.38
6.34
Info
1
Mean
0.8887
.05
0.58
.10
0.62
.25
0.72
.50
0.84
.95
1.25
9-6
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION
9-7
9.2
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION
9-8
Get
Use
get
Y
Prob (Y
<
cutoff )
to
r NULL
for ( link in c ( logit , probit , cloglog ) )
for ( k in c (5 , 5 .5 , 6) ) {
co coef ( glm ( gh < k pgh , data =w , family = binomial ( link ) ) )
r rbind (r , data.frame ( link = link , cutoff =k ,
slope = round ( co [2] ,2) ) )
}
print (r , row.names = FALSE )
link cutoff slope
logit
5.0 -3.39
logit
5.5 -4.33
logit
6.0 -5.62
probit
5.0 -1.69
probit
5.5 -2.61
probit
6.0 -3.07
cloglog
5.0 -3.18
cloglog
5.5 -2.97
cloglog
6.0 -2.51
qnorm(F)
log(F (1 F))
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION
2
0
2
Glycohemoglobin, %
Glycohemoglobin, %
log( log(F))
2
6
4
2
0
0
2
4
6
2
5.0 5.5 6.0 6.5 7.0 7.5
Glycohemoglobin, %
Glycohemoglobin, %
Figure 9.1: Examination of normality and constant variance assumption, and assumptions for various ordinal models
9-9
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-10
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-11
9.3
Quantile Regression
Ruled out OLS and semiparametric proportional odds model
Quantile regression [96, 95] is a different approach to modeling Y
No distributional assumptions other than continuity of Y
All the usual right hand side assumptions
When there is a single predictor that is categorical, quantile
regression coincides with ordinary sample quantiles stratified
by that predictor
Is transformation invariant - pre-transforming Y not important
Let (y) = y( [y < 0]). The th sample quantile is the
minimizer q of Pni1 (yi q). For a conditional th quantile of Y |X the corresponding quantile regression estimator
minimizes Pni=1 (Yi X).
Quantile regression is not as efficient at estimating quantiles as
is ordinary least squares at estimating the mean, if the latters
assumptions hold.
Koenkers quantreg package in R [97] implements quantile re-
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-12
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-13
9.4
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-14
]
Prob[Y y|X] = Prob[
y X
y X
= 1 (
) = (
+
)
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-15
Table 9.1: Distribution families used in ordinal cumulative probability models. denotes the Gaussian cumulative distribution
function. For the Connection column, P1 = Prob[Y y|X1 ], P2 = Prob[Y y|X2 ], = (X2 X1 ). The connection specifies
the only distributional assumption if the model is fitted semiparametrically, i.e, contains an intercept for every unique Y value less
one. For parametric models, P1 must be specified absolutely instead of just requiring a relationship between P1 and P2 . For example,
the traditional Gaussian parametric model specifies that Prob[Y y|X] = 1 ( yX
) = ( y+X
).
Distribution
Logistic
Gaussian
Gumbel maximum
value
Gumbel minimum
value
Cauchy
Link Name
[1 + exp(y)]1
(y)
exp( exp(y))
Inverse
(Link Function)
y
log( 1y
)
1 (y)
log( log(y))
logit
probit
log log
P1
= 1P
exp()
1
P2 = (1 (P1 ) + )
exp()
P2 = P1
1 exp( exp(y))
complementary
log log
cauchit
1 P2 = (1 P1 )exp()
tan1 (y) +
1
2
tan[(y 21 )]
Connection
P2
1P2
= yi|X]
yiProb[Y
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-16
it is not sensible to estimate quantiles of Y when there are heavy ties in Y in the area containing the quantile.
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-17
y X
1
(F (y|X)) =
1(F(y|X))
(F(y|X))
logit(F(y|X))
Figure 9.2: Assumptions of the linear model (left panel) and semiparametric ordinal probit or logit (proportional odds) models (right
panel). Ordinal models do not assume any shape for the distribution of Y for a given X; they only assume parallelism.
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-18
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-19
9.5
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-20
}
fams c ( logistic , probit , loglog , cloglog )
fe
function ( pred , target ) mean ( abs ( pred $ yhat - target ) )
mod gh rcs ( age ,6)
P
Er list ()
for ( est in c ( q2 , q3 , p90 , mean ) ) {
meth if ( est == mean ) ols else QR
p list ()
er rep ( NA , 5)
names ( er ) c ( fams , meth )
for ( family in fams ) {
h orm ( mod , family = family , data = w )
fun if ( est == mean ) Mean ( h )
else {
qu Quantile ( h )
switch ( est , q2 = function ( x ) qu ( .5 , x ) ,
q3 = function ( x ) qu ( .75 , x ) ,
p90 = function ( x ) qu ( .9 , x ) )
}
p [[ family ]] z Predict (h , age = ag , fun = fun , conf.int = FALSE )
er [ family ] fe (z , switch ( est , mean = means , q2 = q2 , q3 = q3 , p90 = p90 ) )
}
h switch ( est ,
mean = ols ( mod , data = w ) ,
q2 = Rq ( mod , data = w ) ,
q3 = Rq ( mod , tau =0 .75 , data = w ) ,
p90 = Rq ( mod , tau =0 .90 , data = w ) )
p [[ meth ]] z Predict (h , age = ag , conf.int = FALSE )
er [ meth ] fe (z , switch ( est , mean = means , q2 = q2 , q3 = q3 , p90 = p90 ) )
Er [[ est ]] er
pr do.call ( rbind , p )
pr $ est est
P rbind.data.frame (P , pr )
}
xyplot ( yhat age | est , groups = .set. , data =P , type = l , # F i g u r e 9.3
auto.key = list ( x = .75 , y = .2 , points = FALSE , lines = TRUE ) ,
panel = function ( ... , subscripts ) {
panel.xyplot ( ... , subscripts = subscripts )
est P $ est [ subscripts [1]]
lpoints ( ag , switch ( est , mean = means , q2 = q2 , q3 = q3 , p90 = p90 ) ,
col = gray ( .7 ) )
er format ( round ( Er [[ est ]] ,3) , nsmall =3)
ltext (26 , 6 .15 , paste ( names ( er ) , collapse = \ n ) ,
cex = .7 , adj =0)
ltext (40 , 6 .15 , paste ( er , collapse = \ n ) ,
cex = .7 , adj =1) })
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-21
30
40
q2
50
60
70
q3
logistic 0.048
logistic 0.050
probit
probit
0.052
0.045
loglog 0.058
loglog 0.037
cloglog 0.072
cloglog 0.077
QR
QR
0.024
6.2
0.027
5.8
5.6
6.0
5.4
5.2
yhat
mean
6.2
6.0
5.8
5.6
5.4
5.2
p90
logistic 0.021
logistic 0.053
probit
probit
0.025
loglog 0.026
loglog 0.041
cloglog 0.033
cloglog 0.101
ols
QR
0.013
30
40
50
60
0.047
0.030
logistic
probit
loglog
cloglog
QR
ols
70
age
Figure 9.3: Three estimated quantiles and estimated mean using 6 methods, compared against caliper-matched sample quantiles/means (circles). Numbers are mean absolute differences between predicted and sample quantities using overlapping intervals
of age and caliper matching. QR:quantile regression.
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-22
1.0
Prob(Y y)
0.8
0.6
0.4
[4.88,5.29)
[5.29,5.44)
[5.44,5.56)
[5.56,5.66)
[5.66,5.76)
[5.76,6.48]
0.2
0.0
4.0
4.5
5.0
5.5
6.0
6.5
7.0
7.5
gh
Figure 9.4: Observed (dashed lines, open circles) and predicted (solid lines, closed circles) exceedance probability distributions from
a model using 6-tiles of OLS-predicted HbA1c . Key shows quantile group intervals of predicted mean HbA1c .
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-23
Agreement between predicted and observed exceedance probability distributions is excellent in Figure 9.4.
To return to the initial look at a linear model with assumed
Gaussian residuals, fit a probit ordinal model and compare the
estimated intercepts to the linear relationship with gh that is
assumed by the normal distribution.
f orm ( gh rcs ( age ,6) , family = probit , data = w )
g ols ( gh rcs ( age ,6) , data = w )
s g $ stats [ Sigma ]
yu f $ yunique [ -1 ]
r quantile ( w $ gh , c ( .005 , .995 ) )
alphas coef ( f ) [1: num.intercepts ( f ) ]
plot ( -yu / s , alphas , type = l , xlim = rev ( - r / s ) , # F i g . 9.5
xlab = expression ( -y / hat ( sigma ) ) , ylab = expression ( alpha [ y ]) )
2
1
0
1
2
3
4
5
14
12
10
^
y
Figure 9.5: Estimated intercepts from probit model
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-24
9.5.2
Examination of BMI
Using the log-log model, we first check the adequacy of BMI
as a summary of height and weight for estimating median gh.
Adjust for age (without assuming linearity) in every case
Look at ratio of coefficients of log height and log weight
Use AIC to judge whether BMI is an adequate summary of
height and weight
f orm ( gh rcs ( age ,5) + log ( ht ) + log ( wt ) ,
family = loglog , data = w )
print (f , latex = TRUE )
age
age
age
age
ht
wt
Coef
0.0398
-0.0158
-0.0072
0.0309
-3.0680
1.2748
Rank Discrim.
Indexes
0.486
S.E.
Wald Z Pr(> |Z|)
0.0055
7.29 < 0.0001
0.0275
-0.57
0.5657
0.0866
-0.08
0.9333
0.1135
0.27
0.7853
0.2789
-11.00 < 0.0001
0.0704
18.10 < 0.0001
aic NULL
for ( mod in list ( gh rcs ( age ,5) + rcs ( log ( bmi ) ,5) ,
gh rcs ( age ,5) + rcs ( log ( ht ) ,5) + rcs ( log ( wt ) ,5) ,
gh rcs ( age ,5) + rcs ( log ( ht ) ,4) * rcs ( log ( wt ) ,4) ) )
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-25
age
age
age
age
ht
wt
Coef
-0.0392
0.0148
0.0093
-0.0321
3.0477
-1.2653
S.E.
Wald Z Pr(> |Z|)
0.0054
-7.24 < 0.0001
0.0274
0.54
0.5888
0.0862
0.11
0.9144
0.1131
-0.28
0.7767
0.2779
10.97 < 0.0001
0.0701
-18.04 < 0.0001
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-26
Back up and look at all body size measures, and examine their
redundancies.
v varclus ( wt + ht + bmi + leg + arml + armc + waist +
tri + sub + age + sex + re , data = w )
plot ( v )
# F i g u r e 9.6
#
Omit
wt
so
it
won t
be
removed
before
bmi
Redundancy Analysis
redun ( formula = ht + bmi + leg + arml + armc + waist + tri +
sub , data = w , r2 = 0.75)
n : 3853
p: 8
nk : 3
Number of NAs :
776
Frequencies of Missing Values Due to Each Variable
ht
bmi
leg arml armc waist
tri
sub
0
0
155
127
130
164
334
655
Type : ordinary
R2 with which each variable can be predicted from all other variables :
ht
bmi
leg arml armc waist
tri
sub
0.829 0.924 0.682 0.748 0.843 0.864 0.531 0.594
Rendundant variables :
bmi ht
Predicted from variables :
leg arml armc waist tri sub
Variable Deleted
R2 R2 after later deletions
1
bmi 0.924
0.909
2
ht 0.792
Six size measures adequately capture the entire set. Height and
BMI are removed. An advantage of removing height is that it
is age-dependent due to vertebral compression in the elderly:
f orm ( ht rcs ( age ,4) * sex , data = w ) # P r o p . o d d s m o d e l
qu Quantile ( f ) ; med function ( x ) qu ( .5 , x )
ggplot ( Predict (f , age , sex , fun = med , conf.int = FALSE ) ,
ylab = Predicted Median Height , cm )
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-27
tri
sub
sexfemale
leg
0.4
ht
arml
0.6
0.8
1.0
bmi
waist
wt
armc
Spearman 2
0.2
age
reOther Race Including MultiRacial
reOther Hispanic
reNonHispanic White
reNonHispanic Black
0.0
180
175
sex
170
male
female
165
160
155
20
40
60
80
Age, years
Figure 9.7: Estimated median height as a smooth function of age, allowing age to interact with sex, from a proportional odds model
between collinear size measures hurts interpretation of partial tests of association in a saturated additive model.
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-28
42
40
sex
male
38
female
36
34
20
40
60
80
Age, years
Figure 9.8: Estimated median upper leg length as a smooth function of age, allowing age to interact with sex, from a proportional
odds model
Spearman 2 Response : gh
age
waist
leg
sub
armc
wt
re
tri
arml
sex
0.00
0.05
0.10
Adjusted
0.15
N df
4629 2
4465 2
4474 2
3974 2
4499 2
4629 2
4629 4
4295 2
4502 2
4629 1
0.20
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-29
strap will be used to penalize predictive ability for variable selection. First the full model is fit using casewise deletion, then
we do a composite test to assess whether any of the frequently
missing predictors is important.
f orm ( gh rcs ( age ,5) + sex + re + rcs ( wt ,3) + rcs ( leg ,3) + arml +
rcs ( armc ,3) + rcs ( waist ,4) + tri + rcs ( sub ,3) ,
family = loglog , data =w , x = TRUE , y = TRUE )
print (f , latex = TRUE , coefs = FALSE )
sub
tri
waist
leg
armc
arml
gh
age
sex
re
wt
t
t
t
t
t
0
Obs
3853
Unique Y
60
Y0.5
5.5
L
max | log
| 3105
##
Composite
t
t
100
t
t
200
t 655
300
Model Likelihood
Ratio Test
LR 2
1180.13
d.f.
22
Pr(> 2 ) < 0.0001
Score 2 1298.88
Pr(> 2 ) < 0.0001
334
164
155
130
127
0
0
0
0
0
400
500
600
700
Discrimination
Indexes
R2
g
gr
|Pr(Y Y0.5 ) 12 |
test :
0.265
0.732
2.080
0.172
Rank Discrim.
Indexes
0.520
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-30
2
d.f.
P
leg
8.30
2
0.0158
Nonlinear
3.32
1
0.0685
arml
0.16
1
0.6924
armc
6.66
2
0.0358
Nonlinear
3.29
1
0.0695
waist
29.40
3 < 0.0001
Nonlinear
4.29
2
0.1171
tri
16.62
1 < 0.0001
sub
40.75
2 < 0.0001
Nonlinear
4.50
1
0.0340
TOTAL NONLINEAR 14.95
5
0.0106
TOTAL
128.29 11 < 0.0001
Mean ( f )
Quantile ( f )
function ( x ) qu ( .5 , x )
function ( x ) qu ( .9 , x )
Rq ( formula ( f ) , data = w )
Rq ( formula ( f ) , data =w , tau = .9 )
pmean Predict (f ,
age , fun =M ,
conf.int = FALSE )
pmed
Predict (f ,
age , fun = med , conf.int = FALSE )
p90
Predict (f ,
age , fun = p90 , conf.int = FALSE )
pmedqr Predict ( fq ,
age , conf.int = FALSE )
p90qr Predict ( fq90 , age , conf.int = FALSE )
z rbind ( orm mean = pmean , orm median = pmed , orm P90 = p90 ,
QR median = pmedqr , QR P90 = p90qr )
ggplot (z , groups = .set. ,
adj.subtitle = FALSE , legend.label = FALSE )
print ( fastbw (f , rule = p ) , estimates = FALSE )
Deleted
arml
sex
wt
armc
Chi - Sq
0.16
0.45
5.72
3.32
d.f.
1
1
2
2
P
0.6924
0.5019
0.0572
0.1897
Residual
0.16
0.61
6.33
9.65
d.f.
1
2
4
6
P
0.6924
0.7381
0.1759
0.1400
AIC
-1.84
-3.39
-1.67
-2.35
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-31
6.25
6.00
orm mean
5.75
orm median
orm P90
5.50
QR median
QR P90
5.25
5.00
20
40
60
80
Age, years
Figure 9.10: Estimated mean and 0.5 and 0.9 quantiles from the log-log ordinal model using casewise deletion, along with predictions
of 0.5 and 0.9 quantiles from quantile regression (QR). Age is varied and other predictors are held constant to medians/modes.
re
leg
waist tri
sub
set.seed (13) # s o c a n r e p r o d u c e r e s u l t s
v validate (f , B =100 , bw = TRUE , estimates = FALSE , rule = p )
Chi - Sq
0.16
0.45
5.72
3.32
d.f.
1
1
2
2
P
0.6924
0.5019
0.0572
0.1897
Residual
0.16
0.61
6.33
9.65
d.f.
1
2
4
6
P
0.6924
0.7381
0.1759
0.1400
AIC
-1.84
-3.39
-1.67
-2.35
Show
re
number
leg
of
waist tri
variables
sub
selected
in
first
30
boots
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-32
Index
Original
Sample
0.5225
0.2712
1.0000
1.2276
0.2007
R2
Slope
g
|Pr(Y Y0.5 ) 12 |
Training
Sample
0.5290
0.2788
1.0000
1.2505
0.2050
Test
Sample
0.5208
0.2692
0.9761
1.2207
0.1987
Optimism
0.0083
0.0095
0.0239
0.0298
0.0064
Corrected
Index
0.5142
0.2617
0.9761
1.1978
0.1943
n
100
100
100
100
100
sex
re
wt
leg
arml
armc
waist
tri
sub
6
19
7
29
8
46
9
4
10
1
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-33
age
age
age
age
re=Other Hispanic
re=Non-Hispanic White
re=Non-Hispanic Black
re=Other Race Including Multi-Racial
leg
leg
waist
waist
waist
tri
sub
sub
sub
an anova ( g )
lan ( an )
Coef
0.0404
-0.0228
0.0126
0.0424
-0.0766
-0.4121
0.0645
-0.0555
-0.0339
0.0153
0.0073
0.0304
-0.0910
-0.0163
-0.0027
0.0674
-0.1895
S.E.
Wald Z Pr(> |Z|)
0.0055
7.29 < 0.0001
0.0279
-0.82
0.4137
0.0876
0.14
0.8857
0.1148
0.37
0.7116
0.0597
-1.28
0.1992
0.0449
-9.17 < 0.0001
0.0566
1.14
0.2543
0.0750
-0.74
0.4593
0.0091
-3.73
0.0002
0.0105
1.46
0.1434
0.0050
1.47
0.1428
0.0158
1.93
0.0536
0.0508
-1.79
0.0732
0.0026
-6.28 < 0.0001
0.0097
-0.28
0.7817
0.0289
2.33
0.0198
0.0922
-2.06
0.0398
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-34
2
d.f.
P
age
692.50
4 < 0.0001
Nonlinear
28.47
3 < 0.0001
re
168.91
4 < 0.0001
leg
24.37
2 < 0.0001
Nonlinear
2.14
1
0.1434
waist
128.31
3 < 0.0001
Nonlinear
4.05
2
0.1318
tri
39.44
1 < 0.0001
sub
39.30
3 < 0.0001
Nonlinear
6.63
2
0.0363
TOTAL NONLINEAR
46.80
8 < 0.0001
TOTAL
1464.24 17 < 0.0001
b
#
new
lines
to
the
plot
with
combined
effect
of
size
var.
leg
sub
tri
waist
re
size
age
100
200
300
400
500
600
700
2 df
Figure 9.11: ANOVA for reduced model after multiple imputation, with addition of a combined effect for four size variables
Figure
M Mean ( g )
ggplot ( Predict (g , fun = M ) , abbrev = TRUE , ylab = NULL )
9.12
Figure
9.13
Compare the estimated age partial effects and confidence intervals with those from a model using casewise deletion, and
with bootstrap nonparametric confidence intervals (also with
casewise deletion).
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-35
1.5
1.5
1.0
1.0
0.5
0.5
0.0
0.0
0.5
0.5
1.0
Race Ethnicity
ORIMR
1.0
20
40
60
80
35
40
1.5
1.5
1.0
1.0
1.0
0.5
0.5
0.5
0.0
0.0
0.0
0.5
0.5
0.5
1.0
1.0
1.0
20
30
40
NnHsW
OthrHs
10
20
1.5
10
45
Subscapular Skinfold, mm
NnHsB
MxcnAm
30
Age, years
30
Triceps Skinfold, mm
40
80
100
120
140
Waist Circumference, cm
Figure 9.12: Partial effects (log hazard or log-log cumulative probability scale) of all predictors in reduced model, after multiple
imputation
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-36
6.25
6.25
6.00
6.00
5.75
5.75
5.50
5.50
5.25
5.25
5.00
5.00
20
40
60
80
Race Ethnicity
ORIMR
35
40
6.25
6.00
6.00
6.00
5.75
5.75
5.75
5.50
5.50
5.50
5.25
5.25
5.25
5.00
20
NnHsW
30
40
Subscapular Skinfold, mm
OthrHs
5.00
45
6.25
10
NnHsB
MxcnAm
30
Age, years
5.00
10
20
30
Triceps Skinfold, mm
40
80
100
120
140
Waist Circumference, cm
Figure 9.13: Partial effects (mean scale) of all predictors in reduced model, after multiple imputation
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-37
log hazard
1.0
0.5
0.0
0.5
20
30
40
50
60
70
80
Age, years
Figure 9.14: Partial effect for age from multiple imputation (center red line) and casewise deletion (center blue line) with symmetric
Wald 0.95 confidence bands using casewise deletion (gray shaded area), basic bootstrap confidence bands using casewise deletion
(blue lines), percentile bootstrap confidence bands using casewise deletion (dashed blue lines), and symmetric Wald confidence bands
accounting for multiple imputation (red lines).
M Mean ( g )
qu Quantile ( g )
med function ( lp ) qu ( .5 , lp )
q90 function ( lp ) qu ( .9 , lp )
lp predict ( g )
lpr quantile ( predict ( g ) , c ( .002 , .998 ) , na.rm = TRUE )
lps seq ( lpr [1] , lpr [2] , length =200)
pmn M ( lps )
pme med ( lps )
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-38
8.0
7.5
7.0
6.5
0.9
6.0
Median
5.5
5.0
5.0
5.5
6.0
6.5
7.0
7.5
8.0
g
Newlevels (g , list ( re = abbreviate ( levels ( w $ re ) ) ) )
exprob ExProb ( g )
nom
nomogram (g , fun = list ( Mean =M ,
Median Glycohemoglobin = med ,
0 .9 Quantile
= q90 ,
Prob ( HbA1c 6 .5 ) =
function ( x ) exprob (x , y =6 .5 ) ,
Prob ( HbA1c 7 .0 ) =
function ( x ) exprob (x , y =7) ,
Prob ( HbA1c 7 .5 ) =
function ( x ) exprob (x , y =7 .5 ) ) ,
fun.at = list ( seq (5 , 8 , by = .5 ) ,
c (5 ,5 .25 ,5 .5 ,5 .75 ,6 ,6 .25 ) ,
c (5 .5 ,6 ,6 .5 ,7 ,8 ,10 ,12 ,14) ,
c ( .01 , .05 , .1 , .2 , .3 , .4 ) ,
c ( .01 , .05 , .1 , .2 , .3 , .4 ) ,
c ( .01 , .05 , .1 , .2 , .3 , .4 ) ) )
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-39
9.16
Figure
10
20
30
40
50
60
70
80
90
100
Points
Age
20
25
30
OthH
35
40
45
50
55
60
65 70 75
80
Race/Ethnicity
NHW
ORIM
45
35
30
25
20
Waist Circumference
50
70
90
100
110
120
130
140
150
160
170
Triceps Skinfold
45
35
25
15 20
25
30
15
5
35 40 45
Subscapular Skinfold
10
Total Points
0
20
40
60
80
100
120
140
160
180
200
220
240
260
280
Linear Predictor
1.5
0.5
0.5
1.5
2.5
Mean
5
5.5
6.5
7 7.5 8
Median Glycohemoglobin
5
5.25
5.5
5.75
6.25
0.9 Quantile
5.5
6.5 7 8 10 12
14
0.05 0.1
0.2
0.3 0.4
0.05 0.1
0.2
0.3 0.4
0.05 0.1
0.2
0.3
Figure 9.16: Nomogram for predicting median, mean, and 0.9 quantile of glycohemoglobin, along with the estimated probability
that HbA1c 6.5, 7, or 7.5, all from the log-log ordinal model
Chapter 10
10-1
10-2
10.1
Descriptive Statistics
Create a variable acute to flag categories of interest; print univariable descriptive statistics.
require ( rms )
getHdata ( support )
# Get data frame from web site
acute support $ dzclass % in % c ( ARF / MOSF , Coma )
latex ( describe ( support [ acute ,]) , file = )
support[acute, ]
35 Variables
537 Observations
age : Age
n
537
missing
0
unique
529
Info
1
Mean
60.7
.05
28.49
.10
35.22
.25
47.93
.50
63.67
.75
74.49
.90
81.54
missing
0
unique
2
missing
0
unique
2
Info
0.67
Sum
356
Mean
0.6629
sex
n
537
missing
0
unique
2
Info
0.7
Sum
201
Mean
0.3743
missing
0
lowest :
unique
85
4
Info
1
6
Mean
23.44
.05
4.0
.10
5.0
.25
9.0
.50
15.0
.75
27.0
.90
47.4
.95
68.2
missing
0
lowest :
unique
340
4
Info
1
6
Mean
446.1
.05
4
.10
6
.25
16
.50
182
.75
724
.90
1421
dzgroup
n
537
missing
0
unique
3
ARF/MOSF w/Sepsis (391, 73%), Coma (60, 11%), MOSF w/Malig (86, 16%)
dzclass
n
537
missing
0
unique
2
.95
1742
.95
85.56
missing
0
unique
7
Info
0.93
Mean
1.525
0
1
2 3 4 5 6
Frequency 111 196 133 51 31 10 5
%
21 36 25 9 6 2 1
missing
126
lowest :
unique
22
Info
0.96
Mean
12.03
.05
7
.10
8
.25
10
.50
12
.75
14
.90
16
.95
17
.75
37
.90
55
.95
100
4, highest: 17 18 19 20 22
income
n
335
missing
202
unique
4
under $11k (158, 47%), $11-$25k (79, 24%), $25-$50k (63, 19%)
>$50k (35, 10%)
missing
0
unique
11
Info
0.82
Mean
19.24
.05
0
.10
0
.25
0
.50
0
0 9 26 37 41 44 55 61 89 94 100
Frequency 301 50 44 19 17 43 11 6 8 6 32
%
56 9 8 4 3 8 2 1 1 1
6
missing
20
unique
516
Info
1
Mean
86652
.05
11075
.10
15180
.25
27389
.50
51079
.75
100904
.90
205562
.95
283411
lowest :
3448
4432
4574
5555
5849
highest: 504660 538323 543761 706577 740010
missing
66
unique
471
Info
1
Mean
46360
.05
6359
.10
8449
.25
15412
.50
29308
.75
57028
.90
108927
.10
8283
.25
14415
.50
26323
.75
54102
.90
87495
.10
14.50
.25
19.62
.50
28.00
.75
39.00
.95
141569
lowest :
0
2071
2522
3191
3325
highest: 269057 269131 338955 357919 390460
missing
206
unique
328
Info
1
Mean
39022
.05
6131
.95
111920
lowest :
0
1562
2478
2626
3421
highest: 144234 154709 198047 234876 271467
missing
1
unique
205
Info
1
Mean
29.83
.05
12.46
.90
47.17
.95
50.37
race
n
535
missing
2
Frequency
%
unique
5
missing
0
lowest :
unique
109
20
27
Info
1
30
Mean
83.28
.05
41.8
.10
49.0
.25
59.0
.50
73.0
.75
111.0
.90
124.4
.95
135.0
missing
5
lowest :
highest:
unique
241
0.05000
51.39844
Info
1
0.06999
58.19531
Mean
14.1
0.09999
61.19531
.05
0.8999
.10
4.5000
.25
7.9749
.50
12.3984
.75
18.1992
.90
25.1891
0.14999
0.19998
79.39062 100.00000
missing
0
lowest :
unique
111
11
30
36
Info
1
Mean
105
.05
51
.10
60
.25
75
.50
111
.75
126
.90
140
.95
155
.95
30.1873
10-3
missing
0
lowest :
unique
45
Info
1
Mean
23.72
.05
8
.10
10
.25
12
.50
24
.75
32
.90
39
.95
40
8, highest: 48 49 52 60 64
missing
0
unique
61
Info
1
Mean
37.52
.05
35.50
.10
35.80
.05
86.99
.10
105.08
.25
36.40
.50
37.80
.75
38.50
.90
39.09
.95
39.50
missing
37
unique
357
Info
1
Mean
227.2
.25
137.88
.50
202.56
.75
290.00
.90
390.49
.95
433.31
missing
191
unique
34
Info
1
Mean
2.668
.05
1.700
.10
1.900
.25
2.225
.50
2.600
.75
3.100
.90
3.400
.95
3.800
missing
151
unique
88
Info
1
Mean
2.678
.05
0.3000
.10
0.4000
.25
0.6000
.50
0.8999
.75
2.0000
.90
6.5996
.95
13.1743
missing
0
lowest :
0.3
unique
84
0.4
Info
1
0.5
0.6
Mean
2.232
.05
0.6000
.10
0.7000
.25
0.8999
.50
1.3999
.75
2.5996
.90
5.2395
.95
7.3197
missing
0
unique
38
Info
1
Mean
138.1
.05
129
.10
131
.25
134
.50
137
.75
142
.90
147
.95
150
lowest : 118 120 121 126 127, highest: 156 157 158 168 175
missing
37
unique
49
Info
1
Mean
7.416
.05
7.270
.10
7.319
.25
7.380
.50
7.420
.75
7.470
.90
7.510
.95
7.529
missing
240
lowest :
30
unique
179
42
52
Info
1
55
Mean
167.7
.05
76.0
.10
89.0
.25
106.0
.50
141.0
.75
200.0
.90
292.4
.95
347.2
missing
233
lowest :
unique
100
3
Info
1
5
Mean
38.91
.05
8.00
.10
11.00
.25
16.75
.50
30.00
.75
56.00
.90
79.70
.95
100.70
missing
234
lowest :
unique
262
5
Info
1
15
Mean
2095
missing
433
unique
8
Info
0.87
0 1 2 3 4 5 6 7
Frequency 51 19 7 6 4 7 8 2
%
49 18 7 6 4 7 8 2
.10
364.0
.25
1156.5
.50
1870.0
.05
20.3
Mean
1.577
.75
2795.0
.90
4008.6
.95
4817.5
10-4
missing
145
unique
8
Info
0.89
Mean
1.86
0 1 2 3 4 5 6 7
Frequency 185 68 22 18 17 20 39 23
%
47 17 6 5 4 5 10 6
sfdm2
n
468
missing
69
unique
5
no(M2 and SIP pres) (134, 29%), adl>=4 (>=5 if sur) (78, 17%)
SIP>=30 (30, 6%), Coma or Intub (5, 1%), <2 mo. follow-up (221, 47%)
missing
0
unique
144
Info
0.96
Mean
2.119
.05
0.000
.10
0.000
.25
0.000
.50
1.839
.75
3.375
.90
6.000
.95
6.000
patterns
of
missing
data
#
Figure
10.1
0.0
num.co
dzclass
dzgroup
d.time
slos
hospdead
sex
age
death
adls
edu
income
0.2
alb
bili
Fraction Missing
0.1
0.3
0.5
urine
glucose
bun
adlp
0.4
totmcst
Show
adlsc
sod
crea
temp
resp
hrt
meanbp
race
avtisst
wblc
charges
totcst
scoma
pafi
ph
sfdm2
Figure 10.1: Cluster analysis showing which predictors tend to be missing on the same patients
10-5
10-6
0.10
0.15
0.20
0.25
0.30
sexmale
sod
raceasian
raceother
racehispanic
raceblack
dzgroupComa
scoma
dzgroupMOSF w/Malig
wblc
30 * Hoeffding D
0.05
edu
income$25$50k
income$11$25k
income>$50k
adlsc
num.co
glucose
alb
ph
pafi
meanbp
urine
resp
hrt
temp
age
bili
0.00
crea
bun
0.35
Figure 10.2: Hierarchical clustering of potential predictors using Hoeffding D as a similarity measure. Categorical predictors are
automatically expanded into dummy variables.
10-7
10.2
describe
distributions
of
variables
to
rms
options ( datadist = dd )
#
Generate
right-censored
survival
time
variable
estimates
dzgroup=MOSF w/Malig
dzgroup=ARF/MOSF w/Sepsis
1
dzgroup=Coma
2
3
Figure
10-8
10.4
Age
0.8
0.8
Survival Probability
1.0
Survival Probability
1.0
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
3.0
2.5
2.0
1.5
1.0
0.5
0.0
Residual
0.5
1.0
1.5
2.0
3.0
2.5
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
3.0
2.5
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
Residual
0.8
0.8
Survival Probability
1.0
Survival Probability
1.0
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
3.0
2.5
2.0
1.5
1.0
0.5
0.0
Residual
0.5
1.0
1.5
2.0
Residual
Figure 10.4: Kaplan-Meier estimates of distributions of normalized, right-censored residuals from the fitted log-normal survival model.
Residuals are stratified by important variables in the model (by quartiles of continuous variables), plus a random variable to depict
the natural variability (in the lower right plot). Theoretical standard Gaussian distributions of residuals are shown with a thick solid
line. The upper left plot is with respect to disease group.
The fit for dzgroup is not great but overall fit is good.
Remove from consideration predictors that are missing in > 0.2
of the patients. Many of these were only collected for the
second phase of SUPPORT.
Of those variables to be included in the model, find which ones
have enough potential predictive power to justify allowing for
nonlinear relationships or multiple categories, which spend more
d.f. For each variable compute Spearman 2 based on multiple
10-9
scoma
meanbp
dzgroup
crea
pafi
ph
sod
hrt
adlsc
temp
wblc
num.co
age
resp
race
N df
537 2
537 2
537 2
537 2
500 2
500 2
537 2
537 2
537 2
537 2
532 2
537 2
537 2
537 2
535 4
0.00
0.02
0.04
0.06
0.08
0.10
0.12
Adjusted 2
Figure 10.5: Generalized Spearman 2 rank correlation between predictors and truncated survival time
Compute
number
of
missing
values
per
variable
sapply ( llist ( age , num.co , scoma , meanbp , hrt , resp , temp , crea , sod , adlsc ,
wblc , pafi , ph ) , function ( x ) sum ( is.na ( x ) ) )
age num . co
0
0
wblc
pafi
5
37
scoma meanbp
0
0
ph
37
hrt
0
resp
0
temp
0
crea
0
sod
0
adlsc
0
meanbp
crea
dzgroup
scoma
pafi
ph
adlsc
age
num.co
hrt
resp
race
sod
wblc
temp
0.00
0.05
0.10
0.15
10-10
N
537
537
537
537
500
500
537
537
537
537
537
535
537
532
537
0.20
|Dxy|
Figure 10.6: Somers Dxy rank correlation between predictors and original survival time. For dzgroup or race, the correlation
coefficient is the maximum correlation from using a dummy variable to represent the most frequent or one to represent the second
most frequent category.,scap=Somers Dxy rank correlation between predictors and original survival time
Can
#
#
also
do
to
do
this
Do a formal redundancy analysis using more than pairwise associations, and allow for non-monotonic transformations in predicting each predictor from all other predictors. This analysis
requires missing values to be imputed so as to not greatly reduce the sample size.
redun ( crea + age + sex + dzgroup + num.co + scoma + adlsc + race2 +
meanbp + hrt + resp + temp + sod + wblc.i + pafi.i + ph.i , nk =4)
Redundancy Analysis
redun ( formula = crea + age + sex + dzgroup + num . co + scoma +
adlsc + race2 + meanbp + hrt + resp + temp + sod + wblc . i +
pafi . i + ph .i , nk = 4)
n : 537
p : 16
Number of NAs :
10-11
nk : 4
0
Type : ordinary
R2 with which each variable can be predicted from all other variables :
crea
0.133
hrt
0.258
age
0.246
resp
0.131
sex dzgroup
0.132
0.451
temp
sod
0.197
0.135
num . co
0.147
wblc . i
0.093
scoma
0.418
pafi . i
0.143
adlsc
0.153
ph . i
0.171
race2
0.151
meanbp
0.178
No redundant variables
sex
temp
race
sod
num.co
hrt
wblc.i
adlsc
resp
scoma
pafi.i
age
meanbp
crea
dzgroup
10-12
10
20
30
df
2
Figure 10.7: Partial 2 statistics for association of each predictor with response from saturated main effects model, penalized for
d.f.
(tests of linearity).
Fit a log-normal survival model with number of parameters corresponding to nonlinear effects determined from Figure 10.7. For the most promising predictors, five knots can
be allocated, as there are fewer singularity problems once
less promising predictors are simplified.
f psm ( S rcs ( age ,5) + sex + dzgroup + num.co +
scoma + pol ( adlsc ,2) + race2 + rcs ( meanbp ,5) +
rcs ( hrt ,3) + rcs ( resp ,3) + temp +
rcs ( crea ,4) + sod + rcs ( wblc.i ,3) + rcs ( pafi.i ,4) ,
dist = lognormal ) # g a u s s i a n f o r S +
print (f , latex = TRUE )
Obs
537
Events 356
2.230782
Model Likelihood
Ratio Test
LR 2
236.83
d.f.
30
2
Pr(> ) < 0.0001
(Intercept)
age
age
age
age
sex=male
dzgroup=Coma
dzgroup=MOSF w/Malig
num.co
scoma
adlsc
adlsc2
race2=other
meanbp
meanbp
meanbp
meanbp
hrt
hrt
resp
resp
temp
crea
crea
crea
sod
wblc.i
wblc.i
pafi.i
pafi.i
pafi.i
Log(scale)
a anova ( f )
Coef
-5.6883
-0.0148
-0.0412
0.1670
-0.2099
-0.0737
-2.0676
-1.4664
-0.1917
-0.0142
-0.3735
0.0442
0.2979
0.0702
-0.3080
0.8438
-0.5715
-0.0171
0.0064
0.0454
-0.0851
0.0523
-0.4585
-11.5176
21.9840
0.0044
0.0746
-0.0880
0.0169
-0.0569
0.1088
0.8024
Discrimination
Indexes
2
R
0.594
Dxy
0.485
g
0.033
gr
1.959
S.E.
Wald Z Pr(> |Z|)
3.7851
-1.50
0.1329
0.0309
-0.48
0.6322
0.1078
-0.38
0.7024
0.5594
0.30
0.7653
1.3707
-0.15
0.8783
0.2181
-0.34
0.7354
0.4062
-5.09 < 0.0001
0.3112
-4.71 < 0.0001
0.0858
-2.23
0.0255
0.0044
-3.25
0.0011
0.1520
-2.46
0.0140
0.0243
1.82
0.0691
0.2658
1.12
0.2624
0.0210
3.34
0.0008
0.2261
-1.36
0.1732
0.8556
0.99
0.3241
0.7707
-0.74
0.4584
0.0069
-2.46
0.0140
0.0063
1.02
0.3090
0.0230
1.97
0.0483
0.0291
-2.93
0.0034
0.0834
0.63
0.5308
0.6727
-0.68
0.4955
19.0027
-0.61
0.5444
31.0113
0.71
0.4784
0.0157
0.28
0.7792
0.0331
2.25
0.0242
0.0377
-2.34
0.0195
0.0055
3.07
0.0021
0.0239
-2.38
0.0173
0.0482
2.26
0.0239
0.0401
19.99 < 0.0001
10-13
10-14
10.3
Figure
Table
10.9
Figure
10.10
10.1
adlsc
age
10-15
crea
hrt
22 = 8.3
1
0
1
23 = 33.6
24 = 16
22 = 11.8
3
0
20
40
60
80
meanbp
num.co
24 = 27.6
21 = 5
2.5
5.0
7.5
50
100
pafi.i
150
resp
log(T)
23 = 15.4
22 = 11.1
0
1
2
3
30
60
90
120
150 0
scoma
sod
21 = 10.6
21 = 0.1
10
20
temp
30
40
wblc.i
2
1
21 = 0.4
22 = 5.5
0
1
2
3
0
25
50
75
100
dzgroup
MOSF w/Malig
other
ARF/MOSF w/Sepsis
38
39
40 0
10
22 = 45.7 white
1
20
30
40
sex
male
37
race2
Coma
36
21 =
1.3
female
2
21=
0.1
log(T)
Figure 10.8: Effect of each predictor on log survival time. Predicted values have been centered so that predictions at predictor
reference values are zero. Pointwise 0.95 confidence bands are also shown. As all Y -axes have the same scale, it is easy to see which
predictors are strongest.
2
15.99
0.23
0.11
45.69
4.99
10.58
8.28
3.31
1.26
27.62
10.51
11.83
1.04
11.10
8.56
0.39
33.63
21.27
0.08
5.47
5.46
15.37
6.97
60.48
261.47
age
Nonlinear
sex
dzgroup
num.co
scoma
adlsc
Nonlinear
race2
meanbp
Nonlinear
hrt
Nonlinear
resp
Nonlinear
temp
crea
Nonlinear
sod
wblc.i
Nonlinear
pafi.i
Nonlinear
TOTAL NONLINEAR
TOTAL
sod
sex
temp
race2
wblc.i
num.co
adlsc
resp
scoma
hrt
age
pafi.i
meanbp
crea
dzgroup
d.f.
4
3
1
2
1
1
2
1
1
4
3
2
1
2
1
1
3
2
1
2
1
3
2
14
30
P
0.0030
0.9722
0.7354
< 0.0001
0.0255
0.0011
0.0159
0.0691
0.2624
< 0.0001
0.0147
0.0027
0.3090
0.0039
0.0034
0.5308
< 0.0001
< 0.0001
0.7792
0.0649
0.0195
0.0015
0.0307
< 0.0001
< 0.0001
10
20
30
40
df
2
10-16
0.10
0.50
10-17
1.00
2.00 3.50
age 74.5:47.9
num.co 2:1
scoma 37:0
adlsc 3.38:0
meanbp 111:59
hrt 126:75
resp 32:12
temp 38.5:36.4
crea 2.6:0.9
sod 142:134
wblc.i 18.2:8.1
pafi.i 323:142
sex female:male
dzgroup Coma:ARF/MOSF w/Sepsis
dzgroup MOSF w/Malig:ARF/MOSF w/Sepsis
race2 other:white
Figure 10.10: Estimated survival time ratios for default settings of predictors. For example, when age changes from its lower quartile
to the upper quartile (47.9y to 74.5y), median survival time decreases by more than half. Different shaded areas of bars indicate
different confidence levels (0.9, 0.95, 0.99).
10-18
10.4
First
add
data
from
the
data
to
model
fit
so
bootstrap
can
re-sample
Index
Original Training
Test
Optimism
Sample Sample Sample
Dxy
0.49
0.51
0.46
0.05
R2
0.59
0.66
0.54
0.12
Intercept
0.00
0.00 0.06
0.06
Slope
1.00
1.00
0.90
0.10
D
0.48
0.55
0.42
0.13
U
0.00
0.00 0.01
0.01
Q
0.48
0.55
0.43
0.12
g
1.96
2.06
1.86
0.19
Corrected
Index
0.43
0.47
0.06
0.90
0.35
0.01
0.36
1.76
n
120
120
120
120
120
120
120
120
10-19
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
10-20
10.5
sigma =1
#
#
R2 =1 .0 since we start
component variables
is
used
to
prevent
Deleted
sod
sex
temp
race2
wblc . i
num . co
resp
adlsc
pafi . i
scoma
hrt
age
crea
meanbp
dzgroup
Chi - Sq
0.43
0.57
2.20
6.81
29.52
30.84
54.18
52.46
66.78
78.07
83.17
68.08
314.47
403.04
441.28
d.f.
1
1
1
1
2
1
2
2
3
1
2
4
3
4
2
sigma
out
by
fast
hat
from
approximating
backward
P
Residual d . f .
0.512
0.43
1
0.451
1.00
2
0.138
3.20
3
0.009
10.01
4
0.000
39.53
6
0.000
70.36
7
0.000 124.55
9
0.000 177.00 11
0.000 243.79 14
0.000 321.86 15
0.000 405.02 17
0.000 473.10 21
0.000 787.57 24
0.000 1190.61 28
0.000 1631.89 30
zero
with
when
all
stepdown
P
AIC
R2
0.5117
-1.57 1.000
0.6073
-3.00 0.999
0.3621
-2.80 0.998
0.0402
2.01 0.994
0.0000
27.53 0.976
0.0000
56.36 0.957
0.0000 106.55 0.924
0.0000 155.00 0.892
0.0000 215.79 0.851
0.0000 291.86 0.803
0.0000 371.02 0.752
0.0000 431.10 0.710
0.0000 739.57 0.517
0.0000 1134.61 0.270
0.0000 1571.89 0.000
being
10-21
f.approx ols ( Z dzgroup + rcs ( meanbp ,5) + rcs ( crea ,4) + rcs ( age ,5) +
rcs ( hrt ,3) + scoma + rcs ( pafi.i ,4) + pol ( adlsc ,2) +
rcs ( resp ,3) , x = TRUE )
f.approx $ stats
n Model L . R .
537.000
1688.225
Sigma
0.370
d.f.
23.000
R2
0.957
g
1.915
var ( full
#
#
#
model )
Compare variance estimates (diagonals of v) with variance estimates from a reduced model that is fitted against the actual
outcomes.
f.sub psm ( S dzgroup + rcs ( meanbp ,5) + rcs ( crea ,4) + rcs ( age ,5) +
rcs ( hrt ,3) + scoma + rcs ( pafi.i ,4) + pol ( adlsc ,2) +
rcs ( resp ,3) , dist = lognormal ) # g a u s s i a n f o r S +
r diag ( v ) / diag ( vcov ( f.sub , regcoef.only = TRUE ) )
r [ c ( which.min ( r ) , which.max ( r ) ) ]
hrt
age
0.976 0.982
f.approx $ var v
latex ( anova ( f.approx , test = Chisq , ss = FALSE ) , file = ,
label = suport.anovaa )
Typeset
mathematical
form
of
approximate
model
10-22
dzgroup
meanbp
Nonlinear
crea
Nonlinear
age
Nonlinear
hrt
Nonlinear
scoma
pafi.i
Nonlinear
adlsc
Nonlinear
resp
Nonlinear
TOTAL NONLINEAR
TOTAL
2
55.94
29.87
9.84
39.04
24.37
18.12
0.34
9.87
0.40
9.85
14.01
6.66
9.71
2.87
9.65
7.13
58.08
252.32
d.f.
2
4
3
3
2
4
3
2
1
1
3
2
2
1
2
1
13
23
P
< 0.0001
< 0.0001
0.0200
< 0.0001
< 0.0001
0.0012
0.9517
0.0072
0.5289
0.0017
0.0029
0.0357
0.0078
0.0904
0.0080
0.0076
< 0.0001
< 0.0001
E(Z) = X, where
X =
2.51
1.94[Coma] 1.75[MOSF w/Malig]
+0.068meanbp 3.08105 (meanbp 41.8)3+ + 7.9105 (meanbp 61)3+
4.91105 (meanbp 73)3+ + 2.61106 (meanbp 109)3+ 1.7106 (meanbp 135)3+
0.553crea 0.229(crea 0.6)3+ + 0.45(crea 1.1)3+ 0.233(crea 1.94)3+
+0.0131(crea 7.32)3+
0.0165age 1.13105 (age 28.5)3+ + 4.05105 (age 49.5)3+
2.15105 (age 63.7)3+ 2.68105 (age 72.7)3+ + 1.9105 (age 85.6)3+
0.0136hrt + 6.09107 (hrt 60)3+ 1.68106 (hrt 111)3+ + 1.07106 (hrt 140)3+
0.0135 scoma
+0.0161pafi.i 4.77107 (pafi.i 88)3+ + 9.11107 (pafi.i 167)3+
5.02107 (pafi.i 276)3+ + 6.76108 (pafi.i 426)3+ 0.369 adlsc + 0.0409 adlsc2
+0.0394resp 9.11105 (resp 10)3+ + 0.000176(resp 24)3+ 8.5105 (resp 39)3+
and [c] = 1 if subject is in group c, 0 otherwise; (x)+ = x if x > 0, 0 otherwise.
on approximate model:
#
#
#
and quantiles
predictors
expected.surv Mean ( f )
quantile.surv Quantile ( f )
latex ( expected.surv , file = , type = Sinput )
expected.surv function ( lp = NULL , parms = 0 .802352037606488 )
{
names ( parms ) NULL
exp ( lp + exp (2 * parms ) / 2)
}
latex ( quantile.surv , file = , type = Sinput )
quantile.surv function ( q = 0 .5 , lp = NULL , parms = 0 .802352037606488 )
{
names ( parms ) NULL
f function ( lp , q , parms ) lp + exp ( parms ) * qnorm ( q )
names ( q ) format ( q )
drop ( exp ( outer ( lp , q , FUN = f , parms = parms ) ) )
}
median.surv
#
Improve
function ( x ) quantile.surv ( lp = x )
variable
labels
for
the
nomogram
rms
Modeling
Model presentation
Model validation
varclus,llist,spearman2
describe,impute,latex
datadist,psm,rcs,ols,fastbw
survplot,Newlabels,Function,
Mean,Quantile,nomogram
validate,calibrate
10-23
10
20
30
40
50
60
70
10-24
80
90
100
Points
MOSF w/Malig
Disease Group
Coma
ARF/MOSF w/Sepsis
Mean Arterial BP
0
6
20
10 11
40
60
80 120
12
Creatinine
5
Age
100
70
60
50
30
10
Heart Rate
300
SUPPORT Coma
Score
200
100
100
50
70 50 30 10
300
PaO2/(.01*FiO2)
0
5
50
100
200 500
700
900
ADL
4.5 2 1
0
65
60
55
50
45
40
35
30
Resp. Rate
0 5
15
Total Points
0
50
100
150
200
250
300
350
400
450
Linear Predictor
7
Figure 10.12: Nomogram for predicting median and mean survival time, based on approximation of full model
Annotated Bibliography
[1] Paul D. Allison. Missing Data. Sage University Papers Series on Quantitative Applications in the Social Sciences,
07-136. Thousand Oaks CA: Sage, 2001 (cit. on pp. 3-1, 3-5).
[2] D. G. Altman.Categorising continuous covariates (letter to the editor). In: Brit J Cancer 64 (1991), p. 975 (cit. on
p. 2-13).
[3] D. G. Altman and P. K. Andersen. Bootstrap investigation of the stability of a Cox regression model. In: Stat
Med 8 (1989), pp. 771783 (cit. on p. 4-15).
[4] D. G. Altman et al. Dangers of using optimal cutpoints in the evaluation of prognostic factors. In: J Nat Cancer
Inst 86 (1994), pp. 829835 (cit. on pp. 2-13, 2-15).
[5] Douglas G. Altman. Suboptimal analysis using optimal cutpoints. In: Brit J Cancer 78 (1998), pp. 556557
(cit. on p. 2-13).
[6] A. C. Atkinson. A note on the generalized information criterion for choice of a model. In: Biometrika 67 (1980),
pp. 413418 (cit. on pp. 2-26, 4-14).
[7] Peter C. Austin. Bootstrap model selection had similar performance for selecting authentic and noise variables
compared to backward variable elimination: a simulation study. In: J Clin Epi 61 (2008), pp. 10091017 (cit. on
p. 4-15).
in general, a bootstrap model selection method had comparable performance to conventional backward variable elimination for identifying the true regression model. In most settings, both methods performed poorly at correctly identifying the correct regression
model.
.
[8] Peter C. Austin, Jack V. Tu, and Douglas S. Lee. Logistic regression had superior performance compared with
regression trees for predicting in-hospital mortality in patients hospitalized with heart failure. In: J Clin Epi 63
(2010), pp. 11451155 (cit. on p. 2-32).
ROC areas for logistic models varied from 0.747 to 0.775 whereas they varied from 0.620-0.651 for recursive partitioning;repeated data
simulation showed large variation in tree structure
.
[9] Sunni A. Barnes, Stacy R. Lindborg, and John W. Seaman.Multiple imputation techniques in small sample clinical
trials. In: Stat Med 25 (2006), pp. 233245 (cit. on p. 3-15).
bad performance of LOCF including high bias and poor confidence interval coverage;simulation setup;longitudinal data;serial data;RCT;dropout;assumed
missing at random (MAR);approximate Bayesian bootstrap;Bayesian least squares;missing data;nice background summary;new completion score method based on fitting a Poisson model for the number of completed clinic visits and using donors and approximate
Bayesian bootstrap
.
[10] Federica Barzi and Mark Woodward. Imputations of missing values in practice: Results from imputations of serum
cholesterol in 28 cohort studies. In: Am J Epi 160 (2004), pp. 3445 (cit. on pp. 3-7, 3-14).
excellent review article for multiple imputation;list of variables to include in imputation model;Imputation models should ideally include
all covariates that are related to the missing data mechanism, have distributions that differ between the respondents and nonrespondents,
are associated with cholesterol, and will be included in the analyses of the final complete data sets;detailed comparison of results
(cholesterol effect and confidence limits) for various imputation methods
.
[11] Heiko Belcher. The concept of residual confounding in regression models and some applications. In: Stat Med 11
(1992), pp. 17471758 (cit. on p. 2-13).
[12] D. A. Belsley, E. Kuh, and R. E. Welsch. Regression Diagnostics: Identifying Influential Data and Sources of
Collinearity. New York: Wiley, 1980 (cit. on p. 4-37).
11-1
ANNOTATED BIBLIOGRAPHY
11-2
[13] David A. Belsley. Conditioning Diagnostics: Collinearity and Weak Data in Regression. New York: Wiley, 1991
(cit. on p. 4-22).
[14] Jacqueline K. Benedetti et al. Effective sample size for tests of censored survival data. In: Biometrika 69 (1982),
pp. 343349 (cit. on p. 4-17).
[15] Kiros Berhane, Michael Hauptmann, and Bryan Langholz. Using tensor product splines in modeling exposure
timeresponse relationships: Application to the Colorado Plateau Uranium Miners cohort. In: Stat Med 27 (2008),
pp. 54845496 (cit. on p. 2-44).
discusses taking product of all univariate spline basis functions
.
[16] Maria Blettner and Willi Sauerbrei. Influence of model-building strategies on the results of a case-control study.
In: Stat Med 12 (1993), pp. 13251338 (cit. on p. 5-22).
[17] James G. Booth and Somnath Sarkar. Monte Carlo approximation of bootstrap variances. In: Am Statistician 52
(1998), pp. 354357 (cit. on p. 5-10).
number of resamples required to estimate variances, quantiles; 800 resamples may be required to guarantee with 0.95 confidence that
the relative error of a variance estimate is 0.1;Efrons original suggestions for as low as 25 resamples were based on comparing stability
of bootstrap estimates to sampling error, but small relative effects can significantly change P -values;number of bootstrap resamples
.
[18] Robert Bordley. Statistical decisionmaking without math. In: Chance 20.3 (2007), pp. 3944 (cit. on p. 1-7).
[19] L. Breiman and J. H. Friedman. Estimating optimal transformations for multiple regression and correlation (with
discussion). In: J Am Stat Assoc 80 (1985), pp. 580619 (cit. on p. 4-30).
[20] Leo Breiman. The little bootstrap and other methods for dimensionality selection in regression: X-fixed prediction
error. In: J Am Stat Assoc 87 (1992), pp. 738754 (cit. on pp. 4-14, 4-15, 5-16).
[21] Leo Breiman et al. Classification and Regression Trees. Pacific Grove, CA: Wadsworth and Brooks/Cole, 1984
(cit. on p. 2-31).
[22] William M. Briggs and Russell Zaretzki. The skill plot: A graphical technique for evaluating continuous diagnostic
tests (with discussion). In: Biometrics 64 (2008), pp. 250261 (cit. on p. 1-7).
statistics such as the AUC are not especially relevant to someone who must make a decision about a particular x c. ... ROC curves
lack or obscure several quantities that are necessary for evaluating the operational effectiveness of diagnostic tests. ... ROC curves were
first used to check how radio receivers (like radar receivers) operated over a range of frequencies. ... This is not how most ROC curves
are used now, particularly in medicine. The receiver of a diagnostic measurement ... wants to make a decision based on some x c, and is
not especially interested in how well he would have done had he used some different cutoff.; in the discussion David Hand states when
integrating to yield the overall AUC measure, it is necessary to decide what weight to give each value in the integration. The AUC
implicitly does this using a weighting derived empirically from the data. This is nonsensical. The relative importance of misclassifying
a case as a noncase, compared to the reverse, cannot come from the data itself. It must come externally, from considerations of the
severity one attaches to the different kinds of misclassifications.; see Lin, Kvam, Lu Stat in Med 28:798-813;2009
.
[23] David Brownstone. Regression strategies. In: Proceedings of the 20th Symposium on the Interface between Computer Science and Statistics. Washington, DC: American Statistical Association, 1988, pp. 7479 (cit. on p. 5-22).
[24] Petra Buettner, Claus Garbe, and Irene Guggenmoos-Holzmann. Problems in defining cutoff points of continuous
prognostic factors: Example of tumor thickness in primary cutaneous melanoma. In: J Clin Epi 50 (1997), pp. 1201
1210 (cit. on p. 2-13).
choice of cut point depends on marginal distribution of predictor
.
[25] Stef Buuren. Flexible imputation of missing data. Boca Raton, FL: Chapman & Hall/CRC, 2012 (cit. on pp. 3-1,
3-14, 3-19).
[26] Centers for Disease Control and Prevention CDC. National Center for Health Statistics NCHS. National Health and
Nutrition Examination Survey. Hyattsville, MD, 2010. url: http://www.cdc.gov/nchs/nhanes/nhanes20092010/nhanes09_10.htm (cit. on p. 9-4).
[27] John M. Chambers and Trevor J. Hastie, eds. Statistical Models in S. Pacific Grove, CA: Wadsworth and Brooks/Cole, 1992 (cit. on p. 2-45).
[28] C. Chatfield. Avoiding statistical pitfalls (with discussion). In: Statistical Sci 6 (1991), pp. 240268 (cit. on
p. 4-38).
[29] C. Chatfield. Model uncertainty, data mining and statistical inference (with discussion). In: J Roy Stat Soc A 158
(1995), pp. 419466 (cit. on pp. 4-10, 5-22).
ANNOTATED BIBLIOGRAPHY
11-3
bias by selecting model because it fits the data well; bias in standard errors;P. 420: ... need for a better balance in the literature and in
statistical teaching between techniques and problem solving strategies. P. 421: It is well known to be logically unsound and practically
misleading (Zhang, 1992) to make inferences as if a model is known to be true when it has, in fact, been selected from the same data
to be used for estimation purposes. However, although statisticians may admit this privately (Breiman (1992) calls it a quiet scandal),
they (we) continue to ignore the difficulties because it is not clear what else could or should be done. P. 421: Estimation errors for
regression coefficients are usually smaller than errors from failing to take into account model specification. P. 422: Statisticians must
stop pretending that model uncertainty does not exist and begin to find ways of coping with it. P. 426: It is indeed strange that we
often admit model uncertainty by searching for a best model but then ignore this uncertainty by making inferences and predictions as
if certain that the best fitting model is actually true. P. 427: The analyst needs to assess the model selection process and not just the
best fitting model. P. 432: The use of subset selection methods is well known to introduce alarming biases. P. 433: ... the AIC can
be highly biased in data-driven model selection situations. P. 434: Prediction intervals will generally be too narrow. In the discussion,
Jamal R. M. Ameen states that a model should be (a) satisfactory in performance relative to the stated objective, (b) logically sound,
(c) representative, (d) questionable and subject to on-line interrogation, (e) able to accommodate external or expert information and
(f) able to convey information.
.
[30] Samprit Chatterjee and Ali S. Hadi. Regression Analysis by Example. Fifth. New York: Wiley, 2012. isbn: 0470905840
(cit. on p. 4-21).
[31] Marie Chavent et al. ClustOfVar: An R package for the clustering of variables. In: J Stat Software 50.13 (Sept.
2012), pp. 116 (cit. on p. 4-26).
[32] A. Ciampi et al. Stratification by stepwise regression, correspondence analysis and recursive partition. In: Comp
Stat Data Analysis 1986 (1986), pp. 185204 (cit. on p. 4-26).
[33] W. S. Cleveland. Robust locally weighted regression and smoothing scatterplots. In: J Am Stat Assoc 74 (1979),
pp. 829836 (cit. on p. 2-28).
[34] E. Francis Cook and Lee Goldman. Asymmetric stratification: An outline for an efficient method for controlling
confounding in cohort studies. In: Am J Epi 127 (1988), pp. 626639 (cit. on p. 2-32).
[35] J. B. Copas. Cross-validation shrinkage of regression predictors. In: J Roy Stat Soc B 49 (1987), pp. 175183
(cit. on p. 5-20).
[36] J. B. Copas.Regression, prediction and shrinkage (with discussion). In: J Roy Stat Soc B 45 (1983), pp. 311354
(cit. on pp. 4-19, 4-20).
[37] David R. Cox. Regression models and life-tables (with discussion). In: J Roy Stat Soc B 34 (1972), pp. 187220
(cit. on p. 2-48).
[38] Sybil L. Crawford, Sharon L. Tennstedt, and John B. McKinlay.A comparison of analytic methods for non-random
missingness of outcome data. In: J Clin Epi 48 (1995), pp. 209219 (cit. on pp. 3-4, 4-44).
[39] N. J. Crichton and J. P. Hinde.Correspondence analysis as a screening method for indicants for clinical diagnosis.
In: Stat Med 8 (1989), pp. 13511362 (cit. on p. 4-26).
[40] Ralph B. DAgostino et al. Development of health risk appraisal functions in the presence of multiple indicators:
The Framingham Study nursing home institutionalization model. In: Stat Med 14 (1995), pp. 17571770 (cit. on
pp. 4-22, 4-25).
[41] C. E. Davis et al. An example of dependencies among variables in a conditional logistic regression. In: Modern
Statistical Methods in Chronic Disease Epi. Ed. by S. H. Moolgavkar and R. L. Prentice. New York: Wiley, 1986,
pp. 140147 (cit. on p. 4-22).
[42] Charles S. Davis. Statistical Methods for the Analysis of Repeated Measurements. New York: Springer, 2002 (cit. on
p. 7-16).
[43] S. Derksen and H. J. Keselman.Backward, forward and stepwise automated subset selection algorithms: Frequency
of obtaining authentic and noise variables. In: British J Math Stat Psych 45 (1992), pp. 265282 (cit. on p. 4-10).
[44] T. F. Devlin and B. J. Weeks. Spline functions for logistic regression modeling. In: Proceedings of the Eleventh
Annual SAS Users Group International Conference. Cary, NC: SAS Institute, Inc., 1986, pp. 646651 (cit. on
p. 2-21).
[45] Peter J. Diggle et al. Analysis of Longitudinal Data. second. Oxford UK: Oxford University Press, 2002 (cit. on
p. 7-10).
[46] Donders et al. Review: A gentle introduction to imputation of missing values. In: J Clin Epi 59 (2006), pp. 1087
1091 (cit. on pp. 3-1, 3-5).
simple demonstration of failure of the add new category method (indicator variable)
.
[47] William D. Dupont. Statistical Modeling for Biomedical Researchers. second. Cambridge, UK: Cambridge University
Press, 2008 (cit. on p. 11-13).
ANNOTATED BIBLIOGRAPHY
11-4
[48] S. Durrleman and R. Simon. Flexible regression models with cubic splines. In: Stat Med 8 (1989), pp. 551561
(cit. on p. 2-25).
[49] B. Efron. Estimating the error rate of a prediction rule: Improvement on cross-validation. In: J Am Stat Assoc 78
(1983), pp. 316331 (cit. on pp. 5-17, 5-20).
suggested need at least 200 models to get an average that is adequate, i.e., 20 repeats of 10-fold cv
.
[50] Bradley Efron and Robert Tibshirani. An Introduction to the Bootstrap. New York: Chapman and Hall, 1993 (cit. on
p. 5-20).
[51] Bradley Efron and Robert Tibshirani. Improvements on cross-validation: The .632+ bootstrap method. In: J Am
Stat Assoc 92 (1997), pp. 548560 (cit. on p. 5-20).
[52] Juanjuan Fan and Richard A. Levine. To amnio or not to amnio: That is the decision for Bayes. In: Chance 20.3
(2007), pp. 2632 (cit. on p. 1-7).
[53] David Faraggi and Richard Simon. A simulation study of cross-validation for selecting an optimal cutpoint in
univariate survival analysis. In: Stat Med 15 (1996), pp. 22032213 (cit. on p. 2-13).
bias in point estimate of effect from selecting cutpoints based on P -value; loss of information from dichotomizing continuous predictors
.
[54] J. J. Faraway. The cost of data analysis. In: J Comp Graph Stat 1 (1992), pp. 213229 (cit. on pp. 4-46, 5-19,
5-22).
[55] Valerii Fedorov, Frank Mannino, and Rongmei Zhang.Consequences of dichotomization. In: Pharm Stat 8 (2009),
pp. 5061. doi: 10.1002/pst.331. url: http://dx.doi.org/10.1002/pst.331 (cit. on pp. 1-6, 2-13).
optimal cutpoint depends on unknown parameters;should only entertain dichotomization when estimating a value of the cumulative
distribution and when the assumed model is very different from the true model;nice graphics
.
[56] D. Freedman, W. Navidi, and S. Peters. On the Impact of Variable Selection in Fitting Regression Equations.
In: Lecture Notes in Economics and Mathematical Systems. New York: Springer-Verlag, 1988, pp. 116 (cit. on
p. 5-20).
[57] J. H. Friedman. A variable span smoother. Technical Report 5. Laboratory for Computational Statistics, Department
of Statistics, Stanford University, 1984 (cit. on p. 4-30).
[58] Mitchell H. Gail and Ruth M. Pfeiffer. On criteria for evaluating models of absolute risk. In: Biostatistics 6.2
(2005), pp. 227239 (cit. on p. 1-7).
[59] Joseph C. Gardiner, Zhehui Luo, and Lee A. Roman. Fixed effects, random effects and GEE: What are the
differences? In: Stat Med 28 (2009), pp. 221239 (cit. on p. 7-9).
nice comparison of models; econometrics; different use of the term fixed effects model
.
[60] A. Giannoni et al. Do optimal prognostic thresholds in continuous physiological variables really exist? Analysis of
origin of apparent thresholds, with systematic review for peak oxygen consumption, ejection fraction and BNP. In:
PLoS ONE 9.1 (2014). doi: 10.1371/journal.pone.0081699. url: http://dx.doi.org/10.1371/journal.
pone.0081699 (cit. on pp. 2-13, 2-15).
[61] John H. Giudice, John R. Fieberg, and Mark S. Lenarz. Spending degrees of freedom in a poor economy: A case
study of building a sightability model for moose in northeastern minnesota. In: J Wildlife Manage (2011). doi:
10.1002/jwmg.213. url: http://dx.doi.org/10.1002/jwmg.213 (cit. on p. 4-1).
[62] Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and estimation. In: J Am Stat
Assoc 102 (2007), pp. 359378 (cit. on p. 1-7).
wonderful review article except missing references from Scandanavian and German medical decision making literature
.
[63] Harvey Goldstein. Restricted unbiased iterative generalized least-squares estimation. In: Biometrika 76.3 (1989),
pp. 622623 (cit. on pp. 7-6, 7-10).
derivation of REML
.
[64] Usha S. Govindarajulu et al. Comparing smoothing techniques in Cox models for exposure-response relationships.
In: Stat Med 26 (2007), pp. 37353752 (cit. on p. 2-26).
ANNOTATED BIBLIOGRAPHY
11-5
authors wrote a SAS macro for restricted cubic splines even though such a macro as existed since 1984; would have gotten more useful
results had simulation been used so would know the true regression shape;measure of agreement of two estimated curves by computing
the area between them, standardized by average of areas under the two;penalized spline and rcs were closer to each other than to
fractional polynomials
.
[65] P. M. Grambsch and P. C. OBrien. The effects of transformations and preliminary tests for non-linearity in
regression. In: Stat Med 10 (1991), pp. 697709 (cit. on pp. 2-36, 4-10).
[66] Robert J. Gray. Flexible methods for analyzing survival data using splines, with applications to breast cancer
prognosis. In: J Am Stat Assoc 87 (1992), pp. 942951 (cit. on pp. 2-44, 4-19).
[67] Robert J. Gray. Spline-based tests in survival analysis. In: Biometrics 50 (1994), pp. 640652 (cit. on p. 2-44).
[68] Michael J. Greenacre. Correspondence analysis of multivariate categorical data by weighted least-squares. In:
Biometrika 75 (1988), pp. 457467 (cit. on p. 4-26).
[69] Sander Greenland. When should epidemiologic regressions use random coefficients? In: Biometrics 56 (2000),
pp. 915921. doi: 10.1111/j.0006- 341X.2000.00915.x. url: http://dx.doi.org/10.1111/j.0006341X.2000.00915.x (cit. on pp. 4-10, 4-41).
use of statistics in epidemiology is largely primitive;stepwise variable selection on confounders leaves important confounders uncontrolled;composition matrix;example with far too many significant predictors with many regression coefficients absurdly inflated when
overfit;lack of evidence for dietary effects mediated through constituents;shrinkage instead of variable selection;larger effect on confidence interval width than on point estimates with variable selection;uncertainty about variance of random effects is just uncertainty
about prior opinion;estimation of variance is pointless;instead the analysis should be repeated using different values;if one feels compelled to estimate 2 , I would recommend giving it a proper prior concentrated amount contextually reasonable values;claim about
ordinary MLE being unbiased is misleading because it assumes the model is correct and is the only model entertained;shrinkage towards
compositional model;models need to be complex to capture uncertainty about the relations...an honest uncertainty assessment requires
parameters for all effects that we know may be present. This advice is implicit in an antiparsimony principle often attributed to L. J.
Savage All models should be as big as an elephant (see Draper, 1995). See also gus06per.
.
[70] Jian Guo et al. Principal component analysis with sparse fused loadings. In: J Comp Graph Stat 19.4 (2011),
pp. 930946 (cit. on p. 4-26).
incorporates blocking structure in the variables;selects different variables for different components;encourages loadings of highly correlated variables to have same magnitude, which aids in interpretation
.
[71] D. Hand and M. Crowder. Practical Longitudinal Data Analysis. London: Chapman & Hall, 1996.
[72] Ofer Harel and Xiao-Hua Zhou. Multiple imputation: Review of theory, implementation and software. In: Stat
Med 26 (2007), pp. 30573077 (cit. on pp. 3-1, 3-7, 3-12, 3-15).
failed to review aregImpute;excellent overview;ugly S code;nice description of different statistical tests including combining likelihood
ratio tests (which appears to be complex, requiring an out-of-sample log likelihood computation);congeniality of imputation and analysis
models;Bayesian approximation or approximate Bayesian bootstrap overview;Although missing at random (MAR) is a non-testable
assumption, it has been pointed out in the literature that we can get very close to MAR if we include enough variables in the imputation
models ... it would be preferred if the missing data modelling was done by the data constructors and not by the users... MI yields valid
inferences not only in congenial settings, but also in certain uncongenial ones as wellwhere the imputers model (1) is more general
(i.e. makes fewer assumptions) than the complete-data estimation method, or when the imputers model makes additional assumptions
that are well-founded.
.
[73] F. E. Harrell. The LOGIST Procedure. In: SUGI Supplemental Library Users Guide. Version 5. Cary, NC: SAS
Institute, Inc., 1986, pp. 269293 (cit. on p. 4-13).
[74] F. E. Harrell, K. L. Lee, and B. G. Pollock.Regression models in clinical studies: Determining relationships between
predictors and response. In: J Nat Cancer Inst 80 (1988), pp. 11981202 (cit. on p. 2-29).
[75] F. E. Harrell et al. Regression modeling strategies for improved prognostic prediction. In: Stat Med 3 (1984),
pp. 143152 (cit. on p. 4-16).
[76] F. E. Harrell et al. Regression models for prognostic prediction: Advantages, problems, and suggested solutions.
In: Ca Trt Rep 69 (1985), pp. 10711077 (cit. on p. 4-16).
[77] Frank E. Harrell, Kerry L. Lee, and Daniel B. Mark. Multivariable prognostic models: Issues in developing models,
evaluating assumptions and adequacy, and measuring and reducing errors. In: Stat Med 15 (1996), pp. 361387
(cit. on p. 4-1).
[78] Frank E. Harrell et al. Development of a clinical prediction model for an ordinal outcome: The World Health
Organization ARI Multicentre Study of clinical signs and etiologic agents of pneumonia, sepsis, and meningitis in
young infants. In: Stat Med 17 (1998), pp. 909944. url: http://onlinelibrary.wiley.com/doi/10.1002/
(SICI)1097-0258(19980430)17:8%3C909::AID-SIM753%3E3.0.CO;2-O/abstract (cit. on pp. 4-19, 4-44).
ANNOTATED BIBLIOGRAPHY
11-6
[79] Trevor J. Hastie and Robert J. Tibshirani. Generalized Additive Models. ISBN 9780412343902. Boca Raton, FL:
Chapman & Hall/CRC, 1990 (cit. on p. 2-35).
[80] Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman. The Elements of Statistical Learning. second. ISBN-10:
0387848576; ISBN-13: 978-0387848570. New York: Springer, 2008 (cit. on p. 2-34).
[81] Yulei He and Alan M. Zaslavsky. Diagnosing imputation models by applying target analyses to posterior replicates
of completed data. In: Stat Med 31.1 (2012), pp. 118. doi: 10.1002/sim.4413. url: http://dx.doi.org/10.
1002/sim.4413 (cit. on p. 3-17).
[82] S. G. Hilsenbeck and G. M. Clark. Practical p-value adjustment for optimally selected cutpoints. In: Stat Med 15
(1996), pp. 103112 (cit. on p. 2-13).
[83] W. Hoeffding.A non-parametric test of independence. In: Ann Math Stat 19 (1948), pp. 546557 (cit. on p. 4-26).
[84] Norbert Hollander, Willi Sauerbrei, and Martin Schumacher. Confidence intervals for the effect of a prognostic
factor after selection of an optimal cutpoint. In: Stat Med 23 (2004), pp. 17011713. doi: 10.1002/sim.1611.
url: http://dx.doi.org/10.1002/sim.1611 (cit. on pp. 2-13, 2-15).
true type I error can be much greater than nominal level;one example where nominal is 0.05 and true is 0.5;minimum P-value
method;CART;recursive partitioning;bootstrap method for correcting confidence interval;based on heuristic shrinkage coefficient;It
should be noted, however, that the optimal cutpoint approach has disadvantages. One of these is that in almost every study where
this method is applied, another cutpoint will emerge. This makes comparisons across studies extremely difficult or even impossible.
Altman et al. point out this problem for studies of the prognostic relevance of the S-phase fraction in breast cancer published in the
literature. They identified 19 different cutpoints used in the literature; some of them were solely used because they emerged as the
optimal cutpoint in a specific data set. In a meta-analysis on the relationship between cathepsin-D content and disease-free survival
in node-negative breast cancer patients, 12 studies were in included with 12 different cutpoints ... Interestingly, neither cathepsin-D nor
the S-phase fraction are recommended to be used as prognostic markers in breast cancer in the recent update of the American Society
of Clinical Oncology.; dichotomization; categorizing continuous variables; refs alt94dan, sch94out, alt98sub
.
[85] Nicholas J. Horton and Ken P. Kleinman. Much ado about nothing: A comparison of missing data methods and
software to fit incomplete data regression models. In: Am Statistician 61.1 (2007), pp. 7990 (cit. on p. 3-15).
[86] C. M. Hurvich and C. L. Tsai. The impact of model selection on inference in linear regression. In: Am Statistician
44 (1990), pp. 214217 (cit. on p. 4-15).
[87] Lisa I. Iezzoni. Dimensions of Risk. In: Risk Adjustment for Measuring Health Outcomes. Ed. by Lisa I. Iezzoni.
Ann Arbor, MI: Foundation of the American College of Healthcare Executives, 1994. Chap. 2, pp. 29118 (cit. on
p. 1-12).
dimensions of risk factors to include in models
.
[88] K. J. Janssen et al. Missing covariate data in medical research: To impute is better than to ignore. In: J Clin Epi
63 (2010), pp. 721727 (cit. on p. 3-20).
[89] Michael P. Jones.Indicator and stratification methods for missing explanatory variables in multiple linear regression.
In: J Am Stat Assoc 91 (1996), pp. 222230 (cit. on p. 3-5).
[90] J. D. Kalbfleisch and R. L. Prentice. Marginal likelihood based on Coxs regression and life model. In: Biometrika
60 (1973), pp. 267278 (cit. on p. 9-25).
[91] Juha Karvanen and Frank E. Harrell.Visualizing covariates in proportional hazards model. In: Stat Med 28 (2009),
pp. 19571966 (cit. on p. 5-2).
[92] Michael G. Kenward, Ian R. White, and James R. Carpener. Should baseline be a covariate or dependent variable
in analyses of change from baseline in clinical trials? (letter to the editor). In: Stat Med 29 (2010), pp. 14551456
(cit. on p. 7-5).
sharp rebuke of liu09sho
.
[93] W. A. Knaus et al. The SUPPORT prognostic model: Objective estimates of survival for seriously ill hospitalized
adults. In: Ann Int Med 122 (1995), pp. 191203. doi: 10.7326/0003- 4819- 122- 3- 199502010- 00007. url:
http://dx.doi.org/10.7326/0003-4819-122-3-199502010-00007 (cit. on pp. 4-31, 10-1).
[94] Mirjam J. Knol et al. Unpredictable bias when using the missing indicator method or complete case analysis for
missing confounder values: an empirical example. In: J Clin Epi 63 (2010), pp. 728736 (cit. on p. 3-5).
[95] R. Koenker and G. Bassett. Regression quantiles. In: Econometrica 46 (1978), pp. 3350 (cit. on p. 9-11).
[96] Roger Koenker. Quantile Regression. ISBN-10: 0-521-60827-9; ISBN-13: 978-0-521-60827-5. New York: Cambridge
University Press, 2005 (cit. on p. 9-11).
ANNOTATED BIBLIOGRAPHY
11-7
[97] Roger Koenker. quantreg: Quantile Regression. R package version 4.38. 2009. url: http://CRAN.R-project.
org/package=quantreg (cit. on p. 9-11).
[98] Charles Kooperberg, Charles J. Stone, and Young K. Truong. Hazard regression. In: J Am Stat Assoc 90 (1995),
pp. 7894 (cit. on p. 10-18).
[99] Warren F. Kuhfeld. The PRINQUAL Procedure. In: SAS/STAT 9.2 Users Guide. Second. Cary, NC: SAS Publishing, 2009. url: http://support.sas.com/documentation/onlinedoc/stat (cit. on p. 4-27).
[100] B. Lausen and M. Schumacher. Evaluating the effect of optimized cutoff values in the assessment of prognostic
factors. In: Comp Stat Data Analysis 21.3 (1996), pp. 307326. doi: 10.1016/0167- 9473(95)00016- X. url:
http://dx.doi.org/10.1016/0167-9473(95)00016-X (cit. on p. 2-13).
[101] J. F. Lawless and K. Singhal. Efficient screening of nonnormal regression models. In: Biometrics 34 (1978),
pp. 318327 (cit. on p. 4-14).
[102] S. le Cessie and J. C. van Houwelingen. Ridge estimators in logistic regression. In: Appl Stat 41 (1992), pp. 191
201 (cit. on p. 4-19).
[103] A. Leclerc et al. Correspondence analysis and logistic modelling: Complementary use in the analysis of a health
survey among nurses. In: Stat Med 7 (1988), pp. 983995 (cit. on p. 4-26).
[104] Katherine J. Lee and John B. Carlin. Recovery of information from multiple imputation: a simulation study.
In: Emerging Themes in Epi 9.1 (June 2012), pp. 3+. issn: 1742-7622. doi: 10 . 1186 / 1742 - 7622 - 9 - 3. url:
http://dx.doi.org/10.1186/1742-7622-9-3 (cit. on p. 3-4).
Not sure that the authors satisfactorily dealt with nonlinear predictor effects in the absence of strong auxiliary information, there is little
to gain from multiple imputation with missing data in the exposure-of-interest. In fact, the authors went further to say that multiple
imputation can introduce bias not present in a complete case analysis if a poorly fitting imputation model is used [from Yong Hao Pua]
.
[105] Seokho Lee, Jianhua Z. Huang, and Jianhua Hu.Sparse logistic principal components analysis for binary data. In:
Ann Appl Stat 4.3 (2010), pp. 15791601 (cit. on p. 2-34).
[106] Chenlei Leng and Hansheng Wang. On general adaptive sparse principal component analysis. In: J Comp Graph
Stat 18.1 (2009), pp. 201215 (cit. on p. 2-34).
[107] Kung-Yee Liang and Scott L. Zeger. Longitudinal data analysis of continuous and discrete responses for pre-post
designs. In: Sankhya 62 (2000), pp. 134148 (cit. on p. 7-5).
makes an error in assuming the baseline variable will have the same univariate distribution as the response except for a shift;baseline
may have for example a truncated distribution based on a trials inclusion criteria;if correlation between baseline and response is zero,
ANCOVA will be twice as efficient as simple analysis of change scores;if correlation is one they may be equally efficient
.
[108] James K. Lindsey. Models for Repeated Measurements. Clarendon Press, 1997.
[109] Roderick J. A. Little and Donald B. Rubin. Statistical Analysis with Missing Data. second. New York: Wiley, 2002
(cit. on pp. 3-4, 3-9, 3-17).
[110] Guanghan F. Liu et al. Should baseline be a covariate or dependent variable in analyses of change from baseline
in clinical trials? In: Stat Med 28 (2009), pp. 25092530 (cit. on p. 7-5).
seems to miss several important points, such as the fact that the baseline variable is often part of the inclusion/exclusion criteria and
so has a truncated distribution that is different from that of the follow-up measurements;sharp rebuke in ken10sho
.
[111] Richard Lockhart et al. A significance test for the lasso. Tech. rep. arXiv, 2013. arXiv: 1301.7161. url: http:
//arxiv.org/abs/1301.7161 (cit. on p. 4-10).
[112] Xiaohui Luo, Leonard A. Stfanski, and Dennis D. Boos. Tuning variable selection procedures by adding noise. In:
Technometrics 48 (2006), pp. 165175 (cit. on p. 1-14).
adding a known amount of noise to the response and studying 2 to tune the stopping rule to avoid overfitting or underfitting;simulation
setup
.
[113] Nathan Mantel.Why stepdown procedures in variable selection. In: Technometrics 12 (1970), pp. 621625 (cit. on
p. 4-14).
[114] S. E. Maxwell and H. D. Delaney. Bivariate median splits and spurious statistical significance. In: Psych Bull 113
(1993), pp. 181190. doi: 10.1037//0033- 2909.113.1.181. url: http://dx.doi.org/10.1037//00332909.113.1.181 (cit. on p. 2-13).
[115] George P. McCabe. Principal variables. In: Technometrics 26 (1984), pp. 137144 (cit. on p. 4-25).
ANNOTATED BIBLIOGRAPHY
11-8
[116] George Michailidis and Jan de Leeuw. The Gifi system of descriptive multivariate analysis. In: Statistical Sci 13
(1998), pp. 307336 (cit. on p. 4-26).
[117] Karel G. M. Moons et al. Using the outcome for imputation of missing predictor values was preferred. In: J Clin
Epi 59 (2006), pp. 10921101. doi: 10.1016/j.jclinepi.2006.01.009. url: http://dx.doi.org/10.1016/
j.jclinepi.2006.01.009 (cit. on p. 3-13).
use of outcome variable; excellent graphical summaries of simulations
.
[118] Barry K. Moser and Laura P. Coombs. Odds ratios for a continuous outcome variable without dichotomizing. In:
Stat Med 23 (2004), pp. 18431860 (cit. on p. 2-13).
large loss of efficiency and power;embeds in a logistic distribution, similar to proportional odds model;categorization;dichotomization of
a continuous response in order to obtain odds ratios often results in an inflation of the needed sample size by a factor greater than 1.5
.
[119] Raymond H. Myers. Classical and Modern Regression with Applications. Boston: PWS-Kent, 1990 (cit. on p. 4-21).
[120] N. J. D. Nagelkerke. A note on a general definition of the coefficient of determination. In: Biometrika 78 (1991),
pp. 691692 (cit. on p. 4-39).
[121] Todd G. Nick and J. Michael Hardin. Regression modeling strategies: An illustrative case study from medical
rehabilitation outcomes research. In: Am J Occ Ther 53 (1999), pp. 459470 (cit. on p. 4-1).
[122] David J. Nott and Chenlei Leng.Bayesian projection approaches to variable selection in generalized linear models.
In: Computational Statistics & Data Analysis 54.12 (Dec. 2010), pp. 32273241. issn: 01679473. doi: 10.1016/
j.csda.2010.01.036. url: http://dx.doi.org/10.1016/j.csda.2010.01.036 (cit. on p. 2-34).
[123] Debashis Paul et al. Preconditioning for feature selection and regression in high-dimensional problems. In: Ann
Stat 36.4 (2008), pp. 15951619. doi: 10.1214/009053607000000578. url: http://dx.doi.org/10.1214/
009053607000000578 (cit. on p. 2-34).
develop consistent Y using a latent variable structure, using for example supervised principal components. Then run stepwise regression
or lasso predicting Y (lasso worked better). Can run into problems when a predictor has importance in an adjusted sense but has no
marginal correlation with Y ;model approximation;model simplification
.
[124] Peter Peduzzi et al. A simulation study of the number of events per variable in logistic regression analysis. In: J
Clin Epi 49 (1996), pp. 13731379 (cit. on pp. 4-16, 4-17).
[125] Peter Peduzzi et al. Importance of events per independent variable in proportional hazards regression analysis. II.
Accuracy and precision of regression estimates. In: J Clin Epi 48 (1995), pp. 15031510 (cit. on p. 4-16).
[126] N. Peek et al. External validation of prognostic models for critically ill patients required substantial sample sizes.
In: J Clin Epi 60 (2007), pp. 491501 (cit. on p. 4-40).
large sample sizes need to obtain reliable external validations;inadequate power of DeLong, DeLong, and Clarke-Pearson test for
differences in correlated ROC areas (p. 498);problem with tests of calibration accuracy having too much power for large sample sizes
.
[127] Michael J. Pencina, Ralph B. DAgostino, and Olga V. Demler. Novel metrics for evaluating improvement in discrimination: net reclassification and integrated discrimination improvement for normal variables and nested models.
In: Stat Med 31.2 (2012), pp. 101113. doi: 10.1002/sim.4348. url: http://dx.doi.org/10.1002/sim.4348
(cit. on p. 4-40).
[128] Michael J. Pencina, Ralph B. DAgostino, and Ewout W. Steyerberg.Extensions of net reclassification improvement
calculations to measure usefulness of new biomarkers. In: Stat Med 30 (2011), pp. 1121. doi: 10.1002/sim.4085.
url: http://dx.doi.org/10.1002/sim.4085 (cit. on p. 4-40).
lack of need for NRI to be category-based;arbitrariness of categories;category-less or continuous NRI is the most objective and versatile
measure of improvement in risk prediction;authors misunderstood the inadequacy of three categories if categories are used;comparison
of NRI to change in C index;example of continuous plot of risk for old model vs. risk for new model
.
[129] Michael J. Pencina et al. Evaluating the added predictive ability of a new marker: From area under the ROC curve
to reclassification and beyond. In: Stat Med 27 (2008), pp. 157172 (cit. on p. 4-40).
small differences in ROC area can still be very meaningful;example of insignificant test for difference in ROC areas with very significant
results from new method;Yates discrimination slope;reclassification table;limiting version of this based on whether and amount by
which probabilities rise for events and lower for non-events when compare new model to old;comparing two models;see letter to the
editor by Van Calster and Van Huffel, Stat in Med 29:318-319, 2010 and by Cook and Paynter, Stat in Med 31:93-97, 2012
ANNOTATED BIBLIOGRAPHY
11-9
[130] Sanne A. Peters et al. Multiple imputation of missing repeated outcome measurements did not add to linear
mixed-effects models. In: J Clin Epi 65.6 (2012), pp. 686695. doi: 10.1016/j.jclinepi.2011.11.012. url:
http://dx.doi.org/10.1016/j.jclinepi.2011.11.012 (cit. on p. 7-9).
[131] Jose C. Pinheiro and Douglas M. Bates. Mixed-Effects Models in S and S-PLUS. New York: Springer, 2000 (cit. on
pp. 7-10, 7-12).
[132] Tjeerd van der Ploeg, Peter C. Austin, and Ewout W. Steyerberg.Modern modelling techniques are data hungry: a
simulation study for predicting dichotomous endpoints. In: BMC medical research methodology 14.1 (Dec. 2014),
pp. 137+. issn: 1471-2288. doi: 10.1186/1471-2288-14-137. url: http://dx.doi.org/10.1186/1471-228814-137 (cit. on pp. 2-4, 4-16).
Would be better to use proper accuracy scores in the assessment. Too much emphasis on optimism as opposed to final discrimination
measure. But much good practical information. Recursive partitioning fared poorly.
.
[133] Richard F. Potthoff and S. N. Roy. A generalized multivariate analysis of variance model useful especially for
growth curve problems. In: Biometrika 51 (1964), pp. 313326 (cit. on p. 7-6).
included an AR1 example
.
[134] Peter Radchenko and Gareth M. James.Variable inclusion and shrinkage algorithms. In: J Am Stat Assoc 103.483
(2008), pp. 13041315 (cit. on p. 2-33).
solves problem caused by lasso using the same penalty parameter for variable selection and shrinkage which causes lasso to have to
keep too many variables in the model to avoid overshrinking the remaining predictors;does not handle scaling issue well
.
[135] D. R. Ragland. Dichotomizing continuous outcome variables: Dependence of the magnitude of association and
statistical power on the cutpoint. In: Epi 3 (1992). See letters to editor May 1993 P. 274-, Vol 4 No. 3, pp. 434440.
doi: 10.1097/00001648-199209000-00009. url: http://dx.doi.org/10.1097/00001648-199209000-00009
(cit. on p. 2-13).
[136] Brendan M. Reilly and Arthur T. Evans.Translating clinical research into clinical practice: Impact of using prediction
rules to make decisions. In: Ann Int Med 144 (2006), pp. 201209 (cit. on p. 1-10).
impact analysis;example of decision aid being ignored or overruled making MD decisions worse;assumed utilities are constant across
subjects by concluding that directives have more impact than predictions;Goldman-Cook clinical prediction rule in AMI
.
[137] Ellen B. Roecker. Prediction error and its estimation for subset-selected models. In: Technometrics 33 (1991),
pp. 459468 (cit. on pp. 4-14, 5-16).
[138] Patrick Royston, Douglas G. Altman, and Willi Sauerbrei. Dichotomizing continuous predictors in multiple regression: a bad idea. In: Stat Med 25 (2006), pp. 127141. doi: 10.1002/sim.2331. url: http://dx.doi.org/10.
1002/sim.2331 (cit. on p. 2-13).
destruction of statistical inference when cutpoints are chosen using the response variable; varying effect estimates when change cutpoints;difficult to interpret effects when dichotomize;nice plot showing effect of categorization; PBC data
.
[139] D. Rubin and N. Schenker.Multiple imputation in health-care data bases: An overview and some applications. In:
Stat Med 10 (1991), pp. 585598 (cit. on p. 3-12).
[140] Warren Sarle. The VARCLUS Procedure. In: SAS/STAT Users Guide. fourth. Vol. 2. Cary, NC: SAS Institute,
Inc., 1990. Chap. 43, pp. 16411659. url: http://support.sas.com/documentation/onlinedoc/stat (cit. on
pp. 4-22, 4-25).
[141] Willi Sauerbrei and Martin Schumacher. A bootstrap resampling procedure for model building: Application to the
Cox regression model. In: Stat Med 11 (1992), pp. 20932109 (cit. on pp. 4-15, 5-17).
[142] G. Schulgen et al. Outcome-oriented cutpoints in quantitative exposure. In: Am J Epi 120 (1994), pp. 172184
(cit. on pp. 2-13, 2-15).
[143] E. Selvin et al. Glycated hemoglobin, diabetes, and cardiovascular risk in nondiabetic adults. In: NEJM 362.9
(Mar. 2010), pp. 800811. doi: 10.1056/NEJMoa0908359. url: http://dx.doi.org/10.1056/NEJMoa0908359
(cit. on p. 2-24).
[144] Stephen Senn. Change from baseline and analysis of covariance revisited. In: Stat Med 25 (2006), pp. 43344344
(cit. on p. 7-5).
ANNOTATED BIBLIOGRAPHY
11-10
shows that claims that in a 2-arm study it is not true that ANCOVA requires the population means at baseline to be identical;refutes
some claims of lia00lon;problems with counterfactuals;temporal additivity (amounts to supposing that despite the fact that groups
are difference at baseline they would show the same evolution over time);causal additivity;is difficult to design trials for which simple
analysis of change scores is unbiased, ANCOVA is biased, and a causal interpretation can be given;temporally and logically, a baseline
cannot be a response to treatment, so baseline and response cannot be modeled in an integrated framework as Laird and Wares
model has been used;one should focus clearly on outcomes as being the only values that can be influenced by treatment and examine
critically any schemes that assume that these are linked in some rigid and deterministic view to baseline values. An alternative tradition
sees a baseline as being merely one of a number of measurements capable of improving predictions of outcomes and models it in this
way.;You cannot establish necessary conditions for an estimator to be valid by nominating a model and seeing what the model implies
unless the model is universally agreed to be impeccable. On the contrary it is appropriate to start with the estimator and see what
assumptions are implied by valid conclusions.;this is in distinction to lia00lon
.
[145] Jun Shao.Linear model selection by cross-validation. In: J Am Stat Assoc 88 (1993), pp. 486494 (cit. on p. 5-17).
[146] Noah Simon et al. A sparse-group lasso. In: J Comp Graph Stat 22.2 (2013), pp. 231245. doi: 10 . 1080 /
10618600.2012.681250. eprint: http://www.tandfonline.com/doi/pdf/10.1080/10618600.2012.681250.
url: http://www.tandfonline.com/doi/abs/10.1080/10618600.2012.681250 (cit. on p. 2-34).
sparse effects both on a group and within group levels;can also be considered special case of group lasso allowing overlap between
groups
.
[147] Sean L. Simpson et al.A linear exponent AR(1) family of correlation structures. In: Stat Med 29 (2010), pp. 1825
1838 (cit. on p. 7-13).
[148] L. R. Smith, F. E. Harrell, and L. H. Muhlbaier. Problems and potentials in modeling survival. In: Medical
Effectiveness Research Data Methods (Summary Report), AHCPR Pub. No. 92-0056. Ed. by Mary L. Grady and
Harvey A. Schwartz. Rockville, MD: US Dept. of Health, Human Services, Agency for Health Care Policy, and
Research, 1992, pp. 151159. url: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/FrankHarrell/
smi92pro.pdf (cit. on p. 4-16).
[149] Ian Spence and Robert F. Garrison. A remarkable scatterplot. In: Am Statistician 47 (1993), pp. 1219 (cit. on
p. 4-38).
[150] D. J. Spiegelhalter. Probabilistic prediction in patient management and clinical trials. In: Stat Med 5 (1986),
pp. 421433 (cit. on pp. 4-19, 4-45, 5-20, 5-21).
z-test for calibration inaccuracy (implemented in Stata, and R Hmisc packages val.prob function)
.
[151] Ewout W. Steyerberg. Clinical Prediction Models. New York: Springer, 2009 (cit. on pp. x, 11-13).
[152] Ewout W. Steyerberg et al. Prognostic modeling with logistic regression analysis: In search of a sensible strategy
in small data sets. In: Med Decis Mak 21 (2001), pp. 4556 (cit. on p. 4-1).
[153] Ewout W. Steyerberg et al. Prognostic modelling with logistic regression analysis: A comparison of selection and
estimation methods in small data sets. In: Stat Med 19 (2000), pp. 10591079 (cit. on p. 2-33).
[154] C. J. Stone. Comment: Generalized additive models. In: Statistical Sci 1 (1986), pp. 312314 (cit. on p. 2-25).
[155] C. J. Stone and C. Y. Koo. Additive splines in statistics. In: Proceedings of the Statistical Computing Section
ASA. Washington, DC, 1985, pp. 4548 (cit. on pp. 2-21, 2-26).
[156] Samy Suissa and Lucie Blais. Binary regression with continuous outcomes. In: Stat Med 14 (1995), pp. 247255.
doi: 10.1002/sim.4780140303. url: http://dx.doi.org/10.1002/sim.4780140303 (cit. on p. 2-13).
[157] Guo-Wen Sun, Thomas L. Shook, and Gregory L. Kay.Inappropriate use of bivariable analysis to screen risk factors
for use in multivariable analysis. In: J Clin Epi 49 (1996), pp. 907916 (cit. on p. 4-17).
[158] Robert Tibshirani. Regression shrinkage and selection via the lasso. In: J Roy Stat Soc B 58 (1996), pp. 267288
(cit. on p. 2-33).
[159] Jos Twisk et al. Multiple imputation of missing values was not necessary before performing a longitudinal mixedmodel analysis. In: J Clin Epi 66.9 (2013), pp. 10221028. doi: 10 . 1016 / j . jclinepi . 2013 . 03 . 017. url:
http://dx.doi.org/10.1016/j.jclinepi.2013.03.017 (cit. on p. 3-3).
[160] Werner Vach and Maria Blettner. Missing Data in Epidemiologic Studies. In: Ency of Biostatistics. New York:
Wiley, 1998, pp. 26412654 (cit. on p. 3-5).
[161] S. van Buuren et al. Fully conditional specification in multivariate imputation. In: J Stat Computation Sim 76.12
(2006), pp. 10491064 (cit. on pp. 3-15, 3-16).
justification for chained equations alternative to full multivariate modeling
ANNOTATED BIBLIOGRAPHY
11-11
[162] Geert J. M. G. van der Heijden et al. Imputation of missing values is superior to complete case analysis and
the missing-indicator method in multivariable diagnostic research: A clinical example. In: J Clin Epi 59 (2006),
pp. 11021109. doi: 10.1016/j.jclinepi.2006.01.015. url: http://dx.doi.org/10.1016/j.jclinepi.
2006.01.015 (cit. on p. 3-5).
Invalidity of adding a new category or an indicator variable for missing values even with MCAR
.
[163] J. C. van Houwelingen and S. le Cessie. Predictive value of statistical models. In: Stat Med 9 (1990), pp. 1303
1325 (cit. on pp. 2-26, 4-19, 5-17, 5-20, 5-22).
[164] William N. Venables and Brian D. Ripley. Modern Applied Statistics with S. Fourth. New York: Springer-Verlag,
2003. isbn: 0387954570 (cit. on p. 9-2).
[165] Geert Verbeke and Geert Molenberghs. Linear Mixed Models for Longitudinal Data. New York: Springer, 2000.
[166] Pierre J. Verweij and Hans C. van Houwelingen. Penalized likelihood in Cox regression. In: Stat Med 13 (1994),
pp. 24272436 (cit. on p. 4-19).
[167] Andrew J. Vickers. Decision analysis for the evaluation of diagnostic tests, prediction models, and molecular
markers. In: Am Statistician 62.4 (2008), pp. 314320 (cit. on p. 1-7).
limitations of accuracy metrics;incorporating clinical consequences;nice example of calculation of expected outcome;drawbacks of conventional decision analysis, especially because of the difficulty of eliciting the expected harm of a missed diagnosis;use of a threshold
on the probability of disease for taking some action;decision curve;has other good references to decision analysis
.
[168] Eric Vittinghoff and Charles E. McCulloch. Relaxing the rule of ten events per variable in logistic and Cox regression. In: Am J Epi 165 (2006), pp. 710718 (cit. on p. 4-16).
the authors may have not been quite stringent enough in their assessment of adequacy of predictions;letter to the editor submitted
.
[169] Paul T. von Hippel. Regression with missing Ys: An improved strategy for analyzing multiple imputed data. In:
Soc Meth 37.1 (2007), pp. 83117 (cit. on p. 3-4).
[170] Howard Wainer. Finding what is not there through the unfortunate binning of results: The Mendel effect. In:
Chance 19.1 (2006), pp. 4956 (cit. on pp. 2-13, 2-16).
can find bins that yield either positive or negative association;especially pertinent when effects are small;With four parameters, I can
fit an elephant; with five, I can make it wiggle its trunk. - John von Neumann
.
[171] Hansheng Wang and Chenlei Leng. Unified LASSO estimation by least squares approximation. In: J Am Stat
Assoc 102 (2007), pp. 10391048. doi: 10.1198/016214507000000509. url: http://dx.doi.org/10.1198/
016214507000000509 (cit. on p. 2-33).
[172] S. Wang et al.Hierarchically penalized Cox regression with grouped variables. In: Biometrika 96.2 (2009), pp. 307
322 (cit. on p. 2-34).
[173] Yohanan Wax. Collinearity diagnosis for a relative risk regression analysis: An application to assessment of dietcancer relationship in epidemiological studies. In: Stat Med 11 (1992), pp. 12731287 (cit. on p. 4-22).
[174] Ian R. White and John B. Carlin.Bias and efficiency of multiple imputation compared with complete-case analysis
for missing covariate values. In: Stat Med 29 (2010), pp. 29202931 (cit. on p. 3-15).
[175] Ian R. White and Patrick Royston. Imputing missing covariate values for the Cox model. In: Stat Med 28 (2009),
pp. 19821998 (cit. on p. 3-3).
approach to using event time and censoring indicator as predictors in the imputation model for missing baseline covariates;recommended
an approximation using the event indicator and the cumulative hazard transformation of time, without their interaction
.
[176] Ian R. White, Patrick Royston, and Angela M. Wood. Multiple imputation using chained equations: Issues and
guidance for practice. In: Stat Med 30.4 (2011), pp. 377399 (cit. on pp. 3-1, 3-10, 3-15, 3-19).
practical guidance for the use of multiple imputation using chained equations;MICE;imputation models for different types of target
variables;PMM choosing at random from among a few closest matches;choosing number of multiple imputations by a reproducibility
argument, suggesting 100f imputations when f is the fraction of cases that are incomplete
.
[177] John Whitehead. Sample size calculations for ordered categorical data. In: Stat Med 12 (1993). See letter to
editor SM 15:1065-6 for binary case;see errata in SM 13:871 1994;see kol95com, jul96sam, pp. 22572271 (cit. on
p. 4-17).
ANNOTATED BIBLIOGRAPHY
11-12
[178] Ryan E. Wiegand. Performance of using multiple stepwise algorithms for variable selection. In: Stat Med 29
(2010), pp. 16471659 (cit. on p. 4-14).
fruitless to try different stepwise methods and look for agreement;the methods will agree on the wrong model
.
[179] Daniela M. Witten and Robert Tibshirani. Testing significance of features by lassoed principal components. In:
Ann Appl Stat 2.3 (2008), pp. 9861012 (cit. on p. 2-34).
reduction in false discovery rates over using a vector of t-statistics;borrowing strength across genes;one would not expect a single gene
to be associated with the outcome, since, in practice, many genes work together to effect a particular phenotype. LPC effectively downweights individual genes that are associated with the outcome but that do not share an expression pattern with a larger group of genes,
and instead favors large groups of genes that appear to be differentially-expressed.;regress principal components on outcome;sparse
principal components
.
[180] S. N. Wood. Generalized Additive Models: An Introduction with R. ISBN 9781584884743. Boca Raton, FL: Chapman & Hall/CRC, 2006 (cit. on p. 2-35).
[181] C. F. J. Wu. Jackknife, bootstrap and other resampling methods in regression analysis. In: Ann Stat 14.4 (1986),
pp. 12611350 (cit. on p. 5-17).
[182] Shifeng Xiong. Some notes on the nonnegative garrote. In: Technometrics 52.3 (2010), pp. 349361 (cit. on
p. 2-34).
... to select tuning parameters, it may be unnecessary to optimize a model selectin criterion repeatedly;natural selection of penalty
function
.
[183] Jianming Ye. On measuring and correcting the effects of data mining and model selection. In: J Am Stat Assoc
93 (1998), pp. 120131 (cit. on p. 1-14).
[184] F. W. Young, Y. Takane, and J. de Leeuw. The principal components of mixed measurement level multivariate
data: An alternating least squares method with optimal scaling features. In: Psychometrika 43 (1978), pp. 279281
(cit. on p. 4-26).
[185] Recai M. Yucel and Alan M. Zaslavsky. Using calibration to improve rounding in imputation. In: Am Statistician
62.2 (2008), pp. 125129 (cit. on p. 3-17).
using rounding to impute binary variables using techniques for continuous data;uses the method to solve for the cutpoint for a continuous
estimate to be converted into a binary value;method should be useful in more general situations;idea is to duplicate the entire dataset and
in the second half of the new datasets to set all non-missing values of the target variable to missing;multiply impute these now-missing
values and compare them to the actual values
.
[186] Hao H. Zhang and Wenbin Lu. Adaptive lasso for Coxs proportional hazards model. In: Biometrika 94 (2007),
pp. 691703 (cit. on p. 2-33).
penalty function has ratios against original MLE;scale-free lasso
.
[187] Hui Zhou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. In: J Comp Graph Stat 15
(2006), pp. 265286 (cit. on p. 2-34).
principal components analysis that shrinks some loadings to zero
.
[188] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. In: J Roy Stat Soc B 67.2
(2005), pp. 301320 (cit. on p. 2-33).
R packages written by FE Harrell are freely available from CRAN, and are managed at https://github.
com/harrelfe.
ANNOTATED BIBLIOGRAPHY
11-13