Course Regression Model Strategies PDF

Regression Modeling Strategies
Frank E Harrell Jr
Department of Biostatistics
Vanderbilt University School of Medicine
Nashville TN 37232 USA
f.harrell@vanderbilt.edu
biostat.mc.vanderbilt.edu/rms
Google group regmod
General questions: stats.stackexchange.com, tag regression-strategies
Supplemental material: biostat.mc.vanderbilt.edu/ClinStat:
Biostatistics for Biomedical Research
Joint Statistical Meetings
Copyright 1995-2016 FE Harrell
Chicago
All Rights Reserved
2016-07-31
Contents
1 Introduction
1-1
1.1
Hypothesis Testing, Estimation, and Prediction . . . . . . . . . . . . . 1-1
1.2
Examples of Uses of Predictive Multivariable Modeling . . . . . . . . . 1-3
1.3
Misunderstandings about Prediction vs. Classification . . . . . . . . . . 1-5
1.4
Planning for Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 1-10
1.5
Choice of the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-13
1.6
Model uncertainty / Data-driven Model Specification . . . . . . . . . . 1-14
2 General Aspects of Fitting Regression Models
2-1
2.1
Notation for Multivariable Regression Models . . . . . . . . . . . . . . 2-5
2.2
Model Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6
2.3
Interpreting Model Parameters . . . . . . . . . . . . . . . . . . . . . . 2-7
2.4
2.3.1
Nominal Predictors . . . . . . . . . . . . . . . . . . . . . . . . 2-7
2.3.2
Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8
2.3.3
Example: Inference for a Simple Model . . . . . . . . . . . . . 2-9
2.3.4
Review of Composite (Chunk) Tests . . . . . . . . . . . . . . . 2-11
Relaxing Linearity Assumption for Continuous Predictors . . . . . . . . 2-13

ii
CONTENTS
2.5
iii
2.4.1
Avoiding Categorization . . . . . . . . . . . . . . . . . . . . . 2-13
2.4.2
Simple Nonlinear Terms . . . . . . . . . . . . . . . . . . . . . 2-17
2.4.3
Splines for Estimating Shape of Regression Function and Determining Predictor Transformations . . . . . . . . . . . . . . . . 2-18
2.4.4
Cubic Spline Functions . . . . . . . . . . . . . . . . . . . . . . 2-20
2.4.5
Restricted Cubic Splines . . . . . . . . . . . . . . . . . . . . . 2-21
2.4.6
Choosing Number and Position of Knots . . . . . . . . . . . . 2-25
2.4.7
Nonparametric Regression . . . . . . . . . . . . . . . . . . . . 2-27
2.4.8
Advantages of Regression Splines over Other Methods . . . . . 2-29
Recursive Partitioning: Tree-Based Models . . . . . . . . . . . . . . . 2-31

2.5.1
New Directions in Predictive Modeling . . . . . . . . . . . . . 2-32
2.6
Multiple Degree of Freedom Tests of Association . . . . . . . . . . . . 2-36
2.7
Assessment of Model Fit . . . . . . . . . . . . . . . . . . . . . . . . . 2-38

2.7.1
Regression Assumptions . . . . . . . . . . . . . . . . . . . . . 2-38
2.7.2
Modeling and Testing Complex Interactions . . . . . . . . . . . 2-42
2.7.3
Fitting Ordinal Predictors . . . . . . . . . . . . . . . . . . . . 2-47
2.7.4
Distributional Assumptions . . . . . . . . . . . . . . . . . . . 2-48
3 Missing Data
3-1
3.1
Types of Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1
3.2
Prelude to Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
3.3
Missing Values for Different Types of Response Variables . . . . . . . . 3-3
3.4
Problems With Simple Alternatives to Imputation
. . . . . . . . . . . 3-4
CONTENTS
iv
3.5
Strategies for Developing an Imputation Model . . . . . . . . . . . . . 3-7
3.6
Single Conditional Mean Imputation . . . . . . . . . . . . . . . . . . . 3-11
3.7
Predictive Mean Matching . . . . . . . . . . . . . . . . . . . . . . . . 3-12
3.8
Multiple Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12
3.9
Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-17
3.10 Summary and Rough Guidelines . . . . . . . . . . . . . . . . . . . . . 3-19

3.10.1 Effective Sample Size . . . . . . . . . . . . . . . . . . . . . . 3-20
4 Multivariable Modeling Strategies
4.1
4-1
Prespecification of Predictor Complexity Without Later Simplification . 4-3

4.1.1
Learning From a Saturated Model . . . . . . . . . . . . . . . . 4-5
4.1.2
Using Marginal Generalized Rank Correlations . . . . . . . . . 4-7
4.2
Checking Assumptions of Multiple Predictors Simultaneously . . . . . . 4-9
4.3
Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10
4.4
Overfitting and Limits on Number of Predictors
4.5
Shrinkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-18
4.6
Collinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-21
4.7
Data Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-23
. . . . . . . . . . . . 4-16
4.7.1
Redundancy Analysis . . . . . . . . . . . . . . . . . . . . . . . 4-24
4.7.2
Variable Clustering . . . . . . . . . . . . . . . . . . . . . . . . 4-25
4.7.3
Transformation and Scaling Variables Without Using Y
4.7.4
Simultaneous Transformation and Imputation . . . . . . . . . . 4-28
. . . . 4-26
CONTENTS
4.7.5
Simple Scoring of Variable Clusters . . . . . . . . . . . . . . . 4-32
4.7.6
Simplifying Cluster Scores . . . . . . . . . . . . . . . . . . . . 4-33
4.7.7
How Much Data Reduction Is Necessary? . . . . . . . . . . . . 4-33
4.8
Other Approaches to Predictive Modeling . . . . . . . . . . . . . . . . 4-36
4.9
Overly Influential Observations . . . . . . . . . . . . . . . . . . . . . 4-36
4.10 Comparing Two Models . . . . . . . . . . . . . . . . . . . . . . . . . 4-39

4.11 Improving the Practice of Multivariable Prediction . . . . . . . . . . . 4-41
4.12 Summary: Possible Modeling Strategies . . . . . . . . . . . . . . . . . 4-44
4.12.1 Developing Predictive Models . . . . . . . . . . . . . . . . . . 4-44
4.12.2 Developing Models for Effect Estimation . . . . . . . . . . . . 4-46
4.12.3 Developing Models for Hypothesis Testing . . . . . . . . . . . 4-47
5 Describing, Resampling, Validating, and Simplifying the Model
5.1
5-1
Describing the Fitted Model . . . . . . . . . . . . . . . . . . . . . . . 5-1

5.1.1
Interpreting Effects . . . . . . . . . . . . . . . . . . . . . . . . 5-1
5.1.2
Indexes of Model Performance . . . . . . . . . . . . . . . . . . 5-2
5.2
The Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6
5.3
Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11

5.3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11
5.3.2
Which Quantities Should Be Used in Validation? . . . . . . . . 5-14
5.3.3
Data-Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . 5-15
5.3.4
Improvements on Data-Splitting: Resampling . . . . . . . . . . 5-17
CONTENTS
vi
5.3.5
Validation Using the Bootstrap . . . . . . . . . . . . . . . . . 5-18
5.4
Bootstrapping Ranks of Predictors . . . . . . . . . . . . . . . . . . . 5-23
5.5
Simplifying the Final Model by Approximating It . . . . . . . . . . . . 5-25
5.6
5.5.1
Difficulties Using Full Models . . . . . . . . . . . . . . . . . . 5-25
5.5.2
Approximating the Full Model . . . . . . . . . . . . . . . . . . 5-25
How Do We Break Bad Habits? . . . . . . . . . . . . . . . . . . . . . 5-27
6 R Software
6-1
6.1
The R Modeling Language . . . . . . . . . . . . . . . . . . . . . . . . 6-2
6.2
User-Contributed Functions . . . . . . . . . . . . . . . . . . . . . . . 6-3
6.3
The rms Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
6.4
Other Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10
7 Modeling Longitudinal Responses using Generalized Least Squares
7-1
7.1
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-1
7.2
Model Specification for Effects on E(Y ) . . . . . . . . . . . . . . . . 7-3

7.2.1
Common Basis Functions . . . . . . . . . . . . . . . . . . . . 7-3
7.2.2
Model for Mean Profile . . . . . . . . . . . . . . . . . . . . . 7-3
7.2.3
Model Specification for Treatment Comparisons . . . . . . . . 7-4
7.3
Modeling Within-Subject Dependence . . . . . . . . . . . . . . . . . . 7-6
7.4
Parameter Estimation Procedure . . . . . . . . . . . . . . . . . . . . 7-10
7.5
Common Correlation Structures . . . . . . . . . . . . . . . . . . . . . 7-12
7.6
Checking Model Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-14
CONTENTS
vii
7.7
R Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-15
7.8
Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-16

7.8.1
Graphical Exploration of Data . . . . . . . . . . . . . . . . . . 7-16
7.8.2
Using Generalized Least Squares . . . . . . . . . . . . . . . . . 7-20
8 Logistic Model Case Study: Survival of Titanic Passengers
8-1
8.1
Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2
8.2
Exploring Trends with Nonparametric Regression . . . . . . . . . . . . 8-5
8.3
Binary Logistic Model with Casewise Deletion of Missing Values . . . . 8-8
8.4
Examining Missing Data Patterns . . . . . . . . . . . . . . . . . . . . 8-13
8.5
Single Conditional Mean Imputation . . . . . . . . . . . . . . . . . . . 8-16
8.6
Multiple Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-20
8.7
Summarizing the Fitted Model . . . . . . . . . . . . . . . . . . . . . . 8-24
9 Regression Models for Continuous Y and Case Study in Ordinal Regression

9-1
9.1
Dataset and Descriptive Statistics . . . . . . . . . . . . . . . . . . . . 9-3
9.2
The Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-7

9.2.1
Checking Assumptions of OLS and Other Models . . . . . . . . 9-7
9.3
Quantile Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-11
9.4
Ordinal Regression Models for Continuous Y . . . . . . . . . . . . . . 9-13
9.5
Ordinal Regression Applied to HbA1c . . . . . . . . . . . . . . . . . . 9-19

9.5.1
Checking Fit for Various Models Using Age . . . . . . . . . . . 9-19
CONTENTS
viii
9.5.2
Examination of BMI . . . . . . . . . . . . . . . . . . . . . . . 9-24
10 Case Study in Parametric Survival Modeling and Model Approximation10-1

10.1 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2
10.2 Checking Adequacy of Log-Normal Accelerated Failure Time Model . . 10-7
10.3 Summarizing the Fitted Model . . . . . . . . . . . . . . . . . . . . . . 10-14
10.4 Internal Validation of the Fitted Model Using the Bootstrap . . . . . . 10-18
10.5 Approximating the Full Model . . . . . . . . . . . . . . . . . . . . . . 10-20
Bibliography
11-1
in the right margin indicates a hyperlink to a YouTube video related to the subject.
in the right margin is a hyperlink to an audio file elaborating on the notes. Red
letters and numbers in the right margin are cues referred to within the audio recordings.
CONTENTS
ix
Course Philosophy
Satisfaction of model assumptions improves precision and
increases statistical power
It is more productive to make a model fit step by step (e.g.,
transformation estimation) than to postulate a simple model
and find out what went wrong
Graphical methods should be married to formal inference
Overfitting occurs frequently, so data reduction and model
validation are important
Software without multiple facilities for assessing and fixing
model fit may only seem to be user-friendly
Carefully fitting an improper model is better than badly fitting (and overfitting) a well-chosen one
Methods which work for all types of regression models are
the most valuable.
In most research projects the cost of data collection far outweighs the cost of data analysis, so it is important to use the
most efficient and accurate modeling techniques, to avoid
categorizing continuous variables, and to not remove data
from the estimation sample just to be able to validate the
CONTENTS
model.
The bootstrap is a breakthrough for statistical modeling and
model validation.
Using the data to guide the data analysis is almost as dangerous as not doing so.
A good overall strategy is to decide how many degrees
of freedom (i.e., number of regression parameters) can be
spent, where they should be spent, to spend them with no
regrets.
See the excellent text Clinical Prediction Models by Steyerberg [151].
Chapter 1
Introduction
1.1
Hypothesis Testing, Estimation, and Prediction

Even when only testing H0 a model based approach has advantages:
Permutation and rank tests not as useful for estimation
Cannot readily be extended to cluster sampling or repeated
measurements
Models generalize tests
2-sample t-test, ANOVA
multiple linear regression
Wilcoxon, Kruskal-Wallis, Spearman
proportional odds ordinal logistic model
1-1
CHAPTER 1. INTRODUCTION
1-2
log-rank Cox
Models not only allow for multiplicity adjustment but for
shrinkage of estimates
Statisticians comfortable with P -value adjustment but
fail to recognize that the difference between the most
different treatments is badly biased
Statistical estimation is usually model-based
Relative effect of increasing cholesterol from 200 to 250
mg/dl on hazard of death, holding other risk factors constant
Adjustment depends on how other risk factors relate to hazard
Usually interested in adjusted (partial) effects, not unadjusted (marginal or crude) effects
1-3
1.2
Examples of Uses of Predictive Multivariable

Modeling
C
Financial performance, consumer purchasing, loan pay-back

Ecology
Product life
Employment discrimination
Medicine, epidemiology, health services research
Probability of diagnosis, time course of a disease
Comparing non-randomized treatments
Getting the correct estimate of relative effects in randomized
studies requires covariable adjustment if model is nonlinear
Crude odds ratios biased towards 1.0 if sample heterogeneous
Estimating absolute treatment effect (e.g., risk difference)
Use e.g. difference in two predicted probabilities
Cost-effectiveness ratios
1-4
incremental cost / incremental ABSOLUTE benefit

most studies use avg. cost difference / avg. benefit, which
may apply to no one
1-5
1.3
Misunderstandings about Prediction vs. Classification

D
Many analysts desire to develop classifiers instead of predictions

Suppose that
1. response variable is binary
2. the two levels represent a sharp dichotomy with no gray
zone (e.g., complete success vs. total failure with no possibility of a partial success)
3. one is forced to assign (classify) future observations to
only these two choices
4. the cost of misclassification is the same for every future
observation, and the ratio of the cost of a false positive
to the cost of a false negative equals the (often hidden)
ratio implied by the analysts classification rule
Then classification is still suboptimal for driving the development of a predictive instrument as well as for hypothesis
testing and estimation
Classification and its associated classification accuracy measure
the proportion classified correctlyare very sensitive to the
relative frequencies of the outcome variable. If a classifier is
applied to another dataset with a different outcome preva-
1-6
lence, the classifier may no longer apply.

Far better is to use the full information in the data to develope a probability model, then develop classification rules
on the basis of estimated probabilities
power, precision, decision making
Classification is more problematic if response variable is ordinal or continuous or the groups are not truly distinct (e.g.,
disease or no disease when severity of disease is on a continuum); dichotomizing it up front for the analysis is not
appropriate
minimum loss of information (when dichotomization is at
the median) is large
may require the sample size to increase manyfold to
compensate for loss of information [55]
Two-group classification represents artificial forced choice
best option may be no choice, get more data
Unlike prediction (e.g., of absolute risk), classification implicitly uses utility (loss; cost of false positive or false negative) functions
Hidden problems:
Utility function depends on variables not collected (sub-
1-7
jects preferences) that are available only at the decision

point
Assumes every subject has the same utility function
Assumes this function coincides with the analysts
Formal decision analysis uses
optimum predictions using all available data
subject-specific utilities, which are often based on variables not predictive of the outcome
ROC analysis is misleading except for the special case of
mass one-time group decision making with unknowable utilitiesa
See [167, 22, 58, 18, 52, 62].
Accuracy score used to drive model building should be a continuous score that utilizes all of the information in the data.
In summary:
a To make an optimal decision you need to know all relevant data about an individual (used to estimate the probability of an outcome), and
the utility (cost, loss function) of making each decision. Sensitivity and specificity do not provide this information. For example, if one estimated
that the probability of a disease given age, sex, and symptoms is 0.1 and the costof a false positive equaled the cost of a false negative, one
would act as if the person does not have the disease. Given other utilities, one would make different decisions. If the utilities are unknown, one
gives the best estimate of the probability of the outcome to the decision maker and let her incorporate her own unspoken utilities in making an
optimum decision for her.
Besides the fact that cutoffs do not apply to individuals, only to groups, individual decision making does not utilize sensitivity and specificity.
For an individual we can compute Prob(Y = 1|X = x); we dont care about Prob(Y = 1|X > c), and an individual having X = x would be
quite puzzled if she were given Prob(X > c|future unknown Y) when she already knows X = x so X is no longer a random variable.
Even when group decision making is needed, sensitivity and specificity can be bypassed. For mass marketing, for example, one can rank order
individuals by the estimated probability of buying the product, to create a lift curve. This is then used to target the k most likely buyers where
k is chosen to meet total program cost constraints.
1-8
Classification is a forced choice a decision.

Decisions require knowledge of the cost or utility of making
an incorrect decision.
Predictions are made without knowledge of utilities.
A prediction can lead to better decisions than classification.
For example suppose that one has an estimate of the risk
of an event, P . One might make a decision if P < 0.10
or P > 0.90 in some situations, even without knowledge of
utilities. If on the other hand P = 0.6 or the confidence
interval for P is wide, one might
make no decision and instead opt to collect more data
make a tentative decision that is revisited later
make a decision using other considerations such as the
infusion of new resources that allow targeting a larger
number of potential customers in a marketing campaign
F
The Dichotomizing Motorist
The speed limit is 60.

I am going faster than the speed limit.
Will I be caught?
1-9
An answer by a dichotomizer:
Are you going faster than 70?

An answer from a better dichotomizer:
If you are among other cars, are you going faster than 73?
If you are exposed are your going faster than 67?
Better:
How fast are you going and are you exposed?

Analogy to most medical diagnosis research in which +/- diagnosis is a false dichotomy of an underlying disease severity:
J
The speed limit is moderately high.

I am going fairly fast.
Will I be caught?
1-10
1.4
Planning for Modeling

K
Chance that predictive model will be used [136]

Response definition, follow-up
Variable definitions
Observer variability
Missing data
Preference for continuous variables
Subjects
Sites
What can keep a sample of data from being appropriate for
modeling:
1. Most important predictor or response variables not collected
2. Subjects in the dataset are ill-defined or not representative
of the population to which inferences are needed
3. Data collection sites do not represent the population of sites
4. Key variables missing in large numbers of subjects
1-11
5. Data not missing at random

6. No operational definitions for key variables and/or measurement errors severe
7. No observer variability studies done
What else can go wrong in modeling?
1. The process generating the data is not stable.
2. The model is misspecified with regard to nonlinearities or
interactions, or there are predictors missing.
3. The model is misspecified in terms of the transformation of
the response variable or the models distributional assumptions.
4. The model contains discontinuities (e.g., by categorizing
continuous predictors or fitting regression shapes with sudden changes) that can be gamed by users.
5. Correlations among subjects are not specified, or the correlation structure is misspecified, resulting in inefficient parameter estimates and overconfident inference.
6. The model is overfitted, resulting in predictions that are too
extreme or positive associations that are false.
7. The user of the model relies on predictions obtained by extrapolating to combinations of predictor values well outside
the range of the dataset used to develop the model.
8. Accurate and discriminating predictions can lead to behavior
changes that make future predictions inaccurate.
1-12
Iezzoni [87] lists these dimensions to capture, for patient outcome studies:
1. age
2. sex
3. acute clinical stability
4. principal diagnosis
5. severity of principal diagnosis
6. extent and severity of comorbidities
7. physical functional status
8. psychological, cognitive, and psychosocial functioning
9. cultural, ethnic, and socioeconomic attributes and behaviors
10. health status and quality of life
11. patient attitudes and preferences for outcomes
General aspects to capture in the predictors:
1. baseline measurement of response variable
2. current status
3. trajectory as of time zero, or past levels of a key variable
4. variables explaining much of the variation in the response
5. more subtle predictors whose distributions strongly differ between levels of the key variable of interest in an observational
study
1-13
1.5
Choice of the Model

P
In biostatistics and epidemiology and most other areas we

usually choose model empirically
Model must use data efficiently
Should model overall structure (e.g., acute vs. chronic)
Robust models are better
Should have correct mathematical structure (e.g., constraints
on probabilities)
1-14
1.6
Model uncertainty / Data-driven Model Specification

Q
Standard errors, C.L., P -values, R wrong if computed as

if the model pre-specified
Stepwise variable selection is widely used and abused
Bootstrap can be used to repeat all analysis steps to properly
penalize variances, etc.
Ye [183]: generalized degrees of freedom (GDF) for any
data mining or model selection procedure based on least
squares
Example: 20 candidate predictors, n = 22, forward stepwise, best 5-variable model: GDF=14.1
Example: CART, 10 candidate predictors, n = 100, 19
nodes: GDF=76
See [112] for an approach involving adding noise to Y to
improve variable selection
Chapter 2
General Aspects of Fitting Regression

Models
Regression modeling meets many analytic needs:
Prediction, capitalizing on efficient estimation methods such
as maximum likelihood and the predominant additivity in a
variety of problems
E.g.: effects of age, smoking, and air quality add to predict lung capacity
When effects are predominantly additive, or when there
arent too many interactions and one knows the likely
interacting variables in advance, regression can beat machine learning techniques that assume interaction effects
are likely to be as strong as main effects
Separate effects of variables (especially exposure and treatment)
2-1
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS
2-2
Hypothesis testing
Deep understanding of uncertainties associated with all model
components
Simplest example: confidence interval for the slope of a
predictor
Confidence intervals for predicted values; simultaneous
confidence intervals for a series of predicted values
E.g.: confidence band for Y over a series of values of
X
Alternative: Stratification
Cross-classify subjects on the basis of the Xs, estimate a
property of Y for each stratum
Only handles a small number of Xs
Does not handle continuous X
Alternative: Single Trees (recursive partitioning/CART)
Interpretable because they are over-simplified and usually
wrong
Cannot separate effects
2-3
Finds spurious interactions

Require huge sample size
Do not handle continuous X effectively; results in very heterogeneous nodes because of incomplete conditioning
Tree structure is unstable so insights are fragile
Alternative: Machine Learning
E.g. random forests, bagging, boosting, support vector machines, neural networks
Allows for high-order interactions and does not require prespecification of interaction terms
Almost automatic; can save analyst time and do the analysis
in one step (long computing time)
Uninterpretable black box
Effects of individual predictors are not separable
Interaction effects (e.g., differential treatment effect = precision medicine = personalized medicine) not available
Because of not using prior information about dominance of
additivity, can require 200 events per candidate predictor
2-4
when Y is binary [132]

Logistic regression may require 20 events per candidate
predictor
Can create a demand for big data where additive statistical models can work on moderate-size data
2-5
2.1
Notation for Multivariable Regression Models

A
Weighted sum of a set of independent or predictor variables

Interpret parameters and state assumptions by linearizing
model with respect to regression coefficients
Analysis of variance setups, interaction effects, nonlinear effects
Examining the 2 regression assumptions
Y
X
0
1, . . . , p
X
response (dependent) variable

X1, X2, . . . , Xp list of predictors
0, 1, . . . , p regression coefficients
intercept parameter(optional)
weights or regression coefficients
0 + 1X1 + . . . + pXp, X0 = 1
Model: connection between X and Y

C(Y |X) : property of distribution of Y given X, e.g.
C(Y |X) = E(Y |X) or Prob{Y = 1|X}.
2-6
2.2
Model Formulations
General regression model
C(Y |X) = g(X).
General linear regression model
C(Y |X) = g(X).
Examples
C(Y |X) =
E(Y |X) =
X,
Y |X
N (X, 2)
C(Y |X) = Prob{Y = 1|X} = (1 + exp(X))1
Linearize: h(C(Y |X)) = X, h(u) = g 1(u)
Example:
C(Y |X) = Prob{Y = 1|X} = (1 + exp(X))1
u
)
h(u) = logit(u) = log(
1u
h(C(Y |X)) = C 0(Y |X) (link)
General linear regression model:
C 0(Y |X) = X.
2-7
2.3
Interpreting Model Parameters

Suppose that Xj is linear and doesnt interact with other Xsa.
E
C 0(Y |X) = X = 0 + 1X1 + . . . + pXp

j = C 0(Y |X1, X2, . . . , Xj + 1, . . . , Xp)
C 0(Y |X1, X2, . . . , Xj , . . . , Xp)
Drop 0 from C 0 and assume C(Y |X) is property of Y that is
linearly related to weighted sum of Xs.
2.3.1
Nominal Predictors
Nominal (polytomous) factor with k levels : k 1 dummy
variables. E.g. T = J, K, L, M :
C(Y |T = J)
C(Y |T = K)
C(Y |T = L)
C(Y |T = M )
=
=
=
=
0
0 + 1
0 + 2
0 + 3.
C(Y |T ) = X = 0 + 1X1 + 2X2 + 3X3,

where
X1 = 1 if T = K, 0 otherwise
a Note that it is not necessary to hold constant all other variables to be able to interpret the effect of one predictor. It is sufficient to hold
constant the weighted sum of all the variables other than Xj . And in many cases it is not physically possible to hold other variables constant
while varying one, e.g., when a model contains X and X 2 (David Hoaglin, personal communication).
2-8
X2 = 1 if T = L, 0 otherwise
X3 = 1 if T = M, 0 otherwise.
The test for any differences in the property C(Y ) between treatments is H0 : 1 = 2 = 3 = 0.
2.3.2
Interactions
G
X1 and X2, effect of X1 on Y depends on level of X2. One

way to describe interaction is to add X3 = X1X2 to model:
C(Y |X) = 0 + 1X1 + 2X2 + 3X1X2.
C(Y |X1 + 1, X2)
=
+
C(Y |X1, X2)

0 + 1(X1 + 1) + 2X2
3(X1 + 1)X2
[0 + 1X1 + 2X2 + 3X1X2]
1 + 3X2.
One-unit increase in X2 on C(Y |X) : 2 + 3X1.

Worse interactions:
If X1 is binary, the interaction may take the form of a difference
in shape (and/or distribution) of X2 vs. C(Y ) depending on
whether X1 = 0 or X1 = 1 (e.g. logarithm vs. square root).
2-9
2.3.3
Example: Inference for a Simple Model

Postulate the model C(Y |age, sex) = 0 + 1age + 2(sex =
f ) + 3age(sex = f ) where sex = f is a dummy indicator
variable for sex=female, i.e., the reference cell is sex=maleb.
Model assumes
1. age is linearly related to C(Y ) for males,

2. age is linearly related to C(Y ) for females, and
3. interaction between age and sex is simple
4. whatever distribution, variance, and independence assumptions are appropriate for the model being considered.
Interpretations of parameters:
Parameter
0
1
2
3
Meaning
C(Y |age = 0, sex = m)
C(Y |age = x + 1, sex = m) C(Y |age = x, sex = m)
C(Y |age = 0, sex = f ) C(Y |age = 0, sex = m)
C(Y |age = x + 1, sex = f ) C(Y |age = x, sex = f )
[C(Y |age = x + 1, sex = m) C(Y |age = x, sex = m)]
3 is the difference in slopes (female male).

When a high-order effect such as an interaction effect is in
the model, be sure to interpret low-order effects by finding out
what makes the interaction effect ignorable. In our example,
the interaction effect is zero when age=0 or sex is male.
b You
can also think of the last part of the model as being 3 X3 , where X3 = age I[sex = f ].
2-10
Hypotheses that are usually inappropriate:
1. H0 : 1 = 0: This tests whether age is associated with Y

for males
2. H0 : 2 = 0: This tests whether sex is associated with Y
for zero year olds
More useful hypotheses follow. For any hypothesis need to
Write what is being tested
Translate to parameters tested
List the alternative hypothesis
Not forget what the test is powered to detect
Test against nonzero slope has maximum power when
linearity holds
If true relationship is monotonic, test for non-flatness will
have some but not optimal power
Test against a quadratic (parabolic) shape will have some
power to detect a logarithmic shape but not against a sine
wave over many cycles
Useful to write e.g. Ha : age is associated with C(Y ),
powered to detect a linear relationship
2-11
Most Useful Tests for Linear age sex Model

Null or Alternative Hypothesis
Mathematical
Statement
Effect of age is independent of sex or
H0 : 3 = 0
Effect of sex is independent of age or
age and sex are additive
age effects are parallel
age interacts with sex
Ha : 3 6= 0
age modifies effect of sex
sex modifies effect of age
sex and age are non-additive (synergistic)
age is not associated with Y
H0 : 1 = 3 = 0
Ha : 1 6= 0 or 3 6= 0
age is associated with Y
age is associated with Y for either
females or males
sex is not associated with Y
H0 : 2 = 3 = 0
sex is associated with Y
Ha : 2 6= 0 or 3 6= 0
sex is associated with Y for some
value of age
Neither age nor sex is associated with Y H0 : 1 = 2 = 3 = 0
Either age or sex is associated with Y
Ha : 1 6= 0 or 2 6= 0 or 3 6= 0
Note: The last test is called the global test of no association.

If an interaction effect present, there is both an age and a sex
effect. There can also be age or sex effects when the lines are
parallel. The global test of association (test of total association) has 3 d.f. instead of 2 (age + sex) because it allows for
unequal slopes.
2.3.4
Review of Composite (Chunk) Tests

M
In the model
y age + sex + weight + waist + tricep
2-12
we may want to jointly test the association between all body

measurements and response, holding age and sex constant.
This 3 d.f. test may be obtained two ways:
Remove the 3 variables and compute the change in SSR
or SSE
Test H0 : 3 = 4 = 5 = 0 using matrix algebra (e.g.,
anova(fit, weight, waist, tricep) if fit is a fit object
created by the R rms package)
2-13
2.4
Relaxing Linearity Assumption for Continuous Predictors

2.4.1
Avoiding Categorization
Natura non facit saltus
(Nature does not make jumps)
Gottfried Wilhelm Leibniz
Relationships seldom linear except when predicting one variable from itself measured earlier
Categorizing continuous predictors into intervals is a disaster; see references
[138, 2, 82, 100, 4, 11, 53, 135, 156, 24]
[114, 142, 5, 84, 118, 170, 55, 60] and Biostatistics for Biomedical
Research, Chapter 18.

Some problems caused by this approach:
1. Estimated values have reduced precision, and associated
tests have reduced power
2. Categorization assumes relationship between predictor and
response is flat within intervals; far less reasonable than
a linearity assumption in most cases
3. To make a continuous predictor be more accurately modeled when categorization is used, multiple intervals are
required
2-14
4. Because of sample size limitations in the very low and

very high range of the variable, the outer intervals (e.g.,
outer quintiles) will be wide, resulting in significant heterogeneity of subjects within those intervals, and residual
confounding
5. Categorization assumes that there is a discontinuity in
response as interval boundaries are crossed. Other than
the effect of time (e.g., an instant stock price drop after
bad news), there are very few examples in which such
discontinuities have been shown to exist.
6. Categorization only seems to yield interpretable estimates.
E.g. odds ratio for stroke for persons with a systolic blood
pressure > 160 mmHg compared to persons with a blood
pressure 160 mmHg interpretation of OR depends
on distribution of blood pressures in the sample (the proportion of subjects > 170, > 180, etc.). If blood pressure
is modeled as a continuous variable (e.g., using a regression spline, quadratic, or linear effect) one can estimate
the ratio of odds for exact settings of the predictor, e.g.,
the odds ratio for 200 mmHg compared to 120 mmHg.
7. Categorization does not condition on full information.
When, for example, the risk of stroke is being assessed for
a new subject with a known blood pressure (say 162 mmHg),
the subject does not report to her physician my blood
pressure exceeds 160 but rather reports 162 mmHg. The
risk for this subject will be much lower than that of a subject with a blood pressure of 200 mmHg.
2-15
8. If cutpoints are determined in a way that is not blinded

to the response variable, calculation of P -values and confidence intervals requires special simulation techniques;
ordinary inferential methods are completely invalid. E.g.:
cutpoints chosen by trial and error utilizing Y , even informally P -values too small and CLs not accuratec.
9. Categorization not blinded to Y biased effect estimates [4, 142]
10. Optimal cutpoints do not replicate over studies. Hollander et al. [84] state that . . . the optimal cutpoint approach has disadvantages.
One of these is that in almost every study where this method is applied, another cutpoint will emerge. This makes comparisons across studies extremely
difficult or even impossible. Altman et al. point out this problem for studies
of the prognostic relevance of the S-phase fraction in breast cancer published
in the literature. They identified 19 different cutpoints used in the literature;
some of them were solely used because they emerged as the optimal cutpoint
in a specific data set. In a meta-analysis on the relationship between cathepsinD content and disease-free survival in node-negative breast cancer patients,
12 studies were in included with 12 different cutpoints . . . Interestingly, neither
cathepsin-D nor the S-phase fraction are recommended to be used as prognostic
markers in breast cancer in the recent update of the American Society of Clinical Oncology. Giannoni et al. [60] demonstrated that many claimed optimal
cutpoints are just the observed median values in the sample, which happens
to optimize statistical power for detecting a separation in outcomes.
11. Disagreements in cutpoints (which are bound to happen

whenever one searches for things that do not exist) cause
c If
a cutpoint is chosen that minimizes the P -value and the resulting P -value is 0.05, the true type I error can easily be above 0.5 [84].
2-16
severe interpretation problems. One study may provide an

odds ratio for comparing body mass index (BMI) > 30
with BMI 30, another for comparing BMI > 28 with
BMI 28. Neither of these has a good definition and
the two estimates are not comparable.
12. Cutpoints are arbitrary and manipulatable; cutpoints can
be found that can result in both positive and negative
associations [170].
13. If a confounder is adjusted for by categorization, there
will be residual confounding that can be explained away
by inclusion of the continuous form of the predictor in the
model in addition to the categories.
To summarize: The use of a (single) cutpoint c makes many
assumptions, including:
1. Relationship between X and Y is discontinuous at X = c
and only X = c
2. c is correctly found as the cutpoint
3. X vs. Y is flat to the left of c
4. X vs. Y is flat to the right of c
5. The choice of c does not depend on the values of other
predictors
Interactive demonstration of power loss of categorization vs.
straight line and quadratic fits in OLS, with varying degree of
nonlinearity and noise added to X (must run in RStudio)
require ( Hmisc )
getRs ( catgNoise.r )
2-17
Exampled of misleading results from creating intervals (here,

deciles) of a continuous predictor. Final interval is extremely
heterogeneous and is greatly influenced by very large glycohemoglobin values, creating the false impression of an inflection
point at 5.9.
2.4.2
Simple Nonlinear Terms

C(Y |X1) = 0 + 1X1 + 2X12.
P
H0 : model is linear in X1 vs. Ha : model is quadratic in

X1 H0 : 2 = 0.
Test of linearity may be powerful if true model is not extremely non-parabolic
d From
NHANES III; Diabetes Care 32:1327-34; 2009 adapted from Diabetes Care 20:1183-1197; 1997.
2-18
Predictions not accurate in general as many phenomena are

non-quadratic
Can get more flexible fits by adding powers higher than 2
But polynomials do not adequately fit logarithmic functions
or threshold effects, and have unwanted peaks and valleys.
2.4.3
Splines for Estimating Shape of Regression Function

and Determining Predictor Transformations
Q
Draftsmans spline : flexible strip of metal or rubber used to

trace curves.
Spline Function : piecewise polynomial
Linear Spline Function : piecewise linear function
Bilinear regression: model is 0 +1X if X a, 2 +3X
if X > a.
Problem with this notation: two lines not constrained to
join
To force simple continuity: 0 +1X+2(Xa)I[X >
a] = 0 + 1X1 + 2X2, where X2 = (X1 a) I[X1 >
a].
Slope is 1, X a, 1 + 2, X > a.
2-19
2 is the slope increment as you pass a

More generally: X-axis divided into intervals with endpoints
a, b, c (knots).
f (X) = 0 + 1X + 2(X a)+ + 3(X b)+
+ 4(X c)+,
where
(u)+ = u, u > 0,
0, u 0.
R
f (X)
= 0 + 1X,
= 0 + 1X + 2(X a)
= 0 + 1X + 2(X a) + 3(X b)
= 0 + 1X + 2(X a)
+3(X b) + 4(X c)
Xa
a<Xb
b<Xc
c < X.
C(Y |X) = f (X) = X,

where X = 0 + 1X1 + 2X2 + 3X3 + 4X4, and
X1 = X X2 = (X a)+
X3 = (X b)+ X4 = (X c)+.
Overall linearity in X can be tested by testing H0 : 2 = 3 =
4 = 0.
2-20
f(X)
X
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
2.4.4
Cubic Spline Functions

Cubic splines are smooth at knots (function, first and second
derivatives agree) cant see joins.
f (X) = 0 + 1X + 2X 2 + 3X 3
+ 4(X a)3+ + 5(X b)3+ + 6(X c)3+
= X
X1 = X
X2 = X 2
X3 = X 3 X4 = (X a)3+
X5 = (X b)3+ X6 = (X c)3+.
k knots k + 3 coefficients excluding intercept.
X 2 and X 3 terms must be included to allow nonlinearity when
2-21
X < a.
2.4.5
Restricted Cubic Splines

T
Stone and Koo [155]: cubic splines poorly behaved in tails. Constrain function to be linear in tails.
k + 3 k 1 parameters [44].
To force linearity when X < a: X 2 and X 3 terms must be
omitted
To force linearity when X > last knot: last two s are redundant, i.e., are just combinations of the other s.
The restricted spline function with k knots t1, . . . , tk is given
by [44]
f (X) = 0 + 1X1 + 2X2 + . . . + k1Xk1,
where X1 = X and for j = 1, . . . , k 2,
Xj+1 = (X tj )3+ (X tk1)3+(tk tj )/(tk tk1)
+ (X tk )3+(tk1 tj )/(tk tk1).
Xj is linear in X for X tk .
For numerical behavior and to put all basis functions for X on
the same scale, R Hmisc and rms package functions by default
divide the terms above by = (tk t1)2.
require ( Hmisc )
x rcspline.eval ( seq (0 ,1 , .01 ) ,
knots = seq ( .05 , .95 , length =5) , inclx = T )
xm x
xm [ xm > .0106 ] NA
2-22
matplot ( x [ ,1] , xm , type = " l " , ylim = c (0 , .01 ) ,

xlab = expression ( X ) , ylab = , lty =1)
matplot ( x [ ,1] , x , type = " l " ,
xlab = expression ( X ) , ylab = , lty =1)
0.010
1.0
0.008
0.8
0.006
0.6
0.004
0.4
0.002
0.2
0.000
0.0
0.0
0.4
0.8
0.0
0.4
0.8
Figure 2.2: Restricted cubic spline component variables for k = 5 and knots at X = .05, .275, .5, .725, and .95. Nonlinear basis
functions are scaled by . The left panel is a ymagnification of the right panel. Fitted functions such as those in Figure 2.3 will be
linear combinations of these basis functions as long as knots are at the same locations used here.
x seq (0 , 1 , length =300)

for ( nk in 3:6) {
set.seed ( nk )
knots seq ( .05 , .95 , length = nk )
xx rcspline.eval (x , knots = knots , inclx = T )
for ( i in 1 : ( nk - 1) )
xx [ , i ] ( xx [ , i ] - min ( xx [ , i ]) ) /
( max ( xx [ , i ]) - min ( xx [ , i ]) )
for ( i in 1 : 20) {
beta 2 * runif ( nk-1 ) - 1
xbeta xx % * % beta + 2 * runif (1) - 1
xbeta ( xbeta - min ( xbeta ) ) /
( max ( xbeta ) - min ( xbeta ) )
if ( i == 1) {
plot (x , xbeta , type = " l " , lty =1 ,
xlab = expression ( X ) , ylab = , bty = " l " )
title ( sub = paste ( nk , " knots " ) , adj =0 , cex = .75 )
for ( j in 1 : nk )
arrows ( knots [ j ] , .04 , knots [ j ] , -.03 ,
angle =20 , length = .07 , lwd =1 .5 )
}
else lines (x , xbeta , col = i )
}
}
1.0
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
2-23
0.2
0.4
X
3 knots
1.0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
X
5 knots
0.8
1.0
0.6
0.8
1.0
4 knots
1.0
0.0
0.6
0.4
X
6 knots
Figure 2.3: Some typical restricted cubic spline functions for k = 3, 4, 5, 6. The yaxis is X. Arrows indicate knots. These curves
were derived by randomly choosing values of subject to standard deviations of fitted functions being normalized.
2-24
Interactive demonstration of linear and cubic spline fitting, plus

ordinary 4th order polynomial. This can be run with RStudio
or in an ordinary R session.
require ( Hmisc )
getRs ( demoSpline.r )
getRs ( demoSpline.r , put = source )
#
#
if
if
using
not
RStudio
Once 0, . . . , k1 are estimated, the restricted cubic spline can

be restated in the form
f (X) = 0 + 1X + 2(X t1)3+ + 3(X t2)3+
+ . . . + k+1(X tk )3+
by dividing 2, . . . , k1 by and computing
k = [2 (t1 tk ) + 3 (t2 tk ) + . . .
+k1 (tk2 tk )]/(tk tk1 )
k+1 = [2 (t1 tk1 ) + 3 (t2 tk1 ) + . . .
+k1 (tk2 tk1 )]/(tk1 tk ).
A test of linearity in X can be obtained by testing

H0 : 2 = 3 = . . . = k1 = 0.
Example: [143]
2-25
2.4.6
Choosing Number and Position of Knots

W
Knots are specified in advance in regression splines

Locations not important in most situations [154, 48]
Place knots where data exist fixed quantiles of predictors
marginal distribution
Fit depends more on choice of k
k
3
4
5
.05
6 .05 .23
7 .025 .1833
Quantiles
.10
.5 .90
.05
.35 .65
.275 .5 .725
.41
.59 .77
.3417 .5 .6583
2-26
.95
.95
.95
.8167 .975
n < 100 replace outer quantiles with 5th smallest and 5th
largest X [155].
Choice of k:
Flexibility of fit vs. n and variance
Usually k = 3, 4, 5. Often k = 4
Large n (e.g. n 100) k = 5
Small n (< 30, say) k = 3
Can use Akaikes information criterion (AIC) [6, 163] to choose
k
This chooses k to maximize model likelihood ratio 2 2k.
See [64] for a comparison of restricted cubic splines, fractional
polynomials, and penalized splines.
2-27
2.4.7
Nonparametric Regression
Y
Estimate tendency (mean or median) of Y as a function of

X
Few assumptions
Especially handy when there is a single X
Plotted trend line may be the final result of the analysis
Simplest smoother: moving average
X: 1 2 3
5
8
Y : 2.1 3.8 5.7 11.1 17.2
|X = 2) = 2.1 + 3.8 + 5.7
E(Y
3
2+3+5
3.8 + 5.7 + 11.1
E(Y |X =
) =
3
3
overlap OK
problem in estimating E(Y ) at outer X-values
estimates very sensitive to bin width
Moving linear regression far superior to moving avg. (moving
flat line)
2-28
Clevelands [33] moving linear regression smoother loess (locally weighted least squares) is the most popular smoother.
To estimate central tendency of Y at X = x:
take all the data having X values within a suitable interval
about x (default is 32 of the data)
fit weighted least squares linear regression within this
neighborhood
points near x given the most weighte
points near extremes of interval receive almost no weight
loess works much better at extremes of X than moving
avg.
provides an estimate at each observed X; other estimates
obtained by linear interpolation
outlier rejection algorithm built-in
loess works for binary Y just turn off outlier detection
Other popular smoother: Friedmans super smoother
For loess or supsmu amount of smoothing can be controlled
by analyst
e Weight here means something different than regression coefficient. It means how much a point is emphasized in developing the regression
coefficients.
2-29
Another alternative: smoothing splinesf

Smoothers are very useful for estimating trends in residual
plots
2.4.8
Advantages of Regression Splines over Other Methods

Regression splines have several advantages [74]:
Parametric splines can be fitted using any existing regression
program
Regression coefficients estimated using standard techniques
(ML or least squares), formal tests of no overall association,
linearity, and additivity, confidence limits for the estimated
regression function are derived by standard theory.
The fitted function directly estimates transformation predictor should receive to yield linearity in C(Y |X).
Even when a simple transformation is obvious, spline function can be used to represent the predictor in the final model
(and the d.f. will be correct). Nonparametric methods do
not yield a prediction equation.
Extension to non-additive models.
Multi-dimensional nonparametric estimators often require
f These
place knots at all the observed data points but penalize coefficient estimates towards smoothness.
burdensome computations.
2-30
2-31
2.5
Recursive Partitioning: Tree-Based Models

Breiman, Friedman, Olshen, and Stone [21]: CART (Classification and Regression Trees) essentially model-free
Method:
Find predictor so that best possible binary split has maximum value of some statistic for comparing 2 groups
Within previously formed subsets, find best predictor and
split maximizing criterion in the subset
Proceed in like fashion until < k obs. remain to split
Summarize Y for the terminal node (e.g., mean, modal category)
Prune tree backward until it cross-validates as well as its
apparent accuracy, or use shrinkage
2-32
Advantages/disadvantages of recursive partitioning:
Does not require functional form for predictors

Does not assume additivity can identify complex interactions
Can deal with missing data flexibly
Interactions detected are frequently spurious
Does not use continuous predictors effectively
Penalty for overfitting in 3 directions
Often tree doesnt cross-validate optimally unless pruned
back very conservatively
Very useful in messy situations or those in which overfitting
is not as problematic (confounder adjustment using propensity scores [34]; missing value imputation)
See [8].
2.5.1
New Directions in Predictive Modeling

The approaches recommended in this course are
2-33
fitting fully pre-specified models without deletion ofinsignificant predictors

using data reduction methods (masked to Y ) to reduce the
dimensionality of the predictors and then fitting the number
of parameters the datas information content can support
use shrinkage (penalized estimation) to fit a large model
without worrying about the sample size.
The data reduction approach can yield very interpretable, stable models, but there are many decisions to be made when
using a two-stage (reduction/model fitting) approach, Newer
approaches are evolving, including the following. These new
approach handle continuous predictors well, unlike recursive
partitioning.
lasso (shrinkage using L1 norm favoring zero regression coefficients) [158, 153]
elastic net (combination of L1 and L2 norms that handles
the p > n case better than the lasso) [188]
adaptive lasso [186, 171]
more flexible lasso to differentially penalize for variable selection and for regression coefficient estimation [134]
group lasso to force selection of all or none of a group of
2-34
related variables (e.g., dummy variables representing a polytomous predictor)

group lasso-like procedures that also allow for variables within
a group to be removed [172]
sparse-group lasso using L1 and L2 norms to achieve spareness on groups and within groups of variables [146]
adaptive group lasso (Wang & Leng)
Breimans nonnegative garrote [182]
preconditioning, i.e., model simplification after developing
a black box predictive model [123, 122]
sparse principal components analysis to achieve parsimony
in data reduction [179, 187, 106, 105]
bagging, boosting, and random forests [80]
One problem prevents most of these methods from being ready
for everyday use:they require scaling predictors before fitting
the model. When a predictor is represented by nonlinear basis
functions, the scaling recommendations in the literature are not
sensible. There are also computational issues and difficulties
obtaining hypothesis tests and confidence intervals.
When data reduction is not required, generalized additive mod-
els [79, 180] should also be considered.
2-35
2-36
2.6
Multiple Degree of Freedom Tests of Association

H
C(Y |X) = 0 + 1X1 + 2X2 + 3X22,

H0 : 2 = 3 = 0 with 2 d.f. to assess association between X2
and outcome.
In the 5-knot restricted cubic spline model
C(Y |X) = 0 + 1X + 2X 0 + 3X 00 + 4X 000,
H0 : 1 = . . . = 4 = 0
Test of association: 4 d.f.

Insignificant dangerous to interpret plot
What to do if 4 d.f. test insignificant, 3 d.f. test for linearity
insig., 1 d.f. test sig. after delete nonlinear terms?
Grambsch and OBrien [65] elegantly described the hazards of
pretesting
Studied quadratic regression
Showed 2 d.f. test of association is nearly optimal even when
regression is linear if nonlinearity entertained
2-37
Considered ordinary regression model

E(Y |X) = 0 + 1X + 2X 2
Two ways to test association between X and Y
Fit quadratic model and test for linearity (H0 : 2 = 0)
F -test for linearity significant at = 0.05 level report as
the final test of association the 2 d.f. F test of H0 : 1 =
2 = 0
If the test of linearity insignificant, refit without the quadratic
term and final test of association is 1 d.f. test, H0 : 1 =
0|2 = 0
Showed that type I error >
Fairly accurate P -value obtained by instead testing against
F with 2 d.f. even at second stage
Cause: are retaining the most significant part of F
BUT if test against 2 d.f. can only lose power when compared with original F for testing both s
SSR from quadratic model > SSR from linear model
2-38
2.7
Assessment of Model Fit

2.7.1
Regression Assumptions
The general linear regression model is
C(Y |X) = X = 0 + 1X1 + 2X2 + . . . + k Xk .

Verify linearity and additivity. Special case:
C(Y |X) = 0 + 1X1 + 2X2,
where X1 is binary and X2 is continuous.
C(Y)
X1 = 1
X1 = 0
X2
Figure 2.4: Regression assumptions for one binary and one continuous predictor
Methods for checking fit:

1. Fit simple linear additive model and check examine residual
plots for patterns
2-39
For OLS: box plots of e stratified by X1, scatterplots

of e vs. X2 and Y , with trend curves (want flat central
tendency, constant variability)
For normality, qqnorm plots of overall and stratified residuals
Advantage: Simplicity
Disadvantages:
Can only compute standard residuals for uncensored continuous response
Subjective judgment of non-randomness
Hard to handle interaction
Hard to see patterns with large n (trend lines help)
Seeing patterns does not lead to corrective action
2. Scatterplot of Y vs. X2 using different symbols according
to values of X1
Advantages: Simplicity, can see interaction
Disadvantages:
Scatterplots cannot be drawn for binary, categorical, or
censored Y
Patterns difficult to see if relationships are weak or n large
3. Stratify the sample by X1 and quantile groups (e.g. deciles)
2-40
of X2; estimate C(Y |X1, X2) for each stratum

Advantages: Simplicity, can see interactions, handles censored Y (if you are careful)
Disadvantages:
Requires large n
Does not use continuous var. effectively (no interpolation)
Subgroup estimates have low precision
Dependent on binning method
4. Separately for levels of X1 fit a nonparametric smoother
relating X2 to Y
Advantages: All regression aspects of the model can be
summarized efficiently with minimal assumptions
Disadvantages:
Does not apply to censored Y
Hard to deal with multiple predictors
5. Fit flexible nonlinear parametric model
Advantages:
One framework for examining the model assumptions, fitting the model, drawing formal inference
d.f. defined and all aspects of statistical inference work
as advertised
2-41
Disadvantages:
Complexity
Generally difficult to allow for interactions when assessing
patterns of effects
Confidence limits, formal inference can be problematic for methods 1-4.
Restricted cubic spline works well for method 5.

|X) = 0 + 1X1 + 2X2 + 3X 0 + 4X 00
C(Y
2
2
= 0 + 1X1 + f(X2),
where
f(X2) = 2X2 + 3X20 + 4X200,
f(X2) spline-estimated transformation of X2.
Plot f(X2) vs. X2
n large can fit separate functions by X1
Test of linearity: H0 : 3 = 4 = 0
Few good reasons to do the test other than to demonstrate
that linearity is not a good default assumption
Nonlinear use transformation suggested by spline fit or
keep spline terms
2-42
Tentative transformation g(X2) check adequacy by expanding g(X2) in spline function and testing linearity
Can find transformations by plotting g(X2) vs. f(X2) for
variety of g
Multiple continuous predictors expand each using spline
Example: assess linearity of X2, X3
C(Y |X) = 0 + 1X1 + 2X2 + 3X20 + 4X200

+ 5X3 + 6X30 + 7X300,
Overall test of linearity H0 : 3 = 4 = 6 = 7 = 0, with 4
d.f.
2.7.2
Modeling and Testing Complex Interactions

X1 binary or linear, X2 continuous:
C(Y |X) = 0 + 1X1 + 2X2 + 3X20 + 4X200

+5X1X2 + 6X1X20 + 7X1X200
Simultaneous test of linearity and additivity: H0 : 3 = . . . =
7 = 0.
2 continuous variables: could transform separately and form
simple product
2-43
But transformations depend on whether interaction terms

adjusted for, so it is usually not possible to estimate transformations and interaction effects other than simultaneously
Compromise: Fit interactions of the form X1f (X2) and
X2g(X1):
T
C(Y |X) =
+
+
+
0 + 1X1 + 2X10 + 3X100

4X2 + 5X20 + 6X200
7X1X2 + 8X1X20 + 9X1X200
10X2X10 + 11X2X100
U
Test of additivity is H0 : 7 = 8 = . . . = 11 = 0 with 5

d.f.
Test of lack of fit for the simple product interaction with X2
is H0 : 8 = 9 = 0
Test of lack of fit for the simple product interaction with X1
is H0 : 10 = 11 = 0
General spline surface:
Cover X1 X2 plane with grid and fit patch-wise cubic
polynomial in two variables
Restrict to be of form aX1 + bX2 + cX1X2 in corners
2-44
Uses all (k 1)2 cross-products of restricted cubic spline

terms
See Gray [66, 67, Section 3.2] for penalized splines allowing
control of effective degrees of freedom. See Berhane et
al. [15] for a good discussion of tensor splines.
50
100
SYSBP
150
200
Tensor Spline Surface, Penalized MLE
20
40
Adjusted to: tx=tpa age=61.52
60
80
DIABP
100
120
140
4 knots, penalty=21
Figure 2.5: Logistic regression estimate of probability of a hemorrhagic stroke for patients in the GUSTO-I trial given t-PA, using
a tensor spline of two restricted cubic splines and penalization (shrinkage). Dark (cold color) regions are low risk, and bright (hot)
regions are higher risk.
Figure 2.5 is particularly interesting because the literature had suggested (based on
approximately 24 strokes) that pulse pressure was the main cause of hemorrhagic stroke
whereas this flexible modeling approach (based on approximately 230 strokes) suggests
that mean arterial blood pressure (roughly a 45 line) is what is most important over a
2-45
broad range of blood pressures. At the far right one can see that pulse pressure (axis
perpendicular to 45 line) may have an impact although a non-monotonic one.
Other issues:
Y non-censored (especially continuous) multi-dimensional

scatterplot smoother [27]
Interactions of order > 2: more trouble
2-way interactions among p predictors: pooled tests
p tests each with p 1 d.f.
Some types of interactions to pre-specify in clinical studies:
Treatment severity of disease being treated
Age risk factors
Age type of disease
Measurement state of a subject during measurement
Race disease
Calendar time treatment
Quality quantity of a symptom
2-46
Measurement amount of deterioration of the measurement

The section between the two horizontal blue lines was inserted after the audio narration was recorded.
The last example is worth expanding as an example in model

formulation. Consider the following study.
A sample of patients seen over several years have a blood
sample taken at time of hospitalization
Blood samples are frozen
Long after the last patient was sampled, the blood samples
are thawed all in the same week and a blood analysis is done
It is known that the quality of the blood analysis deteriorates roughly logarithmically by the age of the sample; blood
measurements made on old samples are assumed to be less
predictive of outcome
This is reflected in an interaction between a function of sample age and the blood measurement Bg
Patients were followed for an event, and the outcome variable of interest is the time from hospitalization to that event
To not assume a perfect logarithmic relationship for sample
g For continuous Y one might need to model the residual variance of Y as increasing with sample age, in addition to modeling the mean
function.
2-47
age on the effect of the blood measurement, a restricted

cubic spline model with 3 default knots will be fitted for log
sample age
Sample age is assumed to not modify the effects of nonblood predictors patient age and sex
Model may be specified the following way using the R rms
package to fit a Cox proportional hazards model
Test for nonlinearity of sampleAge tests the adequacy of assuming a plain logarithmic trend in sample age
f cph ( Surv ( etime , event ) rcs ( log ( sampleAge ) , 3) * rcs (B , 4) +
rcs ( age , 5) * sex , data = mydata )
The B sampleAge interaction effects have 6 d.f. and tests

whether the sample deterioration affects the effect of B. By
not assuming that B has the same effect for old samples as
for young samples, the investigator will be able to estimate
the effect of B on outcome when the blood analysis is ideal by
inserting sampleAge = 1 day when requesting predicted values
as a function of B.
2.7.3
Fitting Ordinal Predictors

Y
Small no. categories (3-4) polytomous factor, dummy

variables
2-48
Design matrix for easy test of adequacy of initial codes k

original codes + k 2 dummies
More categories score using data-driven trend. Later
tests use k 1 d.f. instead of 1 d.f.
E.g., compute logit(mortality) vs. category
2.7.4
Distributional Assumptions
Z
Some models (e.g., logistic): all assumptions in C(Y |X) =

X (implicitly assuming no omitted variables!)
Linear regression: Y X + , n(0, 2)
Examine distribution of residuals
Some models (Weibull, Cox [37]):
C(Y |X) = C(Y = y|X) = d(y) + X
C = log hazard
Check form of d(y)
Show d(y) does not interact with X
Chapter 3
Missing Data
3.1
Types of Missing Data

A
Missing completely at random (MCAR)

Missing at random (MAR)a
Informative missing
(non-ignorable non-response)
See [46, 72, 1, 176, 25] for an introduction to missing data and
imputation concepts.
aAlthough missing at random (MAR) is a non-testable assumption, it has been pointed out in the literature that we can get very close to
MAR if we include enough variables in the imputation models [72].
3-1
CHAPTER 3. MISSING DATA
3-2
3.2
Prelude to Modeling
B
Quantify extent of missing data

Characterize types of subjects with missing data
Find sets of variables missing on same subjects
3-3
3.3
Missing Values for Different Types of Response Variables

C
Serial data with subjects dropping out (not covered in this

courseb
Y =time to event, follow-up curtailed: covered under survival analysisc
Often discard observations with completely missing Y but
sometimes wastefuld
Characterize missings in Y before dropping obs.
b Twist
et al. [159] found instability in using multiple imputation of longitudinal data, and advantages of using instead full likelihood models.
and Royston [175] provide a method for multiply imputing missing covariate values using censored survival time data.
d Y is so valuable that if one is only missing a Y value, imputation is not worthwhile, and imputation of Y is not advised if MCAR or MAR.
c White
3-4
3.4
Problems With Simple Alternatives to Imputation

Deletion of records
Badly biases parameter estimates when the probability of a
case being incomplete is related to Y and not just X [109].
Deletion because of a subset of X being missing always
results in inefficient estimates
Deletion of records with missing Y can result in biases [38]
but is the preferred approach under MCARe
However von Hippel [169] found advantages to a use all
variables to impute all variables then drop observations with
missing Y approach
Lee and Carlin [104] suggest that observations missing on
both Y and on a predictor of major interest are not helpful
Only discard obs. when
MCAR can be justified
Rarely missing predictor of overriding importance that
cant be imputed from other data
e Multiple
imputation of Y in that case does not improve the analysis and assumes the imputation model is correct.
3-5
Fraction of obs. with missings small and n is large

No advantage of deletion except savings of analyst time
Making up missing data better than throwing away real data
See [94]
Adding extra categories of categorical predictors
Including missing data but adding a category missing causes
serious biases [1, 89, 160]
Problem acute when values missing because subject too sick
Difficult to interpret
Fails even under MCAR [89, 1, 46, 162, 94]
May be OK if values are missing because of not applicablef
Likewise, serious problems are caused by setting missing continuous predictors to a constant (e.g., zero) and adding an indicator variable to try to estimate the effect of missing values.
Two examples from Donder et al.[46] using binary logistic regression, N = 500.
f E.g. you have a measure of marital happiness, dichotomized as high or low, but your sample contains some unmarried people. OK to have
a 3-category variable with values high, low, and unmarriedPaul Allison, IMPUTE list, 4Jul09.
3-6
Results of 1000 Simulations With 1 = 1.0 with MAR and Two Types of
Imputation
Imputation 1 S.E. Coverage of

Method
0.90 C.I.
Single
0.989 0.09
0.64
Multiple
0.989 0.14
0.90
Now consider a simulation with 1 = 1, 2 = 0, X2 correlated
with X1(r = 0.75) but redundant in predicting Y , use missingness indicator when X1 is MCAR in 0.4 of 500 subjects. This
is also compared with grand mean fill-in imputation.
Results of 1000 Simulations Adding a Third Predictor Indicating Missing
for X1
Imputation
1 2
Method
Indicator
0.55 0.51
Overall mean 0.55
In the incomplete observations the constant X1 is uncorrelated
with X2.
3-7
3.5
Strategies for Developing an Imputation Model

The goal of imputation is to preserve the information
and meaning of the non-missing data.
Exactly how are missing values estimated?
Could ignore all other information random or grand mean

fill-in
Can use external info not used in response model (e.g., zip
code for income)
Need to utilize reason for non-response if possible
Use statistical model with sometimes-missing X as response
variable
Model to estimate the missing values should include all variables that are either
1. related to the missing data mechanism;
2. have distributions that differ between subjects that have
the target variable missing and those that have it measured;
3. associated with the sometimes-missing variable when it is
not missing; or
4. included in the final response model [10, 72]
3-8
Ignoring imputation results in biased V ()

transcan function in Hmisc library: optimal transformations of all variables to make residuals more stable and to
allow non-monotonic transformations
aregImpute function in Hmisc: good approximation to full
Bayesian multiple imputation procedure using the bootstrap
transcan and aregImpute use the following for fitting imputation models:
1. initialize NAs to median (mode for categoricals)

2. expand all categorical predictors using dummy variables
3. expand all continuous predictors using restricted cubic
splines
4. optionally optimally transform the variable being predicted
by expanding it with restricted cubic splines and using the
first canonical variate (multivariate regression) as the optimum transformation (maximizing R2)
5. one-dimensional scoring of categorical variables being predicted using canonical variates on dummy variables representing the categories (Fishers optimum scoring algorithm); when imputing categories, solve for which category yields a score that is closest to the predicted score
aregImpute and transcan work withfit.mult.impute to make
final analysis of response variable relatively easy
3-9
Predictive mean matching [109]: replace missing value with

observed value of subject having closest predicted value to
the predicted value of the subject with the NA. Key considerations are how to
1. model the target when it is not NA
2. match donors on predicted values
3. avoid overuse of good donors to disallow excessive ties
in imputed data
4. account for all uncertainties
Predictive model for each target uses any outcomes, all predictors in the final model other than the target, plus auxiliary
variables not in the outcome model
No distributional assumptions
Predicted values need only be monotonically related to real
predictive values
PMM can result in some donor observations being used
repeatedly
Causes lumpy distribution of imputed values
Address by sampling from multinomial distribution, probabilities = scaled distance of all predicted values to predicted value (y ) of observation needing imputing
Tukeys tricube function is a good weighting function
3-10
(used in loess):
wi = (1 min(di/s, 1)3)3,
di = |yi y |
s = 0.2 mean|yi y | is a good default scale factor
scale so that P wi = 1
Recursive partitioning with surrogate splits handles case
where a predictor of a variable needing imputation is missing
itself
[176] discusses an alternative method based on choosing a
donor observation at random from the q closest matches
(q = 3, for example)
3-11
3.6
Single Conditional Mean Imputation

M
Can fill-in using unconditional mean or median if number of

missings low and X is unrelated to other Xs
Otherwise, first approximation to good imputation uses other
Xs to predict a missing X
This is a single best guess conditional mean
Z = X plus possibly auxiliary variables that pre j = Z ,
X
j
cede Xj in the causal chain that are not intended to be in
the outcome model.
Cannot include Y in Z without adding random errors to imputed values as done with multiple imputation (would steal
info from Y )
Recursive partitioning can sometimes be helpful for nonparametrically estimating conditional means
3-12
3.7
Predictive Mean Matching

3.8
Multiple Imputation
N
Single imputation could use a random draw from the conditional distribution for an individual
j = Z + , Z = [X j,
Y ] plus auxiliary variables
X
= n(0, ) or a random draw from the calculated residuals
bootstrap
approximate Bayesian bootstrap [139, 72]: sample with replacement from sample with replacement of residuals
Multiple imputations (M ) with random draws
Draw sample of M residuals for each missing value to be
imputed
Average M
In general can provide least biased estimates of
Simple formula for imputation-corrected var()

Function of average apparent variances and betweenimputation variances of
3-13
BUT full multiple imputation needs to account for uncertainty in the imputation models by refitting these models
for each of the M draws
transcan does not do that; aregImpute does
Note that multiple imputation can and should use the response variable for imputing predictors [117]
aregImpute algorithm [117]
Takes all aspects of uncertainty into account using the
bootstrap
Different bootstrap resamples used for each imputation
by fitting a flexible additive model on a sample with replacement from the original data
This model is used to predict all of the original missing
and non-missing values for the target variable for the current imputation
Uses flexible parametric additive regression models to impute
There is an option to allow target variables to be optimally transformed, even non-monotonically (but this can
overfit)
By default uses predictive mean matching for imputation;
3-14
no residuals required (can also do more parametric regression imputation)

By default uses weighted PMM; many other matching
options
Uses by default van Buurens Type 1 matching [25, Section 3.4.2] to capture the right amount of uncertainty by
computing predicted values for missing values using a regression fit on the bootstrap sample, and finding donor
observations by matching those predictions to predictions
from potential donors using the regression fit from the
original sample of complete observations
When a predictor of the target variable is missing, it is
first imputed from its last imputation when it was a target
variable
First 3 iterations of process are ignored (burn-in)
Compares favorably to R MICE approach
Example:
a aregImpute ( age + sex + bp + death + heart.attack.before.death ,
data = mydata , n.impute =5)
f fit.mult.impute ( death rcs ( age ,3) + sex +
rcs ( bp ,5) , lrm , a , data = mydata )
See Barzi and Woodward [10] for a nice review of multiple imputation with
detailed comparison of results (point estimates and confidence limits for
the effect of the sometimes-missing predictor) for various imputation meth-
3-15
ods. Barnes et al. [9] have a good overview of imputation methods and a
comparison of bias and confidence interval coverage for the methods when
applied to longitudinal data with a small number of subjects. Horton and
Kleinman [85] have a good review of several software packages for dealing
with missing data, and a comparison of them with aregImpute. Harel and
Zhou [72] provide a nice overview of multiple imputation and discuss some of
the available software. White and Carlin [174] studied bias of multiple imputation vs. complete-case analysis. White et al. [176] provide much practical
guidance.
Caution: Methods can generate imputations having very reasonable distributions but still not having the property that final response model regression
coefficients have nominal confidence interval coverage. It is worth checking
that imputations generate the correct collinearities among covariates.
With MICE and aregImpute we are using the chained equation

approach [176]
Chained equations handles a wide variety of target variables

to be imputed and allows for multiple variables to be missing
on the same subject
Iterative process cycles through all target variables to impute
all missing values [161]
Does not attempt to use the full Bayesian multivariate model
for all target variables, making it more flexible and easy to
use
Possible to create improper imputations, e.g., imputing con-
3-16
flicting values for different target variables

However, simulation studies [161] demonstrate very good
performance of imputation based on chained equations
3-17
3.9
Diagnostics
R
MCAR can be partially assessed by comparing distribution

of non-missing Y for those subjects with complete X vs.
those subjects having incomplete X [109]
Yucel and Zaslavsky [185] (see also [81])
Interested in reasonableness of imputed values for a sometimesmissing predictor Xj
Duplicate entire dataset
In the duplicated observations set all non-missing values of
Xj to missing; let w denote this set of observations set to
missing
Develop imputed values for the missing values of Xj
In the observations in w compare the distribution of imputed
Xj to the original values of Xj
3-18
duplicate
originally missing
impute
all missing
3-19
3.10
Summary and Rough Guidelines

S
Table 3.1: Summary of Methods for Dealing with Missing Values
Method
Deletion
Single Multiple
Allows non-random missing
x
x
Reduces sample size
x
x
Apparent S.E. of too low
Increases real S.E. of
x
biased
if not MCAR
x
The following contains crude guidelines. Simulation studies are

needed to refine the recommendations. Here f refers to the
proportion of observations having any variables missing.
f < 0.03: It doesnt matter very much how you impute missings or whether you adjust variance of regression coefficient
estimates for having imputed data in this case. For continuous variables imputing missings with the median non-missing
value is adequate; for categorical predictors the most frequent category can be used. Complete case analysis is also
an option here. Multiple imputation may be needed to check
that the simple approach worked.
f 0.03: Use multiple imputation with number of imputationsg equal to max(5, 100f ). Fewer imputations may be
possible with very large sample sizes. Type 1 predictive
mean matching is usually preferred, with weighted selection of donors. Account for imputation in estimating the
g White et al. [176] recommend choosing M so that the key inferential statistics are very reproducible should the imputation analysis be
repeated. They suggest the use of 100f imputations. See also [25, Section 2.7].
3-20
covariance matrix for final parameter estimates. Use the t

distribution instead of the Gaussian distribution for tests and
confidence intervals, if possible, using the estimated d.f. for
the parameter estimates.
Multiple predictors frequently missing: More imputations

may be required. Perform a sensitivity to order analysis
by creating multiple imputations using different orderings of
sometimes missing variables. It may be beneficial to initially sort variables so that the one with the most NAs will
be imputed first.
Reason for missings more important than number of missing
values.
Extreme amount of missing data does not prevent one from
using multiple imputation, because alternatives are worse [88].
3.10.1
Effective Sample Size

It is useful to look look at examples of effective sample sizes
in the presence of missing data. If a sample of 1000 subjects
contains various amounts and patterns of missings what size
nc of a complete sample would have equivalent information for
the intended purpose of the analysis?
1. A new marker was collected on a random sample of 200 of
the subjects and one wants to estimate the added predictive
3-21
value due to the marker: nc = 200

2. Height is missing on 100 subjects but we want to study
association between BMI and outcome. Weight, sex, and
waist circumference are available on all subjects: nc = 980
1
3. Each of 10 predictors is randomly missing on 10
of subjects,
and the predictors are uncorrelated with each other and are
each weakly related to the outcome: nc = 500
4. Same as previous but the predictors can somewhat be predicted from non-missing predictors: nc = 750
5. The outcome variable was not assessed on a random
subjects: nc = 800
1
5
of
6. The outcome represents sensitive information, is missing on

1
2 of subjects, and we dont know what made subjects respond to the question: nc = 0 (serious selection bias)
7. One of the baseline variables was collected prospectively 12
of the time and for the other subjects it was retrospectively
estimated only for subjects ultimately suffering a stroke and
we dont know which subjects had a stroke: nc = 0 (study
not worth doing)
8. The outcome variable was assessed by emailing the 1000
subjects, for which 800 responded, and we dont know what
made subjects respond: nc = 0 (model will possibly be very
biasedat least the intercept)
Chapter 4
Multivariable Modeling Strategies

A
Spending d.f.: examining or fitting parameters in models,

or examining tables or graphs that utilize Y to tell you how
to model variables
If wish to preserve statistical properties, cant retrieve d.f.
once they are spent (see Grambsch & OBrien)
If a scatterplot suggests linearity and you fit a linear model,
how many d.f. did you actually spend (i.e., the d.f. that when
put into a formula results in accurate confidence limits or
P -values)?
Decide number of d.f. that can be spent
Decide where to spend them
Spend them
General references: [121, 152, 77, 61]
4-1
CHAPTER 4. MULTIVARIABLE MODELING STRATEGIES
4-2
There are many choices to be made when deciding upon a

global modeling strategy, including choice between
parametric and nonparametric procedures
parsimony and complexity
parsimony and good discrimination ability
interpretable models and black boxes.
4-3
4.1
Prespecification of Predictor Complexity Without Later Simplification

D
Rarely expect linearity

Cant always use graphs or other devices to choose transformation
If select from among many transformations, results biased
Need to allow flexible nonlinearity to potentially strong predictors not known to predict linearly
Once decide a predictor is in can choose no. of parameters
to devote to it using a general association index with Y
Need a measure of potential predictive punch
Measure needs to mask analyst to true form of regression
to preserve statistical properties
Motivating examples:
#
Overfitting
flat
relationship
require ( rms )
set.seed (1)
x runif (1000)
y runif (1000 , -0.5 , 0 .5 )
dd datadist (x , y ) ; options ( datadist = dd )
par ( mfrow = c (2 ,2) , mar = c (2 , 2 , 3 , 0 .5 ) )
pp function ( actual ) {
yhat predict (f , data.frame ( x = xs ) )
yreal actual ( xs )
plot (0 , 0 , xlim = c (0 ,1) ,
ylim = range ( c ( quantile (y , c (0 .1 , 0 .9 ) ) , yhat ,
yreal ) ) ,
type = n , axes = FALSE )
axis (1 , labels = FALSE ) ; axis (2 , labels = FALSE )
lines ( xs , yreal )
lines ( xs , yhat , col = blue )
}
f ols ( y rcs (x , 5) )
xs seq (0 , 1 , length =150)
pp ( function ( x ) 0 * x )
title ( Mild Error :\ nOverfitting a Flat Relationship ,
cex =0 .5 )
y x + runif (1000 , -0.5 , 0 .5 )
f ols ( y rcs (x , 5) )
pp ( function ( x ) x )
title ( Mild Error :\ nOverfitting a Linear Relationship ,
cex =0 .5 )
y x 4 + runif (1000 , -1 , 1)
f ols ( y x )
pp ( function ( x ) x 4)
title ( Serious Error :\ nUnderfitting a Steep Relationship ,
cex =0 .5 )
y - ( x - 0 .5 ) 2 + runif (1000 , -0.2 , 0 .2 )
f ols ( y x )
pp ( function ( x ) - ( x - 0 .5 ) 2)
title ( Tragic Error :\ nMonotonic Fit to \ nNon-Monotonic Relationship ,
cex =0 .5 )
4-4
Mild Error:
Overfitting a Linear Relationship
Mild Error:
Overfitting a Flat Relationship
Tragic0Error:
Monotonic Fit to
NonMonotonic Relationship
0
Serious Error:
Underfitting a Steep Relationship
4-5
Table 4.1: Examples of Reducing the Number of Parameters
Categorical predictor with k levels Collapse less frequent

categories into other
Continuous predictor represented Reduce k to a number as
as k-knot r.c. spline
low as 3, or 0 (linear)
4.1.1
Learning From a Saturated Model

When the effective sample size available is sufficiently large
so that a saturated main effects model may be fitted, a good
4-6
approach to gauging predictive potential is the following.

Let all continuous predictors be represented as restricted
cubic splines with k knots, where k is the maximum number
of knots the analyst entertains for the current problem.
Let all categorical predictors retain their original categories
except for pooling of very low prevalence categories (e.g.,
ones containing < 6 observations).
Fit this general main effects model.
Compute the partial 2 statistic for testing the association
of each predictor with the response, adjusted for all other
predictors. In the case of ordinary regression convert partial
F statistics to 2 statistics or partial R2 values.
Make corrections for chance associations tolevel the playing
field for predictors having greatly varying d.f., e.g., subtract
the d.f. from the partial 2 (the expected value of 2p is p
under H0).
Make certain that tests of nonlinearity are not revealed as
this would bias the analyst.
Sort the partial association statistics in descending order.
Commands in the rms package can be used to plot only what
is needed. Here is an example for a logistic model.
4-7
f lrm ( y sex + race + rcs ( age ,5) + rcs ( weight ,5) +

rcs ( height ,5) + rcs ( blood.pressure ,5) )
plot ( anova ( f ) )
4.1.2
Using Marginal Generalized Rank Correlations

When collinearities or confounding are not problematic, a quicker
approach based on pairwise measures of association can be useful. This approach will not have numerical problems (e.g., singular covariance matrix) and is based on:
G
2 d.f. generalization of Spearman R2 based on rank(X)
and rank(X)2 vs. rank(Y )
2 can detect U-shaped relationships
For categorical X, 2 is R2 from dummy variables regressed
against rank(Y ); this is tightly related to the Wilcoxon
MannWhitneyKruskalWallis rank test for group differencesa
Sort variables by descending order of 2
Specify number of knots for continuous X, combine infrequent categories of categorical X based on 2
Allocating d.f. based on partial tests of association or sorting
a This
test statistic does not inform the analyst of which groups are different from one another.
4-8
2 is a fair procedure because

We already decided to keep variable in model no matter
what 2 or 2 values are seen
2 and 2 do not reveal degree of nonlinearity; high value
may be due solely to strong linear effect
low 2 or 2 for a categorical variable might lead to collapsing the most disparate categories
Initial simulations show the procedure to be conservative. Note
that one can move from simpler to more complex models but
not the other way round
4-9
4.2
Checking Assumptions of Multiple Predictors

Simultaneously
I
Sometimes failure to adjust for other variables gives wrong

transformation of an X, or wrong significance of interactions
Sometimes unwieldy to deal simultaneously with all predictors at each stage assess regression assumptions separately for each predictor
4-10
4.3
Variable Selection
J
Series of potential predictors with no prior knowledge

exploration shrinkage (overfitting)
significant ) 6= [29]
Summary of problem: E(|
standard errors, P -values too small
Biased R2, ,
F and 2 statistics do not have the claimed distributionb [65]
Will result in residual confounding if use variable selection
to find confounders [69]
Derksen and Keselman [43] found that in stepwise analyses
the final model represented noise 0.20-0.74 of time, final
model usually contained < 12 actual number of authentic
predictors. Also:
1. The degree of correlation between the predictor variables affected the frequency with which authentic
predictor variables found their way into the final
model.
2. The number of candidate predictor variables affected
the number of noise variables that gained entry to
the model.
b Lockhart et al. [111] provide an example with n = 100 and 10 orthogonal predictors where all true s are zero. The test statistic for the
first variable to enter has type I error of 0.39 when the nominal is set to 0.05.
4-11
3. The size of the sample was of little practical importance in determining the number of authentic variables contained in the final model.
4. The population multiple coefficient of determination
could be faithfully estimated by adopting a statistic
that is adjusted by the total number of candidate
predictor variables rather than the number of variables in the final model.
Global test with p d.f. insignificant stop
Simulation experiment, true 2 = 6.25, 8 candidate variables,
4 of them related to Y in the population. Select best model
using AIC.
require ( MASS )
sim function (n , sigma =2 .5 , pr = FALSE , prcor = FALSE ) {
x1 rnorm ( n )
x2 x1 + 0 .5 * rnorm ( n )
x3 rnorm ( n )
x4 x3 + 1 .5 * rnorm ( n )
x5 x1 + rnorm ( n ) / 1 .3
x6 x2 + rnorm ( n ) / 1 .3
x7 x3 + x4 + rnorm ( n )
x8 x7 + 0 .5 * rnorm ( n )
if ( prcor ) return ( round ( cor ( cbind ( x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 ) ) ,2) )
lp x1 + x2 + .5 * x3 + .4 * x7
y lp + sigma * rnorm ( n )
f lm ( y x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 )
g stepAIC (f , trace =0)
p g $ rank - 1
xs if ( p == 0) none else
gsub ( [ \\+ x ] , , as.character ( formula ( g ) ) [3])
if ( pr ) print ( formula ( g ) , showEnv = FALSE )
ssesw sum ( resid ( g ) 2)
s2s ssesw / g $ df.residual
# Set SSEsw / ( n - gdf - 1) = true sigma 2
gdf n - 1 - ssesw / ( sigma 2)
#
Compute
root
mean
squared
error
against
true
linear
predictor
rmse.full sqrt ( mean (( fitted ( f ) - lp ) 2) )

rmse.step sqrt ( mean (( fitted ( g ) - lp ) 2) )
list ( stats = c ( n =n , vratio = s2s / ( sigma 2) ,
gdf = gdf , apparentdf =p , rmse.full = rmse.full , rmse.step = rmse.step ) ,
4-12
xselected = xs )
}
rsim function (B , n ) {
xs character ( B )
r matrix ( NA , nrow =B , ncol =6)
for ( i in 1: B ) {
w
sim ( n )
r [i ,] w $ stats
xs [ i ] w $ xselected
}
colnames ( r ) names ( w $ stats )
s apply (r , 2 , median )
p r [ , apparentdf ]
s [ apparentdf ] mean ( p )
print ( round (s , 2) )
print ( table ( p ) )
cat ( Prob [ correct model ]= , round ( sum ( xs == 1237 ) /B , 2) , \ n )
}
Show the correlation matrix being assumed for the Xs:

sim (50000 , prcor = TRUE )
x1
x2
x3
x4
x5
x6
x7
x8
x1 1.00 0.89 -0.01 0.00 0.80 0.74 0.00 0.00
x2 0.89 1.00 -0.01 0.00 0.71 0.83 0.00 0.00
x3 -0.01 -0.01 1.00 0.56 -0.01 -0.01 0.74 0.73
x4 0.00 0.00 0.56 1.00 0.00 0.00 0.88 0.86
x5 0.80 0.71 -0.01 0.00 1.00 0.59 0.00 0.00
x6 0.74 0.83 -0.01 0.00 0.59 1.00 0.00 0.00
x7 0.00 0.00 0.74 0.88 0.00 0.00 1.00 0.98
x8 0.00 0.00 0.73 0.86 0.00 0.00 0.98 1.00
Simulate to find the distribution of the number of variables

selected, the proportion of simulations in which the true model
(X1, X2, X3, X7) was found, the median value of 2/ 2, the
median effective d.f., and the mean number of apparent d.f.,
for varying sample sizes.
set.seed (11)
rsim (100 , 20)
n
20.00
actual
vratio
0.70
p
1 2 3 4 5 6 7
4 16 32 22 14 9 3
Prob [ correct model ]= 0
model
not
selected
gdf apparentdf
8.09
3.65
once
rmse . full
1.62
rmse . step
1.56
4-13
rsim (100 , 40)

n
40.00
vratio
0.87
gdf apparentdf
7.34
3.06
rmse . full
1.21
rmse . step
1.15
gdf apparentdf
9.08
3.81
rmse . full
0.59
rmse . step
0.62
gdf apparentdf
9.26
4.21
rmse . full
0.43
rmse . step
0.41
gdf apparentdf
6.30
4.58
rmse . full
0.17
rmse . step
0.15
p
1 2 3 4 5 6 7
2 34 33 21 8 1 1
Prob [ correct model ]= 0
rsim (100 , 150)
n
150.00
vratio
0.97
p
2 3 4 5 6
10 24 44 19 3
Prob [ correct model ]= 0.13
rsim (100 , 300)
n
300.00
vratio
0.98
p
3 4 5 6
12 60 23 5
rsim (100 , 2000)
n
2000.00
vratio
1.00
p
4 5 6 7
54 35 10 1
As n the mean number of variables selected increased. The

proportion of simulations in which the correct model was found
increased from 0 to 0.52. 2 is underestimated in non-large
samples by a factor of 0.70, resulting in the d.f. needed to debias 2 being 8.1 when the apparent d.f. was only 3.65 on the
average, when n = 20. Variable selection did increase closeness
to the true X for some sample sizes.
Variable selection methods [73]:
N
4-14
Forward selection, backward elimination

Stopping rule: residual 2 with d.f. = no. candidates remaining at current step
Test for significance or use Akaikes information criterion
(AIC [6]), here 2 2 d.f.
Better to use subject matter knowledge!
No currently available stopping rule was developed for stepwise, only for comparing a limited number of pre-specified
models [20, Section 1.3]
Roecker [137] studied forward selection (FS), all possible subsets selection (APS), full fits
APS more likely to select smaller, less accurate models than
FS
Neither as accurate as full model fit unless >
variables redundant or unnecessary
1
2
candidate
Step-down is usually better than forward [113] and can be

used efficiently with maximum likelihood estimation [101]
Fruitless to try different stepwise methods to look for agreement [178]
4-15
Bootstrap can help decide between full and reduced model

Full model fits gives meaningful confidence intervals with
standard formulas, C.I. after stepwise does not [3, 86, 20]
Data reduction (grouping variables) can help
Using the bootstrap to select important variables for inclusion in the final model [141] is problematic [7]
It is not logical that a population regression coefficient would
be exactly zero just because its estimate was insignificant
4-16
4.4
Overfitting and Limits on Number of Predictors

Q
Concerned with avoiding overfitting

Assume typical problem in medicine, epidemiology, and the
social sciences in which the signal:noise ratio is small (higher
ratios allow for more aggressive modeling)
p should be <
m
15 [75, 76, 148, 125, 124, 168, 132]
p = number of parameters in full model or number of candidate parameters in a stepwise analysis

Derived from simulations to find minimum sample size so
that apparent discrimination = validated discrimination
Applies to typical signal:noise ratios found outside of tightly
controlled experiments
If true R2 is high, many parameters can be estimated from
smaller samples
Ignores sample size needed just to estimate the intercept or,
in semiparametric models, the underlying distribution functionc
c The
sample size needed for these is model-dependent
4-17
To just estimate in a linear model with a multiplicative

margin of error of 1.2 with 0.95 confidence requires n = 70
R
Table 4.2: Limiting Sample Sizes for Various Response Variables
Type of Response Variable

Continuous
Binary
Ordinal (k categories)
Failure (survival) time
Limiting Sample Size m

n (total sample size)
min(n1 , n2 ) a
P
n n12 ki=1 n3i b
number of failures c
a If one considers the power of a two-sample binomial test compared with a Wilcoxon test if the response could be made continuous and the
proportional odds assumption holds, the effective sample size for a binary response is 3n1 n2 /n 3 min(n1 , n2 ) if nn1 is near 0 or 1 [177, Eq. 10,
15]. Here n1 and n2 are the marginal frequencies of the two response levels [124].
b Based on the power of a proportional odds model two-sample test when the marginal cell sizes for the response are n , . . . , n , compared
1
k
with all cell sizes equal to unity (response is continuous) [177, Eq, 3]. If all cell sizes are equal, the relative efficiency of having k response categories
compared to a continuous response is 1 k12 [177, Eq. 14], e.g., a 5-level response is almost as efficient as a continuous one if proportional odds
holds across category cutoffs.
c This is approximate, as the effective sample size may sometimes be boosted somewhat by censored observations, especially for nonproportional hazards methods such as Wilcoxon-type tests [14].
Narrowly distributed predictor even higher n

p includes all variables screened for association with response, including interactions
Univariable screening (graphs, crosstabs, etc.) in no way
reduces multiple comparison problems of model building [157]
4-18
4.5
Shrinkage
Slope of calibration plot; regression to the mean
Statistical estimation procedure pre-shrunk models

Arent regression coefficients OK because theyre unbiased?
Problem is in how we use coefficient estimates
Consider 20 samples of size n = 50 from U (0, 1)
Compute group means and plot in ascending order
Equivalent to fitting an intercept and 19 dummies using least
squares
Result generalizes to general problems in plotting Y vs. X
set.seed (123)
n 50
y runif (20 * n )
group rep (1:20 , each = n )
ybar tapply (y , group , mean )
ybar sort ( ybar )
plot (1:20 , ybar , type = n , axes = FALSE , ylim = c ( .3 , .7 ) ,
xlab = Group , ylab = Group Mean )
lines (1:20 , ybar )
points (1:20 , ybar , pch =20 , cex = .5 )
axis (2)
axis (1 , at =1:20 , labels = FALSE )
for ( j in 1:20) axis (1 , at =j , labels = names ( ybar ) [ j ])
abline ( h = .5 , col = gray ( .85 ) )
Prevent shrinkage by using pre-shrinkage
4-19
0.7
Group Mean
0.6
0.5
16 6 17 2 10 14 20 9
7 11 18 5
0.4
0.3
1 15 13 19 12
Group
Figure 4.1: Sorted means from 20 samples of size 50 from a uniform [0, 1] distribution. The reference line at 0.5 depicts the true
population value of all of the means.
Spiegelhalter [150]: var. selection arbitrary, better prediction

usually results from fitting all candidate variables and using
shrinkage
Shrinkage closer to that expected from full model fit than
based on number of significant variables [36]
Ridge regression [102, 163]
Penalized MLE [166, 66, 78]
Heuristic shrinkage parameter of van Houwelingen and le
Cessie [163, Eq. 77]
model 2 p
=
,
model 2
2
2
OLSd: = np1
n1 Radj /R
n1
2
Radj
= 1 (1 R2) np1
p close to no. candidate variables

Copas [36, Eq. 8.5] adds 2 to numerator
d An
excellent discussion about such indexes may be found here.
4-20
4-21
4.6
Collinearity
W
When at least 1 predictor can be predicted well from others

Can be a blessing (data reduction, transformations)
power
s.e. of ,
This is appropriate asking too much of the data [30,
Chap. 9]
Variables compete in variable selection, chosen one arbitrary

Does not affect joint influence of a set of highly correlated
variables (use multiple d.f. tests)
Does not at all affect predictions on model construction sample
Does not affect predictions on new data [119, pp. 379-381] if
1. Extreme extrapolation not attempted
2. New data have same type of collinearities as original data
Example: LDL and total cholesterol problem only if more
inconsistent in new data
Example: age and age2 no problem
4-22
One way to quantify for each predictor: variance inflation

factors (VIF)
General approach (maximum likelihood) transform information matrix to correlation form, VIF=diagonal of inverse [41, 173]
See Belsley [13, pp. 28-30] for problems with VIF
Easy approach: SAS VARCLUS procedure [140], S varclus
function, other clustering techniques: group highly correlated variables
Can score each group (e.g., first principal component, P C1 [40]);

summary scores not collinear
4-23
4.7
Data Reduction
Z
Unless n >> p, model unlikely to validate

Data reduction: p
Use the literature to eliminate unimportant variables.
Eliminate variables whose distributions are too narrow.
Eliminate candidate predictors that are missing in a large
number of subjects, especially if those same predictors are
likely to be missing for future applications of the model.
Use a statistical data reduction method such as incomplete
principal components regression, nonlinear generalizations
of principal components such as principal surfaces, sliced inverse regression, variable clustering, or ordinary cluster analysis on a measure of similarity between variables.
Data reduction is completely masked to Y , which is
precisely why it does not distort estimates, standard errors,
P -values, or confidence limits
Data reduction = unsupervised learning
Example: dataset with 40 events and 60 candidate predictors
4-24
Use variable clustering to group variables by correlation

structure
Use clinical knowledge to refine the clusters
Keep age and severity of disease as separate predictors
because of their strength
For others create clusters: socioeconomic, risk factors/history, and physiologic function
Summarize each cluster with its first principal component
P C1, i.e., the linear combination of characteristics that
maximizes variance of the score across subjects subject
to an overall constraint on the coefficients
Fit outcome model with 5 predictors
4.7.1
Redundancy Analysis
A
Remove variables that have poor distributions

E.g., categorical variables with fewer than 2 categories
having at least 20 observations
Use flexible additive parametric additive models to determine
how well each variable can be predicted from the remaining
variables
4-25
Variables dropped in stepwise fashion, removing the most

predictable variable at each step
Remaining variables used to predict
Process continues until no variable still in the list of predictors can be predicted with an R2 or adjusted R2 greater
than a specified threshold or until dropping the variable with
the highest R2 (adjusted or ordinary) would cause a variable
that was dropped earlier to no longer be predicted at the
threshold from the now smaller list of predictors
R/S function redun in Hmisc package
Related to principal variables [115] but faster
4.7.2
Variable Clustering
B
Goal: Separate variables into groups

variables within group correlated with each other
variables not correlated with non-group members
Score each dimension, stop trying to separate effects of factors measuring same phenomenon
Variable clustering [140, 40] (oblique-rotation PC analysis)
4-26
separate variables so that first PC is representative of

group
Can also do hierarchical cluster analysis on similarity matrix
based on squared Spearman or Pearson correlations, or more
generally, Hoeffdings D [83].
See [70] for a method related to variable clustering and
sparse principal components.
[31] implement many more variable clustering methods
Example: Figure 9.6
4.7.3
Transformation and Scaling Variables Without Using

Y
C
Reduce p by estimating transformations using associations

with other predictors
Purely categorical predictors correspondence analysis [103,
39, 32, 68, 116]
Mixture of qualitative and continuous variables: qualitative

principal components
Maximum total variance (MTV) of Young, Takane, de Leeuw [184,
116]
4-27
1. Compute P C1 of variables using correlation matrix

2. Use regression (with splines, dummies, etc.) to predict
P C1 from each X expand each Xj and regress it
separately on P C1 to get working transformations
3. Recompute P C1 on transformed Xs
4. Repeat 3-4 times until variation explained by P C1 plateaus
and transformations stabilize
Maximum generalized variance (MGV) method of Sarle [99,
pp. 1267-1268]
1. Predict each variable from (current transformations of)

all other variables
2. For each variable, expand it into linear and nonlinear
terms or dummies, compute first canonical variate
3. For example, if there are only two variables X1 and X2
represented as quadratic polynomials, solve for a, b, c, d
such that aX1+bX12 has maximum correlation with cX2+
dX22.
4. Goal is to transform each var. so that it is most similar
to predictions from other transformed variables
5. Does not rely on PCs or variable clustering
MTV (PC-based instead of canonical var.) and MGV implemented in SAS PROC PRINQUAL [99]
1. Allows flexible transformations including monotonic splines
2. Does not allow restricted cubic splines, so may be unstable unless monotonicity assumed
4-28
3. Allows simultaneous imputation but often yields wild estimates

4.7.4
Simultaneous Transformation and Imputation

S transcan Function for Data Reduction & Imputation
Initialize missings to medians (or most frequent category)
Initialize transformations to original variables
Take each variable in turn as Y
Exclude obs. missing on Y
Expand Y (spline or dummy variables)
Score (transform Y ) using first canonical variate
Missing Y predict canonical variate from Xs
The imputed values can optionally be shrunk to avoid overfitting for small n or large p
Constrain imputed values to be in range of non-imputed
ones
Imputations on original scale
4-29
1. Continuous back-solve with linear interpolation

2. Categorical classification tree (most freq. cat.) or
match to category whose canonical score is closest to
one predicted
Multiple imputation bootstrap or approx. Bayesian boot.
1. Sample residuals multiple times (default M = 5)
2. Are on optimally transformed scale
3. Back-transform
4. fit.mult.impute works with aregImpute and transcan output to easily get imputation-corrected variances and avg.
Option to insert constants as imputed values (ignored during

transformation estimation); helpful when a lab value may be
missing because the patient returned to normal
Imputations and transformed values may be easily obtained
for new data
An S function Function will create a series of S functions
that transform each predictor
Example: n = 415 acutely ill patients
1. Relate heart rate to mean arterial blood pressure
2. Two blood pressures missing
3. Heart rate not monotonically related to blood pressure
4-30
4. See Figures 4.2 and 4.3

require ( Hmisc )
getHdata ( support )
# Get data frame from web site
heart.rate
support $ hrt
blood.pressure support $ meanbp
blood.pressure [400:401]
Mean Arterial Blood Pressure Day 3
[1] 151 136
blood.pressure [400:401] NA # C r e a t e t w o m i s s i n g s
d data.frame ( heart.rate , blood.pressure )
par ( pch =46)
# F i g u r e 4.2
w transcan ( heart.rate + blood.pressure , transformed = TRUE ,
imputed = TRUE , show.na = TRUE , data = d )
Convergence criterion :2.901 0.035
0.007
Convergence in 4 iterations
R2 achieved in predicting each variable :
heart . rate blood . pressure
0.259
0.259
Adjusted R2 :
heart . rate blood . pressure
0.254
0.253
w $ imputed $ blood.pressure
400
401
132.4057 109.7741
t w $ transformed
spe round ( c ( spearman ( heart.rate , blood.pressure ) ,
spearman ( t [ , heart.rate ] ,
t [ , blood.pressure ]) ) , 2)
plot ( heart.rate , blood.pressure )
# F i g u r e 4.3
plot ( t [ , heart.rate ] , t [ , blood.pressure ] ,
xlab = Transformed hr , ylab = Transformed bp )
ACE (Alternating Conditional Expectation) of Breiman and

Friedman [19]
1. Uses nonparametric super smoother [57]
2. Allows monotonicity constraints, categorical vars.
4-31
Transformed blood.pressure
Transformed heart.rate
0
6
50
100
150
200
250
300
50
heart.rate
100
150
blood.pressure
Figure 4.2: Transformations fitted using transcan. Tick marks indicate the two imputed values for blood pressure.
100
50
Transformed bp
blood.pressure
150
50
100
150
200
heart.rate
250
300
Transformed hr
Figure 4.3: The lower left plot contains raw data (Spearman = 0.02); the lower right is a scatterplot of the corresponding
transformed values ( = 0.13). Data courtesy of the SUPPORT study [93].
4-32
3. Does not handle missing data

F
These methods find marginal transformations

Check adequacy of transformations using Y
1. Graphical
2. Nonparametric smoothers (X vs. Y )
3. Expand original variable using spline, test additional predictive information over original transformation
4.7.5
Simple Scoring of Variable Clusters

G
Try to score groups of transformed variables with P C1

Reduces d.f. by pre-transforming var. and by combining
multiple var.
Later may want to break group apart, but delete all variables
in groups whose summary scores do not add significant information
Sometimes simplify cluster score by finding a subset of its
constituent variables which predict it with high R2.
Series of dichotomous variables:
Construct X1 = 0-1 according to whether any variables pos-
4-33
itive
Construct X2 = number of positives
Test whether original variables add to X1 or X2
4.7.6
Simplifying Cluster Scores

4.7.7
How Much Data Reduction Is Necessary?

Using Expected Shrinkage to Guide Data Reduction
Fit full model with all candidates, p d.f., LR likelihood ratio
2
Compute
If < 0.9, consider shrunken estimator from whole model, or
data reduction (again not using Y )
q regression d.f. for reduced model
Assume best case: discarded dimensions had no association
with Y
Expected loss in LR is p q
4-34
New shrinkage [LR (p q) q]/[LR (p q)]

Solve for q q (LR p)/9
Under these assumptions, no hope unless original LR > p+9
No 2 lost by dimension reduction q LR/10
Example:
J
Binary logistic model, 45 events on 150 subjects

10:1 rule analyze 4.5 d.f. total
Analyst wishes to include age, sex, 10 others
Not known if age linear or if age and sex additive
4 knots 3+1+1 d.f. for age and sex if restrict interaction
to be linear
Full model with 15 d.f. has LR=50
Expected shrinkage factor (50 15)/50 = 0.7
LR> 15 + 9 = 24 reduction may help
Reduction to q = (50 15)/9 4 d.f. necessary
Have to assume age linear, reduce other 10 to 1 d.f.
4-35
Separate hypothesis tests intended use full model, adjust

for multiple comparisons
Summary of Some Data Reduction Methods
Goals
Reasons
Methods
Variable clustering
Subject
knowledge
Group predictors so that

each group represents
a single dimension that
can be summarized with
a single score
d.f. arising from

multiple predictors
Make P C1 more reasonable summary
matter
Group predictors to
maximize proportion
of variance explained
by P C1 of each group
Hierarchical clustering using a matrix
of similarity measures
between predictors
d.f. due to nonlinear and dummy variable components
Transform predictors
Allows predictors to
be optimally combined
Maximum total variance on a group of related predictors
Make P C1 more reasonable summary
Canonical variates on
the total set of predictors
Use in customized
model for imputing
missing values on
each predictor
Score a group of predictors
Multiple
dimensional
scoring of all predictors
d.f. for group to unity
P C1
Simple point scores
d.f. for all predictors

combined
Principal components
1, 2, . . . , k, k
<
p
computed from all
transformed predictors
4-36
4.8
Other Approaches to Predictive Modeling

4.9
Overly Influential Observations

K
Every observation should influence fit

Major results should not rest on 1 or 2 obs.
Overly infl. obs. variance of predictions
Also affects variable selection
Reasons for influence:
Too few observations for complexity of model (see Sections
4.7, 4.3)
Data transcription or entry errors
Extreme values of a predictor
1. Sometimes subject so atypical should remove from dataset
2. Sometimes truncate measurements where data density
ends
3. Example: n = 4000, 2000 deaths, white blood count
range 500-100,000, .05,.95 quantiles=2755, 26700
4-37
4. Linear spline function fit

5. Sensitive to WBC> 60000 (n = 16)
6. Predictions stable if truncate WBC to 40000 (n = 46
above 40000)
Disagreements between predictors and response. Ignore unless extreme values or another explanation
Example: n = 8000, one extreme predictor value not on
straight line relationship with other (X, Y ) 2 = 36 for
H0 : linearity
Statistical Measures:
Leverage: capacity to be influential (not necessarily infl.)
Diagonals of hat matrix H = X(X 0X)1X 0 measures
how an obs. predicts its own response [12]
hii > 2(p + 1)/n may signal a high leverage point [12]
DFBETAS: change in upon deletion of each obs, scaled
by s.e.
DFFIT: change in X upon deletion of each obs
DFFITS: DFFIT standardized by s.e. of
Some
classify obs as overly influential when |DFFITS| >
r
2 (p + 1)/(n p 1) [12]
4-38
Others examine entire distribution for outliers

No substitute for careful examination of data [28, 149]
Maximum likelihood estimation requires 1-step approximations
4-39
4.10
Comparing Two Models

N
Level playing field (independent datasets, same no. candidate d.f., careful bootstrapping)
Criteria:
1. calibration
2. discrimination
3. face validity
4. measurement errors in required predictors
5. use of continuous predictors (which are usually better defined than categorical ones)
6. omission of insignificant variables that nonetheless make
sense as risk factors
7. simplicity (though this is less important with the availability of computers)
8. lack of fit for specific types of subjects
Goal is to rank-order: ignore calibration
Otherwise, dismiss a model having poor calibration
Good calibration compare discrimination (e.g., R2 [120],
model 2, Somers Dxy , Spearmans , area under ROC
curve)
4-40
Worthwhile to compare models on a measure not used to

optimize either model, e.g., mean absolute error, median
absolute error if using OLS
Rank measures may not give enough credit to extreme predictions model 2, R2, examine extremes of distribution
of Y
Examine differences in predicted values from the two models
See [126, 129, 128, 127] for discussions and examples of low
power for testing differences in ROC areas, and for other
approaches.
4-41
4.11
Improving the Practice of Multivariable Prediction

See also Section 5.6.
Greenland [69] discusses many important points:
Stepwise variable selection on confounders leaves important
confounders uncontrolled
Shrinkage is far superior to variable selection
Variable selection does more damage to confidence interval
widths than to point estimates
Claims about unbiasedness of ordinary MLEs are misleading
because they assume the model is correct and is the only
model entertained
models need to be complex to capture uncertainty about
the relations . . . an honest uncertainty assessment requires
parameters for all effects that we know may be present.
This advice is implicit in an antiparsimony principle often
attributed to L. J. Savage All models should be as big as
an elephant (see Draper, 1995)
Greenlands example of inadequate adjustment for confounders
4-42
as a result of using a bad modeling strategy:
Case-control study of diet, food constituents, breast cancer

140 cases, 222 controls
35 food constituent intakes and 5 confounders
Food intakes are correlated
Traditional stepwise analysis not adjusting simultaneously
for all foods consumed 11 foods had P < 0.05
Full model with all 35 foods competing 2 had P < 0.05
Rigorous simultaneous analysis (hierarchical random slopes
model) penalizing estimates for the number of associations
examined no foods associated with breast cancer
S
Global Strategies
Use a method known not to work well (e.g., stepwise variable selection without penalization; recursive partitioning),
document how poorly the model performs (e.g. using the
bootstrap), and use the model anyway
Develop a black box model that performs poorly and is difficult to interpret (e.g., does not incorporate penalization)
4-43
Develop a black box model that performs well and is difficult

to interpret
Develop interpretable approximations to the black box
Develop an interpretable model (e.g. give priority to additive
effects) that performs well and is likely to perform equally
well on future data from the same stream
T
Preferred Strategy in a Nutshell
Decide how many d.f. can be spent

Decide where to spend them
Spend them
Dont reconsider, especially if inference needed
4-44
4.12
Summary: Possible Modeling Strategies

4.12.1
Developing Predictive Models

U
1. Assemble accurate, pertinent data and lots of it, with wide

distributions for X.
2. Formulate good hypotheses specify relevant candidate
predictors and possible interactions. Dont use Y to decide
which Xs to include.
3. Characterize subjects with missing Y . Delete such subjects
in rare circumstances [38]. For certain models it is effective
to multiply impute Y .
4. Characterize and impute missing X. In most cases use multiple imputation based on X and Y
5. For each predictor specify complexity or degree of nonlinearity that should be allowed (more for important predictors or
for large n) (Section 4.1)
6. Do data reduction if needed (pre-transformations, combinations), or use penalized estimation [78]
7. Use the entire sample in model development
8. Can do highly structured testing to simplify initial model
(a) Test entire group of predictors with a single P -value
4-45
(b) Make each continuous predictor have same number of

knots, and select the number that optimizes AIC
(c) Test the combined effects of all nonlinear terms with a
single P -value
9. Make tests of linearity of effects in the model only to demonstrate to others that such effects are often statistically significant. Dont remove individual insignificant effects from
the model.
10. Check additivity assumptions by testing pre-specified interaction terms. Use a global test and either keep all or delete
all interactions.
11. Check to see if there are overly-influential observations.
12. Check distributional assumptions and choose a different model
if needed.
13. Do limited backwards step-down variable selection if parsimony is more important that accuracy [150]. But confidence
limits, etc., must account for variable selection (e.g., bootstrap).
14. This is the final model.
15. Interpret the model graphically and by computing predicted
values and appropriate test statistics. Compute pooled tests
of association for collinear predictors.
16. Validate this model for calibration and discrimination ability,
preferably using bootstrapping.
4-46
17. Shrink parameter estimates if there is overfitting but no further data reduction is desired (unless shrinkage built-in to
estimation)
18. When missing values were imputed, adjust final variancecovariance matrix for imputation. Do this as early as possible because it will affect other findings.
19. When all steps of the modeling strategy can be automated,
consider using Faraways method [54] to penalize for the randomness inherent in the multiple steps.
20. Develop simplifications to the final model as needed.
4.12.2
Developing Models for Effect Estimation

V
1. Less need for parsimony; even less need to remove insignificant variables from model (otherwise CLs too narrow)
2. Careful consideration of interactions; inclusion forces estimates to be conditional and raises variances
3. If variable of interest is mostly the one that is missing, multiple imputation less valuable
4. Complexity of main variable specified by prior beliefs, compromise between variance and bias
5. Dont penalize terms for variable of interest
6. Model validation less necessary
4-47
4.12.3
Developing Models for Hypothesis Testing

W
1. Virtually same as previous strategy

2. Interactions require tests of effect by varying values of another variable, ormain effect + interactionjoint tests (e.g.,
is treatment effective for either sex, allowing effects to be
different)
3. Validation may help quantify overadjustment
Chapter 5
Describing, Resampling, Validating, and

Simplifying the Model
5.1
Describing the Fitted Model

5.1.1
Interpreting Effects
A
Regression coefficients if 1 d.f. per factor, no interaction

Not standardized regression coefficients
Many programs print meaningless estimates such as effect
of increasing age2 by one unit, holding age constant
Need to account for nonlinearity, interaction, and use meaningful ranges
For monotonic relationships, estimate X at quartiles of
continuous variables, separately for various levels of inter5-1
CHAPTER 5. DESCRIBING, RESAMPLING, VALIDATING, AND SIMPLIFYING THE MODEL
5-2
acting factors
Subtract estimates, anti-log, e.g., to get inter-quartile-range
odds or hazards ratios. Base C.L. on s.e. of difference. See
Figure 10.10.
Partial effect plot: Plot effect of each predictor on X or

some transformation. See Figure 10.8. See also [91].
Nomogram. See Figure 10.12.
Use regression tree to approximate the full model
5.1.2
Indexes of Model Performance

Error Measures
C
Central tendency of prediction errors

Mean absolute prediction error: mean |Y Y |
Mean squared prediction error
Binary Y : Brier score (quadratic proper scoring rule)
Logarithmic proper scoring rule (avg. log-likelihood)
Discrimination measures
5-3
Pure discrimination: rank correlation of (Y , Y )

Spearman , Kendall , Somers Dxy
Y binary Dxy = 2 (C 12 )
C = concordance probability = area under receiver operating characteristic curve Wilcoxon-Mann-Whitney
statistic
Mostly discrimination: R2
2
Radj
overfitting corrected if model pre-specified
Brier score can be decomposed into discrimination and

calibration components
Discrimination measures based on variation in Y
regression sum of squares
gindex
Calibration measures
calibrationinthelarge: average Y vs. average Y

high-resolution calibration curve (calibrationinthesmall).
See Figure 8.7.
calibration slope and intercept
maximum absolute calibration error
5-4
mean absolute calibration error

0.9 quantile of calibration error
F
gIndex
Based on Ginis mean difference

mean over all possible i 6= j of |Zi Zj |
interpretable, robust, highly efficient measure of variation
g = Ginis mean difference of Xi = Y
Example: Y = systolic blood pressure; g = 11mmHg is
typical difference in Y
Independent of censoring etc.
For models in which anti-log of difference in Y represent
meaningful ratios (odds ratios, hazard ratios, ratio of medians):
gr = exp(g)
For models in which Y can be turned into a probability
estimate (e.g., logistic regression):
gp = Ginis mean difference of P
These gindexes represent e.g. typical odds ratios, typical
risk differences
Can define partial g
5-5
5-6
5.2
The Bootstrap
H
If know population model, use simulation or analytic derivations to study behavior of statistical estimator
Suppose Y has a cumulative dist. fctn. F (y) = Prob{Y
y}
We have sample of size n from F (y),
Y1, Y2, . . . , Yn
Steps:
1. Repeatedly simulate sample of size n from F
2. Compute statistic of interest
3. Study behavior over B repetitions
Example: 1000 samples, 1000 sample medians, compute
their sample variance
F unknown estimate by empirical dist. fctn.
n
1 X
Fn(y) =
[Yi y].
n i=1
Example: sample of size n = 30 from a normal distribution
with mean 100 and SD 10
set.seed (6)
x rnorm (30 , 100 , 20)
xs seq (50 , 150 , length =150)
5-7
cdf pnorm ( xs , 100 , 20)

plot ( xs , cdf , type = l , ylim = c (0 ,1) ,
xlab = expression ( x ) ,
ylab = expression ( paste ( " Prob [ " , X x , " ] " ) ) )
lines ( ecdf ( x ) , cex = .5 )
1.0
0.8
Prob[X x]
0.6
0.4
0.2
0.0
60
80
100
120
140
x
Figure 5.1: Empirical and population cumulative distribution function
Fn corresponds to density function placing probability n1 at

each observed data point ( nk if point duplicated k times)
Pretend that F Fn
Sampling from Fn sampling with replacement from observed data Y1, . . . , Yn
Large n selects 1 e1 0.632 of original data points
in each bootstrap sample at least once
Some observations not selected, others selected more than
once
5-8
Efrons bootstrap general-purpose technique for estimating properties of estimators without assuming or knowing
distribution of data F
Take B samples of size n with replacement, choose B so
that summary measure of individual statistics summary if
B=
Bootstrap based on distribution of observed differences between a resampled parameter estimate and the original estimate telling us about the distribution of unobservable differences between the original estimate and the unknown parameter
Example: Data (1, 5, 6, 7, 8, 9), obtain 0.80 confidence interval
for population median, and estimate of population expected
value of sample median (only to estimate the bias in the original
estimate of the median).
options ( digits =3)
y c (2 ,5 ,6 ,7 ,8 ,9 ,10 ,11 ,12 ,13 ,14 ,19 ,20 ,21)
y c (1 ,5 ,6 ,7 ,8 ,9)
set.seed (17)
n length ( y )
n2 n / 2
n21 n2 +1
B 400
M double ( B )
plot (0 , 0 , xlim = c (0 , B ) , ylim = c (3 ,9) ,
xlab = " Bootstrap Samples Used " ,
ylab = " Mean and 0 .1 , 0 .9 Quantiles " , type = " n " )
for ( i in 1: B ) {
s sample (1: n , n , replace = T )
x sort ( y [ s ])
m .5 * ( x [ n2 ]+ x [ n21 ])
M[i] m
if ( i 20) {
w as.character ( x )
cat (w , " & & " , sprintf ( % .1f ,m ) ,
if ( i < 20) " \\\\\ n " else " \\\\ \\ hline \ n " ,
5-9
file = / doc / rms / validate / tab.tex , append = i > 1)

}
points (i , mean ( M [1: i ]) , pch =46)
if ( i 10)
{
q quantile ( M [1: i ] , c ( .1 , .9 ) )
points (i , q [1] , pch =46 , col = blue )
points (i , q [2] , pch =46 , col = blue )
}
}
table ( M )
M
1
6
3 3.5
10
7
4 4.5
8
2
5 5.5
23 43
6 6.5
75 59
7 7.5
66 47
8 8.5
42 11
9
1
hist (M , nclass = length ( unique ( M ) ) , xlab = " " , main = " " )
60
Frequency
Mean and 0.1, 0.9 Quantiles
40
20
4
0
0
100
200
300
400
Bootstrap Samples Used

Figure 5.2: Estimating properties of sample median using the bootstrap
First 20 samples:
Bootstrap Sample
166789
155568
578999
777889
157799
156678
788888
555799
155779
155778
115577
115578
155778
156788
156799
667789
157889
668999
115569
168999
5-10
Sample Median
6.5
5.0
8.5
7.5
7.0
6.0
8.0
6.0
6.0
6.0
5.0
5.0
6.0
6.5
6.5
7.0
7.5
8.5
5.0
8.5
N
Histogram tells us whether we can assume normality for the

bootstrap medians or need to use quantiles of medians to
construct C.L.
Need high B for quantiles, low for variance (but see [17])
5-11
5.3
Model Validation
5.3.1
Introduction
O
External validation (best: another country at another time);

also validates sampling, measurementsa
Internal
apparent (evaluate fit on same data used to create fit)
data splitting
cross-validation
bootstrap: get overfitting-corrected accuracy index
Best way to make model fit data well is to discard much of
the data
Predictions on another dataset will be inaccurate
Need unbiased assessment of predictive accuracy
Working definition of external validation: Validation of
a prediction tool on a sample that was not available at publia But
in many cases it is better to combine data and include country or calendar time as a predictor.
5-12
cation time. Alternate: Validation of a prediction tool by an

independent research team.
One suggested hierarchy of the quality of various validation
methods is as follows, ordered from worst to best.
1. Attempting several validations (internal or external) and reporting only the one that worked
2. Reporting apparent performance on the training dataset (no
validation)
3. Reporting predictive accuracy on an undersized independent
test sample
4. Internal validation using data splitting where at least one of
the training and test samples is not huge and the investigator
is not aware of the arbitrariness of variable selection done
on a single sample
5. Strong internal validation using 100 repeats of 10-fold crossvalidation or several hundred bootstrap resamples, repeating
all analysis steps involving Y afresh at each re-sample and
the arbitrariness of selected important variables is reported
(if variable selection is used)
6. External validation on a large test sample, done by the original research team
7. Re-analysis by an independent research team using strong
internal validation of the original dataset
5-13
8. External validation using new test data, done by an independent research team
9. External validation using new test data generated using different instruments/technology, done by an independent research team
Some points to consider:
Unless both sample sizes are huge, external validation can

be low precision
External validation can be costly and slow and may result in
disappointment that would have been revealed earlier with
rigorous internal validation
External validation is sometimes gamed; researchers disappointed in the validation sometimes ask for a do over; resampling validation is harder to game as long as all analytical
steps using Y are repeated each time.
Instead of external validation to determine model applicability at a different time or place, and being disappointed if
the model does not work in that setting, consider building a
unified model containing time and place as predictors
When the model was fully pre-specified, external validation
tests the model
5-14
But when the model was fitted using machine learning, feature screening, variable selection, or model selection, the
model developed using training data is usually only an example of a model, and the test sample validation could be
called an example validation
When resampling is used to repeat all modeling steps for
each resample, rigorous internal validation tests the process
used to develop the model and happens to also provide a
high-precision estimate of the likely future performance of
the final model developed using that process, properly penalizing for model uncertainty.
Resampling also reveals the volatility of the model selection
process
See BBR 10.11
5.3.2
Which Quantities Should Be Used in Validation?

U
OLS: R is one good measure for quantifying drop-off in

predictive ability
Example: n = 10, p = 9, apparent R2 = 1 but R2 will be
close to zero on new subjects
Example: n = 20, p = 10, apparent R2 = .9, R2 on new
5-15
2
= 0.79
data 0.7, Radj
Adjusted R2 solves much of the bias problem assuming p in

its formula is the largest number of parameters ever examined against Y
Few other adjusted indexes exist
Also need to validate models with phantom d.f.
Cross-validation or bootstrap can provide unbiased estimate
of any index; bootstrap has higher precision
Two main types of quantities to validate

1. Calibration or reliability: ability to make unbiased estimates of response (Y vs. Y )
2. Discrimination: ability to separate responses
OLS: R2; gindex; binary logistic model: ROC area,
equivalent to rank correlation between predicted probability of event and 0/1 event
Unbiased validation nearly always necessary, to detect overfitting
5.3.3
Data-Splitting
W
Split data into training and test sets
5-16
Interesting to compare index of accuracy in training and test

Freeze parameters from training
Make sure you allow R2 = 1 SSE/SST for test sample
to be < 0
Dont compute ordinary R2 on X vs. Y ; this allows for
linear recalibration aX + b vs. Y
Test sample must be large enough to obtain very accurate
assessment of accuracy
Training sample is whats left

Example: overall sample n = 300, training sample n = 200,
predict on test sample (n = 100),
develop model, freeze ,
P
2
(Yi Xi )
2
R = 1 P(Y Y )2 .
i
Disadvantages of data splitting:

1. Costly in n [137, 20]
2. Requires decision to split at beginning of analysis
3. Requires larger sample held out than cross-validation
4. Results vary if split again
5. Does not validate the final model (from recombined data)
6. Not helpful in getting CL corrected for var. selection
5-17
5.3.4
Improvements on Data-Splitting: Resampling

Z
No sacrifice in sample size

Work when modeling process automated
Bootstrap excellent for studying arbitrariness of variable selection [141]. See P. ??.
Cross-validation solves many problems of data splitting [163,
145, 181, 49]
Example of -validation:
1. Split data at random into 10 tenths
1
of data at a time
2. Leave out 10
9
3. Develop model on 10
, including any variable selection,
pre-testing, etc.
1
4. Freeze coefficients, evaluate on 10
5. Average R2 over 10 reps
Drawbacks:
1. Choice of number of groups and repetitions
2. Doesnt show full variability of var. selection
3. Does not validate full model
4. Lower precision than bootstrap
5-18
5. Need to do 50 repeats of 10-fold cross-validation to ensure adequate precision

Randomization method
1. Randomly permute Y
2. Optimism = performance of fitted model compared to
what expect by chance
5.3.5
Validation Using the Bootstrap

C
Estimate optimism of final whole sample fit without holding

out data
From original X and Y select sample of size n with replacement
Derive model from bootstrap sample
Apply to original sample
Simple bootstrap uses average of indexes computed on original sample
Estimated optimism = difference in indexes
Repeat about B = 100 times, get average expected optimism
5-19
Subtract average optimism from apparent index in final model

Example: n = 1000, have developed a final model that is
hopefully ready to publish. Call estimates from this final
model .
2
final model has apparent R2 (Rapp
) =0.4
2
how inflated is Rapp
?
get resamples of size 1000 with replacement from original

1000
2
for each resample compute Rboot
= apparent R2 in bootstrap sample
freeze these coefficients (call them boot), apply to original

2
(whole) sample (Xorig , Yorig ) to get Rorig
= R2(Xorig boot, Yorig )
2
2
optimism = Rboot
Rorig
average over B = 100 optimisms to get optimism

2
Roverf
itting
corrected
2
= Rapp
optimism
Example: See P. ??
Is estimating unconditional (not conditional on X) distribution of R2, etc. [54, p. 217]
5-20
Conditional estimates would require assuming the model one

is trying to validate
Efrons .632 method may perform better (reduce bias further) for small n [49], [50, p. 253], [51]
Bootstrap useful for assessing calibration in addition to discrimination:
F
Fit C(Y |X) = X on bootstrap sample

Re-fit C(Y |X) = 0 + 1X on same data
0 = 0, 1 = 1
Test data (original dataset): re-estimate 0, 1
1 < 1 if overfit, 0 > 0 to compensate
1 quantifies overfitting and useful for improving calibration [150]
Use Efrons method to estimate optimism in (0, 1), estimate
(0, 1) by subtracting optimism from (0, 1)
See also Copas [35] and van Houwelingen and le Cessie [163,
p. 1318]
See [56] for warnings about the bootstrap, and [49] for variations
5-21
on the bootstrap to reduce bias.

Use bootstrap to choose between full and reduced models:
G
Bootstrap estimate of accuracy for full model

Repeat, using chosen stopping rule for each re-sample
Full fit usually outperforms reduced model [150]
Stepwise modeling often reduces optimism but this is not
offset by loss of information from deleting marginal var.
Method
Apparent Rank
OverBias-Corrected
Correlation of Optimism
Correlation
Predicted vs.
Observed
Full Model
0.50
0.06
0.44
Stepwise Model
0.47
0.05
0.42
In this example, stepwise modeling lost a possible 0.500.47 =

0.03 predictive discrimination. The full model fit will especially
be an improvement when
1. The stepwise selection deleted several variables which were

almost significant.
2. These marginal variables have some real predictive value,
even if its slight.
3. There is no small set of extremely dominant variables that
would be easily found by stepwise selection.
Other issues:
5-22
See [163] for many interesting ideas

Faraway [54] shows how bootstrap is used to penalize for
choosing transformations for Y , outlier and influence checking, variable selection, etc. simultaneously
Brownstone [23, p. 74] feels thattheoretical statisticians have
been unable to analyze the sampling properties of [usual
multi-step modeling strategies] under realistic conditions
and concludes that the modeling strategy must be completely specified and then bootstrapped to get consistent
estimates of variances and other sampling properties
See Blettner and Sauerbrei [16] and Chatfield [29] for more
interesting examples of problems resulting from data-driven
analyses.
5-23
5.4
Bootstrapping Ranks of Predictors

J
Order of importance of predictors not pre-specified

Researcher interested in determining winners and losers
Bootstrap useful in documenting the difficulty of this task
Get confidence limits of the rank of each predictor in the
scale of partial 2 - d.f.
Example using OLS
#
#
#
#
Use the plot method for anova , with pl = FALSE to s u p p r e s s actual

p l o t t i n g of c h i - s q u a r e - d.f. for each b o o t s t r a p r e p e t i t i o n .
Rank the n e g a t i v e of the a d j u s t e d c h i - s q u a r e s so that a rank of
1 is a s s i g n e d to the h i g h e s t .
It is i m p o r t a n t to tell
#
#
p l o t . a n o v a . r m s not to sort the results , or every b o o t s t r a p

r e p l i c a t i o n w o u l d h a v e r a n k s of 1 ,2 ,3 , ... for the s t a t s .
require ( rms )
n 300
set.seed (1)
d data.frame ( x1 = runif ( n ) , x2 = runif ( n ) , x3 = runif ( n ) , x4 = runif ( n ) ,
x5 = runif ( n ) , x6 = runif ( n ) , x7 = runif ( n ) , x8 = runif ( n ) ,
x9 = runif ( n ) , x10 = runif ( n ) , x11 = runif ( n ) , x12 = runif ( n ) )
d $ y with (d , 1 * x1 + 2 * x2 + 3 * x3 + 4 * x4 + 5 * x5 + 6 * x6 + 7 * x7 +
8 * x8 + 9 * x9 + 10 * x10 + 11 * x11 + 12 * x12 + 9 * rnorm ( n ) )
f ols ( y x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10 + x11 + x12 , data = d )
B 1000
ranks matrix ( NA , nrow =B , ncol =12)
rankvars function ( fit )
rank ( plot ( anova ( fit ) , sort = none , pl = FALSE ) )
Rank rankvars ( f )
for ( i in 1: B ) {
j sample (1: n , n , TRUE )
bootfit update (f , data =d , subset = j )
ranks [i ,] rankvars ( bootfit )
}
lim t ( apply ( ranks , 2 , quantile , probs = c ( .025 , .975 ) ) )
predictor factor ( names ( Rank ) , names ( Rank ) )
w data.frame ( predictor , Rank , lower = lim [ ,1] , upper = lim [ ,2])
require ( ggplot2 )
ggplot (w , aes ( x = predictor , y = Rank ) ) + geom_point () + coord_flip () +
5-24
scale_y_continuous ( breaks =1:12) +

geom_errorbar ( aes ( ymin = lim [ ,1] , ymax = lim [ ,2]) , width =0)
x12
x11
x10
predictor
x9
x8
x7
x6
x5
x4
x3
x2
x1
10
11
12
Rank
Figure 5.3: Bootstrap percentile 0.95 confidence limits for ranks of predictors in an OLS model. Ranking is on the basis of partial
2 minus d.f. Point estimates are original ranks
5-25
5.5
Simplifying the Final Model by Approximating It

5.5.1
Difficulties Using Full Models

K
Predictions are conditional on all variables, standard errors

when predict for a low-frequency category
Collinearity
Can average predictions over categories to marginalize,
s.e.
5.5.2
Approximating the Full Model

L
Full model is gold standard

Approximate it to any desired degree of accuracy
If approx. with a tree, best c-v tree will have 1 obs./node
Can use least squares to approx. model by predicting Y =
X
When original model also fit using least squares, coef. of
5-26
approx. model against Y coef. of subset of variables fitted

against Y (as in stepwise)
Model approximation still has some advantages
1. Uses unbiased estimate of from full fit
2. Stopping rule less arbitrary
3. Inheritance of shrinkage
If estimates from full model are and approx. model is based
on a subset T of predictors X, coef. of approx. model are
where
W ,
W = (T 0T )1T 0X
Variance matrix of reduced coef.: W V W 0
5-27
5.6
How Do We Break Bad Habits?

N
Insist on validation of predictive models and discoveries

Show collaborators that split-sample validation is not appropriate unless the number of subjects is huge
Split more than once and see volatile results
Calculate a confidence interval for the predictive accuracy
in the test dataset and show that it is very wide
Run simulation study with no real associations and show
that associations are easy to find
Analyze the collaborators data after randomly permuting
the Y vector and show some positive findings
Show that alternative explanations are easy to posit
Importance of a risk factor may disappear if 5 unimportant risk factors are added back to the model
Omitted main effects can explain apparent interactions
Uniqueness analysis: attempt to predict the predicted values from a model derived by data torture from all of the
features not used in the model
Chapter 6
R Software
R allows interaction spline functions, wide variety of predictor parameterizations, wide variety of models, unifying model
formula language, model validation by resampling.
R is comprehensive:
Easy to write R functions for new models wide variety
of modern regression models implemented (trees, nonparametric, ACE, AVAS, survival models for multiple events)
Designs can be generated for any model all handle class
var, interactions, nonlinear expansions
Single R objects (e.g., fit object) can be self-documenting
automatic hypothesis tests, predictions for new data
Superior graphics
Classes and generic functions
6-1
CHAPTER 6. R SOFTWARE
6-2
6.1
The R Modeling Language

R statistical modeling language:
response terms
y age + sex
y age + sex + age : sex
y age * sex
#
#
#
#
age + sex main effects

add second-order interaction
second-order interaction +
all main effects
y ( age + sex + pressure ) 2

#
age + sex + pressure + age : sex + age : pressure...
y ( age + sex + pressure ) 2 - sex : pressure

#
#
#
all main e f f e c t s and all 2 nd order

interactions except sex : pressure
age + race + sex + age : sex + race : sex
age * sex ) # n o i n t e r a c t . w i t h r a c e , s e x
y ( age + race ) * sex

y treatment * ( age * race +
sqrt ( y ) sex * sqrt ( age ) + race
#
#
functions ,
race is an
with dummy v a r i a b l e s g e n e r a t e d if
R factor ( classification ) variable
poly ( age ,2)
# poly generates orthogonal
y sex +
polynomials
race.sex interaction ( race , sex )
y age + race.sex
# for when you want dummy variables for
#
all
combinations
of
the
factors
The formula for a regression model is given to a modeling function, e.g.

lrm ( y rcs (x ,4) )
is read use a logistic regression model to model y as a function

of x, representing x by a restricted cubic spline with 4 default
knotsa.
update
f
f2
f3
f4
f5
a lrm
function: re-fit model with changes in terms or data:
lrm ( y rcs (x ,4) + x2 + x3 )

update (f , subset = sex == " male " )
update (f , ..-x2 )
# remove x2 from model
update (f , .. + rcs ( x5 ,5) ) # a d d r c s ( x 5 , 5 ) t o m o d e l
update (f , y2 . )
# same terms , new response var.
and rcs are in the rms package.
6-3
6.2
User-Contributed Functions
R is high-level object-oriented language.
R (UNIX, Linux, Mac, Windows)
Multitude of user-contributed functions freely available
International community of users
Some R functions:
See Venables and Ripley
Hierarchical clustering: hclust
Principal components: princomp, prcomp
Canonical correlation: cancor
Nonparametric transform-both-sides additive models:
ace, avas
Parametric transform-both-sides additive models:

areg, areg.boot (Hmisc package in R))
Rank correlation methods:
rcorr, hoeffd, spearman2 (Hmisc)
Variable clustering: varclus (Hmisc)

Single imputation: transcan (Hmisc)
Multiple imputation: aregImpute (Hmisc)
Restricted cubic splines:
rcspline.eval (Hmisc)
Re-state restricted spline in simpler form:
rcspline.restate (Hmisc)
6-4
6-5
6.3
The rms Package

datadist function to compute predictor distribution summaries
y sex + lsp ( age , c (20 ,30 ,40 ,50 ,60) ) +
sex % ia % lsp ( age , c (20 ,30 ,40 ,50 ,60) )
E.g. restrict age cholesterol interaction to be of form AF (B)+

BG(A):
y lsp ( age ,30) + rcs ( cholesterol ,4) +
lsp ( age ,30) % ia % rcs ( cholesterol ,4)
Special fitting functions by Harrell to simplify procedures described in these notes:

Table 6.1: rms Fitting Functions
Function
ols
lrm
orm
psm
cph
bj
Glm
Gls
Rq
Purpose
Ordinary least squares linear model
Binary and ordinal logistic regression model
Has options for penalized MLE
Ordinal semi-parametric regression model for
continuous Y and several link functions
Accelerated failure time parametric survival
models
Cox proportional hazards regression
Buckley-James censored least squares model
rms version of glm
rms version of gls
rms version of rq
Related R
Functions
lm
glm
polr,lrm
survreg
coxph
survreg,lm
glm
gls (nlme package)
rq (quantreg package)
6-6
Table 6.2: rms Transformation Functions
Function
Purpose
asis
rcs
pol
lsp
catg
scored
matrx
strat
No post-transformation (seldom used explicitly)

Restricted cubic splines
Polynomial using standard notation
Linear spline
Categorical predictor (seldom)
Ordinal categorical variables
Keep variables as group for anova and fastbw
Non-modeled stratification factors
(used for cph only)
Function
Purpose
Print parameters and statistics of fit
Fitted regression coefficients
Formula used in the fit
Detailed specifications of fit
Fetch covariance matrix
Fetch maximized log-likelihood
Fetch AIC with option to put on chi-square basis
Likelihood ratio test for two nested models
Compute all univariable LR 2
Robust covariance matrix estimates
Bootstrap covariance matrix estimates
and bootstrap distributions of estimates
Find optimum penalty factors by tracing
effective AIC for a grid of penalties
Print effective d.f. for each type of variable
in model, for penalized fit or pentrace result
Summary of effects of predictors
Plot continuously shaded confidence bars
for results of summary
Wald tests of most meaningful hypotheses
Graphical depiction of anova
General contrasts, C.L., tests
Easily generate predictor combinations
Obtain predicted values or design matrix
Obtain predicted values and confidence limits easily
varying a subset of predictors and others set at
default values
Plot the result of Predict using lattice
Plot the result of Predict using ggplot2
Fast backward step-down variable selection
(or resid) Residuals, influence stats from fit
Sensitivity analysis for unmeasured confounder
Which observations are overly influential
LATEX representation of fitted model
print
coef
formula
specs
vcov
logLik
AIC
lrtest
univarLR
robcov
bootcov
pentrace
effective.df
summary
plot.summary
anova
plot.anova
contrast
gendata
predict
Predict
plot.Predict
ggplot.Predict
fastbw
residuals
sensuc
which.influence
latex
Related R
Functions
I
ns
poly
factor
ordered
matrix
strata
Related Functions
step
residuals
Function
Function
Function
Hazard
Survival
Quantile
Mean
nomogram
survest
survplot
validate
val.prob
val.surv
calibrate
vif
naresid
naprint
impute
Purpose
R function analytic representation of X
from a fitted regression model
R function analytic representation of a fitted
hazard function (for psm)
R function analytic representation of fitted
survival function (for psm, cph)
function for quantiles of survival time
(for psm, cph)
function for mean survival time or for ordinal logistic
Draws a nomogram for the fitted model
Estimate survival probabilities (psm, cph)
Plot survival curves (psm, cph)
Validate indexes of model fit using resampling
External validation of a probability model
External validation of a survival model
Estimate calibration curve using resampling
Variance inflation factors for fitted model
Bring elements corresponding to missing data
back into predictions and residuals
Print summary of missing values
Impute missing values
6-7
Related Functions
latex
latex, plot
survfit
plot.survfit
lrm
calibrate
val.prob
aregImpute
Example:
treat: categorical variable with levels "a","b","c"
num.diseases: ordinal variable, 0-4
age: continuous
Restricted cubic spline
cholesterol: continuous
(3 missings; use median)
log(cholesterol+10)
Allow treat cholesterol interaction

Program to fit logistic model, test all effects in design, es-
6-8
timate effects (e.g. inter-quartile range odds ratios), plot

estimated transformations
require ( rms )
# make new functions available
ddist datadist ( cholesterol , treat , num.diseases , age )
#
Could
have
used
ddist
datadist ( data.frame.name )
options ( datadist = " ddist " )

# d e f i n e s data dist. to rms
cholesterol impute ( cholesterol )
fit lrm ( y treat + scored ( num.diseases ) + rcs ( age ) +
log ( cholesterol +10) + treat : log ( cholesterol +10) )
describe ( y treat + scored ( num.diseases ) + rcs ( age ) )
#
#
or use d e s c r i b e ( f o r m u l a ( fit ) ) for

d e s c r i b e f u n c t i o n ( in Hmisc ) gets
fit
robcov ( fit )
specs ( fit )
anova ( fit )
anova ( fit , treat , cholesterol )
plot ( anova ( fit ) )
summary ( fit )
plot ( summary ( fit ) )
summary ( fit , treat = " b " , age =60)
summary ( fit , age = c (50 ,70) )
summary ( fit , age = c (50 ,60 ,70) )
If
had
not
Estimate
defined
and
test
datadist ,
treatment
all v a r i a b l e s used in fit

simple s t a t i s t i c s on v a r i a b l e s
#
#
#
#
Would make all statistics that follow

use a robust covariance matrix
would need x =T , y = T in lrm ()
Describe the design characteristics
Test
#
#
#
#
#
#
Summarize anova graphically

Estimate effects using default ranges
G r a p h i c a l d i s p l a y of e f f e c t s with C.I.
Specify reference cell and adjustment val
E s t i m a t e effect of i n c r e a s i n g age from
50 to 70
these
by
themselves
# I n c r e a s e age f r o m 50 to 70 , a d j u s t to
# 60 when e s t i m a t i n g e f f e c t s of other
# factors
would have to define ranges for all var.
( b-a )
effect
averaged
over
cholesterols
contrast ( fit , list ( treat = b , cholesterol = c (150 ,200 ,250) ) ,

list ( treat = a , cholesterol = c (150 ,200 ,250) ) ,
type = average )
#
#
#
See the help file for contrast.rms for several examples

how to obtain joint tests of m u l t i p l e c o n t r a s t s and how
double differences ( interaction contrasts )
of
to
get
p Predict ( fit , age = seq (20 ,80 , length =100) , treat , conf.int = FALSE )
plot ( p )
# Plot relationship between age and log
#
or
ggplot (p)
#
#
#
odds , s e p a r a t e curve
no C.I.
Same but 2 panels
for
#
#
#
3 -dimensional perspective
cholesterol , and log odds
ranges for both variables
each
treat ,
plot (p , age | treat )

ggplot (p , groups = FALSE )
bplot ( Predict ( fit , age , cholesterol , np =50) )
p l o t for age ,
using default
plot ( Predict ( fit , num.diseases , fun = function ( x ) 1 / (1+ exp ( -x ) ) , conf.int = .9 ) ,

ylab = " Prob " )
# Plot e s t i m a t e d p r o b a b i l i t i e s i n s t e a d of
#
Again ,
if
no
datadist
were
# log odds
defined , would
( or use
have to
ggplot () )
tell plot
all
limits
logit predict ( fit , expand.grid ( treat = " b " , num.dis =1:3 , age = c (20 ,40 ,60) ,
cholesterol = seq (100 ,300 , length =10) ) )
#
Could
also
obtain
list
of
predictor
settings
interactively }
6-9
logit predict ( fit , gendata ( fit , nobs =12) )

#
#
#
#
Since age doesn t i n t e r a c t with anything , we

i n t e r a c t i v e l y try v a r i o u s t r a n s f o r m a t i o n s of
f u n c t i o n of age as the gold s t a n d a r d . We are
transformation.
can quickly and

age , t a k i n g the s p l i n e
seeking a linearizing
ag 10:80
logit predict ( fit , expand.grid ( treat = " a " , num.dis =0 , age = ag ,
cholesterol = median ( cholesterol ) ) , type = " terms " ) [ , " age " ]
#
Note :
#
#
#
#
#
#
if
" main effect " ignoring interaction terms

Could also use
l o g i t P r e d i c t ( f , a g e = ag , . . . ) $ y h a t ,
which allows e v a l u a t i o n of the shape for any level of i n t e r a c t i n g
factors.
When age does not interact with anything , the result from
p r e d i c t ( f , ... , t y p e =" t e r m s ") w o u l d e q u a l the r e s u l t f r o m
age
Predict
#
#
#
Could also specify

l o g i t p r e d i c t ( fit , g e n d a t a ( fit , a g e = ag , c h o l e s t e r o l = . . . ) )
U n - m e n t i o n e d v a r i a b l e s set to r e f e r e n c e values
if
interacted
all
other
with
terms
anything ,
were
plot ( ag .5 , logit )
plot ( ag 1 .5 , logit )
Draw
nomogram
for
the
try
try
square root
1 .5 power
# invokes
fit
model
would
be
the
age
ignored
#
#
latex ( fit )
this
vs.
latex.lrm ,
spline
creates
transform.
fit.tex
plot ( nomogram ( fit ) )

#
Compose
function
to
evaluate
linear
predictors
analytically
g Function ( fit )
g ( treat = b , cholesterol =260 , age =50)
#
Letting
num.diseases
default
to
reference
value
To examine interactions in a simpler way, you may want to

group age into tertiles:
age.tertile cut2 ( age , g =3)
#
For
automatic
ranges
later ,
add
age.tertile
fit lrm ( y age.tertile * rcs ( cholesterol ) )
to
datadist
input
6-10
6.4
Other Functions
supsmu: Friedmans super smoother
lowess: Clevelands scatterplot smoother
glm: generalized linear models (see Glm)
gam: Generalized additive models
rpart: Like original CART with surrogate splits for missings,
censored data extension (Atkinson & Therneau)
validate.rpart: in rms; validates recursive partitioning with
respect to certain accuracy indexes
loess: multi-dimensional scatterplot smoother
f loess ( y age * pressure )
plot ( f )
# cross-sectional plots
ages seq (20 ,70 , length =40)
pressures seq (80 ,200 , length =40)
pred predict (f , expand.grid ( age = ages , pressure = pressures ) )
persp ( ages , pressures , pred )
# 3 -d plot
Chapter 7
Modeling Longitudinal Responses using

Generalized Least Squares
7.1
Notation
A
N subjects
Subject i (i = 1, 2, . . . , N ) has ni responses measured at
times ti1, ti2, . . . , tini
Response at time t for subject i: Yit
Subject i has baseline covariates Xi
Generally the response measured at time ti1 = 0 is a covariate in Xi instead of being the first measured response
Yi0
Time trend in response is modeled with k parameters so
that the time main effect has k d.f.
7-1
CHAPTER 7. MODELING LONGITUDINAL RESPONSES USING GENERALIZED LEAST SQUARES
7-2
Let the basis functions modeling the time effect be g1(t), g2(t), . . . , g
7-3
7.2
Model Specification for Effects on E(Y )

7.2.1
Common Basis Functions

B
k dummy variables for k + 1 unique times (assumes no functional form for time but may spend many d.f.)
k = 1 for linear time trend, g1(t) = t
korder polynomial in t
k + 1knot restricted cubic spline (one linear term, k 1
nonlinear terms)
7.2.2
Model for Mean Profile

C
A model for mean time-response profile without interactions

between time and any X:
E[Yit|Xi] = Xi + 1g1(t) + 2g2(t) + . . . + k gk (t)
Model with interactions between time and some Xs: add
product terms for desired interaction effects
Example: To allow the mean time trend for subjects in group
1 (reference group) to be arbitrarily different from time trend
7-4
for subjects in group 2, have a dummy variable for group 2,

a time main effect curve with k d.f. and all k products of
these time components with the dummy variable for group
2
7.2.3
Model Specification for Treatment Comparisons

D
In studies comparing two or more treatments, a response is

often measured at baseline (pre-randomization)
Analyst has the option to use this measurement as Yi0 or as
part of Xi
Jim Rochon (Rho, Inc., Chapel Hill NC) has the following
comments about this:
For RCTs, I draw a sharp line at the point when the intervention begins. The LHS [left hand side of the model
equation] is reserved for something that is a response to treatment. Anything before this point can potentially be
included as a covariate in the regression model. This includes thebaselinevalue of the outcome variable. Indeed,
the best predictor of the outcome at the end of the study is typically where the patient began at the beginning. It
drinks up a lot of variability in the outcome; and, the effect of other covariates is typically mediated through this
variable.
I treat anything after the intervention begins as an outcome. In the western scientific method, an effect must
follow the cause even if by a split second.
Note that an RCT is different than a cohort study. In a cohort study, Time 0 is not terribly meaningful. If we
want to model, say, the trend over time, it would be legitimate, in my view, to include the baseline value on the
LHS of that regression model.
Now, even if the intervention, e.g., surgery, has an immediate effect, I would include still reserve the LHS for
anything that might legitimately be considered as the response to the intervention. So, if we cleared a blocked
artery and then measured the MABP, then that would still be included on the LHS.
Now, it could well be that most of the therapeutic effect occurred by the time that the first repeated measure was
taken, and then levels off. Then, a plot of the means would essentially be two parallel lines and the treatment
effect is the distance between the lines, i.e., the difference in the intercepts.
If the linear trend from baseline to Time 1 continues beyond Time 1, then the lines will have a common intercept
but the slopes will diverge. Then, the treatment effect will the difference in slopes.
One point to remember is that the estimated intercept is the value at time 0 that we predict from the set of
repeated measures post randomization. In the first case above, the model will predict different intercepts even
though randomization would suggest that they would start from the same place. This is because we were asleep
at the switch and didnt record the action from baseline to time 1. In the second case, the model will predict the
same intercept values because the linear trend from baseline to time 1 was continued thereafter.
More importantly, there are considerable benefits to including it as a covariate on the RHS. The baseline value
tends to be the best predictor of the outcome post-randomization, and this maneuver increases the precision of
7-5
the estimated treatment effect. Additionally, any other prognostic factors correlated with the outcome variable will
also be correlated with the baseline value of that outcome, and this has two important consequences. First, this
greatly reduces the need to enter a large number of prognostic factors as covariates in the linear models. Their
effect is already mediated through the baseline value of the outcome variable. Secondly, any imbalances across the
treatment arms in important prognostic factors will induce an imbalance across the treatment arms in the baseline
value of the outcome. Including the baseline value thereby reduces the need to enter these variables as covariates
in the linear models.
Stephen Senn [144] states that temporally and logically, a

baseline cannot be a response to treatment, so baseline and
response cannot be modeled in an integrated framework.
. . . one should focus clearly on outcomes as being the only values that can be influenced by treatment and examine
critically any schemes that assume that these are linked in some rigid and deterministic view to baseline values.
An alternative tradition sees a baseline as being merely one of a number of measurements capable of improving
predictions of outcomes and models it in this way.
The final reason that baseline cannot be modeled as the response at time zero is that many studies have inclusion/exclusion criteria that include cutoffs on the baseline variable.
In other words, the baseline measurement comes from a truncated distribution. In general it is not appropriate to model
the baseline with the same distributional shape as the followup measurements. Thus the approach recommended by Liang
and Zeger [107] and Liu et al. [110] are problematica.
a In addition to this, one of the papers conclusions that analysis of covariance is not appropriate if the population means of the baseline
variable are not identical in the treatment groups is not correct [144]. See [92] for a rebuke of [110].
7-6
7.3
Modeling Within-Subject Dependence

F
Random effects and mixed effects models have become very

popular
Disadvantages:
Induced correlation structure for Y may be unrealistic
Numerically demanding
Require complex approximations for distributions of test
statistics
Extended linear model (with no random effects) is a logical
extension of the univariate model (e.g., few statisticians use
subject random effects for univariate Y )
This was known as growth curve models and generalized
least squares [133, 63] and was developed long before mixed
effect models became popular
Pinheiro and Bates (Section 5.1.2) state that in some applications, one may wish to avoid incorporating random effects
in the model to account for dependence among observations,
choosing to use the within-group component i to directly
model variance-covariance structure of the response.
7-7
We will assume that Yit|Xi has a multivariate normal distribution with mean given above and with variance-covariance
matrix Vi, an ni ni matrix that is a function of ti1, . . . , tini
We further assume that the diagonals of Vi are all equal

Procedure can be generalized to allow for heteroscedasticity
over time or with respect to X (e.g., males may be allowed
to have a different variance than females)
This extended linear model has the following assumptions:
all the assumptions of OLS at a single time point including correct modeling of predictor effects and univariate
normality of responses conditional on X
the distribution of two responses at two different times for
the same subject, conditional on X, is bivariate normal
with a specified correlation coefficient
the joint distribution of all ni responses for the ith subject
is multivariate normal with the given correlation pattern
(which implies the previous two distributional assumptions)
responses from any times for any two different subjects
are uncorrelated
What Methods To Use for Repeated Measurements / Serial Data?
Assumes normality
Assumes independence of
measurements within subject
Assumes a correlation structuref
Requires same measurement
times for all subjects
Does not allow smooth modeling
of time to save d.f.
Does not allow adjustment for
baseline covariates
Does not easily extend to
non-continuous Y
Loses information by not using
intermediate measurements
Does not allow widely varying #
of observations per subject
Does not allow for subjects
to have distinct trajectoriesk
Assumes subject-specific effects
are Gaussian
Badly biased if non-random
dropouts
Biased in general
Harder to get tests & CLs
Requires large # subjects/clusters
SEs are wrong
Assumptions are not verifiable
in small samples
Does not extend to complex
settings such as time-dependent
covariates and dynamico models
a Thanks
Repeated
Measures
ANOVA
GEE
Mixed
Effects
Model
GLS LOCF
7-8
ab
Summary
Statisticc
N/A
to Charles Berry, Brian Cade, Peter Flom, Bert Gunter, and Leena Choi for valuable input.
generalized estimating equations; GLS: generalized least squares; LOCF: last observation carried forward.
c E.g., compute within-subject slope, mean, or area under the curve over time. Assumes that the summary measure is an adequate summary
of the time profile and assesses the relevant treatment effect.
d Unless one uses the Huynh-Feldt or Greenhouse-Geisser correction
e For full efficiency, if using the working independence model
f Or requires the user to specify one
g For full efficiency of regression coefficient estimates
h Unless the last observation is missing
i The cluster sandwich variance estimator used to estimate SEs in GEE does not perform well in this situation, and neither does the working
independence model because it does not weight subjects properly.
j Unless one knows how to properly do a weighted analysis
k Or uses population averages
l Unlike GLS, does not use standard maximum likelihood methods yielding simple likelihood ratio 2 statistics. Requires high-dimensional
integration to marginalize random effects, using complex approximations, and if using SAS, unintuitive d.f. for the various tests.
m Because there is no correct formula for SE of effects; ordinary SEs are not penalized for imputation and are too small
n If correction not applied
o E.g., a model with a predictor that is a lagged value of the response variable
b GEE:
7-9
Gardiner et al. [59] compared several longitudinal data models, especially with regard to assumptions and how regression
coefficients are estimated. Peters et al. [130] have an empirical study confirming that the use all available data approach
of likelihoodbased longitudinal models makes imputation of
follow-up measurements unnecessary.
7-10
7.4
Parameter Estimation Procedure

J
Generalized least squares

Like weighted least squares but uses a covariance matrix
that is not diagonal
Each subject can have her own shape of Vi due to each
subject being measured at a different set of times
Maximum likelihood
Newton-Raphson or other trial-and-error methods used for
estimating parameters
For small number of subjects, advantages in using REML
(restricted maximum likelihood) instead of ordinary MLE [45,
Section 5.3], [131, Chapter 5], [63] (esp. to get more unbiased
estimate of the covariance matrix)
When imbalances are not severe, OLS fitted ignoring subject
identifiers may be efficient
But OLS standard errors will be too small as they dont
take intra-cluster correlation into account
May be rectified by substituting covariance matrix estimated from Huber-White cluster sandwich estimator or
7-11
from cluster bootstrap

When imbalances are severe and intra-subject correlations
are strong, OLS is not expected to be efficient because it
gives equal weight to each observation
a subject contributing two distant observations receives
1
5 the weight of a subject having 10 tightly-spaced observations
7-12
7.5
Common Correlation Structures

M
Usually restrict ourselves to isotropic correlation structures

correlation between responses within subject at two times
depends only on a measure of distance between the two
times, not the individual times
We simplify further and assume depends on |t1 t2|
Can speak interchangeably of correlations of residuals within
subjects or correlations between responses measured at different times on the same subject, conditional on covariates
X
Assume that the correlation coefficient for Yit1 vs. Yit2 conditional on baseline covariates Xi for subject i is h(|t1 t2|, ),
where is a vector (usually a scalar) set of fundamental correlation parameters
Some commonly used structures when times are continuous
and are not equally spaced [131, Section 5.3.3] (nlme correlation function names are at the right if the structure is
implemented in nlme):
Compound symmetry : h = if t1 6= t2 , 1 if t1 = t2
(Essentially what two-way ANOVA assumes)
Autoregressive-moving average lag 1 : h = |t1 t2 | = s
where s = |t1 t2 |
Exponential : h = exp(s/)
Gaussian : h = exp[(s/)2 ]
Linear : h = (1 s/)[s < ]
nlme corCompSymm
corCAR1
corExp
corGaus
corLin
Rational quadratic : h = 1 (s/)2 /[1 + (s/)2 ]

Spherical : h = [1 1.5(s/) + 0.5(s/)3 ][s < ]
Linear exponent AR(1) : h =
sdmin
dmin + d
max dmin
7-13
corRatio
corSpher
, 1 if t1 = t2 [147]
The structures 37 use as a scaling parameter, not as

something restricted to be in [0, 1]
7-14
7.6
Checking Model Fit

O
Constant variance assumption: usual residual plots

Normality assumption: usual qq residual plots
Correlation pattern: Variogram
Estimate correlations of all possible pairs of residuals at
different time points
Pool all estimates at same absolute difference in time s
) vs. s on the
Variogram is a plot with y = 1 h(s,
x-axis
Superimpose the theoretical variogram assumed by the
model
7-15
7.7
R Software
P
Nonlinear mixed effects model package of Pinheiro & Bates

For linear models, fitting functions are
lme for mixed effects models
gls for generalized least squares without random effects
For this version the rms package has Gls so that many features of rms can be used:
anova :
all partial Wald tests, test of linearity, pooled tests

summary : effect estimates (differences in Y ) and confidence
limits, can be plotted
plot : continuous effect plots
nomogram : nomogram
Function : generate R function code for fitted model
latex : LATEX representation of fitted model
In addition, Gls has a bootstrap option (hence you do not
use rmss bootcov for Gls fits).
To get regular gls functions named anova (for likelihood ratio tests, AIC, etc.) or summary use anova.gls or summary.gls
nlme package has many graphics and fit-checking functions
Several functions will be demonstrated in the case study
7-16
7.8
Case Study
Consider the dataset in Table 6.9 of Davis [42, pp. 161-163] from
a multicenter, randomized controlled trial of botulinum toxin
type B (BotB) in patients with cervical dystonia from nine U.S.
sites.
Randomized to placebo (N = 36), 5000 units of BotB (N =
36), 10,000 units of BotB (N = 37)
Response variable: total score on Toronto Western Spasmodic Torticollis Rating Scale (TWSTRS), measuring severity, pain, and disability of cervical dystonia (high scores mean
more impairment)
TWSTRS measured at baseline (week 0) and weeks 2, 4, 8,
12, 16 after treatment began
Dataset cdystonia from web site
7.8.1
Graphical Exploration of Data

require ( rms )
getHdata ( cdystonia )
attach ( cdystonia )
#
Construct
unique
subject
ID
uid with ( cdystonia , factor ( paste ( site , id ) ) )

#
Tabulate
patterns
of
subjects
time
points
7-17
table ( tapply ( week , uid ,

function ( w ) paste ( sort ( unique ( w ) ) , collapse = ) ) )
0
1
0 2 4 8 12 16
94
0 2 4
1
0 2 4 8 16
1
superposing
Plot
raw
data ,
0 2 4 12 16
3
0 2 8 12 16
2
0 2 4 8
1
0 4 8 12 16
4
0 2 4 8 12
1
0 4 8 16
1
subjects
xl xlab ( Week ) ; yl ylab ( TWSTRS-total score )

ggplot ( cdystonia , aes ( x = week , y = twstrs , color = factor ( id ) ) ) +
geom_line () + xl + yl + facet_grid ( treat site ) +
guides ( color = FALSE ) # F i g . 7.1
60
10000U
40
60
5000U
TWSTRStotal score
20
40
20
60
Placebo
40
20
0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15 0 5 10 15
Week
Figure 7.1: Time profiles for individual subjects, stratified by study site and dose
#
Show
quartiles
require ( data.table )
cdystonia data.table ( cdystonia )
cdys cdystonia [ , j = as.list ( quantile ( twstrs , (1 : 3) / 4) ) ,
by = list ( treat , week ) ]
cdys upData ( cdys , rename = c ( 25% = Q1 , 50% = Q2 , 75% = Q3 ) , print = FALSE )
ggplot ( cdys , aes ( x = week , y = Q2 ) ) + xl + yl + ylim (0 , 70) +
geom_line () + facet_wrap ( treat , nrow =2) +
geom_ribbon ( aes ( ymin = Q1 , ymax = Q3 ) , alpha =0 .2 )
10000U
Fig.
7-18
7.2
5000U
60
TWSTRStotal score
40
20
0
Placebo
60
40
20
0
0
10
15
Week
Figure 7.2: Quartiles of TWSTRS stratified by dose
#
Show
means
with
bootstrap
nonparametric
CLs
cdys
cdystonia [ , j = as.list ( smean.cl.boot ( twstrs ) ) ,

by = list ( treat , week ) ]
ggplot ( cdys , aes ( x = week , y = Mean ) ) + xl + yl + ylim (0 , 70) +
geom_line () + facet_wrap ( treat , nrow =2) +
geom_ribbon ( aes ( x = week , ymin = Lower , ymax = Upper ) , alpha =0 .2 )
Fig.
7.3
10000U
5000U
60
TWSTRStotal score
40
20
0
Placebo
60
40
20
0
0
10
15
Week
Figure 7.3: Mean responses and nonparametric bootstrap 0.95 confidence limits for population means, stratified by dose
7-19
7-20
Model with Yi0 as Baseline Covariate

baseline subset ( data.frame ( cdystonia , uid ) , week == 0 ,
-week )
baseline upData ( baseline , rename = c ( twstrs = twstrs0 ) ,
print = FALSE )
followup subset ( data.frame ( cdystonia , uid ) , week > 0 ,
c ( uid , week , twstrs ) )
rm ( uid )
both
merge ( baseline , followup , by = uid )
dd
datadist ( both )
options ( datadist = dd )
7.8.2
Using Generalized Least Squares

We stay with baseline adjustment and use a variety of correlation structures, with constant variance. Time is modeled as a
restricted cubic spline with 3 knots, because there are only 3
unique interior values of week.
require ( nlme )
cp list ( corCAR1 , corExp , corCompSymm , corLin , corGaus , corSpher )
z vector ( list , length ( cp ) )
for ( k in 1: length ( cp ) ) {
z [[ k ]] gls ( twstrs treat * rcs ( week , 3) +
rcs ( twstrs0 , 3) + rcs ( age , 4) * sex , data = both ,
correlation = cp [[ k ]]( form = week | uid ) )
}
anova ( z [[1]] , z [[2]] , z [[3]] , z [[4]] , z [[5]] , z [[6]])
z [[1]]
z [[2]]
z [[3]]
z [[4]]
z [[5]]
z [[6]]
Model
1
2
3
4
5
6
df
20
20
20
20
20
20
AIC
3553.906
3553.906
3587.974
3575.079
3621.081
3570.958
BIC
3638.357
3638.357
3672.426
3659.531
3705.532
3655.409
logLik
-1756.953
-1756.953
-1773.987
-1767.540
-1790.540
-1765.479
AIC computed above is set up so that smaller values are best.

From this the continuous-time AR1 and exponential structures
7-21
are tied for the best. For the remainder of the analysis use
corCAR1, using Gls.
a Gls ( twstrs treat * rcs ( week , 3) + rcs ( twstrs0 , 3) +
rcs ( age , 4) * sex , data = both ,
correlation = corCAR1 ( form =week | uid ) )
print (a , latex = TRUE )
Generalized Least Squares Fit by REML

Gls(model = twstrs ~ treat * rcs(week, 3) + rcs(twstrs0, 3) +
rcs(age, 4) * sex, data = both, correlation = corCAR1(form = ~week |
uid))
Obs
522 Log-restricted-likelihood -1756.95
Clusters 108 Model d.f.
17
g
11.334
8.5917
d.f.
504
Intercept
treat=5000U
treat=Placebo
week
week
twstrs0
twstrs0
age
age
age
sex=M
treat=5000U * week
treat=Placebo * week
treat=5000U * week
treat=Placebo * week
age * sex=M
age * sex=M
age * sex=M
Coef
-0.3093
0.4344
7.1433
0.2879
0.7313
0.8071
0.2129
-0.1178
0.6968
-3.4018
24.2802
0.0745
-0.1256
-0.4389
-0.6459
-0.5846
1.4652
-4.0338
S.E.
11.8804
2.5962
2.6133
0.2973
0.3078
0.1449
0.1795
0.2346
0.6484
2.5599
18.6208
0.4221
0.4243
0.4363
0.4381
0.4447
1.2388
4.8123
t
Pr(> |t|)
-0.03
0.9792
0.17
0.8672
2.73
0.0065
0.97
0.3334
2.38
0.0179
5.57 < 0.0001
1.19
0.2360
-0.50
0.6158
1.07
0.2830
-1.33
0.1845
1.30
0.1929
0.18
0.8599
-0.30
0.7674
-1.01
0.3149
-1.47
0.1411
-1.31
0.1892
1.18
0.2375
-0.84
0.4023
Correlation Structure: Continuous AR(1)

Formula: ~week | uid
Parameter estimate(s):
Phi
0.8666689
= 0.8672, the estimate of the correlation between two mea-
7-22
surements taken one week apart on the same subject. The estimated correlation for measurements 10 weeks apart is 0.867210 =
0.24.
v Variogram (a , form = week | uid )
plot ( v ) # F i g u r e 7.4
Semivariogram
0.6
10
0.4
0.2
12
14
Distance
Figure 7.4: Variogram, with assumed correlation pattern superimposed
Check constant variance and normality assumptions:

both $ resid r resid ( a ) ; both $ fitted fitted ( a )
yl ylab ( Residuals )
p1 ggplot ( both , aes ( x = fitted , y = resid ) ) + geom_point () +
facet_grid ( treat ) + yl
p2 ggplot ( both , aes ( x = twstrs0 , y = resid ) ) + geom_point () + yl
p3 ggplot ( both , aes ( x = week , y = resid ) ) + yl + ylim ( -20 ,20) +
stat_summary ( fun.data = " mean_sdl " , geom = smooth )
p4 ggplot ( both , aes ( sample = resid ) ) + stat_qq () +
geom_abline ( intercept = mean ( r ) , slope = sd ( r ) ) + yl
gridExtra :: grid.arrange ( p1 , p2 , p3 , p4 , ncol =2)
# F i g u r e 7.5
Now get hypothesis tests, estimates, and graphically interpret

the model.
latex ( anova ( a ) , file = , label = longit-anova )
plot ( anova ( a ) )
Figure
Table
7.1
7.6
ylm ylim (25 , 60)

p1 ggplot ( Predict (a , week , treat , conf.int = FALSE ) ,
adj.subtitle = FALSE , legend.position = top ) + ylm
Residuals
20
20
5000U
20
Placebo
Residuals
10000U
20
40
40
20 30 40 50 60 70 20 30 40 50 60 70 20 30 40 50 60 70
30
fitted
40
50
60
twstrs0
20
20
10
Residuals
Residuals
7-23
20
10
40
20
4
week
12
16
theoretical
Figure 7.5: Three residual plots to check for absence of trends in central tendency and in variability. Upper right panel shows the
baseline score on the x-axis. Bottom left panel shows the mean 2SD. Bottom right panel is the QQ plot for checking normality
of residuals from the GLS fit.
Table 7.1: Wald Statistics for twstrs
treat (Factor+Higher Order Factors)

All Interactions
week (Factor+Higher Order Factors)
All Interactions
Nonlinear (Factor+Higher Order Factors)
twstrs0
Nonlinear
age (Factor+Higher Order Factors)
All Interactions
sex (Factor+Higher Order Factors)
All Interactions
treat week (Factor+Higher Order Factors)
Nonlinear
Nonlinear Interaction : f(A,B) vs. AB
age sex (Factor+Higher Order Factors)
Nonlinear
TOTAL NONLINEAR
TOTAL INTERACTION
TOTAL NONLINEAR + INTERACTION
TOTAL
sex
age * sex
age
treat * week
2
22.11
14.94
77.27
14.94
6.61
233.83
1.41
9.68
4.86
7.59
5.67
4.86
14.94
2.27
2.27
4.86
3.76
3.76
15.03
19.75
28.54
322.98
d.f.
6
4
6
4
3
2
1
6
3
4
4
3
4
2
2
3
2
2
8
7
11
17
P
0.0012
0.0048
< 0.0001
0.0048
0.0852
< 0.0001
0.2354
0.1388
0.1826
0.1077
0.2252
0.1826
0.0048
0.3208
0.3208
0.1826
0.1526
0.1526
0.0586
0.0061
0.0027
< 0.0001
treat
week
twstrs0
50
100
150
200
2 df
Figure 7.6: Results of anova.rms from generalized least squares fit with continuous time AR1 correlation structure
7-24
7-25
p2 ggplot ( Predict (a , twstrs0 ) , adj.subtitle = FALSE ) + ylm

p3 ggplot ( Predict (a , age , sex ) , adj.subtitle = FALSE ,
legend.position = top ) + ylm
gridExtra :: grid.arrange ( p1 , p2 , p3 , ncol =2)
# F i g u r e 7.7
Treatment
10000U
5000U
Placebo
60
^
X
48
^
X
44
40
40
20
36
4
12
30
16
40
Sex
50
60
TWSTRStotal score
Week
50
^
X
45
40
35
30
25
40
50
60
70
80
Age, years
Figure 7.7: Estimated effects of time, baseline TWSTRS, age, and sex
latex ( summary ( a ) , file = , table.env = FALSE ) # S h o w s f o r w e e k 8
Low High
Effect
S.E.
Lower 0.95 Upper 0.95
week
4
12 8 6.69100 1.10570
4.5238
8.8582
twstrs0
39
53 14 13.55100 0.88618
11.8140
15.2880
age
46
65 19 2.50270 2.05140
-1.5179
6.5234
treat 5000U:10000U
1
2
0.59167 1.99830
-3.3249
4.5083
treat Placebo:10000U
1
3
5.49300 2.00430
1.5647
9.4212
sex M:F
1
2
-1.08500 1.77860
-4.5711
2.4011
#
To
for
get
results
for
week
treatment ,
use
e.g.
summary (a ,
Compare
low
dose
with
for
placebo ,
different
w e e k =4 ,
separately
reference
group
treat = Placebo )
at
each
time
7-26
k1 contrast (a , list ( week = c (2 ,4 ,8 ,12 ,16) , treat = 5000 U ) ,

list ( week = c (2 ,4 ,8 ,12 ,16) , treat = Placebo ) )
options ( width =80)
print ( k1 , digits =3)
1
2
3
4*
5*
week twstrs0 age sex Contrast S . E . Lower Upper

Z Pr ( >| z |)
2
46 56
F
-6.31 2.10 -10.43 -2.186 -3.00
0.0027
4
46 56
F
-5.91 1.82 -9.47 -2.349 -3.25
0.0011
8
46 56
F
-4.90 2.01 -8.85 -0.953 -2.43
0.0150
12
46 56
F
-3.07 1.75 -6.49 0.361 -1.75
0.0795
16
46 56
F
-1.02 2.10 -5.14 3.092 -0.49
0.6260
Redundant contrasts are denoted by *

Confidence intervals are 0.95 individual intervals
#
Compare
high
dose
with
placebo
k2 contrast (a , list ( week = c (2 ,4 ,8 ,12 ,16) , treat = 10000 U ) ,

list ( week = c (2 ,4 ,8 ,12 ,16) , treat = Placebo ) )
print ( k2 , digits =3)
1
2
3
4*
5*
week twstrs0 age sex Contrast S . E . Lower Upper

Z Pr ( >| z |)
2
46 56
F
-6.89 2.07 -10.96 -2.83 -3.32
0.0009
4
46 56
F
-6.64 1.79 -10.15 -3.13 -3.70
0.0002
8
46 56
F
-5.49 2.00 -9.42 -1.56 -2.74
0.0061
12
46 56
F
-1.76 1.74 -5.17 1.65 -1.01
0.3109
16
46 56
F
2.62 2.09 -1.47 6.71 1.25
0.2099
Redundant contrasts are denoted by *

Confidence intervals are 0.95 individual intervals
k1 as.data.frame ( k1 [ c ( week , Contrast , Lower , Upper ) ])
p1 ggplot ( k1 , aes ( x = week , y = Contrast ) ) + geom_point () +
geom_line () + ylab ( Low Dose - Placebo ) +
geom_errorbar ( aes ( ymin = Lower , ymax = Upper ) , width =0)
k2 as.data.frame ( k2 [ c ( week , Contrast , Lower , Upper ) ])
p2 ggplot ( k2 , aes ( x = week , y = Contrast ) ) + geom_point () +
geom_line () + ylab ( High Dose - Placebo ) +
geom_errorbar ( aes ( ymin = Lower , ymax = Upper ) , width =0)
gridExtra :: grid.arrange ( p1 , p2 , ncol =2)
# F i g u r e 7.8
Although multiple d.f. tests such as total treatment effects or

treatment time interaction tests are comprehensive, their increased degrees of freedom can dilute power. In a treatment
comparison, treatment contrasts at the last time point (single
d.f. tests) are often of major interest. Such contrasts are informed by all the measurements made by all subjects (up until
dropout times) when a smooth time trend is assumed.
7-27
High Dose Placebo
Low Dose Placebo
12
16
week
week
Figure 7.8: Contrasts and 0.95 confidence limits from GLS fit
n nomogram (a , age = c ( seq (20 , 80 , by =10) , 85) )

plot (n , cex.axis = .55 , cex.var = .8 , lmgp = .25 ) # F i g u r e 7.9
12
16
10
20
30
40
50
60
70
7-28
80
90
100
Points
TWSTRStotal score
20
25
30
50
35
40
45
50
55
60
65
70
60
age (sex=F)
85
80
40 70 20
60
70
age (sex=M)
50
40
85
30
20
5000U
treat (week=2)
10000U
Placebo
5000U
treat (week=4)
10000U
Placebo
5000U
treat (week=8)
10000U
Placebo
10000U
treat (week=12)
5000U
Placebo
treat (week=16)
5000U
10000U
Total Points
0
10
20
30
40
50
60
70
80
90
100
110
120
130
50
55
60
65
70
Linear Predictor
15
20
25
30
35
40
45
Figure 7.9: Nomogram from GLS fit. Second axis is the baseline score.
140
Chapter 8
Logistic Model Case Study: Survival of

Titanic Passengers
Data source: The Titanic Passenger List edited by Michael A. Findlay, originally
published in Eaton & Haas (1994) Titanic: Triumph and Tragedy, Patrick Stephens
Ltd, and expanded with the help of the Internet community. The original html files were
obtained from Philip Hind (1999) (http://atschool.eduweb.co.uk/phind). The dataset
was compiled and interpreted by Thomas Cason. It is available in Rand spreadsheet
formats from biostat.mc.vanderbilt.edu/DataSets under the name titanic3.
8-1
CHAPTER 8. LOGISTIC MODEL CASE STUDY: SURVIVAL OF TITANIC PASSENGERS
8-2
8.1
Descriptive Statistics
require ( rms )
getHdata ( titanic3 )
#
List
of
names
of
# get dataset from

v a r i a b l e s to a n a l y z e
web
site
v c ( pclass , survived , age , sex , sibsp , parch )

t3 titanic3 [ , v ]
units ( t3 $ age ) years
latex ( describe ( t3 ) , file = )
t3
1309 Observations
6 Variables
pclass
n
1309
missing
0
unique
3
1st (323, 25%), 2nd (277, 21%), 3rd (709, 54%)
survived : Survived
n
1309
missing
0
unique
2
Info
0.71
Sum
500
Mean
0.382
unique
98
Info
1
Mean
29.88
.05
5
age : Age [years]

n
1046
missing
263
.10
14
.25
21
.50
28
.75
39
.90
50
.95
57
lowest : 0.1667 0.3333 0.4167 0.6667 0.7500

highest: 70.5000 71.0000 74.0000 76.0000 80.0000
sex
n
1309
missing
0
unique
2
female (466, 36%), male (843, 64%)
sibsp : Number of Siblings/Spouses Aboard

n
1309
missing
0
unique
7
Info
0.67
Mean
0.4989
0
1 2 3 4 5 8
Frequency 891 319 42 20 22 6 9
%
68 24 3 2 2 0 1
parch : Number of Parents/Children Aboard

n
1309
missing
0
unique
8
Info
0.55
Mean
0.385
0
1
2 3 4 5 6 9
Frequency 1002 170 113 8 6 6 2 2
%
77 13
9 1 0 0 0 0
dd datadist ( t3 )
#
describe
distributions
of
variables
to
rms
s summary ( survived age + sex + pclass +
cut2 ( sibsp ,0:3) + cut2 ( parch ,0:3) , data = t3 )
plot (s , main = , subtitles = FALSE )
# F i g u r e 8.1
Show 4-way relationships after collapsing levels. Suppress estimates based on < 25 passengers.
8-3
Age [years]
[ 0.167,22.0)
[22.000,28.5)
[28.500,40.0)
[40.000,80.0]
Missing
290
246
265
245
263
sex
female
male
pclass
1st
2nd
3rd
0
1
2
[3,8]
891
319
42
57
Number of Parents/Children Aboard
0
1
2
[3,9]
0.2
0.3
0.4
1309
0.5
0.6
Survived
Figure 8.1: Univariable summaries of Titanic survival
tn transform ( t3 ,
agec = ifelse ( age < 21 , child , adult ) ,
sibsp = ifelse ( sibsp == 0 , no sib / sp , sib / sp ) ,
parch = ifelse ( parch == 0 , no par / child , par / child ) )
g function ( y ) if ( length ( y ) < 25) NA else mean ( y )
s with ( tn , summarize ( survived ,
llist ( agec , sex , pclass , sibsp , parch ) , g ) )
llist ,
Figure
summarize
in
Hmisc
1002
170
113
24
Overall
#
#
323
277
709
Number of Siblings/Spouses Aboard
466
843
package
8.2 :
ggplot ( subset (s , agec ! = NA ) ,
aes ( x = survived , y = pclass , shape = sex ) ) +
geom_point () + facet_grid ( agec sibsp * parch ) +
xlab ( Proportion Surviving ) + ylab ( Passenger Class ) +
scale_x_continuous ( breaks = c (0 , .5 , 1) )
0.7
no sib/sp
sib/sp
sib/sp
no par/child
par/child
no par/child
par/child
2nd
adult
1st
sex
3rd
2nd
1st
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
Proportion Surviving
Figure 8.2: Multi-way summary of Titanic survival
female
male
child
Passenger Class
3rd
no sib/sp
8-4
1.0
8-5
8.2
Exploring Trends with Nonparametric Regression

# F i g u r e 8.3
b scal e_size_d iscrete ( range = c ( .1 , .85 ) )
yl ylab ( NULL )
p1 ggplot ( t3 , aes ( x = age , y = survived ) ) +
histSpikeg ( survived age , lowess = TRUE , data = t3 ) +
ylim (0 ,1) + yl
p2 ggplot ( t3 , aes ( x = age , y = survived , color = sex ) ) +
histSpikeg ( survived age + sex , lowess = TRUE ,
data = t3 ) + ylim (0 ,1) + yl
p3 ggplot ( t3 , aes ( x = age , y = survived , size = pclass ) ) +
histSpikeg ( survived age + pclass , lowess = TRUE ,
data = t3 ) + b + ylim (0 ,1) + yl
p4 ggplot ( t3 , aes ( x = age , y = survived , color = sex ,
size = pclass ) ) +
histSpikeg ( survived age + sex + pclass ,
lowess = TRUE , data = t3 ) +
b + ylim (0 ,1) + yl
gridExtra :: grid.arrange ( p1 , p2 , p3 , p4 , ncol =2)
# combine 4
# F i g u r e 8.4
top theme ( legend.position = top )
p1 ggplot ( t3 , aes ( x = age , y = survived , color = cut2 ( sibsp ,
0:2) ) ) + stat_plsmo () + b + ylim (0 ,1) + yl + top +
sc al e_c ol or_ dis cr ete ( name = siblings / spouses )
p2 ggplot ( t3 , aes ( x = age , y = survived , color = cut2 ( parch ,
0:2) ) ) + stat_plsmo () + b + ylim (0 ,1) + yl + top +
sc al e_c ol or_ dis cr ete ( name = parents / children )
gridExtra :: grid.arrange ( p1 , p2 , ncol =2)
1.00
1.00
0.75
0.75
8-6
sex
0.50
female
0.50
male
0.25
0.25
0.00
0.00
0
20
40
60
80
20
age
40
60
80
age
1.00
1.00
0.75
0.75
sex
female
pclass
male
1st
0.50
2nd
0.50
pclass
1st
3rd
0.25
2nd
0.25
3rd
0.00
0.00
0
20
40
age
60
80
20
40
60
80
age
Figure 8.3: Nonparametric regression (loess) estimates of the relationship between age and the probability of surviving the Titanic,
with tick marks depicting the age distribution. The top left panel shows unstratified estimates of the probability of survival. Other
panels show nonparametric estimates by various stratifications.
siblings/spouses
[2,8]
parents/children
1.00
1.00
0.75
0.75
0.50
0.50
0.25
0.25
0.00
8-7
[2,9]
0.00
0
20
40
age
60
80
20
40
60
80
age
Figure 8.4: Relationship between age and survival stratified by the number of siblings or spouses on board (left panel) or by the
number of parents or children of the passenger on board (right panel).
8-8
Table 8.1: Wald Statistics for survived

All Interactions
pclass (Factor+Higher Order Factors)
All Interactions
All Interactions
sibsp (Factor+Higher Order Factors)
All Interactions
parch (Factor+Higher Order Factors)
All Interactions
sex pclass (Factor+Higher Order Factors)
sex age (Factor+Higher Order Factors)
pclass age (Factor+Higher Order Factors)
age sibsp (Factor+Higher Order Factors)
Nonlinear
age parch (Factor+Higher Order Factors)
Nonlinear
sex pclass age (Factor+Higher Order Factors)
Nonlinear
TOTAL NONLINEAR
TOTAL INTERACTION
TOTAL
2
187.15
59.74
100.10
46.51
56.20
34.57
28.66
19.67
12.13
3.51
3.51
42.43
15.89
14.47
4.17
13.47
12.92
6.88
12.13
1.76
1.76
3.51
1.80
1.80
8.34
7.74
28.66
75.61
79.49
241.93
d.f.
15
14
20
18
32
28
24
5
4
5
4
10
12
9
3
16
12
6
4
3
3
4
3
3
8
6
24
30
33
39
P
< 0.0001
< 0.0001
< 0.0001
0.0003
0.0052
0.1826
0.2331
0.0014
0.0164
0.6217
0.4761
< 0.0001
0.1962
0.1066
0.2441
0.6385
0.3749
0.3324
0.0164
0.6235
0.6235
0.4761
0.6147
0.6147
0.4006
0.2581
0.2331
< 0.0001
< 0.0001
< 0.0001
8.3
Binary Logistic Model with Casewise Deletion of Missing Values

First fit a model that is saturated with respect to age, sex,
pclass. Insufficient variation in sibsp, parch to fit complex
interactions or nonlinearities.
f1 lrm ( survived sex * pclass * rcs ( age ,5) +
rcs ( age ,5) * ( sibsp + parch ) , data = t3 )
# T a b l e 8.1
latex ( anova ( f1 ) , file = , label = titanic-anova3 , size = small )
3-way interactions, parch clearly insignificant, so drop

f lrm ( survived ( sex + pclass + rcs ( age ,5) ) 2 +
rcs ( age ,5) * sibsp , data = t3 )
print (f , latex = TRUE )
Logistic Regression Model

lrm(formula = survived ~ (sex + pclass + rcs(age, 5))^2 + rcs(age,
5) * sibsp, data = t3)
Frequencies of Missing Values Due to Each Variable
survived
0
sex
0
pclass
0
Obs
1046
0
619
1
427
L
6
max | log
|
610
age
263
sibsp
0
Model Likelihood
Ratio Test
LR 2
553.87
d.f.
26
2
Pr(> ) < 0.0001
Intercept
sex=male
pclass=2nd
pclass=3rd
age
age
age
age
sibsp
sex=male * pclass=2nd
sex=male * pclass=3rd
sex=male * age
sex=male * age
sex=male * age
sex=male * age
pclass=2nd * age
pclass=3rd * age
pclass=2nd * age
pclass=3rd * age
pclass=2nd * age
pclass=3rd * age
pclass=2nd * age
pclass=3rd * age
age * sibsp
age * sibsp
age * sibsp
age * sibsp
Coef
3.3075
-1.1478
6.7309
-1.6437
0.0886
-0.7410
4.9264
-6.6129
-1.0446
-0.7682
2.1520
-0.2191
1.0842
-6.5578
8.3716
-0.5446
-0.1634
1.9156
0.8205
-8.9545
-5.4276
9.3926
7.5403
0.0357
-0.0467
0.5574
-1.1937
Discrimination
Indexes
2
R
0.555
g
2.427
gr
11.325
gp
0.365
Brier
0.130
Rank Discrim.
Indexes
C
0.878
Dxy
0.756
0.758
a
0.366
S.E.
Wald Z Pr(> |Z|)
1.8427
1.79
0.0727
1.0878
-1.06
0.2914
3.9617
1.70
0.0893
1.8299
-0.90
0.3691
0.1346
0.66
0.5102
0.6513
-1.14
0.2552
4.0047
1.23
0.2186
5.4100
-1.22
0.2216
0.3441
-3.04
0.0024
0.7083
-1.08
0.2781
0.6214
3.46
0.0005
0.0722
-3.04
0.0024
0.3886
2.79
0.0053
2.6511
-2.47
0.0134
3.8532
2.17
0.0298
0.2653
-2.05
0.0401
0.1308
-1.25
0.2118
1.0189
1.88
0.0601
0.6091
1.35
0.1780
5.5027
-1.63
0.1037
3.6475
-1.49
0.1367
6.9559
1.35
0.1769
4.8519
1.55
0.1202
0.0340
1.05
0.2933
0.2213
-0.21
0.8330
1.6680
0.33
0.7382
2.5711
-0.46
0.6425
8-9
8-10
2
199.42
56.14
108.73
42.83
47.04
24.51
22.72
19.95
10.99
35.40
10.08
8.17
8.17
6.86
6.11
6.11
10.99
1.81
1.81
22.72
67.58
70.68
253.18

All Interactions
All Interactions
All Interactions
All Interactions
Nonlinear
Nonlinear
Nonlinear
TOTAL NONLINEAR
TOTAL INTERACTION
TOTAL
d.f.
7
6
12
10
20
16
15
5
4
2
4
3
3
8
6
6
4
3
3
15
18
21
26
P
< 0.0001
< 0.0001
< 0.0001
< 0.0001
0.0006
0.0789
0.0902
0.0013
0.0267
< 0.0001
0.0391
0.0426
0.0426
0.5516
0.4113
0.4113
0.0267
0.6134
0.6134
0.0902
< 0.0001
< 0.0001
< 0.0001
latex ( anova ( f ) , file = , label = titanic-anova2 , size = small ) # 8.2
Show the many effects of predictors.
p Predict (f , age , sex , pclass , sibsp =0 , fun = plogis )

ggplot ( p )
# F i g . 8.5
ggplot ( Predict (f , sibsp , age = c (10 ,15 ,20 ,50) , conf.int = FALSE ) )
# # F i g u r e 8.6
Note that children having many siblings apparently had lower

survival. Married adults had slightly higher survival than unmarried ones.
Validate the model using the bootstrap to check overfitting.
Ignoring two very insignificant pooled tests.
f update (f , x = TRUE , y = TRUE )
#
x = TRUE ,
y= TRUE
adds
raw
data
to
fit
object
so
can
bootstrap
set.seed (131)
# so can r e p l i c a t e r e - s a m p l e s
latex ( validate (f , B =200) , digits =2 , size = Ssize )
1st
2nd
8-11
3rd
1.00
0.75
sex
female
0.50
male
0.25
0.00
0
20
40
60
20
40
60
20
40
60
Age, years
Figure 8.5: Effects of predictors on probability of survival of Titanic passengers, estimated for zero siblings or spouses
log odds
Age, years
10
15
4
20
50
6
0
Adjusted to:sex=male pclass=3rd
Figure 8.6: Effect of number of siblings and spouses on the log odds of surviving, for third class males
Index
Original Training
Test
Optimism
Sample Sample Sample
Dxy
0.76
0.77
0.74
0.03
2
R
0.55
0.58
0.53
0.05
Intercept
0.00
0.00 0.08
0.08
Slope
1.00
1.00
0.87
0.13
Emax
0.00
0.00
0.05
0.05
D
0.53
0.56
0.50
0.06
U
0.00
0.00
0.01
0.01
Q
0.53
0.56
0.49
0.07
B
0.13
0.13
0.13
0.01
g
2.43
2.75
2.37
0.37
gp
0.37
0.37
0.35
0.02
cal calibrate (f , B =200)
plot ( cal , subtitles = FALSE )
Figure
8-12
Corrected
Index
0.72
0.50
0.08
0.87
0.05
0.46
0.01
0.46
0.14
2.05
0.35
n
200
200
200
200
200
200
200
200
200
200
200
8.7
n =1046
Mean absolute error =0.009
Mean squared error =0.00012
0.9 Quantile of absolute error =0.017
Actual Probability
1.0
0.8
0.6
0.4
Apparent
Biascorrected
Ideal
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
Predicted Pr{survived=1}
Figure 8.7: Bootstrap overfitting-corrected loess nonparametric calibration curve for casewise deletion model
But moderate problem with missing data
8-13
8.4
Examining Missing Data Patterns

na.patterns naclus ( titanic3 )
require ( rpart )
# Recursive partitioning
package
who.na rpart ( is.na ( age ) sex + pclass + survived +

sibsp + parch , data = titanic3 , minbucket =15)
naplot ( na.patterns , na per var )
plot ( who.na , margin = .1 ) ; text ( who.na ) # F i g u r e 8.8
plot ( na.patterns )
Fraction of NAs in each Variable

pclass
survived
name
sex
sibsp
parch
ticket
cabin
boat
fare
embarked
age
home.dest
body
pclass=ab
|
parch>=0.5
0.09167
0.0
0.2
0.4
0.6
Fraction
of NAs
0.1806
0.3249
0.8
survived
name
pclass
sex
parch
fare
ticket
sibsp
age
0.3
cabin
0.2
boat
home.dest
0.4
body
Fraction Missing
0.1
embarked
0.0
Figure 8.8: Patterns of missing data. Upper left panel shows the fraction of observations missing on each predictor. Lower panel
depicts a hierarchical cluster analysis of missingness combinations. The similarity measure shown on the Y -axis is the fraction of
observations for which both variables are missing. Right panel shows the result of recursive partitioning for predicting is.na(age).
The rpart function found only strong patterns according to passenger class.
plot ( summary ( is.na ( age ) sex + pclass + survived +

sibsp + parch , data = t3 ) ) # F i g u r e 8.9
m lrm ( is.na ( age ) sex * pclass + survived + sibsp + parch ,
data = t3 )
print (m , latex = TRUE , needspace = 3 .5in )

mean
sex
female
male
pclass
1st
2nd
3rd
N
466
843
323
277
709
Survived
809
500
3
4
5
8
891
319
42
20
22
6
9
No
Yes
Number of Parents/Children Aboard
0
1
2
3
4
5
6
9
Overall
0.0
0.2
1002
170
113
8
6
6
2
2
1309
0.4
0.6
0.8
1.0
is.na(age)
Figure 8.9: Univariable descriptions of proportion of passengers with missing age

lrm(formula = is.na(age) ~ sex * pclass + survived + sibsp +
parch, data = t3)
Model Likelihood Discrimination Rank Discrim.
Ratio Test
Indexes
Indexes
Obs
1309 LR 2
114.99 R2
0.133 C
0.703
FALSE
1046 d.f.
8 g
1.015 Dxy
0.406
TRUE
263 Pr(> 2 ) < 0.0001 gr
2.759
0.452
log L
6
max | | 510
gp
0.126 a
0.131
Brier
0.148
8-14
8-15
Table 8.3: Wald Statistics for is.na(age)

All Interactions
All Interactions
survived
sibsp
parch
TOTAL
Intercept
sex=male
pclass=2nd
pclass=3rd
survived
sibsp
parch
sex=male * pclass=2nd
sex=male * pclass=3rd
Coef
-2.2030
0.6440
-1.0079
1.6124
-0.1806
0.0435
-0.3526
0.1347
-0.8563
2
5.61
5.58
68.43
5.58
0.98
0.35
7.92
5.58
82.90
d.f.
3
2
4
2
1
1
1
2
8
P
0.1324
0.0614
< 0.0001
0.0614
0.3232
0.5548
0.0049
0.0614
< 0.0001
S.E.
Wald Z Pr(> |Z|)
0.3641
-6.05 < 0.0001
0.3953
1.63
0.1033
0.6658
-1.51
0.1300
0.3596
4.48 < 0.0001
0.1828
-0.99
0.3232
0.0737
0.59
0.5548
0.1253
-2.81
0.0049
0.7545
0.18
0.8583
0.4214
-2.03
0.0422
latex ( anova ( m ) , file = , label = titanic-anova.na ) # T a b l e 8.3
pclass
and parch are the important predictors of missing age.
8-16
8.5
Single Conditional Mean Imputation

D
First try: conditional mean imputation

Default spline transformation for age caused distribution of imputed values to be much different from non-imputed ones; constrain to linear
xtrans transcan ( I ( age ) + sex + pclass + sibsp + parch ,
imputed = TRUE , pl = FALSE , pr = FALSE , data = t3 )
summary ( xtrans )
transcan ( x = I ( age ) + sex + pclass + sibsp + parch , imputed = TRUE ,
pr = FALSE , pl = FALSE , data = t3 )
Iterations : 5
R2 achieved in predicting each variable :
age
0.264
sex pclass
0.076 0.242
sibsp
0.249
parch
0.291
sibsp
0.245
parch
0.288
Adjusted R2 :
age
0.260
sex pclass
0.073 0.239
Coefficients of canonical variates for predicting each ( row ) variable

age
sex
pclass sibsp parch
age
0.92 6.05 -2.02 -2.65
sex
0.03
-0.56 -0.01 -0.75
pclass 0.08 -0.26
0.03 0.28
sibsp -0.02 0.00 0.03
0.86
parch -0.03 -0.30 0.23
0.75
Summary of imputed values
age
n missing
263
0
.75
.90
28.10
42.77
unique
24
.95
42.77
Info
0.91
Mean
28.53
.05
17.34
lowest : 9.829 11.757 13.224 15.152 17.283

highest : 33.246 34.738 38.638 40.840 42.768
Starting estimates for imputed values :
.10
21.77
.25
26.17
.50
28.10
age
28
#
#
sex pclass
2
3
Look at mean i m p u t e d
a g e . i is age , f i l l e d
sibsp
0
8-17
parch
0
v a l u e s by sex , p c l a s s and
in with c o n d i t i o n a l mean
observed means
estimates
age.i with ( t3 , impute ( xtrans , age , data = t3 ) )

i is.imputed ( age.i )
with ( t3 , tapply ( age.i [ i ] , list ( sex [ i ] , pclass [ i ]) , mean ) )
1 st
2 nd
3 rd
female 39.08396 31.31831 23.10548
male
42.76765 33.24650 26.87451
with ( t3 , tapply ( age , list ( sex , pclass ) , mean , na.rm = TRUE ) )
1 st
2 nd
3 rd
female 37.03759 27.49919 22.18531
male
41.02925 30.81540 25.96227
dd
datadist ( dd , age.i )
f.si lrm ( survived ( sex + pclass + rcs ( age.i ,5) ) 2 +
rcs ( age.i ,5) * sibsp , data = t3 )
print ( f.si , coefs = FALSE , latex = TRUE )

lrm(formula = survived ~ (sex + pclass + rcs(age.i, 5))^2 +
5) * sibsp, data = t3)
Model Likelihood Discrimination
Ratio Test
Indexes
Obs
1309 LR 2
640.85 R2
0.526
0
809 d.f.
26 g
2.223
1
500 Pr(> 2 ) < 0.0001 gr
9.233
log L
4
max | | 410
gp
0.346
Brier
0.133
rcs(age.i,
Rank Discrim.
Indexes
C
0.861
Dxy
0.723
0.728
a
0.341
p1 Predict (f ,
age ,
pclass , sex , sibsp =0 , fun = plogis )
p2 Predict ( f.si , age.i , pclass , sex , sibsp =0 , fun = plogis )
p rbind ( Casewise Deletion = p1 , Single Imputation = p2 ,
rename = c ( age.i = age ) )
# creates .set. variable
ggplot (p , groups = sex , ylab = Probability of Surviving )
# F i g u r e 8.10
latex ( anova ( f.si ) , file = , label = titanic-anova.si )
Table
8.4
Casewise Deletion
8-18
Single Imputation
1.00
0.75
1st
0.50
0.25
1.00
0.75
sex
2nd
Probability of Surviving
0.00
0.50
female
male
0.25
0.00
1.00
0.75
3rd
0.50
0.25
0.00
0
20
40
60
20
40
60
Age, years
Figure 8.10: Predicted probability of survival for males from fit using casewise deletion (bottom) and single conditional mean
imputation (top). sibsp is set to zero for these predicted values.

All Interactions
All Interactions
age.i (Factor+Higher Order Factors)
All Interactions
All Interactions
sex age.i (Factor+Higher Order Factors)
Nonlinear
pclass age.i (Factor+Higher Order Factors)
Nonlinear
age.i sibsp (Factor+Higher Order Factors)
Nonlinear
TOTAL NONLINEAR
TOTAL INTERACTION
TOTAL
2
245.39
52.85
112.07
36.79
49.32
25.62
19.71
22.02
12.28
30.29
8.91
5.62
5.62
6.05
5.44
5.44
12.28
2.05
2.05
19.71
67.00
69.53
305.74
d.f.
7
6
12
10
20
16
15
5
4
2
4
3
3
8
6
6
4
3
3
15
18
21
26
P
< 0.0001
< 0.0001
< 0.0001
0.0001
0.0003
0.0595
0.1835
0.0005
0.0154
< 0.0001
0.0633
0.1319
0.1319
0.6421
0.4888
0.4888
0.0154
0.5614
0.5614
0.1835
< 0.0001
< 0.0001
< 0.0001
8-19
8-20
8.6
Multiple Imputation
The following uses aregImpute with predictive mean matching.
By default, aregImpute does not transform age when it is being predicted from the other variables. Four knots are used to
transform age when used to impute other variables (not needed
here as no other missings were present). Since the fraction of
263
= 0.2 we use 20 imputaobservations with missing age is 1309
tions.
set.seed (17)
# so can r e p r o d u c e random a s p e c t s
mi aregImpute ( age + sex + pclass +
sibsp + parch + survived ,
data = t3 , n.impute =20 , nk =4 , pr = FALSE )
mi
Multiple Imputation using Bootstrap and PMM

aregImpute ( formula = age + sex + pclass + sibsp + parch + survived ,
data = t3 , n . impute = 20 , nk = 4 , pr = FALSE )
n : 1309
p: 6
Number of NAs :
age
sex
263
0
age
sex
pclass
sibsp
parch
survived
Imputations : 20
pclass
0
sibsp
0
nk : 4
parch survived
0
0
type d . f .
s
1
c
1
c
2
s
2
s
2
l
1
Transformation of Target Variables Forced to be Linear

R - squares for Predicting Non - Missing Values for Each Variable
Using Last Imputations of Predictors
age
0.295
#
Print
the
first
10
imputations
for
the
first
10
passengers
having
missing
8-21
age
mi $ imputed $ age [1:10 , 1:10]
16
38
41
47
60
70
71
75
81
107
[ ,1] [ ,2] [ ,3] [ ,4] [ ,5] [ ,6] [ ,7] [ ,8] [ ,9] [ ,10]
40
49
24
29 60.0
58
64
36
50
61
33
45
40
49 80.0
2
38
38
36
53
29
24
19
31 40.0
60
64
42
30
65
40
42
29
48 36.0
46
64
30
38
42
52
40
22
31 38.0
22
19
24
40
33
16
14
23
23 18.0
24
19
27
59
23
30
62
57
30 42.0
31
64
40
40
63
43
23
36
61 45.5
58
64
27
24
50
44
57
47
31 45.0
30
64
62
39
67
52
18
24
62 32.5
38
64
47
19
23
Show the distribution of imputed (black) and actual ages (gray).

E
plot ( mi )
Ecdf ( t3 $ age , add = TRUE , col = gray , lwd =2 ,
subtitles = FALSE ) # F i g . 8.11
Proportion <= x
1.0
0.8
0.6
0.4
0.2
0.0
0
20
40
60
80
Imputed age
Figure 8.11: Distributions of imputed and actual ages for the Titanic dataset. Imputed values are in black and actual ages in gray.
Fit logistic models for 20 completed datasets and print the ratio
of imputation-corrected variances to average ordinary variances
f.mi fit.mult.impute (
survived ( sex + pclass + rcs ( age ,5) ) 2 +
rcs ( age ,5) * sibsp ,
lrm , mi , data = t3 , pr = FALSE )
latex ( anova ( f.mi ) , file = , label = titanic-anova.mi ,
size = small )
# T a b l e 8.5
The Wald 2 for age is reduced by accounting for imputation
8-22

All Interactions
All Interactions
All Interactions
All Interactions
Nonlinear
Nonlinear
Nonlinear
TOTAL NONLINEAR
TOTAL INTERACTION
TOTAL
2
240.42
54.56
114.21
36.43
50.37
25.88
24.21
24.22
12.86
30.99
11.38
8.15
8.15
5.30
4.63
4.63
12.86
1.84
1.84
24.21
67.12
70.99
298.78
d.f.
7
6
12
10
20
16
15
5
4
2
4
3
3
8
6
6
4
3
3
15
18
21
26
P
< 0.0001
< 0.0001
< 0.0001
0.0001
0.0002
0.0557
0.0616
0.0002
0.0120
< 0.0001
0.0226
0.0430
0.0430
0.7246
0.5918
0.5918
0.0120
0.6058
0.6058
0.0616
< 0.0001
< 0.0001
< 0.0001
but is increased by using patterns of association with survival

status to impute missing age.
Show estimated effects of age by classes.
p1 Predict ( f.si , age.i , pclass , sex , sibsp =0 , fun = plogis )
p2 Predict ( f.mi , age ,
pclass , sex , sibsp =0 , fun = plogis )
p rbind ( Single Imputation = p1 , Multiple Imputation = p2 ,
rename = c ( age.i = age ) )
ggplot (p , groups = sex , ylab = Probability of Surviving )
# F i g u r e 8.12
Single Imputation
8-23
Multiple Imputation
1.00
0.75
1st
0.50
0.25
1.00
0.75
sex
2nd
Probability of Surviving
0.00
0.50
female
male
0.25
0.00
1.00
0.75
3rd
0.50
0.25
0.00
0
20
40
60
20
40
60
Age, years
Figure 8.12: Predicted probability of survival for males from fit using single conditional mean imputation again (top) and multiple
random draw imputation (bottom). Both sets of predictions are for sibsp=0.
8-24
8.7
Summarizing the Fitted Model

Show odds ratios for changes in predictor values
#
Get
predicted
values
for
certain
types
of
passengers
s summary ( f.mi , age = c (1 ,30) , sibsp =0:1)

#
override
default
ranges
for
plot (s , log = TRUE , main = )
variables
# Figure
8.13
0.10
0.50
2.00
5.00
age 30:1
sibsp 1:0
sex female:male
pclass 1st:3rd
pclass 2nd:3rd
Adjusted to:sex=male pclass=3rd age=28 sibsp=0

Figure 8.13: Odds ratios for some predictor settings
phat predict ( f.mi ,

combos
expand.grid ( age = c (2 ,21 ,50) , sex = levels ( t3 $ sex ) ,
pclass = levels ( t3 $ pclass ) ,
sibsp =0) , type = fitted )
#
#
Can
also
use
P r e d i c t ( f.mi , age = c (2 ,21 ,50) , sex ,

s i b s p =0 , fun = p l o g i s )$ y h a t

data.frame ( combos , phat )
1
2
3
4
5
6
7
8
9
10
11
12
age
2
21
50
2
21
50
2
21
50
2
21
50
sex pclass sibsp phat

female
1 st
0 0.97
female
1 st
0 0.98
female
1 st
0 0.97
male
1 st
0 0.88
male
1 st
0 0.48
male
1 st
0 0.27
female
2 nd
0 1.00
female
2 nd
0 0.90
female
2 nd
0 0.82
male
2 nd
0 1.00
male
2 nd
0 0.08
male
2 nd
0 0.04
pclass ,
13
14
15
16
17
18
2 female
21 female
50 female
2
male
21
male
50
male
3 rd
3 rd
3 rd
3 rd
3 rd
3 rd
0
0
0
0
0
0
8-25
0.85
0.57
0.37
0.91
0.13
0.06
We can also get predicted values by creating an S function that

will evaluate the model on demand.
pred.logit Function ( f.mi )
#
#
Note : if
normally
don t d e f i n e s i b s p to p r e d . l o g i t , d e f a u l t s to
just type the f u n c t i o n name to see its body
latex ( pred.logit , file = , type = Sinput , size = small ,

width.cutoff =49)
p r e d . l o g i t f u n c t i o n ( s e x = male , p c l a s s = 3 r d , age = 2 8 ,
s i b s p = 0)
{
3 . 2 4 2 7 6 7 1 0 . 9 5 4 3 1 8 0 9 ( s e x == male ) + 5 . 4 0 8 6 5 0 5
( p c l a s s == 2 nd ) 1 . 3 3 7 8 6 2 3 ( p c l a s s ==
3 r d ) + 0 . 0 9 1 1 6 2 6 4 9 age 0 . 0 0 0 3 1 2 0 4 3 2 7
pmax ( age 6 , 0 ) 3 + 0 . 0 0 2 1 7 5 0 4 1 3 pmax ( age
2 1 , 0 ) 3 0 . 0 0 2 7 6 2 7 0 3 2 pmax ( age 2 7 , 0 ) 3 +
0 . 0 0 0 9 8 0 5 1 3 7 pmax ( age 3 6 , 0 ) 3 8 .0808484e05
pmax ( age 55 . 8 , 0 ) 3 1 . 1 5 6 7 9 7 6 s i b s p +
( s e x == male ) ( 0.46061284 ( p c l a s s ==
2 nd ) + 2 . 0 4 0 6 5 2 3 ( p c l a s s == 3 r d ) ) +
( s e x == male ) ( 0.22469066 age + 0 . 0 0 0 4 3 7 0 8 2 9 6
pmax ( age 6 , 0 ) 3 0 . 0 0 2 6 5 0 5 1 3 6 pmax ( age
2 1 , 0 ) 3 + 0 . 0 0 3 1 2 0 1 4 0 4 pmax ( age 2 7 ,
0 ) 3 0 . 0 0 0 9 7 9 2 3 7 4 9 pmax ( age 3 6 ,
0 ) 3 + 7 .2527708e05 pmax ( age 55 . 8 ,
0 ) 3 ) + ( p c l a s s == 2 nd ) ( 0.46144083
age + 0 . 0 0 0 7 0 1 9 4 8 4 9 pmax ( age 6 , 0 ) 3
0 . 0 0 3 4 7 2 6 6 6 2 pmax ( age 2 1 , 0 ) 3 + 0 . 0 0 3 5 2 5 5 3 8 7
pmax ( age 2 7 , 0 ) 3 0 . 0 0 0 7 9 0 0 8 9 1 pmax ( age
3 6 , 0 ) 3 + 3 .5268151e05 pmax ( age 55 . 8 ,
0 ) 3 ) + ( p c l a s s == 3 r d ) ( 0.17513289
age + 0 . 0 0 0 3 5 2 8 3 3 5 8 pmax ( age 6 , 0 ) 3
0 . 0 0 2 3 0 4 9 3 7 2 pmax ( age 2 1 , 0 ) 3 + 0 . 0 0 2 8 9 7 8 9 6 2
pmax ( age 2 7 , 0 ) 3 0 . 0 0 1 0 5 1 4 5 pmax ( age
3 6 , 0 ) 3 + 0 . 0 0 0 1 0 5 6 5 7 3 5 pmax ( age 55 . 8 ,
0 ) 3 ) + s i b s p ( 0 . 0 4 0 8 3 0 7 7 3 age 1 .5627772e05
pmax ( age 6 , 0 ) 3 + 0 . 0 0 0 1 2 7 9 0 2 5 6 pmax ( age
2 1 , 0 ) 3 0 . 0 0 0 2 5 0 3 9 3 8 5 pmax ( age 2 7 ,
0 ) 3 + 0 . 0 0 0 1 7 8 7 1 7 0 1 pmax ( age 3 6 , 0 ) 3
4 .0597949e05 pmax ( age 55 . 8 , 0 ) 3 )
}
#
Run
the
newly
created
function
plogis ( pred.logit ( age = c (2 ,21 ,50) , sex = male , pclass = 3 rd ) )

[1] 0.914817 0.132640 0.056248
8-26
A nomogram could be used to obtain predicted values manually,

but this is not feasible when so many interaction terms are
present.
Package
Hmisc
Hmisc
rms
rparta
a Written
R Software Used
Functions
Purpose
Miscellaneous functions summary,plsmo,naclus,llist,latex
Imputation
Modeling
Model presentation
Model validation
Recursive partitioning
by Atkinson & Therneau
summarize,Dotplot,describe
transcan,impute,fit.mult.impute,aregImpute
datadist,lrm,rcs
plot,summary,nomogram,Function
validate,calibrate
rpart
Chapter 9
Regression Models for Continuous Y and

Case Study in Ordinal Regression
This chapter concerns univariate continuous Y . There are
many multivariable models for predicting such response variables.
linear models with assumed normal residuals, fitted with ordinary least squares
generalized linear models and other parametric models based
on special distributions such as the gamma
generalized additive models (GAMs)
generalization of GAMs to also nonparametrically transform
Y
quantile regression (see Section 9.3)
other robust regression models that, like quantile regres9-1
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION
9-2
sion, use an objective different from minimizing the sum of

squared errors [164]
semiparametric models based on the ranks of Y , such as the
Cox proportional hazards model and the proportional odds
ordinal logistic model
cumulative probability models (often called cumulative link
models) which are semiparametric models from a wider class
of families than the logistic
Semiparametric models that treat Y as ordinal but not intervalscaled have many advantages including robustness and freedom
of distributional assumptions for Y conditional on any given set
of predictors.
Advantages are demonstrated in a case study of a cumulative
probability ordinal model. Some of the results are compared to
quantile regression and OLS. Many of the methods used in the
case study also apply to ordinary linear models.
9-3
9.1
Dataset and Descriptive Statistics

Diabetes Mellitus (DM) type II (adult onset diabetes) is
strongly associated with obesity
Primary laboratory test for diabetes: gylcosylated hemoglobin
(HbA1c), also called glycated hemoglobin, glycohemoglobin,
or
hemoglobin A1c.
HbA1c reflects average blood glucose for the preceding 60
to 90 days
HbA1c > 7.0 usually taken as a positive diagnosis of diabetes
Goal of analysis:
better understand effects of body size measurements on
risk of DM
enhance screening for DM
Best way to develop a model for DM screening is not to
fit binary logistic model with HbA1c > 7 as the response
variable
All cutpoints are arbitrary; no justification for any putative cut
9-4
HbA1c 2=6.9, 7.1=10

lower power, wider confidence
Larger standard errors of ,
bands
Better: predict continuous HbA1c using continuous response model, then convert to probability HbA1c exceeds
any cutoff or estimate 0.9 quantile of HbA1c
Data: U.S. National Health and Nutrition Examination Survey (NHANES) from National Center for Health Statistics/CDC: http://www.cdc.gov/nchs/nhanes.htm[26]
age 80 coded as 80 by CDC
Subset with age 21, neither diagnosed nor treated for DM
require ( rms )
getHdata ( nhgh )
w subset ( nhgh , age 21 & dx ==0 & tx ==0 , select = -c ( dx , tx ) )
latex ( describe ( w ) , file = )
w
4629 Observations
18 Variables
seqn : Respondent sequence number
n
4629
missing
0
unique
4629
Info
1
Mean
56902
.05
52136
.10
52633
.25
54284
.50
56930
.75
59495
.90
61079
.95
61641
lowest : 51624 51629 51630 51645 51647

highest: 62152 62153 62155 62157 62158
sex
n
4629
missing
0
unique
2
male (2259, 49%), female (2370, 51%)
age : Age [years]

n
4629
missing
0
unique
703
Info
1
Mean
48.57
lowest : 21.00 21.08 21.17 21.25 21.33

highest: 79.67 79.75 79.83 79.92 80.00
.05
23.33
.10
26.08
.25
33.92
.50
46.83
.75
61.83
.90
74.83
.95
80.00
re : Race/Ethnicity
n
4629
missing
0
unique
5
Mexican American (832, 18%), Other Hispanic (474, 10%)

Non-Hispanic White (2318, 50%), Non-Hispanic Black (756, 16%)
Other Race Including Multi-Racial (249, 5%)
income : Family Income

n
4389
missing
240
unique
14
[0,5000) (162, 4%), [5000,10000) (216, 5%), [10000,15000) (371, 8%)

[15000,20000) (300, 7%), [20000,25000) (374, 9%)
[25000,35000) (535, 12%), [35000,45000) (421, 10%)
[45000,55000) (346, 8%), [55000,65000) (257, 6%), [65000,75000) (188, 4%)
> 20000 (149, 3%), < 20000 (52, 1%), [75000,100000) (399, 9%)
>= 100000 (619, 14%)
wt : Weight [kg]
n
4629
missing
0
unique
890
Info
1
Mean
80.49
.05
52.44
.10
57.18
.25
66.10
.50
77.70
.75
91.40
.90
106.52
.95
118.00
.05
151.1
.10
154.4
.25
160.1
.50
167.2
.75
175.0
.90
181.0
.95
184.8
.05
20.02
.10
21.35
.25
24.12
.50
27.60
.75
31.88
.90
36.75
.95
40.68
lowest : 33.2 36.1 37.9 38.5 38.7

highest: 184.3 186.9 195.3 196.6 203.0
ht : Standing Height [cm]

n
4629
missing
0
unique
512
Info
1
Mean
167.5
lowest : 123.3 135.4 137.5 139.4 139.8

highest: 199.2 199.3 199.6 201.7 202.7
bmi : Body Mass Index [kg/m2 ]

n
4629
missing
0
unique
1994
Info
1
Mean
28.59
lowest : 13.18 14.59 15.02 15.40 15.49

highest: 61.20 62.81 65.62 71.30 84.87
leg : Upper Leg Length [cm]

n
4474
missing
155
unique
216
Info
1
Mean
38.39
.05
32.0
.10
33.5
.25
36.0
.50
38.4
.75
41.0
.90
43.3
.95
44.6
.90
40.6
.95
41.7
.90
39.1
.95
41.4
lowest : 20.4 24.9 25.0 25.1 26.4, highest: 49.0 49.5 49.8 50.0 50.3
arml : Upper Arm Length [cm]

n
4502
missing
127
unique
156
Info
1
Mean
37.01
.05
32.6
.10
33.5
.25
35.0
.50
37.0
.75
39.0
lowest : 24.8 27.0 27.5 29.2 29.5, highest: 45.2 45.5 45.6 46.0 47.0
armc : Arm Circumference [cm]

n
4499
missing
130
unique
290
Info
1
Mean
32.87
.05
25.4
.10
26.9
.25
29.5
.50
32.5
.75
35.8
lowest : 17.9 19.0 19.3 19.5 19.9, highest: 54.2 54.9 55.3 56.0 61.0
waist : Waist Circumference [cm]

n
4465
missing
164
unique
716
Info
1
Mean
97.62
.05
74.8
.10
78.6
.25
86.9
.50
96.3
.75
107.0
.90
117.8
.95
125.0
lowest : 59.7 60.0 61.5 62.0 62.4

highest: 160.0 160.6 162.2 162.7 168.7
tri : Triceps Skinfold [mm]

n
4295
missing
334
lowest :
2.6
unique
342
3.1
3.2
Info
1
3.3
Mean
18.94
.05
7.2
.10
8.8
.25
12.0
.50
18.0
.75
25.2
.90
31.0
.95
33.8
3.4, highest: 39.6 39.8 40.0 40.2 40.6
sub : Subscapular Skinfold [mm]

n
3974
missing
655
lowest :
3.8
unique
329
4.2
4.6
Info
1
4.8
Mean
20.8
.05
8.60
.10
10.30
.25
14.40
.50
20.30
.75
26.58
4.9, highest: 40.0 40.1 40.2 40.3 40.4
gh : Glycohemoglobin [%]
n
4629
missing
0
lowest :
4.0
unique
63
4.1
4.2
Info
0.99
4.3
Mean
5.533
.05
4.8
.10
5.0
.25
5.2
.50
5.5
.75
5.8
4.4, highest: 11.9 12.0 12.1 12.3 14.5
.90
6.0
.95
6.3
.90
32.00
.95
35.00
9-5
albumin : Albumin [g/dL]
n
4576
missing
53
unique
26
Info
0.99
Mean
4.261
.05
3.7
.10
3.9
.25
4.1
.50
4.3
.75
4.5
.90
4.7
.95
4.8
.75
15
.90
19
.95
22
.75
0.99
.90
1.14
lowest : 2.6 2.7 3.0 3.1 3.2, highest: 4.9 5.0 5.1 5.2 5.3
bun : Blood urea nitrogen [mg/dL]

n
4576
lowest :
missing
53
1
unique
50
3
Info
0.99
Mean
13.03
.05
7
.10
8
.25
10
.50
12
5, highest: 49 53 55 56 63
SCr : Creatinine [mg/dL]

n
4576
missing
53
lowest :
highest:
0.34
5.98
unique
167
0.38
6.34
Info
1
Mean
0.8887
.05
0.58
.10
0.62
.25
0.72
.50
0.84
0.39 0.40 0.41

9.13 10.98 15.66
dd datadist ( w ) ; options ( datadist = dd )
.95
1.25
9-6
9-7
9.2
The Linear Model

The most popular multivariable model for analyzing a univariate
continuous Y is the the linear model
E(Y |X) = X,
where is estimated using ordinary least squares, that is, by
2.
solving for to minimize P(Yi X )
To compute P -values and confidence limits using parametric
methods (and for least squares estimates to coincide with
maximum likelihood estimates) we would have to assume
that Y |X is normal with mean X and constant variance
2 a
9.2.1
Checking Assumptions of OLS and Other Models

First see if gh would make a Gaussian residuals model fit
Use ordinary regression on 4 key variables to collapse into
one variable (predicted mean from OLS model)
Stratify predicted mean into 6 quantile groups
Apply the normal inverse ECDF of gh to these strata and
a The latter assumption may be dispensed with if we use a robust HuberWhite or bootstrap covariance matrix estimate. Normality may
sometimes be dispensed with by using bootstrap confidence intervals, but this would not fix inefficiency problems with OLS when residuals are
non-normal.
9-8
check for normality and constant 2

ECDF is for Prob[Y y|X] but for ordinal modeling we
want to state models in terms of Prob[Y y|X] so take 1
- ECDF before inverse transforming
f ols ( gh rcs ( age ,5) + sex + re + rcs ( bmi , 3) , data = w )
pgh fitted ( f )
p function ( fun , row , col ) {
f substitute ( fun ) ; g function ( F ) eval ( f )
z Ecdf ( gh , groups = cut2 ( pgh , g =6) ,
fun = function ( F ) g (1 - F ) ,
ylab = as.expression ( f ) , xlim = c (4 .5 , 7 .75 ) , data =w ,
label.curve = FALSE )
print (z , split = c ( col , row , 2 , 2) , more = row < 2 | col < 2)
}
p ( log ( F / (1 -F ) ) ,
1 , 1)
p ( qnorm ( F ) ,
1 , 2)
p ( -log ( -log ( F ) ) , 2 , 1)
p ( log ( -log (1 -F ) ) , 2 , 2)
#
#
#
Get
Use
get
slopes of pgh for some c u t o f f s of

glm c o m p l e m e n t a r y l o g - l o g link on
l o g - l o g link on Prob ( Y cutoff )
Y
Prob (Y
<
cutoff )
to
r NULL
for ( link in c ( logit , probit , cloglog ) )
for ( k in c (5 , 5 .5 , 6) ) {
co coef ( glm ( gh < k pgh , data =w , family = binomial ( link ) ) )
r rbind (r , data.frame ( link = link , cutoff =k ,
slope = round ( co [2] ,2) ) )
}
print (r , row.names = FALSE )
link cutoff slope
logit
5.0 -3.39
logit
5.5 -4.33
logit
6.0 -5.62
probit
5.0 -1.69
probit
5.5 -2.61
probit
6.0 -3.07
cloglog
5.0 -3.18
cloglog
5.5 -2.97
cloglog
6.0 -2.51
Upper right curves are not linear, implying that a normal

conditional distribution cannot work for ghb
b They
are not parallel either.
qnorm(F)
log(F (1 F))
2
0
2
5.0 5.5 6.0 6.5 7.0 7.5
5.0 5.5 6.0 6.5 7.0 7.5
Glycohemoglobin, %
Glycohemoglobin, %
log( log(1 F))
log( log(F))
2
6
4
2
0
0
2
4
6
2
5.0 5.5 6.0 6.5 7.0 7.5
5.0 5.5 6.0 6.5 7.0 7.5
Glycohemoglobin, %
Glycohemoglobin, %
Figure 9.1: Examination of normality and constant variance assumption, and assumptions for various ordinal models
9-9
CHAPTER 9. REGRESSION MODELS FOR CONTINUOUS Y AND CASE STUDY IN ORDINAL REGRESSION 9-10
There is non-parallelism for the logit model

Other graphs will be used to guide selection of an ordinal
model below
9.3
Quantile Regression
Ruled out OLS and semiparametric proportional odds model
Quantile regression [96, 95] is a different approach to modeling Y
No distributional assumptions other than continuity of Y
All the usual right hand side assumptions
When there is a single predictor that is categorical, quantile
regression coincides with ordinary sample quantiles stratified
by that predictor
Is transformation invariant - pre-transforming Y not important
Let (y) = y( [y < 0]). The th sample quantile is the
minimizer q of Pni1 (yi q). For a conditional th quantile of Y |X the corresponding quantile regression estimator
minimizes Pni=1 (Yi X).
Quantile regression is not as efficient at estimating quantiles as
is ordinary least squares at estimating the mean, if the latters
assumptions hold.
Koenkers quantreg package in R [97] implements quantile re-
gression, and the rms packages Rq function provides a front-end

that gives rise to various graphics and inference tools.
If we model the median gh as a function of covariates, only
the X structure need be correct. Other quantiles (e.g., 90th
percentile) can be directly modeled but standard errors will be
much larger as it is more difficult to precisely estimate outer
quantiles.
9.4
Ordinal Regression Models for Continuous Y

Advantages of semiparametric models (e.g., quantile regression and cumulative probability ordinal models
For ordinal cumulative probability models, there is no distributional assumption for Y given a setting of X
Assume only a connection between distributions of Y for
different X
Applying an increasing 11 transformation to Y results in
no change to regression coefficient estimatesc
Regression coefficient estimates are completely robust to extreme Y valuesd
Estimates of quantiles of Y are
exactly transformation-preserving, e.g., estimate of median
of log Y is exactly the log of the estimate of median Y
For a general continuous distribution function F (y), an ordinal regression model based on cumulative probabilities may
be stated as followse. Let the ordered unique values of Y
be denoted by y1, y2, . . . , yk and let the intercepts associated
c For symmetric distributions applying a decreasing transformation will negate the coefficients. For asymmetric distributions (e.g., Gumbel),
reversing the order of Y will do more than change signs.
d Only an estimate of mean Y from these s
is non-robust.
e It is more traditional to state the model in terms of Prob[Y y|X] but we use Prob[Y y|X] so that higher predicted values are
associated with higher Y .
with y1, . . . , yk be 1, 2, . . . , k , where 1 = because

Prob[Y y1] = 1. Let y = i, i : yi = y. Then
Prob[Y yi|X] = F (i + X) = F (yi + X)
For the OLS fully parametric case, the model may be restated
y X
Y X
]
Prob[Y y|X] = Prob[
y X
y X
= 1 (
) = (
+
)
so that to within an additive constantf y = y

(intercepts
are linear in y whereas they are arbitrarily descending in the
ordinal model), and is absorbed in to put the OLS model
into the new notation.
The general ordinal regression model assumes that for fixed
X1, X2,
F 1(Prob[Y y|X2]) F 1(Prob[Y y|X1])
= (X2 X1)
independent of the s (parallelism assumption). If F = [1 +
exp(y)]1, this is the proportional odds assumption.
Common choices of F , implemented in the rms orm function,
are shown in Table 9.1.
The Gumbel maximum value distribution is also called the extreme value type I distribution. This distribution (log log
f
y
are unchanged if a constant is added to all y.
Table 9.1: Distribution families used in ordinal cumulative probability models. denotes the Gaussian cumulative distribution
function. For the Connection column, P1 = Prob[Y y|X1 ], P2 = Prob[Y y|X2 ], = (X2 X1 ). The connection specifies
the only distributional assumption if the model is fitted semiparametrically, i.e, contains an intercept for every unique Y value less
one. For parametric models, P1 must be specified absolutely instead of just requiring a relationship between P1 and P2 . For example,
the traditional Gaussian parametric model specifies that Prob[Y y|X] = 1 ( yX
) = ( y+X
).
Distribution
Logistic
Gaussian
Gumbel maximum
value
Gumbel minimum
value
Cauchy
Link Name
[1 + exp(y)]1
(y)
exp( exp(y))
Inverse
(Link Function)
y
log( 1y
)
1 (y)
log( log(y))
logit
probit
log log
P1
= 1P
exp()
1
P2 = (1 (P1 ) + )
exp()
P2 = P1
1 exp( exp(y))
log( log(1 y))
complementary
log log
cauchit
1 P2 = (1 P1 )exp()
tan1 (y) +
1
2
tan[(y 21 )]
Connection
P2
1P2
link) also represents a continuous time proportional hazards

model. The hazard ratio when X changes from X1 to X2 is
exp((X2 X1)).
The mean of Y |X is easily estimated by computing
k
X
i=1
= yi|X]
yiProb[Y
and the q th quantile of Y |X is y such that

y .g
F 1(1 q) X =
The orm function in the rms package takes advantage of the
information matrix being of a sparse tri-band diagonal form
for the intercept parameters. This makes the computations
efficient even for hundreds of intercepts (i.e., unique values of
Y ). orm is made to handle continuous Y .
Ordinal regression has nice properties in addition to those listed
above, allowing for
g The intercepts have to be shifted to the left one position in solving this equation because the quantile is such that Prob[Y y] = q
whereas the model is stated in terms of Prob[Y y].
estimation of quantiles as efficiently as quantile regression

if the parallel slopes assumptions hold
efficient estimation of mean Y
direct estimation of Prob[Y y|X]
arbitrary clumping of values of Y , while still estimating
and mean Y efficientlyh
solutions for using ordinary Newton-Raphson or other popular optimization techniques
being based on a standard likelihood function, penalized estimation can be straightforward
Wald, score, and likelihood ratio 2 tests that are more
powerful than tests from quantile regression
To summarize how assumptions of parametric models compare
to assumptions of semiparametric models, consider the ordinary
linear model or its special case the equal variance two-sample
t-test, vs. the probit or logit (proportional odds) ordinal model
or their special cases the Van der Waerden (normal-scores) twosample test or the Wilcoxon test. All the assumptions of the
linear model other than independence of residuals are captured
h But
it is not sensible to estimate quantiles of Y when there are heavy ties in Y in the area containing the quantile.
in the following (written in traditional Y y form):

y X
)
F (y|X) = Prob[Y y|X] = (
y X
1
(F (y|X)) =
On the other hand, ordinal models assume the following:
1(F(y|X))
(F(y|X))
logit(F(y|X))
Figure 9.2: Assumptions of the linear model (left panel) and semiparametric ordinal probit or logit (proportional odds) models (right
panel). Ordinal models do not assume any shape for the distribution of Y for a given X; they only assume parallelism.
Prob[Y y|X] = F (g(y) X),

where g is unknown and may be discontinuous.
From this point we revert back to Y y notation so that Y
increases as X increases.
Global Modeling Implications
Ordinal regression invariant to choice of transformation of

Y
Y needs to be ordinal
Difference in two ordinal variables is not necessarily ordinal
Never analyze differences in regression
Y =final value, adjust for baseline values as covariates
9.5
Ordinal Regression Applied to HbA1c

In Figure 9.1, logit inverse curves are not parallel so proportional odds assumption does not hold
log-log link yields highest degree of parallelism and most
constant regression coeffients across cutoffs of gh so use this
link in an ordinal regression model (linearity of the curves is
not required)
9.5.1
Checking Fit for Various Models Using Age

Another way to examine model fit is to flexibly fit the single
most important predictor (age) using a variety of methods, and
comparing predictions to sample quantiles and means based on
overlapping subsets on age, each subset being subjects having age < 5 years away from the point being predicted by the
models. Here we predict the 0.5, 0.75, and 0.9 quantiles and
the mean. For quantiles we can compare to quantile regression(discussed below) and for means we compare to OLS.
ag 25:75
lag length ( ag )
q2 q3 p90 means numeric ( lag )
for ( i in 1: lag ) {
s which ( abs ( w $ age - ag [ i ]) < 5)
y w $ gh [ s ]
a quantile (y , probs = c ( .5 , .75 , .9 ) )
q2 [ i ]
a [1]
q3 [ i ]
a [2]
p90 [ i ]
a [3]
means [ i ] mean ( y )
}
fams c ( logistic , probit , loglog , cloglog )
fe
function ( pred , target ) mean ( abs ( pred $ yhat - target ) )
mod gh rcs ( age ,6)
P
Er list ()
for ( est in c ( q2 , q3 , p90 , mean ) ) {
meth if ( est == mean ) ols else QR
p list ()
er rep ( NA , 5)
names ( er ) c ( fams , meth )
for ( family in fams ) {
h orm ( mod , family = family , data = w )
fun if ( est == mean ) Mean ( h )
else {
qu Quantile ( h )
switch ( est , q2 = function ( x ) qu ( .5 , x ) ,
q3 = function ( x ) qu ( .75 , x ) ,
p90 = function ( x ) qu ( .9 , x ) )
}
p [[ family ]] z Predict (h , age = ag , fun = fun , conf.int = FALSE )
er [ family ] fe (z , switch ( est , mean = means , q2 = q2 , q3 = q3 , p90 = p90 ) )
}
h switch ( est ,
mean = ols ( mod , data = w ) ,
q2 = Rq ( mod , data = w ) ,
q3 = Rq ( mod , tau =0 .75 , data = w ) ,
p90 = Rq ( mod , tau =0 .90 , data = w ) )
p [[ meth ]] z Predict (h , age = ag , conf.int = FALSE )
er [ meth ] fe (z , switch ( est , mean = means , q2 = q2 , q3 = q3 , p90 = p90 ) )
Er [[ est ]] er
pr do.call ( rbind , p )
pr $ est est
P rbind.data.frame (P , pr )
}
xyplot ( yhat age | est , groups = .set. , data =P , type = l , # F i g u r e 9.3
auto.key = list ( x = .75 , y = .2 , points = FALSE , lines = TRUE ) ,
panel = function ( ... , subscripts ) {
panel.xyplot ( ... , subscripts = subscripts )
est P $ est [ subscripts [1]]
lpoints ( ag , switch ( est , mean = means , q2 = q2 , q3 = q3 , p90 = p90 ) ,
col = gray ( .7 ) )
er format ( round ( Er [[ est ]] ,3) , nsmall =3)
ltext (26 , 6 .15 , paste ( names ( er ) , collapse = \ n ) ,
cex = .7 , adj =0)
ltext (40 , 6 .15 , paste ( er , collapse = \ n ) ,
cex = .7 , adj =1) })
It can be seen in Figure 9.3 that models dedicated to a specific

task (quantile reqression for quantiles and OLS for means) were
best for those tasks. Although the log-log ordinal cumulative
30
40
q2
50
60
70
q3
logistic 0.048
logistic 0.050
probit
probit
0.052
0.045
loglog 0.058
loglog 0.037
cloglog 0.072
cloglog 0.077
QR
QR
0.024
6.2
0.027
5.8
5.6
6.0
5.4
5.2
yhat
mean
6.2
6.0
5.8
5.6
5.4
5.2
p90
logistic 0.021
logistic 0.053
probit
probit
0.025
loglog 0.026
loglog 0.041
cloglog 0.033
cloglog 0.101
ols
QR
0.013
30
40
50
60
0.047

0.030
logistic
probit
loglog
cloglog
QR
ols
70
age
Figure 9.3: Three estimated quantiles and estimated mean using 6 methods, compared against caliper-matched sample quantiles/means (circles). Numbers are mean absolute differences between predicted and sample quantities using overlapping intervals
of age and caliper matching. QR:quantile regression.
probability model did not estimate the median as accurately as

some other methods, it does well for the 0.75 and 0.9 quantiles
and is the best compromise overall because of its ability to
also directly predict the mean as well as quantitles such as
Prob[HbA1c > 7|X].
For here on we focus on the log-log ordinal model.
Going back to the bottom left of figure 9.1, lets look at quantile
groups of predicted HbA1c by OLS and plot predicted distributions of actual HbA1c against empirical distributions.
w $ pghg cut2 ( pgh , g =6)
f orm ( gh pghg , family = loglog , data = w )
lp predict (f , newdata = data.frame ( pghg = levels ( w $ pghg ) ) )
ep ExProb ( f ) # E x c e e d a n c e p r o b . f u n c t n . g e n e r a t o r i n r m s
z ep ( lp )
j order ( w $ pghg ) # p u t s i n o r d e r o f l p ( l e v e l s o f p g h g )
plot (z , xlim = c (4 , 7 .5 ) , data = w [j , c ( pghg , gh ) ]) # F i g . 9.4
1.0
Prob(Y y)
0.8
0.6
0.4
[4.88,5.29)
[5.29,5.44)
[5.44,5.56)
[5.56,5.66)
[5.66,5.76)
[5.76,6.48]
0.2
0.0
4.0
4.5
5.0
5.5
6.0
6.5
7.0
7.5
gh
Figure 9.4: Observed (dashed lines, open circles) and predicted (solid lines, closed circles) exceedance probability distributions from
a model using 6-tiles of OLS-predicted HbA1c . Key shows quantile group intervals of predicted mean HbA1c .
Agreement between predicted and observed exceedance probability distributions is excellent in Figure 9.4.
To return to the initial look at a linear model with assumed
Gaussian residuals, fit a probit ordinal model and compare the
estimated intercepts to the linear relationship with gh that is
assumed by the normal distribution.
f orm ( gh rcs ( age ,6) , family = probit , data = w )
g ols ( gh rcs ( age ,6) , data = w )
s g $ stats [ Sigma ]
yu f $ yunique [ -1 ]
r quantile ( w $ gh , c ( .005 , .995 ) )
alphas coef ( f ) [1: num.intercepts ( f ) ]
plot ( -yu / s , alphas , type = l , xlim = rev ( - r / s ) , # F i g . 9.5
xlab = expression ( -y / hat ( sigma ) ) , ylab = expression ( alpha [ y ]) )
2
1
0
1
2
3
4
5
14
12
10
^
y
Figure 9.5: Estimated intercepts from probit model
Figure 9.5 depicts a significant departure from that implied by

Gaussian residuals.
9.5.2
Examination of BMI
Using the log-log model, we first check the adequacy of BMI
as a summary of height and weight for estimating median gh.
Adjust for age (without assuming linearity) in every case
Look at ratio of coefficients of log height and log weight
Use AIC to judge whether BMI is an adequate summary of
height and weight
f orm ( gh rcs ( age ,5) + log ( ht ) + log ( wt ) ,
family = loglog , data = w )
-log-log Ordinal Regression Model

orm(formula = gh ~ rcs(age, 5) + log(ht) + log(wt), data = w,
family = loglog)
Model Likelihood
Discrimination
Ratio Test
Indexes
Obs
4629 LR 2
1126.94 R2
0.217
Unique Y
63 d.f.
6 g
0.627
Y0.5
5.5 Pr(> 2 ) < 0.0001 gr
1.872
log L
1
6
2
max | | 110
Score 1262.81 |Pr(Y Y0.5 ) 2 | 0.153
Pr(> 2 ) < 0.0001
age
age
age
age
ht
wt
Coef
0.0398
-0.0158
-0.0072
0.0309
-3.0680
1.2748
Rank Discrim.
Indexes
0.486
S.E.
Wald Z Pr(> |Z|)
0.0055
7.29 < 0.0001
0.0275
-0.57
0.5657
0.0866
-0.08
0.9333
0.1135
0.27
0.7853
0.2789
-11.00 < 0.0001
0.0704
18.10 < 0.0001
aic NULL
for ( mod in list ( gh rcs ( age ,5) + rcs ( log ( bmi ) ,5) ,
gh rcs ( age ,5) + rcs ( log ( ht ) ,5) + rcs ( log ( wt ) ,5) ,
gh rcs ( age ,5) + rcs ( log ( ht ) ,4) * rcs ( log ( wt ) ,4) ) )
aic c ( aic , AIC ( orm ( mod , family = loglog , data = w ) ) )

print ( aic )
[1] 25910.77 25910.17 25906.03
The ratio of the coefficient of log height to the coefficient of log

weight is -2.4, which is between what BMI uses and the more
dimensionally reasonable weight / height3. By AIC, a spline interaction surface between height and weight does slightly better
than BMI in predicting HbA1c, but a nonlinear function of BMI
is barely worse. It will require other body size measures to
displace BMI as a predictor.
As an aside, compare this model fit to that from the Cox proportional hazards model. The Cox model uses a conditioning
argument to obtain a partial likelihood free of the intercepts
(and requires a second step to estimate these log discrete hazard components) whereas we are using a full marginal likelihood
of the ranks of Y [90].
print ( cph ( Surv ( gh ) rcs ( age ,5) + log ( ht ) + log ( wt ) , data = w ) ,
latex = TRUE )
Cox Proportional Hazards Model

cph(formula = Surv(gh) ~ rcs(age, 5) + log(ht) + log(wt), data = w)
Model Tests
Discrimination
Indexes
2
2
Obs
4629 LR
1120.20 R
0.215
Events 4629 d.f.
6 Dxy
0.359
2
Center 8.3792 Pr(> ) 0.0000 g
0.622
Score 2 1258.07 gr
1.863
2
Pr(> ) 0.0000
age
age
age
age
ht
wt
Coef
-0.0392
0.0148
0.0093
-0.0321
3.0477
-1.2653
S.E.
Wald Z Pr(> |Z|)
0.0054
-7.24 < 0.0001
0.0274
0.54
0.5888
0.0862
0.11
0.9144
0.1131
-0.28
0.7767
0.2779
10.97 < 0.0001
0.0701
-18.04 < 0.0001
Back up and look at all body size measures, and examine their
redundancies.
v varclus ( wt + ht + bmi + leg + arml + armc + waist +
tri + sub + age + sex + re , data = w )
plot ( v )
# F i g u r e 9.6
#
Omit
wt
so
it
won t
be
removed
before
bmi
redun ( ht + bmi + leg + arml + armc + waist + tri + sub ,

data =w , r2 = .75 )
Redundancy Analysis
redun ( formula = ht + bmi + leg + arml + armc + waist + tri +
sub , data = w , r2 = 0.75)
n : 3853
p: 8
nk : 3
Number of NAs :
776
ht
bmi
leg arml armc waist
tri
sub
0
0
155
127
130
164
334
655
Transformation of target variables forced to be linear

R2 cutoff : 0.75
Type : ordinary
R2 with which each variable can be predicted from all other variables :
ht
bmi
leg arml armc waist
tri
sub
0.829 0.924 0.682 0.748 0.843 0.864 0.531 0.594
Rendundant variables :
bmi ht
Predicted from variables :
leg arml armc waist tri sub
Variable Deleted
R2 R2 after later deletions
1
bmi 0.924
0.909
2
ht 0.792
Six size measures adequately capture the entire set. Height and
BMI are removed. An advantage of removing height is that it
is age-dependent due to vertebral compression in the elderly:
f orm ( ht rcs ( age ,4) * sex , data = w ) # P r o p . o d d s m o d e l
qu Quantile ( f ) ; med function ( x ) qu ( .5 , x )
ggplot ( Predict (f , age , sex , fun = med , conf.int = FALSE ) ,
ylab = Predicted Median Height , cm )
tri
sub
sexfemale
leg
0.4
ht
arml
0.6
0.8
1.0
bmi
waist
wt
armc
Spearman 2
0.2
age
reOther Race Including MultiRacial
reOther Hispanic
reNonHispanic White
reNonHispanic Black
0.0
Figure 9.6: Variable clustering for all potential predictors
Predicted Median Height, cm
180
175
sex
170
male
female
165
160
155
20
40
60
80
Age, years
Figure 9.7: Estimated median height as a smooth function of age, allowing age to interact with sex, from a proportional odds model
But also see a change in leg length:

f orm ( leg rcs ( age ,4) * sex , data = w )
qu Quantile ( f ) ; med function ( x ) qu ( .5 , x )
ggplot ( Predict (f , age , sex , fun = med , conf.int = FALSE ) ,
ylab = Predicted Median Upper Leg Length , cm )
Next allocate d.f. according to generalized Spearman 2i.

s spearman2 ( gh age + sex + re + wt + leg + arml + armc +
waist + tri + sub , data =w , p =2)
plot ( s )
i Competition
between collinear size measures hurts interpretation of partial tests of association in a saturated additive model.
Predicted Median Upper Leg Length, cm
42
40
sex
male
38
female
36
34
20
40
60
80
Age, years
Figure 9.8: Estimated median upper leg length as a smooth function of age, allowing age to interact with sex, from a proportional
odds model
Spearman 2 Response : gh
age
waist
leg
sub
armc
wt
re
tri
arml
sex
0.00
0.05
0.10
Adjusted
0.15
N df
4629 2
4465 2
4474 2
3974 2
4499 2
4629 2
4629 4
4295 2
4502 2
4629 1
0.20
Figure 9.9: Generalized squared rank correlations
Parameters will be allocated in descending order of 2. But

note that subscapular skinfold has a large number of NAs and
other predictors also have NAs. Suboptimal casewise deletion
will be used until the final model is fitted.
Because there are many competing body measures, we use
backwards stepdown to arrive at a set of predictors. The boot-
strap will be used to penalize predictive ability for variable selection. First the full model is fit using casewise deletion, then
we do a composite test to assess whether any of the frequently
missing predictors is important.
f orm ( gh rcs ( age ,5) + sex + re + rcs ( wt ,3) + rcs ( leg ,3) + arml +
rcs ( armc ,3) + rcs ( waist ,4) + tri + rcs ( sub ,3) ,
family = loglog , data =w , x = TRUE , y = TRUE )
print (f , latex = TRUE , coefs = FALSE )

orm(formula = gh ~ rcs(age, 5) + sex + re + rcs(wt, 3) + rcs(leg,
3) + arml + rcs(armc, 3) + rcs(waist, 4) + tri + rcs(sub,
3), data = w, x = TRUE, y = TRUE, family = loglog)
N
sub
tri
waist
leg
armc
arml
gh
age
sex
re
wt
t
t
t
t
t
0
Obs
3853
Unique Y
60
Y0.5
5.5
L
max | log
| 3105
##
Composite
t
t
100
t
t
200
t 655
300
Model Likelihood
Ratio Test
LR 2
1180.13
d.f.
22
Pr(> 2 ) < 0.0001
Score 2 1298.88
Pr(> 2 ) < 0.0001
334
164
155
130
127
0
0
0
0
0
400
500
600
700
Discrimination
Indexes
R2
g
gr
|Pr(Y Y0.5 ) 12 |
test :
lan function ( a ) latex (a , table.env = FALSE , file = )

lan ( anova (f , leg , arml , armc , waist , tri , sub ) )
0.265
0.732
2.080
0.172
Rank Discrim.
Indexes
0.520
2
d.f.
P
leg
8.30
2
0.0158
Nonlinear
3.32
1
0.0685
arml
0.16
1
0.6924
armc
6.66
2
0.0358
Nonlinear
3.29
1
0.0695
waist
29.40
3 < 0.0001
Nonlinear
4.29
2
0.1171
tri
16.62
1 < 0.0001
sub
40.75
2 < 0.0001
Nonlinear
4.50
1
0.0340
TOTAL NONLINEAR 14.95
5
0.0106
TOTAL
128.29 11 < 0.0001
The model yields Spearman = 0.52, the rank correlation

between predicted and observed HbA1c.
Show predicted mean and median HbA1c as a function of age,
adjusting other variables to median/mode. Compare the estimate of the median with that from quantile regression (discussed below).
M
qu
med
p90
fq
fq90
Mean ( f )
Quantile ( f )
function ( x ) qu ( .5 , x )
function ( x ) qu ( .9 , x )
Rq ( formula ( f ) , data = w )
Rq ( formula ( f ) , data =w , tau = .9 )
pmean Predict (f ,
age , fun =M ,
conf.int = FALSE )
pmed
Predict (f ,
age , fun = med , conf.int = FALSE )
p90
Predict (f ,
age , fun = p90 , conf.int = FALSE )
pmedqr Predict ( fq ,
age , conf.int = FALSE )
p90qr Predict ( fq90 , age , conf.int = FALSE )
z rbind ( orm mean = pmean , orm median = pmed , orm P90 = p90 ,
QR median = pmedqr , QR P90 = p90qr )
ggplot (z , groups = .set. ,
adj.subtitle = FALSE , legend.label = FALSE )
print ( fastbw (f , rule = p ) , estimates = FALSE )
Deleted
arml
sex
wt
armc
Chi - Sq
0.16
0.45
5.72
3.32
d.f.
1
1
2
2
P
0.6924
0.5019
0.0572
0.1897
Residual
0.16
0.61
6.33
9.65
d.f.
1
2
4
6
P
0.6924
0.7381
0.1759
0.1400
AIC
-1.84
-3.39
-1.67
-2.35
6.25
6.00
orm mean
5.75
orm median
orm P90
5.50
QR median
QR P90
5.25
5.00
20
40
60
80
Age, years
Figure 9.10: Estimated mean and 0.5 and 0.9 quantiles from the log-log ordinal model using casewise deletion, along with predictions
of 0.5 and 0.9 quantiles from quantile regression (QR). Age is varied and other predictors are held constant to medians/modes.
Factors in Final Model

[1] age
re
leg
waist tri
sub
set.seed (13) # s o c a n r e p r o d u c e r e s u l t s
v validate (f , B =100 , bw = TRUE , estimates = FALSE , rule = p )
Backwards Step - down - Original Model

Deleted
arml
sex
wt
armc
Chi - Sq
0.16
0.45
5.72
3.32
d.f.
1
1
2
2
P
0.6924
0.5019
0.0572
0.1897
Residual
0.16
0.61
6.33
9.65
d.f.
1
2
4
6
P
0.6924
0.7381
0.1759
0.1400
AIC
-1.84
-3.39
-1.67
-2.35

[1] age
#
Show
re
number
leg
of
waist tri
variables
sub
selected
in
latex (v , B =30 , file = , size = small )
first
30
boots
Index
Original
Sample
0.5225
0.2712
1.0000
1.2276
0.2007
R2
Slope
g
|Pr(Y Y0.5 ) 12 |
Training
Sample
0.5290
0.2788
1.0000
1.2505
0.2050
Test
Sample
0.5208
0.2692
0.9761
1.2207
0.1987
Optimism
0.0083
0.0095
0.0239
0.0298
0.0064
Corrected
Index
0.5142
0.2617
0.9761
1.1978
0.1943
n
100
100
100
100
100
Factors Retained in Backwards Elimination

First 30 Resamples
age
sex
re
wt
leg
arml
armc
waist
tri
sub
Frequencies of Numbers of Factors Retained

5
1
6
19
7
29
8
46
9
4
10
1
Next fit the reduced model. Use multiple imputation to impute

missing predictors.
a aregImpute ( gh + wt + ht + bmi + leg + arml + armc + waist +
tri + sub + age + re , data =w , n.impute =5 , pr = FALSE )
g fit.mult.impute ( gh rcs ( age ,5) + re + rcs ( leg ,3) +
rcs ( waist ,4) + tri + rcs ( sub ,4) ,
orm , a , family = loglog , data =w , pr = FALSE )
print (g , latex = TRUE , needspace = 1 .5in )

fit.mult.impute(formula = gh ~ rcs(age, 5) + re + rcs(leg, 3) +
rcs(waist, 4) + tri + rcs(sub, 4), fitter = orm, xtrans = a,
data = w, pr = FALSE, family = loglog)
Model Likelihood
Discrimination
Rank Discrim.
Ratio Test
Indexes
Indexes
2
2
Obs
4629 LR
1448.42 R
0.269
0.513
Unique Y
63 d.f.
17 g
0.743
Y0.5
5.5 Pr(> 2 ) < 0.0001 gr
2.102
log L
1
2
5
max | | 110
Score 1569.21 |Pr(Y Y0.5 ) 2 | 0.173
Pr(> 2 ) < 0.0001
age
age
age
age
re=Other Hispanic
re=Non-Hispanic White
re=Non-Hispanic Black
re=Other Race Including Multi-Racial
leg
leg
waist
waist
waist
tri
sub
sub
sub
an anova ( g )
lan ( an )
Coef
0.0404
-0.0228
0.0126
0.0424
-0.0766
-0.4121
0.0645
-0.0555
-0.0339
0.0153
0.0073
0.0304
-0.0910
-0.0163
-0.0027
0.0674
-0.1895
S.E.
Wald Z Pr(> |Z|)
0.0055
7.29 < 0.0001
0.0279
-0.82
0.4137
0.0876
0.14
0.8857
0.1148
0.37
0.7116
0.0597
-1.28
0.1992
0.0449
-9.17 < 0.0001
0.0566
1.14
0.2543
0.0750
-0.74
0.4593
0.0091
-3.73
0.0002
0.0105
1.46
0.1434
0.0050
1.47
0.1428
0.0158
1.93
0.0536
0.0508
-1.79
0.0732
0.0026
-6.28 < 0.0001
0.0097
-0.28
0.7817
0.0289
2.33
0.0198
0.0922
-2.06
0.0398
2
d.f.
P
age
692.50
4 < 0.0001
Nonlinear
28.47
3 < 0.0001
re
168.91
4 < 0.0001
leg
24.37
2 < 0.0001
Nonlinear
2.14
1
0.1434
waist
128.31
3 < 0.0001
Nonlinear
4.05
2
0.1318
tri
39.44
1 < 0.0001
sub
39.30
3 < 0.0001
Nonlinear
6.63
2
0.0363
TOTAL NONLINEAR
46.80
8 < 0.0001
TOTAL
1464.24 17 < 0.0001
b
#
anova (g , leg , waist , tri , sub )

Add
new
lines
to
the
plot
with
combined
effect
of
size
var.
s rbind ( an , size = b [ TOTAL , ])

class ( s ) anova.rms
plot ( s )
leg
sub
tri
waist
re
size
age
100
200
300
400
500
600
700
2 df
Figure 9.11: ANOVA for reduced model after multiple imputation, with addition of a combined effect for four size variables
ggplot ( Predict ( g ) , abbrev = TRUE , ylab = NULL )
Figure
M Mean ( g )
ggplot ( Predict (g , fun = M ) , abbrev = TRUE , ylab = NULL )
9.12
Figure
9.13
Compare the estimated age partial effects and confidence intervals with those from a model using casewise deletion, and
with bootstrap nonparametric confidence intervals (also with
casewise deletion).
1.5
1.5
1.0
1.0
0.5
0.5
0.0
0.0
0.5
0.5
1.0
Race Ethnicity
ORIMR
1.0
20
40
60
80
35
40
1.5
1.5
1.0
1.0
1.0
0.5
0.5
0.5
0.0
0.0
0.0
0.5
0.5
0.5
1.0
1.0
1.0
20
30
40
NnHsW
OthrHs
10
20
1.00.5 0.0 0.5 1.0 1.5
1.5
10
45
Upper Leg Length, cm
Subscapular Skinfold, mm
NnHsB
MxcnAm
30
Age, years
30
Triceps Skinfold, mm
40
80
100
120
140
Waist Circumference, cm
Figure 9.12: Partial effects (log hazard or log-log cumulative probability scale) of all predictors in reduced model, after multiple
imputation
6.25
6.25
6.00
6.00
5.75
5.75
5.50
5.50
5.25
5.25
5.00
5.00
20
40
60
80
Race Ethnicity
ORIMR
35
40
6.25
6.00
6.00
6.00
5.75
5.75
5.75
5.50
5.50
5.50
5.25
5.25
5.25
5.00
20
NnHsW
30
40
Subscapular Skinfold, mm
OthrHs
Upper Leg Length, cm

6.25
5.00
45
6.25
10
NnHsB
MxcnAm
30
Age, years
5.00 5.25 5.50 5.75 6.00 6.25
5.00
10
20
30
Triceps Skinfold, mm
40
80
100
120
140
Waist Circumference, cm
Figure 9.13: Partial effects (mean scale) of all predictors in reduced model, after multiple imputation
gc orm ( gh rcs ( age ,5) + re + rcs ( leg ,3) +

rcs ( waist ,4) + tri + rcs ( sub ,4) ,
family = loglog , data =w , x = TRUE , y = TRUE )
gb bootcov ( gc , B =300)
bootclb Predict ( gb , age , boot.type = basic )
bootclp Predict ( gb , age , boot.type = percentile )
multimp Predict (g , age )
plot ( Predict ( gc , age ) , addpanel = function ( ... ) {
with ( bootclb , { llines ( age , lower , col = blue )
llines ( age , upper , col = blue ) })
with ( bootclp , { llines ( age , lower , col = blue , lty =2)
llines ( age , upper , col = blue , lty =2) })
with ( multimp , { llines ( age , lower , col = red )
llines ( age , upper , col = red )
llines ( age , yhat , col = red ) } ) } ,
col.fill = gray ( .9 ) , adj.subtitle = FALSE )
# F i g u r e 9.14
log hazard
1.0
0.5
0.0
0.5
20
30
40
50
60
70
80
Age, years
Figure 9.14: Partial effect for age from multiple imputation (center red line) and casewise deletion (center blue line) with symmetric
Wald 0.95 confidence bands using casewise deletion (gray shaded area), basic bootstrap confidence bands using casewise deletion
(blue lines), percentile bootstrap confidence bands using casewise deletion (dashed blue lines), and symmetric Wald confidence bands
accounting for multiple imputation (red lines).
M Mean ( g )
qu Quantile ( g )
med function ( lp ) qu ( .5 , lp )
q90 function ( lp ) qu ( .9 , lp )
lp predict ( g )
lpr quantile ( predict ( g ) , c ( .002 , .998 ) , na.rm = TRUE )
lps seq ( lpr [1] , lpr [2] , length =200)
pmn M ( lps )
pme med ( lps )
p90 q90 ( lps )

plot ( pmn , pme ,
# F i g u r e 9.15
xlab = expression ( paste ( Predicted Mean , HbA [ " 1 c " ]) ) ,
ylab = Median and 0 .9 Quantile , type = l ,
xlim = c (4 .75 , 8 .0 ) , ylim = c (4 .75 , 8 .0 ) , bty = n )
box ( col = gray ( .8 ) )
lines ( pmn , p90 , col = blue )
abline ( a =0 , b =1 , col = gray ( .8 ) )
text (6 .5 , 5 .5 , Median )
text (5 .5 , 6 .3 , 0 .9 , col = blue )
nint 350
scat1d ( M ( lp ) ,
nint = nint )
scat1d ( med ( lp ) , side =2 , nint = nint )
scat1d ( q90 ( lp ) , side =4 , col = blue , nint = nint )
Median and 0.9 Quantile
8.0
7.5
7.0
6.5
0.9
6.0
Median
5.5
5.0
5.0
5.5
6.0
6.5
7.0
7.5
8.0
Predicted Mean HbA1c

Figure 9.15: Predicted mean HbA1c vs. predicted median and 0.9 quantile along with their marginal distributions
g
Newlevels (g , list ( re = abbreviate ( levels ( w $ re ) ) ) )
exprob ExProb ( g )
nom
nomogram (g , fun = list ( Mean =M ,
Median Glycohemoglobin = med ,
0 .9 Quantile
= q90 ,
Prob ( HbA1c 6 .5 ) =
function ( x ) exprob (x , y =6 .5 ) ,
Prob ( HbA1c 7 .0 ) =
function ( x ) exprob (x , y =7) ,
Prob ( HbA1c 7 .5 ) =
function ( x ) exprob (x , y =7 .5 ) ) ,
fun.at = list ( seq (5 , 8 , by = .5 ) ,
c (5 ,5 .25 ,5 .5 ,5 .75 ,6 ,6 .25 ) ,
c (5 .5 ,6 ,6 .5 ,7 ,8 ,10 ,12 ,14) ,
c ( .01 , .05 , .1 , .2 , .3 , .4 ) ,
c ( .01 , .05 , .1 , .2 , .3 , .4 ) ,
c ( .01 , .05 , .1 , .2 , .3 , .4 ) ) )
plot ( nom , lmgp = .28 )
9.16
Figure
10
20
30
40
50
60
70
80
90
100
Points
Age
20
25
30
OthH
35
40
45
50
55
60
65 70 75
80
Race/Ethnicity
NHW
ORIM
Upper Leg Length

55
45
35
30
25
20
Waist Circumference
50
70
90
100
110
120
130
140
150
160
170
Triceps Skinfold
45
35
25
15 20
25
30
15
5
35 40 45
Subscapular Skinfold
10
Total Points
0
20
40
60
80
100
120
140
160
180
200
220
240
260
280
Linear Predictor
1.5
0.5
0.5
1.5
2.5
Mean
5
5.5
6.5
7 7.5 8
Median Glycohemoglobin
5
5.25
5.5
5.75
6.25
0.9 Quantile
5.5
6.5 7 8 10 12
14
Prob(HbA1c >= 6.5)

0.01
0.05 0.1
0.2
0.3 0.4
Prob(HbA1c >= 7.0)

0.01
0.05 0.1
0.2
0.3 0.4
Prob(HbA1c >= 7.5)

0.01
0.05 0.1
0.2
0.3
Figure 9.16: Nomogram for predicting median, mean, and 0.9 quantile of glycohemoglobin, along with the estimated probability
that HbA1c 6.5, 7, or 7.5, all from the log-log ordinal model
Chapter 10
Case Study in Parametric Survival Modeling

and Model Approximation
Data source: Random sample of 1000 patients from Phases I
& II of SUPPORT (Study to Understand Prognoses Preferences
Outcomes and Risks of Treatment, funded by the Robert Wood
Johnson Foundation). See [93]. The dataset is available from
http://biostat.mc.vanderbilt.edu/DataSets.
Analyze acute disease subset of SUPPORT (acute respiratory failure, multiple organ system failure, coma) the
shape of the survival curves is different between acute and
chronic disease categories
Patients had to survive until day 3 of the study to qualify
Baseline physiologic variables measured during day 3
10-1
CHAPTER 10. PARAMETRIC SURVIVAL MODELING AND MODEL APPROXIMATION
10-2
10.1
Descriptive Statistics
Create a variable acute to flag categories of interest; print univariable descriptive statistics.
require ( rms )
getHdata ( support )
# Get data frame from web site
acute support $ dzclass % in % c ( ARF / MOSF , Coma )
latex ( describe ( support [ acute ,]) , file = )
support[acute, ]
35 Variables
537 Observations
age : Age
n
537
missing
0
unique
529
Info
1
Mean
60.7
.05
28.49
.10
35.22
.25
47.93
.50
63.67
.75
74.49
.90
81.54
lowest : 18.04 18.41 19.76 20.30 20.31

highest: 91.62 91.82 91.93 92.74 95.51
death : Death at any time up to NDI date:31DEC94

n
537
missing
0
unique
2
missing
0
unique
2
Info
0.67
Sum
356
Mean
0.6629
sex
n
537
female (251, 47%), male (286, 53%)
hospdead : Death in Hospital

n
537
missing
0
unique
2
Info
0.7
Sum
201
Mean
0.3743
slos : Days from Study Entry to Discharge

n
537
missing
0
lowest :
unique
85
4
Info
1
6
Mean
23.44
.05
4.0
.10
5.0
.25
9.0
.50
15.0
.75
27.0
.90
47.4
.95
68.2
7, highest: 145 164 202 236 241
d.time : Days of Follow-Up

n
537
missing
0
lowest :
unique
340
4
Info
1
6
Mean
446.1
.05
4
.10
6
.25
16
.50
182
.75
724
.90
1421
7, highest: 1977 1979 1982 2011 2022
dzgroup
n
537
missing
0
unique
3
ARF/MOSF w/Sepsis (391, 73%), Coma (60, 11%), MOSF w/Malig (86, 16%)
dzclass
n
537
missing
0
unique
2
ARF/MOSF (477, 89%), Coma (60, 11%)
.95
1742
.95
85.56

num.co : number of comorbidities
n
537
missing
0
unique
7
Info
0.93
Mean
1.525
0
1
2 3 4 5 6
Frequency 111 196 133 51 31 10 5
%
21 36 25 9 6 2 1
edu : Years of Education

n
411
missing
126
lowest :
unique
22
Info
0.96
Mean
12.03
.05
7
.10
8
.25
10
.50
12
.75
14
.90
16
.95
17
.75
37
.90
55
.95
100
4, highest: 17 18 19 20 22
income
n
335
missing
202
unique
4
under $11k (158, 47%), $11-$25k (79, 24%), $25-$50k (63, 19%)
>$50k (35, 10%)
scoma : SUPPORT Coma Score based on Glasgow D3

n
537
missing
0
unique
11
Info
0.82
Mean
19.24
.05
0
.10
0
.25
0
.50
0
0 9 26 37 41 44 55 61 89 94 100
Frequency 301 50 44 19 17 43 11 6 8 6 32
%
56 9 8 4 3 8 2 1 1 1
6
charges : Hospital Charges

n
517
missing
20
unique
516
Info
1
Mean
86652
.05
11075
.10
15180
.25
27389
.50
51079
.75
100904
.90
205562
.95
283411
lowest :
3448
4432
4574
5555
5849
highest: 504660 538323 543761 706577 740010
totcst : Total RCC cost

n
471
missing
66
unique
471
Info
1
Mean
46360
.05
6359
.10
8449
.25
15412
.50
29308
.75
57028
.90
108927
.10
8283
.25
14415
.50
26323
.75
54102
.90
87495
.10
14.50
.25
19.62
.50
28.00
.75
39.00
.95
141569
lowest :
0
2071
2522
3191
3325
highest: 269057 269131 338955 357919 390460
totmcst : Total micro-cost

n
331
missing
206
unique
328
Info
1
Mean
39022
.05
6131
.95
111920
lowest :
0
1562
2478
2626
3421
highest: 144234 154709 198047 234876 271467
avtisst : Average TISS, Days 3-25

n
536
missing
1
unique
205
Info
1
Mean
29.83
.05
12.46
.90
47.17
.95
50.37
lowest : 4.000 5.667 8.000 9.000 9.500

highest: 58.500 59.000 60.000 61.000 64.000
race
n
535
missing
2
Frequency
%
unique
5
white black asian other hispanic

417
84
4
8
22
78
16
1
1
4
meanbp : Mean Arterial Blood Pressure Day 3

n
537
missing
0
lowest :
unique
109
20
27
Info
1
30
Mean
83.28
.05
41.8
.10
49.0
.25
59.0
.50
73.0
.75
111.0
.90
124.4
.95
135.0
32, highest: 155 158 161 162 180
wblc : White Blood Cell Count Day 3

n
532
missing
5
lowest :
highest:
unique
241
0.05000
51.39844
Info
1
0.06999
58.19531
Mean
14.1
0.09999
61.19531
.05
0.8999
.10
4.5000
.25
7.9749
.50
12.3984
.75
18.1992
.90
25.1891
0.14999
0.19998
79.39062 100.00000
hrt : Heart Rate Day 3

n
537
missing
0
lowest :
unique
111
11
30
36
Info
1
Mean
105
.05
51
.10
60
.25
75
.50
111
40, highest: 189 193 199 232 300
.75
126
.90
140
.95
155
.95
30.1873
10-3

resp : Respiration Rate Day 3
n
537
missing
0
lowest :
unique
45
Info
1
Mean
23.72
.05
8
.10
10
.25
12
.50
24
.75
32
.90
39
.95
40
8, highest: 48 49 52 60 64
temp : Temperature (celcius) Day 3

n
537
missing
0
unique
61
Info
1
Mean
37.52
.05
35.50
.10
35.80
.05
86.99
.10
105.08
.25
36.40
.50
37.80
.75
38.50
.90
39.09
.95
39.50
lowest : 32.50 34.00 34.09 34.90 35.00

highest: 40.20 40.59 40.90 41.00 41.20
pafi : PaO2/(.01*FiO2) Day 3

n
500
missing
37
unique
357
Info
1
Mean
227.2
.25
137.88
.50
202.56
.75
290.00
.90
390.49
.95
433.31
lowest : 45.00 48.00 53.33 54.00 55.00

highest: 574.00 595.12 640.00 680.00 869.38
alb : Serum Albumin Day 3

n
346
missing
191
unique
34
Info
1
Mean
2.668
.05
1.700
.10
1.900
.25
2.225
.50
2.600
.75
3.100
.90
3.400
.95
3.800
lowest : 1.100 1.200 1.300 1.400 1.500

highest: 4.100 4.199 4.500 4.699 4.800
bili : Bilirubin Day 3

n
386
missing
151
unique
88
Info
1
Mean
2.678
.05
0.3000
.10
0.4000
.25
0.6000
.50
0.8999
.75
2.0000
.90
6.5996
.95
13.1743
lowest : 0.09999 0.19998 0.29999 0.39996 0.50000

highest: 22.59766 30.00000 31.50000 35.00000 39.29688
crea : Serum creatinine Day 3

n
537
missing
0
lowest :
0.3
unique
84
0.4
Info
1
0.5
0.6
Mean
2.232
.05
0.6000
.10
0.7000
.25
0.8999
.50
1.3999
.75
2.5996
.90
5.2395
.95
7.3197
0.7, highest: 10.4 10.6 11.2 11.6 11.8
sod : Serum sodium Day 3

n
537
missing
0
unique
38
Info
1
Mean
138.1
.05
129
.10
131
.25
134
.50
137
.75
142
.90
147
.95
150
lowest : 118 120 121 126 127, highest: 156 157 158 168 175
ph : Serum pH (arterial) Day 3

n
500
missing
37
unique
49
Info
1
Mean
7.416
.05
7.270
.10
7.319
.25
7.380
.50
7.420
.75
7.470
.90
7.510
.95
7.529
lowest : 6.960 6.989 7.069 7.119 7.130

highest: 7.560 7.569 7.590 7.600 7.659
glucose : Glucose Day 3

n
297
missing
240
lowest :
30
unique
179
42
52
Info
1
55
Mean
167.7
.05
76.0
.10
89.0
.25
106.0
.50
141.0
.75
200.0
.90
292.4
.95
347.2
68, highest: 446 468 492 576 598
bun : BUN Day 3

n
304
missing
233
lowest :
unique
100
3
Info
1
5
Mean
38.91
.05
8.00
.10
11.00
.25
16.75
.50
30.00
.75
56.00
.90
79.70
.95
100.70
6, highest: 123 124 125 128 146
urine : Urine Output Day 3

n
303
missing
234
lowest :
unique
262
5
Info
1
15
Mean
2095
missing
433
unique
8
Info
0.87
0 1 2 3 4 5 6 7
Frequency 51 19 7 6 4 7 8 2
%
49 18 7 6 4 7 8 2
.10
364.0
.25
1156.5
.50
1870.0
20, highest: 6865 6920 7360 7560 7750
adlp : ADL Patient Day 3

n
104
.05
20.3
Mean
1.577
.75
2795.0
.90
4008.6
.95
4817.5
10-4

adls : ADL Surrogate Day 3
n
392
missing
145
unique
8
Info
0.89
Mean
1.86
0 1 2 3 4 5 6 7
Frequency 185 68 22 18 17 20 39 23
%
47 17 6 5 4 5 10 6
sfdm2
n
468
missing
69
unique
5
no(M2 and SIP pres) (134, 29%), adl>=4 (>=5 if sur) (78, 17%)
SIP>=30 (30, 6%), Coma or Intub (5, 1%), <2 mo. follow-up (221, 47%)
adlsc : Imputed ADL Calibrated to Surrogate

n
537
missing
0
unique
144
Info
0.96
Mean
2.119
.05
0.000
.10
0.000
.25
0.000
.50
1.839
.75
3.375
.90
6.000
.95
6.000
lowest : 0.0000 0.4948 0.4948 1.0000 1.1667

highest: 5.7832 6.0000 6.3398 6.4658 7.0000
patterns
of
missing
data
#
Figure
10.1
0.0
num.co
dzclass
dzgroup
d.time
slos
hospdead
sex
age
death
adls
edu
income
0.2
alb
bili
Fraction Missing
0.1
0.3
0.5
urine
glucose
bun
adlp
0.4
totmcst
Show
adlsc
sod
crea
temp
resp
hrt
meanbp
race
avtisst
wblc
charges
totcst
scoma
pafi
ph
sfdm2
plot ( naclus ( support [ acute ,]) )
Figure 10.1: Cluster analysis showing which predictors tend to be missing on the same patients
10-5
10-6
Show associations between predictors using a general non-monotonic

measure of dependence (Hoeffding D).
ac support [ acute ,]
ac $ dzgroup ac $ dzgroup [ drop = TRUE ]
# Remove unused levels
attach ( ac )
vc varclus ( age + sex + dzgroup + num.co + edu + income + scoma + race +
meanbp + wblc + hrt + resp + temp + pafi + alb + bili + crea + sod +
ph + glucose + bun + urine + adlsc , sim = hoeffding )
plot ( vc )
# F i g u r e 10.2
0.10
0.15
0.20
0.25
0.30
sexmale
sod
raceasian
raceother
racehispanic
raceblack
dzgroupComa
scoma
dzgroupMOSF w/Malig
wblc
30 * Hoeffding D
0.05
edu
income$25$50k
income$11$25k
income>$50k
adlsc
num.co
glucose
alb
ph
pafi
meanbp
urine
resp
hrt
temp
age
bili
0.00
crea
bun
0.35
Figure 10.2: Hierarchical clustering of potential predictors using Hoeffding D as a similarity measure. Categorical predictors are
automatically expanded into dummy variables.
10-7
10.2
Checking Adequacy of Log-Normal Accelerated Failure Time Model

dd datadist ( ac )
#
describe
distributions
of
variables
to
rms
#
Generate
right-censored
survival
time
variable
years d.time / 365 .25

units ( years ) Year
S Surv ( years , death )
#
#
Show normal inverse Kaplan-Meier

s t r a t i f i e d by d z g r o u p
estimates
survplot ( npsurv ( S dzgroup ) , conf = none ,

fun = qnorm , logt = TRUE )
# F i g u r e 10.3
dzgroup=MOSF w/Malig
dzgroup=ARF/MOSF w/Sepsis
1
dzgroup=Coma
2
3
log Followup Time in Years

Figure 10.3: 1 (SKM (t)) stratified by dzgroup. Linearity and semi-parallelism indicate a reasonable fit to the log-normal accelerated
failure time model with respect to one predictor.
More stringent assessment of log-normal assumptions: check

distribution of residuals from an adjusted model:
f psm ( S dzgroup + rcs ( age ,5) + rcs ( meanbp ,5) ,
dist = lognormal , y = TRUE ) # d i s t = g a u s s i a n f o r S +
r resid ( f )
survplot (r , dzgroup , label.curve = FALSE )
survplot (r , age ,
label.curve = FALSE )
survplot (r , meanbp , label.curve = FALSE )
random.number runif ( length ( age ) )
survplot (r , random.number , label.curve = FALSE )
Figure
10-8
10.4
Age
0.8
0.8
Survival Probability
1.0
1.0
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
3.0
2.5
2.0
1.5
1.0
0.5
0.0
Residual
0.5
1.0
1.5
2.0
3.0
2.5
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
3.0
2.5
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
2.0
Residual
Mean Arterial Blood Pressure Day 3
0.8
0.8
1.0
1.0
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
3.0
2.5
2.0
1.5
1.0
0.5
0.0
Residual
0.5
1.0
1.5
2.0
Residual
Figure 10.4: Kaplan-Meier estimates of distributions of normalized, right-censored residuals from the fitted log-normal survival model.
Residuals are stratified by important variables in the model (by quartiles of continuous variables), plus a random variable to depict
the natural variability (in the lower right plot). Theoretical standard Gaussian distributions of residuals are shown with a thick solid
line. The upper left plot is with respect to disease group.
The fit for dzgroup is not great but overall fit is good.
Remove from consideration predictors that are missing in > 0.2
of the patients. Many of these were only collected for the
second phase of SUPPORT.
Of those variables to be included in the model, find which ones
have enough potential predictive power to justify allowing for
nonlinear relationships or multiple categories, which spend more
d.f. For each variable compute Spearman 2 based on multiple
10-9
linear regression of rank(x), rank(x)2 and the survival time,

truncating survival time at the shortest follow-up for survivors
(356 days). This rids the data of censoring but creates many
ties at 356 days.
shortest.follow.up min ( d.time [ death ==0] , na.rm = TRUE )
d.timet pmin ( d.time , shortest.follow.up )
w spearman2 ( d.timet age + num.co + scoma + meanbp +
hrt + resp + temp + crea + sod + adlsc +
wblc + pafi + ph + dzgroup + race , p =2)
plot (w , main = )
# F i g u r e 10.5
scoma
meanbp
dzgroup
crea
pafi
ph
sod
hrt
adlsc
temp
wblc
num.co
age
resp
race
N df
537 2
537 2
537 2
537 2
500 2
500 2
537 2
537 2
537 2
537 2
532 2
537 2
537 2
537 2
535 4
0.00
0.02
0.04
0.06
0.08
0.10
0.12
Adjusted 2
Figure 10.5: Generalized Spearman 2 rank correlation between predictors and truncated survival time
A better approach is to use the complete information in the

failure and censoring times by computing Somers Dxy rank
correlation allowing for censoring.
w rcorrcens ( S age + num.co + scoma + meanbp + hrt + resp +
temp + crea + sod + adlsc + wblc + pafi + ph +
dzgroup + race )
plot (w , main = )
# F i g u r e 10.6
#
Compute
number
of
missing
values
per
variable
sapply ( llist ( age , num.co , scoma , meanbp , hrt , resp , temp , crea , sod , adlsc ,
wblc , pafi , ph ) , function ( x ) sum ( is.na ( x ) ) )
age num . co
0
0
wblc
pafi
5
37
scoma meanbp
0
0
ph
37
hrt
0
resp
0
temp
0
crea
0
sod
0
adlsc
0
meanbp
crea
dzgroup
scoma
pafi
ph
adlsc
age
num.co
hrt
resp
race
sod
wblc
temp
0.00
0.05
0.10
0.15
10-10
N
537
537
537
537
500
500
537
537
537
537
537
535
537
532
537
0.20
|Dxy|
Figure 10.6: Somers Dxy rank correlation between predictors and original survival time. For dzgroup or race, the correlation
coefficient is the maximum correlation from using a dummy variable to represent the most frequent or one to represent the second
most frequent category.,scap=Somers Dxy rank correlation between predictors and original survival time
Can
#
#
Can also use the Hmisc naclus and

Impute missing values with normal
also
do
naplot ( naclus ( support [ acute ,]) )

naplot functions
or modal values
to
do
this
wblc.i impute ( wblc , 9)

pafi.i impute ( pafi , 333 .3 )
ph.i
impute ( ph ,
7 .4 )
race2 race
levels ( race2 ) list ( white = white , other = levels ( race ) [ -1 ])
race2 [ is.na ( race2 ) ] white
dd datadist ( dd , wblc.i , pafi.i , ph.i , race2 )
Do a formal redundancy analysis using more than pairwise associations, and allow for non-monotonic transformations in predicting each predictor from all other predictors. This analysis
requires missing values to be imputed so as to not greatly reduce the sample size.
redun ( crea + age + sex + dzgroup + num.co + scoma + adlsc + race2 +
meanbp + hrt + resp + temp + sod + wblc.i + pafi.i + ph.i , nk =4)
Redundancy Analysis
redun ( formula = crea + age + sex + dzgroup + num . co + scoma +
adlsc + race2 + meanbp + hrt + resp + temp + sod + wblc . i +
pafi . i + ph .i , nk = 4)
n : 537
p : 16
Number of NAs :
10-11
nk : 4
0
Transformation of target variables forced to be linear

R2 cutoff : 0.9
Type : ordinary
R2 with which each variable can be predicted from all other variables :
crea
0.133
hrt
0.258
age
0.246
resp
0.131
sex dzgroup
0.132
0.451
temp
sod
0.197
0.135
num . co
0.147
wblc . i
0.093
scoma
0.418
pafi . i
0.143
adlsc
0.153
ph . i
0.171
race2
0.151
meanbp
0.178
No redundant variables
Better approach to gauging predictive potential and allocating

d.f.:
Allow all continuous variables to have a the maximum number of knots entertained, in a log-normal survival model
Must use imputation to avoid losing data
Fit a saturated main effects model
Makes full use of censored data
Had to limit to 4 knots, force scoma to be linear, and omit
ph.i to avoid singularity
k 4
f psm ( S rcs ( age , k ) + sex + dzgroup + pol ( num.co ,2) + scoma +
pol ( adlsc ,2) + race + rcs ( meanbp , k ) + rcs ( hrt , k ) + rcs ( resp , k ) +
rcs ( temp , k ) + rcs ( crea ,3) + rcs ( sod , k ) + rcs ( wblc.i , k ) +
rcs ( pafi.i , k ) , dist = lognormal )
plot ( anova ( f ) )
# F i g u r e 10.7
Figure 10.7 properly blinds the analyst to the form of effects
sex
temp
race
sod
num.co
hrt
wblc.i
adlsc
resp
scoma
pafi.i
age
meanbp
crea
dzgroup
10-12
10
20
30
df
2
Figure 10.7: Partial 2 statistics for association of each predictor with response from saturated main effects model, penalized for
d.f.
(tests of linearity).
Fit a log-normal survival model with number of parameters corresponding to nonlinear effects determined from Figure 10.7. For the most promising predictors, five knots can
be allocated, as there are fewer singularity problems once
less promising predictors are simplified.
f psm ( S rcs ( age ,5) + sex + dzgroup + num.co +
scoma + pol ( adlsc ,2) + race2 + rcs ( meanbp ,5) +
rcs ( hrt ,3) + rcs ( resp ,3) + temp +
rcs ( crea ,4) + sod + rcs ( wblc.i ,3) + rcs ( pafi.i ,4) ,
dist = lognormal ) # g a u s s i a n f o r S +
Parametric Survival Model: Log Normal Distribution

psm(formula = S ~ rcs(age, 5) + sex + dzgroup + num.co + scoma +
pol(adlsc, 2) + race2 + rcs(meanbp, 5) + rcs(hrt, 3) + rcs(resp,
3) + temp + rcs(crea, 4) + sod + rcs(wblc.i, 3) + rcs(pafi.i,
4), dist = "lognormal")
Obs
537
Events 356
2.230782
Model Likelihood
Ratio Test
LR 2
236.83
d.f.
30
2
Pr(> ) < 0.0001
(Intercept)
age
age
age
age
sex=male
dzgroup=Coma
dzgroup=MOSF w/Malig
num.co
scoma
adlsc
adlsc2
race2=other
meanbp
meanbp
meanbp
meanbp
hrt
hrt
resp
resp
temp
crea
crea
crea
sod
wblc.i
wblc.i
pafi.i
pafi.i
pafi.i
Log(scale)
a anova ( f )
Coef
-5.6883
-0.0148
-0.0412
0.1670
-0.2099
-0.0737
-2.0676
-1.4664
-0.1917
-0.0142
-0.3735
0.0442
0.2979
0.0702
-0.3080
0.8438
-0.5715
-0.0171
0.0064
0.0454
-0.0851
0.0523
-0.4585
-11.5176
21.9840
0.0044
0.0746
-0.0880
0.0169
-0.0569
0.1088
0.8024
Discrimination
Indexes
2
R
0.594
Dxy
0.485
g
0.033
gr
1.959
S.E.
Wald Z Pr(> |Z|)
3.7851
-1.50
0.1329
0.0309
-0.48
0.6322
0.1078
-0.38
0.7024
0.5594
0.30
0.7653
1.3707
-0.15
0.8783
0.2181
-0.34
0.7354
0.4062
-5.09 < 0.0001
0.3112
-4.71 < 0.0001
0.0858
-2.23
0.0255
0.0044
-3.25
0.0011
0.1520
-2.46
0.0140
0.0243
1.82
0.0691
0.2658
1.12
0.2624
0.0210
3.34
0.0008
0.2261
-1.36
0.1732
0.8556
0.99
0.3241
0.7707
-0.74
0.4584
0.0069
-2.46
0.0140
0.0063
1.02
0.3090
0.0230
1.97
0.0483
0.0291
-2.93
0.0034
0.0834
0.63
0.5308
0.6727
-0.68
0.4955
19.0027
-0.61
0.5444
31.0113
0.71
0.4784
0.0157
0.28
0.7792
0.0331
2.25
0.0242
0.0377
-2.34
0.0195
0.0055
3.07
0.0021
0.0239
-2.38
0.0173
0.0482
2.26
0.0239
0.0401
19.99 < 0.0001
10-13
10-14
10.3
Summarizing the Fitted Model

Plot the shape of the effect of each predictor on log survival
time.
All effects centered: can be placed on common scale
Wald 2 statistics, penalized for d.f., plotted in descending
order
ggplot ( Predict (f , ref.zero = TRUE ) , vnames = names ,
sepdiscrete = vertical , anova = a ) # F i g u r e 10.8
latex (a , file = , label = support-anovat )
plot ( a )
Figure
Table
10.9

plot ( summary ( f ) , log = TRUE , main = )
Figure
10.10
10.1
adlsc
age
10-15
crea
hrt
22 = 8.3
1
0
1
23 = 33.6
24 = 16
22 = 11.8
3
0
20
40
60
80
meanbp
num.co
24 = 27.6
21 = 5
2.5
5.0
7.5
50
100
pafi.i
150
resp
log(T)
23 = 15.4
22 = 11.1
0
1
2
3
30
60
90
120
150 0
scoma
sod
21 = 10.6
21 = 0.1
100 200 300 400 500
10
20
temp
30
40
wblc.i
2
1
21 = 0.4
22 = 5.5
0
1
2
3
0
25
50
75
100
130 135 140 145 150 15535
dzgroup
MOSF w/Malig
other
ARF/MOSF w/Sepsis
38
39
40 0
10
22 = 45.7 white
1
20
30
40
sex
male
37
race2
Coma
36
21 =
1.3
female
2
21=
0.1
log(T)
Figure 10.8: Effect of each predictor on log survival time. Predicted values have been centered so that predictions at predictor
reference values are zero. Pointwise 0.95 confidence bands are also shown. As all Y -axes have the same scale, it is easy to see which
predictors are strongest.
Table 10.1: Wald Statistics for S
2
15.99
0.23
0.11
45.69
4.99
10.58
8.28
3.31
1.26
27.62
10.51
11.83
1.04
11.10
8.56
0.39
33.63
21.27
0.08
5.47
5.46
15.37
6.97
60.48
261.47
age
Nonlinear
sex
dzgroup
num.co
scoma
adlsc
Nonlinear
race2
meanbp
Nonlinear
hrt
Nonlinear
resp
Nonlinear
temp
crea
Nonlinear
sod
wblc.i
Nonlinear
pafi.i
Nonlinear
TOTAL NONLINEAR
TOTAL
sod
sex
temp
race2
wblc.i
num.co
adlsc
resp
scoma
hrt
age
pafi.i
meanbp
crea
dzgroup
d.f.
4
3
1
2
1
1
2
1
1
4
3
2
1
2
1
1
3
2
1
2
1
3
2
14
30
P
0.0030
0.9722
0.7354
< 0.0001
0.0255
0.0011
0.0159
0.0691
0.2624
< 0.0001
0.0147
0.0027
0.3090
0.0039
0.0034
0.5308
< 0.0001
< 0.0001
0.7792
0.0649
0.0195
0.0015
0.0307
< 0.0001
< 0.0001
10
20
30
40
df
2
Figure 10.9: Contribution of variables in predicting survival time in log-normal model
10-16
0.10
0.50
10-17
1.00
2.00 3.50
age 74.5:47.9
num.co 2:1
scoma 37:0
adlsc 3.38:0
meanbp 111:59
hrt 126:75
resp 32:12
temp 38.5:36.4
crea 2.6:0.9
sod 142:134
wblc.i 18.2:8.1
pafi.i 323:142
sex female:male
dzgroup Coma:ARF/MOSF w/Sepsis
dzgroup MOSF w/Malig:ARF/MOSF w/Sepsis
race2 other:white
Figure 10.10: Estimated survival time ratios for default settings of predictors. For example, when age changes from its lower quartile
to the upper quartile (47.9y to 74.5y), median survival time decreases by more than half. Different shaded areas of bars indicate
different confidence levels (0.9, 0.95, 0.99).
10-18
10.4
Internal Validation of the Fitted Model Using

the Bootstrap
Validate indexes describing the fitted model.
#
First
add
data
from
the
data
to
model
fit
so
bootstrap
can
re-sample
g update (f , x = TRUE , y = TRUE )

set.seed (717)
latex ( validate (g , B =120 , dxy = TRUE ) , digits =2 , size = Ssize )
Index
Original Training
Test
Optimism
Sample Sample Sample
Dxy
0.49
0.51
0.46
0.05
R2
0.59
0.66
0.54
0.12
Intercept
0.00
0.00 0.06
0.06
Slope
1.00
1.00
0.90
0.10
D
0.48
0.55
0.42
0.13
U
0.00
0.00 0.01
0.01
Q
0.48
0.55
0.43
0.12
g
1.96
2.06
1.86
0.19
Corrected
Index
0.43
0.47
0.06
0.90
0.35
0.01
0.36
1.76
n
120
120
120
120
120
120
120
120
From Dxy and R2 there is a moderate amount of overfitting.

Slope shrinkage factor (0.90) is not troublesome
Almost unbiased estimate of future predictive discrimination
on similar patients is the corrected Dxy of 0.43.
Validate predicted 1-year survival probabilities. Use a smooth
approach that does not require binning [98] and use less precise
Kaplan-Meier estimates obtained by stratifying patients by the
predicted probability, with at least 60 patients per group.
set.seed (717)
cal calibrate (g , u =1 , B =120)
plot ( cal , subtitles = FALSE )
10-19
Fraction Surviving 1 Years
cal calibrate (g , cmethod = KM , u =1 , m =60 , B =120 , pr = FALSE )

plot ( cal , add = TRUE )
# F i g u r e 10.11
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
Predicted 1 Year Survival

Figure 10.11: Bootstrap validation of calibration curve. Dots represent apparent calibration accuracy; are bootstrap estimates
corrected for overfitting, based on binning predicted survival probabilities and and computing Kaplan-Meier estimates. Black curve
is the estimated observed relationship using hare and the blue curve is the overfitting-corrected hare estimate. The gray-scale line
depicts the ideal relationship.
10-20
10.5
Approximating the Full Model

The fitted log-normal model is perhaps too complex for routine
use and for routine data collection. Let us develop a simplified
model that can predict the predicted values of the full model
with high accuracy (R2 = 0.96). The simplification is done
using a fast backward stepdown against the full model predicted
values.
Z predict ( f )
# X* beta hat
a ols ( Z rcs ( age ,5) + sex + dzgroup + num.co +
scoma + pol ( adlsc ,2) + race2 +
rcs ( meanbp ,5) + rcs ( hrt ,3) + rcs ( resp ,3) +
temp + rcs ( crea ,4) + sod + rcs ( wblc.i ,3) +
rcs ( pafi.i ,4) , sigma =1)
#
sigma =1
#
#
R2 =1 .0 since we start
component variables
is
used
to
prevent
fastbw (a , aics =10000)
Deleted
sod
sex
temp
race2
wblc . i
num . co
resp
adlsc
pafi . i
scoma
hrt
age
crea
meanbp
dzgroup
Chi - Sq
0.43
0.57
2.20
6.81
29.52
30.84
54.18
52.46
66.78
78.07
83.17
68.08
314.47
403.04
441.28
d.f.
1
1
1
1
2
1
2
2
3
1
2
4
3
4
2
sigma
out
by
fast
hat
from
approximating
backward
P
Residual d . f .
0.512
0.43
1
0.451
1.00
2
0.138
3.20
3
0.009
10.01
4
0.000
39.53
6
0.000
70.36
7
0.000 124.55
9
0.000 177.00 11
0.000 243.79 14
0.000 321.86 15
0.000 405.02 17
0.000 473.10 21
0.000 787.57 24
0.000 1190.61 28
0.000 1631.89 30

None
zero
with
when
all
stepdown
P
AIC
R2
0.5117
-1.57 1.000
0.6073
-3.00 0.999
0.3621
-2.80 0.998
0.0402
2.01 0.994
0.0000
27.53 0.976
0.0000
56.36 0.957
0.0000 106.55 0.924
0.0000 155.00 0.892
0.0000 215.79 0.851
0.0000 291.86 0.803
0.0000 371.02 0.752
0.0000 431.10 0.710
0.0000 739.57 0.517
0.0000 1134.61 0.270
0.0000 1571.89 0.000
Approximate Estimates after Deleting Factors

Coef
S . E . Wald Z P
[1 ,] -0.5928 0.04315 -13.74 0
being
10-21
f.approx ols ( Z dzgroup + rcs ( meanbp ,5) + rcs ( crea ,4) + rcs ( age ,5) +
rcs ( hrt ,3) + scoma + rcs ( pafi.i ,4) + pol ( adlsc ,2) +
rcs ( resp ,3) , x = TRUE )
f.approx $ stats
n Model L . R .
537.000
1688.225
Sigma
0.370
d.f.
23.000
R2
0.957
g
1.915
Estimate variancecovariance matrix of the coefficients of

reduced model
This covariance matrix does not include the scale parameter
V
X
x
w
v
vcov (f , regcoef.only = TRUE )

cbind ( Intercept =1 , g $ x )
cbind ( Intercept =1 , f.approx $ x )
solve ( t ( x ) % * % x , t ( x ) ) % * % X
w %*% V %*% t(w)
var ( full
#
#
#
full model design

approx. model design
contrast matrix
model )
Compare variance estimates (diagonals of v) with variance estimates from a reduced model that is fitted against the actual
outcomes.
f.sub psm ( S dzgroup + rcs ( meanbp ,5) + rcs ( crea ,4) + rcs ( age ,5) +
rcs ( hrt ,3) + scoma + rcs ( pafi.i ,4) + pol ( adlsc ,2) +
rcs ( resp ,3) , dist = lognormal ) # g a u s s i a n f o r S +
r diag ( v ) / diag ( vcov ( f.sub , regcoef.only = TRUE ) )
r [ c ( which.min ( r ) , which.max ( r ) ) ]
hrt
age
0.976 0.982
f.approx $ var v
latex ( anova ( f.approx , test = Chisq , ss = FALSE ) , file = ,
label = suport.anovaa )
Equation for simplified model:

#
Typeset
mathematical
form
latex ( f.approx , file = )
of
approximate
model
10-22
Table 10.2: Wald Statistics for Z
dzgroup
meanbp
Nonlinear
crea
Nonlinear
age
Nonlinear
hrt
Nonlinear
scoma
pafi.i
Nonlinear
adlsc
Nonlinear
resp
Nonlinear
TOTAL NONLINEAR
TOTAL
2
55.94
29.87
9.84
39.04
24.37
18.12
0.34
9.87
0.40
9.85
14.01
6.66
9.71
2.87
9.65
7.13
58.08
252.32
d.f.
2
4
3
3
2
4
3
2
1
1
3
2
2
1
2
1
13
23
P
< 0.0001
< 0.0001
0.0200
< 0.0001
< 0.0001
0.0012
0.9517
0.0072
0.5289
0.0017
0.0029
0.0357
0.0078
0.0904
0.0080
0.0076
< 0.0001
< 0.0001
E(Z) = X, where
X =
2.51
1.94[Coma] 1.75[MOSF w/Malig]
+0.068meanbp 3.08105 (meanbp 41.8)3+ + 7.9105 (meanbp 61)3+
4.91105 (meanbp 73)3+ + 2.61106 (meanbp 109)3+ 1.7106 (meanbp 135)3+
0.553crea 0.229(crea 0.6)3+ + 0.45(crea 1.1)3+ 0.233(crea 1.94)3+
+0.0131(crea 7.32)3+
0.0165age 1.13105 (age 28.5)3+ + 4.05105 (age 49.5)3+
2.15105 (age 63.7)3+ 2.68105 (age 72.7)3+ + 1.9105 (age 85.6)3+
0.0136hrt + 6.09107 (hrt 60)3+ 1.68106 (hrt 111)3+ + 1.07106 (hrt 140)3+
0.0135 scoma
+0.0161pafi.i 4.77107 (pafi.i 88)3+ + 9.11107 (pafi.i 167)3+
5.02107 (pafi.i 276)3+ + 6.76108 (pafi.i 426)3+ 0.369 adlsc + 0.0409 adlsc2
+0.0394resp 9.11105 (resp 10)3+ + 0.000176(resp 24)3+ 8.5105 (resp 39)3+
and [c] = 1 if subject is in group c, 0 otherwise; (x)+ = x if x > 0, 0 otherwise.
Nomogram for predicting median and mean survival time, based
on approximate model:
#
#
#
Derive S functions that express mean

of s u r v i v a l time for s p e c i f i c linear
analytically
and quantiles
predictors
expected.surv Mean ( f )
quantile.surv Quantile ( f )
latex ( expected.surv , file = , type = Sinput )
expected.surv function ( lp = NULL , parms = 0 .802352037606488 )
{
names ( parms ) NULL
exp ( lp + exp (2 * parms ) / 2)
}
latex ( quantile.surv , file = , type = Sinput )
quantile.surv function ( q = 0 .5 , lp = NULL , parms = 0 .802352037606488 )
{
names ( parms ) NULL
f function ( lp , q , parms ) lp + exp ( parms ) * qnorm ( q )
names ( q ) format ( q )
drop ( exp ( outer ( lp , q , FUN = f , parms = parms ) ) )
}
median.surv
#
Improve
function ( x ) quantile.surv ( lp = x )
variable
labels
for
the
nomogram
f.approx Newlabels ( f.approx , c ( Disease Group , Mean Arterial BP ,

Creatinine , Age , Heart Rate , SUPPORT Coma Score ,
PaO2 / ( .01 * FiO2 ) , ADL , Resp. Rate ) )
nom
nomogram ( f.approx ,
pafi.i = c (0 , 50 , 100 , 200 , 300 , 500 , 600 , 700 , 800 , 900) ,
fun = list ( Median Survival Time = median.surv ,
Mean Survival Time = expected.surv ) ,
fun.at = c ( .1 , .25 , .5 ,1 ,2 ,5 ,10 ,20 ,40) )
plot ( nom , cex.var =1 , cex.axis = .75 , lmgp = .25 )
# F i g u r e 10.12
S Packages and Functions Used

Packages
Purpose
Functions
Hmisc
Miscellaneous functions describe,ecdf,naclus,
rms
Modeling
Model presentation
Model validation
Note: All packages are available from CRAN
varclus,llist,spearman2
describe,impute,latex
datadist,psm,rcs,ols,fastbw
survplot,Newlabels,Function,
Mean,Quantile,nomogram
validate,calibrate
10-23
10
20
30
40
50
60
70
10-24
80
90
100
Points
MOSF w/Malig
Disease Group
Coma
ARF/MOSF w/Sepsis
Mean Arterial BP
0
6
20
10 11
40
60
80 120
12
Creatinine
5
Age
100
70
60
50
30
10
Heart Rate
300
SUPPORT Coma
Score
200
100
100
50
70 50 30 10
300
PaO2/(.01*FiO2)
0
5
50
100
200 500
700
900
ADL
4.5 2 1
0
65
60
55
50
45
40
35
30
Resp. Rate
0 5
15
Total Points
0
50
100
150
200
250
300
350
400
450
Linear Predictor
7
Median Survival Time

0.10.25
0.5 1 2 5 102040
Mean Survival Time

0.10.25
0.5 1 2 5 102040
Figure 10.12: Nomogram for predicting median and mean survival time, based on approximation of full model
Annotated Bibliography
[1] Paul D. Allison. Missing Data. Sage University Papers Series on Quantitative Applications in the Social Sciences,
07-136. Thousand Oaks CA: Sage, 2001 (cit. on pp. 3-1, 3-5).
[2] D. G. Altman.Categorising continuous covariates (letter to the editor). In: Brit J Cancer 64 (1991), p. 975 (cit. on
p. 2-13).
[3] D. G. Altman and P. K. Andersen. Bootstrap investigation of the stability of a Cox regression model. In: Stat
Med 8 (1989), pp. 771783 (cit. on p. 4-15).
[4] D. G. Altman et al. Dangers of using optimal cutpoints in the evaluation of prognostic factors. In: J Nat Cancer
Inst 86 (1994), pp. 829835 (cit. on pp. 2-13, 2-15).
[5] Douglas G. Altman. Suboptimal analysis using optimal cutpoints. In: Brit J Cancer 78 (1998), pp. 556557
(cit. on p. 2-13).
[6] A. C. Atkinson. A note on the generalized information criterion for choice of a model. In: Biometrika 67 (1980),
pp. 413418 (cit. on pp. 2-26, 4-14).
[7] Peter C. Austin. Bootstrap model selection had similar performance for selecting authentic and noise variables
compared to backward variable elimination: a simulation study. In: J Clin Epi 61 (2008), pp. 10091017 (cit. on
p. 4-15).
in general, a bootstrap model selection method had comparable performance to conventional backward variable elimination for identifying the true regression model. In most settings, both methods performed poorly at correctly identifying the correct regression
model.
.
[8] Peter C. Austin, Jack V. Tu, and Douglas S. Lee. Logistic regression had superior performance compared with
regression trees for predicting in-hospital mortality in patients hospitalized with heart failure. In: J Clin Epi 63
(2010), pp. 11451155 (cit. on p. 2-32).
ROC areas for logistic models varied from 0.747 to 0.775 whereas they varied from 0.620-0.651 for recursive partitioning;repeated data
simulation showed large variation in tree structure
.
[9] Sunni A. Barnes, Stacy R. Lindborg, and John W. Seaman.Multiple imputation techniques in small sample clinical
trials. In: Stat Med 25 (2006), pp. 233245 (cit. on p. 3-15).
bad performance of LOCF including high bias and poor confidence interval coverage;simulation setup;longitudinal data;serial data;RCT;dropout;assumed
missing at random (MAR);approximate Bayesian bootstrap;Bayesian least squares;missing data;nice background summary;new completion score method based on fitting a Poisson model for the number of completed clinic visits and using donors and approximate
Bayesian bootstrap
.
[10] Federica Barzi and Mark Woodward. Imputations of missing values in practice: Results from imputations of serum
cholesterol in 28 cohort studies. In: Am J Epi 160 (2004), pp. 3445 (cit. on pp. 3-7, 3-14).
excellent review article for multiple imputation;list of variables to include in imputation model;Imputation models should ideally include
all covariates that are related to the missing data mechanism, have distributions that differ between the respondents and nonrespondents,
are associated with cholesterol, and will be included in the analyses of the final complete data sets;detailed comparison of results
(cholesterol effect and confidence limits) for various imputation methods
.
[11] Heiko Belcher. The concept of residual confounding in regression models and some applications. In: Stat Med 11
(1992), pp. 17471758 (cit. on p. 2-13).
[12] D. A. Belsley, E. Kuh, and R. E. Welsch. Regression Diagnostics: Identifying Influential Data and Sources of
Collinearity. New York: Wiley, 1980 (cit. on p. 4-37).
11-1
ANNOTATED BIBLIOGRAPHY
11-2
[13] David A. Belsley. Conditioning Diagnostics: Collinearity and Weak Data in Regression. New York: Wiley, 1991
(cit. on p. 4-22).
[14] Jacqueline K. Benedetti et al. Effective sample size for tests of censored survival data. In: Biometrika 69 (1982),
pp. 343349 (cit. on p. 4-17).
[15] Kiros Berhane, Michael Hauptmann, and Bryan Langholz. Using tensor product splines in modeling exposure
timeresponse relationships: Application to the Colorado Plateau Uranium Miners cohort. In: Stat Med 27 (2008),
pp. 54845496 (cit. on p. 2-44).
discusses taking product of all univariate spline basis functions
.
[16] Maria Blettner and Willi Sauerbrei. Influence of model-building strategies on the results of a case-control study.
In: Stat Med 12 (1993), pp. 13251338 (cit. on p. 5-22).
[17] James G. Booth and Somnath Sarkar. Monte Carlo approximation of bootstrap variances. In: Am Statistician 52
(1998), pp. 354357 (cit. on p. 5-10).
number of resamples required to estimate variances, quantiles; 800 resamples may be required to guarantee with 0.95 confidence that
the relative error of a variance estimate is 0.1;Efrons original suggestions for as low as 25 resamples were based on comparing stability
of bootstrap estimates to sampling error, but small relative effects can significantly change P -values;number of bootstrap resamples
.
[18] Robert Bordley. Statistical decisionmaking without math. In: Chance 20.3 (2007), pp. 3944 (cit. on p. 1-7).
[19] L. Breiman and J. H. Friedman. Estimating optimal transformations for multiple regression and correlation (with
discussion). In: J Am Stat Assoc 80 (1985), pp. 580619 (cit. on p. 4-30).
[20] Leo Breiman. The little bootstrap and other methods for dimensionality selection in regression: X-fixed prediction
error. In: J Am Stat Assoc 87 (1992), pp. 738754 (cit. on pp. 4-14, 4-15, 5-16).
[21] Leo Breiman et al. Classification and Regression Trees. Pacific Grove, CA: Wadsworth and Brooks/Cole, 1984
(cit. on p. 2-31).
[22] William M. Briggs and Russell Zaretzki. The skill plot: A graphical technique for evaluating continuous diagnostic
tests (with discussion). In: Biometrics 64 (2008), pp. 250261 (cit. on p. 1-7).
statistics such as the AUC are not especially relevant to someone who must make a decision about a particular x c. ... ROC curves
lack or obscure several quantities that are necessary for evaluating the operational effectiveness of diagnostic tests. ... ROC curves were
first used to check how radio receivers (like radar receivers) operated over a range of frequencies. ... This is not how most ROC curves
are used now, particularly in medicine. The receiver of a diagnostic measurement ... wants to make a decision based on some x c, and is
not especially interested in how well he would have done had he used some different cutoff.; in the discussion David Hand states when
integrating to yield the overall AUC measure, it is necessary to decide what weight to give each value in the integration. The AUC
implicitly does this using a weighting derived empirically from the data. This is nonsensical. The relative importance of misclassifying
a case as a noncase, compared to the reverse, cannot come from the data itself. It must come externally, from considerations of the
severity one attaches to the different kinds of misclassifications.; see Lin, Kvam, Lu Stat in Med 28:798-813;2009
.
[23] David Brownstone. Regression strategies. In: Proceedings of the 20th Symposium on the Interface between Computer Science and Statistics. Washington, DC: American Statistical Association, 1988, pp. 7479 (cit. on p. 5-22).
[24] Petra Buettner, Claus Garbe, and Irene Guggenmoos-Holzmann. Problems in defining cutoff points of continuous
prognostic factors: Example of tumor thickness in primary cutaneous melanoma. In: J Clin Epi 50 (1997), pp. 1201
1210 (cit. on p. 2-13).
choice of cut point depends on marginal distribution of predictor
.
[25] Stef Buuren. Flexible imputation of missing data. Boca Raton, FL: Chapman & Hall/CRC, 2012 (cit. on pp. 3-1,
3-14, 3-19).
[26] Centers for Disease Control and Prevention CDC. National Center for Health Statistics NCHS. National Health and
Nutrition Examination Survey. Hyattsville, MD, 2010. url: http://www.cdc.gov/nchs/nhanes/nhanes20092010/nhanes09_10.htm (cit. on p. 9-4).
[27] John M. Chambers and Trevor J. Hastie, eds. Statistical Models in S. Pacific Grove, CA: Wadsworth and Brooks/Cole, 1992 (cit. on p. 2-45).
[28] C. Chatfield. Avoiding statistical pitfalls (with discussion). In: Statistical Sci 6 (1991), pp. 240268 (cit. on
p. 4-38).
[29] C. Chatfield. Model uncertainty, data mining and statistical inference (with discussion). In: J Roy Stat Soc A 158
(1995), pp. 419466 (cit. on pp. 4-10, 5-22).
11-3
bias by selecting model because it fits the data well; bias in standard errors;P. 420: ... need for a better balance in the literature and in
statistical teaching between techniques and problem solving strategies. P. 421: It is well known to be logically unsound and practically
misleading (Zhang, 1992) to make inferences as if a model is known to be true when it has, in fact, been selected from the same data
to be used for estimation purposes. However, although statisticians may admit this privately (Breiman (1992) calls it a quiet scandal),
they (we) continue to ignore the difficulties because it is not clear what else could or should be done. P. 421: Estimation errors for
regression coefficients are usually smaller than errors from failing to take into account model specification. P. 422: Statisticians must
stop pretending that model uncertainty does not exist and begin to find ways of coping with it. P. 426: It is indeed strange that we
often admit model uncertainty by searching for a best model but then ignore this uncertainty by making inferences and predictions as
if certain that the best fitting model is actually true. P. 427: The analyst needs to assess the model selection process and not just the
best fitting model. P. 432: The use of subset selection methods is well known to introduce alarming biases. P. 433: ... the AIC can
be highly biased in data-driven model selection situations. P. 434: Prediction intervals will generally be too narrow. In the discussion,
Jamal R. M. Ameen states that a model should be (a) satisfactory in performance relative to the stated objective, (b) logically sound,
(c) representative, (d) questionable and subject to on-line interrogation, (e) able to accommodate external or expert information and
(f) able to convey information.
.
[30] Samprit Chatterjee and Ali S. Hadi. Regression Analysis by Example. Fifth. New York: Wiley, 2012. isbn: 0470905840
(cit. on p. 4-21).
[31] Marie Chavent et al. ClustOfVar: An R package for the clustering of variables. In: J Stat Software 50.13 (Sept.
2012), pp. 116 (cit. on p. 4-26).
[32] A. Ciampi et al. Stratification by stepwise regression, correspondence analysis and recursive partition. In: Comp
Stat Data Analysis 1986 (1986), pp. 185204 (cit. on p. 4-26).
[33] W. S. Cleveland. Robust locally weighted regression and smoothing scatterplots. In: J Am Stat Assoc 74 (1979),
pp. 829836 (cit. on p. 2-28).
[34] E. Francis Cook and Lee Goldman. Asymmetric stratification: An outline for an efficient method for controlling
confounding in cohort studies. In: Am J Epi 127 (1988), pp. 626639 (cit. on p. 2-32).
[35] J. B. Copas. Cross-validation shrinkage of regression predictors. In: J Roy Stat Soc B 49 (1987), pp. 175183
(cit. on p. 5-20).
[36] J. B. Copas.Regression, prediction and shrinkage (with discussion). In: J Roy Stat Soc B 45 (1983), pp. 311354
(cit. on pp. 4-19, 4-20).
[37] David R. Cox. Regression models and life-tables (with discussion). In: J Roy Stat Soc B 34 (1972), pp. 187220
(cit. on p. 2-48).
[38] Sybil L. Crawford, Sharon L. Tennstedt, and John B. McKinlay.A comparison of analytic methods for non-random
missingness of outcome data. In: J Clin Epi 48 (1995), pp. 209219 (cit. on pp. 3-4, 4-44).
[39] N. J. Crichton and J. P. Hinde.Correspondence analysis as a screening method for indicants for clinical diagnosis.
[40] Ralph B. DAgostino et al. Development of health risk appraisal functions in the presence of multiple indicators:
The Framingham Study nursing home institutionalization model. In: Stat Med 14 (1995), pp. 17571770 (cit. on
pp. 4-22, 4-25).
[41] C. E. Davis et al. An example of dependencies among variables in a conditional logistic regression. In: Modern
Statistical Methods in Chronic Disease Epi. Ed. by S. H. Moolgavkar and R. L. Prentice. New York: Wiley, 1986,
pp. 140147 (cit. on p. 4-22).
[42] Charles S. Davis. Statistical Methods for the Analysis of Repeated Measurements. New York: Springer, 2002 (cit. on
p. 7-16).
[43] S. Derksen and H. J. Keselman.Backward, forward and stepwise automated subset selection algorithms: Frequency
of obtaining authentic and noise variables. In: British J Math Stat Psych 45 (1992), pp. 265282 (cit. on p. 4-10).
[44] T. F. Devlin and B. J. Weeks. Spline functions for logistic regression modeling. In: Proceedings of the Eleventh
Annual SAS Users Group International Conference. Cary, NC: SAS Institute, Inc., 1986, pp. 646651 (cit. on
p. 2-21).
[45] Peter J. Diggle et al. Analysis of Longitudinal Data. second. Oxford UK: Oxford University Press, 2002 (cit. on
p. 7-10).
[46] Donders et al. Review: A gentle introduction to imputation of missing values. In: J Clin Epi 59 (2006), pp. 1087
1091 (cit. on pp. 3-1, 3-5).
simple demonstration of failure of the add new category method (indicator variable)
.
[47] William D. Dupont. Statistical Modeling for Biomedical Researchers. second. Cambridge, UK: Cambridge University
Press, 2008 (cit. on p. 11-13).
11-4
[48] S. Durrleman and R. Simon. Flexible regression models with cubic splines. In: Stat Med 8 (1989), pp. 551561
(cit. on p. 2-25).
[49] B. Efron. Estimating the error rate of a prediction rule: Improvement on cross-validation. In: J Am Stat Assoc 78
(1983), pp. 316331 (cit. on pp. 5-17, 5-20).
suggested need at least 200 models to get an average that is adequate, i.e., 20 repeats of 10-fold cv
.
[50] Bradley Efron and Robert Tibshirani. An Introduction to the Bootstrap. New York: Chapman and Hall, 1993 (cit. on
p. 5-20).
[51] Bradley Efron and Robert Tibshirani. Improvements on cross-validation: The .632+ bootstrap method. In: J Am
Stat Assoc 92 (1997), pp. 548560 (cit. on p. 5-20).
[52] Juanjuan Fan and Richard A. Levine. To amnio or not to amnio: That is the decision for Bayes. In: Chance 20.3
(2007), pp. 2632 (cit. on p. 1-7).
[53] David Faraggi and Richard Simon. A simulation study of cross-validation for selecting an optimal cutpoint in
univariate survival analysis. In: Stat Med 15 (1996), pp. 22032213 (cit. on p. 2-13).
bias in point estimate of effect from selecting cutpoints based on P -value; loss of information from dichotomizing continuous predictors
.
[54] J. J. Faraway. The cost of data analysis. In: J Comp Graph Stat 1 (1992), pp. 213229 (cit. on pp. 4-46, 5-19,
5-22).
[55] Valerii Fedorov, Frank Mannino, and Rongmei Zhang.Consequences of dichotomization. In: Pharm Stat 8 (2009),
pp. 5061. doi: 10.1002/pst.331. url: http://dx.doi.org/10.1002/pst.331 (cit. on pp. 1-6, 2-13).
optimal cutpoint depends on unknown parameters;should only entertain dichotomization when estimating a value of the cumulative
distribution and when the assumed model is very different from the true model;nice graphics
.
[56] D. Freedman, W. Navidi, and S. Peters. On the Impact of Variable Selection in Fitting Regression Equations.
In: Lecture Notes in Economics and Mathematical Systems. New York: Springer-Verlag, 1988, pp. 116 (cit. on
p. 5-20).
[57] J. H. Friedman. A variable span smoother. Technical Report 5. Laboratory for Computational Statistics, Department
of Statistics, Stanford University, 1984 (cit. on p. 4-30).
[58] Mitchell H. Gail and Ruth M. Pfeiffer. On criteria for evaluating models of absolute risk. In: Biostatistics 6.2
(2005), pp. 227239 (cit. on p. 1-7).
[59] Joseph C. Gardiner, Zhehui Luo, and Lee A. Roman. Fixed effects, random effects and GEE: What are the
differences? In: Stat Med 28 (2009), pp. 221239 (cit. on p. 7-9).
nice comparison of models; econometrics; different use of the term fixed effects model
.
[60] A. Giannoni et al. Do optimal prognostic thresholds in continuous physiological variables really exist? Analysis of
origin of apparent thresholds, with systematic review for peak oxygen consumption, ejection fraction and BNP. In:
PLoS ONE 9.1 (2014). doi: 10.1371/journal.pone.0081699. url: http://dx.doi.org/10.1371/journal.
pone.0081699 (cit. on pp. 2-13, 2-15).
[61] John H. Giudice, John R. Fieberg, and Mark S. Lenarz. Spending degrees of freedom in a poor economy: A case
study of building a sightability model for moose in northeastern minnesota. In: J Wildlife Manage (2011). doi:
10.1002/jwmg.213. url: http://dx.doi.org/10.1002/jwmg.213 (cit. on p. 4-1).
[62] Tilmann Gneiting and Adrian E. Raftery. Strictly proper scoring rules, prediction, and estimation. In: J Am Stat
Assoc 102 (2007), pp. 359378 (cit. on p. 1-7).
wonderful review article except missing references from Scandanavian and German medical decision making literature
.
[63] Harvey Goldstein. Restricted unbiased iterative generalized least-squares estimation. In: Biometrika 76.3 (1989),
pp. 622623 (cit. on pp. 7-6, 7-10).
derivation of REML
.
[64] Usha S. Govindarajulu et al. Comparing smoothing techniques in Cox models for exposure-response relationships.
11-5
authors wrote a SAS macro for restricted cubic splines even though such a macro as existed since 1984; would have gotten more useful
results had simulation been used so would know the true regression shape;measure of agreement of two estimated curves by computing
the area between them, standardized by average of areas under the two;penalized spline and rcs were closer to each other than to
fractional polynomials
.
[65] P. M. Grambsch and P. C. OBrien. The effects of transformations and preliminary tests for non-linearity in
regression. In: Stat Med 10 (1991), pp. 697709 (cit. on pp. 2-36, 4-10).
[66] Robert J. Gray. Flexible methods for analyzing survival data using splines, with applications to breast cancer
prognosis. In: J Am Stat Assoc 87 (1992), pp. 942951 (cit. on pp. 2-44, 4-19).
[67] Robert J. Gray. Spline-based tests in survival analysis. In: Biometrics 50 (1994), pp. 640652 (cit. on p. 2-44).
[68] Michael J. Greenacre. Correspondence analysis of multivariate categorical data by weighted least-squares. In:
Biometrika 75 (1988), pp. 457467 (cit. on p. 4-26).
[69] Sander Greenland. When should epidemiologic regressions use random coefficients? In: Biometrics 56 (2000),
pp. 915921. doi: 10.1111/j.0006- 341X.2000.00915.x. url: http://dx.doi.org/10.1111/j.0006341X.2000.00915.x (cit. on pp. 4-10, 4-41).
use of statistics in epidemiology is largely primitive;stepwise variable selection on confounders leaves important confounders uncontrolled;composition matrix;example with far too many significant predictors with many regression coefficients absurdly inflated when
overfit;lack of evidence for dietary effects mediated through constituents;shrinkage instead of variable selection;larger effect on confidence interval width than on point estimates with variable selection;uncertainty about variance of random effects is just uncertainty
about prior opinion;estimation of variance is pointless;instead the analysis should be repeated using different values;if one feels compelled to estimate 2 , I would recommend giving it a proper prior concentrated amount contextually reasonable values;claim about
ordinary MLE being unbiased is misleading because it assumes the model is correct and is the only model entertained;shrinkage towards
compositional model;models need to be complex to capture uncertainty about the relations...an honest uncertainty assessment requires
parameters for all effects that we know may be present. This advice is implicit in an antiparsimony principle often attributed to L. J.
Savage All models should be as big as an elephant (see Draper, 1995). See also gus06per.
.
[70] Jian Guo et al. Principal component analysis with sparse fused loadings. In: J Comp Graph Stat 19.4 (2011),
pp. 930946 (cit. on p. 4-26).
incorporates blocking structure in the variables;selects different variables for different components;encourages loadings of highly correlated variables to have same magnitude, which aids in interpretation
.
[71] D. Hand and M. Crowder. Practical Longitudinal Data Analysis. London: Chapman & Hall, 1996.
[72] Ofer Harel and Xiao-Hua Zhou. Multiple imputation: Review of theory, implementation and software. In: Stat
Med 26 (2007), pp. 30573077 (cit. on pp. 3-1, 3-7, 3-12, 3-15).
failed to review aregImpute;excellent overview;ugly S code;nice description of different statistical tests including combining likelihood
ratio tests (which appears to be complex, requiring an out-of-sample log likelihood computation);congeniality of imputation and analysis
models;Bayesian approximation or approximate Bayesian bootstrap overview;Although missing at random (MAR) is a non-testable
assumption, it has been pointed out in the literature that we can get very close to MAR if we include enough variables in the imputation
models ... it would be preferred if the missing data modelling was done by the data constructors and not by the users... MI yields valid
inferences not only in congenial settings, but also in certain uncongenial ones as wellwhere the imputers model (1) is more general
(i.e. makes fewer assumptions) than the complete-data estimation method, or when the imputers model makes additional assumptions
that are well-founded.
.
[73] F. E. Harrell. The LOGIST Procedure. In: SUGI Supplemental Library Users Guide. Version 5. Cary, NC: SAS
Institute, Inc., 1986, pp. 269293 (cit. on p. 4-13).
[74] F. E. Harrell, K. L. Lee, and B. G. Pollock.Regression models in clinical studies: Determining relationships between
predictors and response. In: J Nat Cancer Inst 80 (1988), pp. 11981202 (cit. on p. 2-29).
[75] F. E. Harrell et al. Regression modeling strategies for improved prognostic prediction. In: Stat Med 3 (1984),
pp. 143152 (cit. on p. 4-16).
[76] F. E. Harrell et al. Regression models for prognostic prediction: Advantages, problems, and suggested solutions.
In: Ca Trt Rep 69 (1985), pp. 10711077 (cit. on p. 4-16).
[77] Frank E. Harrell, Kerry L. Lee, and Daniel B. Mark. Multivariable prognostic models: Issues in developing models,
evaluating assumptions and adequacy, and measuring and reducing errors. In: Stat Med 15 (1996), pp. 361387
(cit. on p. 4-1).
[78] Frank E. Harrell et al. Development of a clinical prediction model for an ordinal outcome: The World Health
Organization ARI Multicentre Study of clinical signs and etiologic agents of pneumonia, sepsis, and meningitis in
young infants. In: Stat Med 17 (1998), pp. 909944. url: http://onlinelibrary.wiley.com/doi/10.1002/
(SICI)1097-0258(19980430)17:8%3C909::AID-SIM753%3E3.0.CO;2-O/abstract (cit. on pp. 4-19, 4-44).
11-6
[79] Trevor J. Hastie and Robert J. Tibshirani. Generalized Additive Models. ISBN 9780412343902. Boca Raton, FL:
Chapman & Hall/CRC, 1990 (cit. on p. 2-35).
[80] Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman. The Elements of Statistical Learning. second. ISBN-10:
0387848576; ISBN-13: 978-0387848570. New York: Springer, 2008 (cit. on p. 2-34).
[81] Yulei He and Alan M. Zaslavsky. Diagnosing imputation models by applying target analyses to posterior replicates
of completed data. In: Stat Med 31.1 (2012), pp. 118. doi: 10.1002/sim.4413. url: http://dx.doi.org/10.
1002/sim.4413 (cit. on p. 3-17).
[82] S. G. Hilsenbeck and G. M. Clark. Practical p-value adjustment for optimally selected cutpoints. In: Stat Med 15
(1996), pp. 103112 (cit. on p. 2-13).
[83] W. Hoeffding.A non-parametric test of independence. In: Ann Math Stat 19 (1948), pp. 546557 (cit. on p. 4-26).
[84] Norbert Hollander, Willi Sauerbrei, and Martin Schumacher. Confidence intervals for the effect of a prognostic
factor after selection of an optimal cutpoint. In: Stat Med 23 (2004), pp. 17011713. doi: 10.1002/sim.1611.
url: http://dx.doi.org/10.1002/sim.1611 (cit. on pp. 2-13, 2-15).
true type I error can be much greater than nominal level;one example where nominal is 0.05 and true is 0.5;minimum P-value
method;CART;recursive partitioning;bootstrap method for correcting confidence interval;based on heuristic shrinkage coefficient;It
should be noted, however, that the optimal cutpoint approach has disadvantages. One of these is that in almost every study where
this method is applied, another cutpoint will emerge. This makes comparisons across studies extremely difficult or even impossible.
Altman et al. point out this problem for studies of the prognostic relevance of the S-phase fraction in breast cancer published in the
literature. They identified 19 different cutpoints used in the literature; some of them were solely used because they emerged as the
optimal cutpoint in a specific data set. In a meta-analysis on the relationship between cathepsin-D content and disease-free survival
in node-negative breast cancer patients, 12 studies were in included with 12 different cutpoints ... Interestingly, neither cathepsin-D nor
the S-phase fraction are recommended to be used as prognostic markers in breast cancer in the recent update of the American Society
of Clinical Oncology.; dichotomization; categorizing continuous variables; refs alt94dan, sch94out, alt98sub
.
[85] Nicholas J. Horton and Ken P. Kleinman. Much ado about nothing: A comparison of missing data methods and
software to fit incomplete data regression models. In: Am Statistician 61.1 (2007), pp. 7990 (cit. on p. 3-15).
[86] C. M. Hurvich and C. L. Tsai. The impact of model selection on inference in linear regression. In: Am Statistician
44 (1990), pp. 214217 (cit. on p. 4-15).
[87] Lisa I. Iezzoni. Dimensions of Risk. In: Risk Adjustment for Measuring Health Outcomes. Ed. by Lisa I. Iezzoni.
Ann Arbor, MI: Foundation of the American College of Healthcare Executives, 1994. Chap. 2, pp. 29118 (cit. on
p. 1-12).
dimensions of risk factors to include in models
.
[88] K. J. Janssen et al. Missing covariate data in medical research: To impute is better than to ignore. In: J Clin Epi
63 (2010), pp. 721727 (cit. on p. 3-20).
[89] Michael P. Jones.Indicator and stratification methods for missing explanatory variables in multiple linear regression.
In: J Am Stat Assoc 91 (1996), pp. 222230 (cit. on p. 3-5).
[90] J. D. Kalbfleisch and R. L. Prentice. Marginal likelihood based on Coxs regression and life model. In: Biometrika
60 (1973), pp. 267278 (cit. on p. 9-25).
[91] Juha Karvanen and Frank E. Harrell.Visualizing covariates in proportional hazards model. In: Stat Med 28 (2009),
pp. 19571966 (cit. on p. 5-2).
[92] Michael G. Kenward, Ian R. White, and James R. Carpener. Should baseline be a covariate or dependent variable
in analyses of change from baseline in clinical trials? (letter to the editor). In: Stat Med 29 (2010), pp. 14551456
(cit. on p. 7-5).
sharp rebuke of liu09sho
.
[93] W. A. Knaus et al. The SUPPORT prognostic model: Objective estimates of survival for seriously ill hospitalized
adults. In: Ann Int Med 122 (1995), pp. 191203. doi: 10.7326/0003- 4819- 122- 3- 199502010- 00007. url:
http://dx.doi.org/10.7326/0003-4819-122-3-199502010-00007 (cit. on pp. 4-31, 10-1).
[94] Mirjam J. Knol et al. Unpredictable bias when using the missing indicator method or complete case analysis for
missing confounder values: an empirical example. In: J Clin Epi 63 (2010), pp. 728736 (cit. on p. 3-5).
[95] R. Koenker and G. Bassett. Regression quantiles. In: Econometrica 46 (1978), pp. 3350 (cit. on p. 9-11).
[96] Roger Koenker. Quantile Regression. ISBN-10: 0-521-60827-9; ISBN-13: 978-0-521-60827-5. New York: Cambridge
University Press, 2005 (cit. on p. 9-11).
11-7
[97] Roger Koenker. quantreg: Quantile Regression. R package version 4.38. 2009. url: http://CRAN.R-project.
org/package=quantreg (cit. on p. 9-11).
[98] Charles Kooperberg, Charles J. Stone, and Young K. Truong. Hazard regression. In: J Am Stat Assoc 90 (1995),
pp. 7894 (cit. on p. 10-18).
[99] Warren F. Kuhfeld. The PRINQUAL Procedure. In: SAS/STAT 9.2 Users Guide. Second. Cary, NC: SAS Publishing, 2009. url: http://support.sas.com/documentation/onlinedoc/stat (cit. on p. 4-27).
[100] B. Lausen and M. Schumacher. Evaluating the effect of optimized cutoff values in the assessment of prognostic
factors. In: Comp Stat Data Analysis 21.3 (1996), pp. 307326. doi: 10.1016/0167- 9473(95)00016- X. url:
http://dx.doi.org/10.1016/0167-9473(95)00016-X (cit. on p. 2-13).
[101] J. F. Lawless and K. Singhal. Efficient screening of nonnormal regression models. In: Biometrics 34 (1978),
pp. 318327 (cit. on p. 4-14).
[102] S. le Cessie and J. C. van Houwelingen. Ridge estimators in logistic regression. In: Appl Stat 41 (1992), pp. 191
201 (cit. on p. 4-19).
[103] A. Leclerc et al. Correspondence analysis and logistic modelling: Complementary use in the analysis of a health
survey among nurses. In: Stat Med 7 (1988), pp. 983995 (cit. on p. 4-26).
[104] Katherine J. Lee and John B. Carlin. Recovery of information from multiple imputation: a simulation study.
In: Emerging Themes in Epi 9.1 (June 2012), pp. 3+. issn: 1742-7622. doi: 10 . 1186 / 1742 - 7622 - 9 - 3. url:
http://dx.doi.org/10.1186/1742-7622-9-3 (cit. on p. 3-4).
Not sure that the authors satisfactorily dealt with nonlinear predictor effects in the absence of strong auxiliary information, there is little
to gain from multiple imputation with missing data in the exposure-of-interest. In fact, the authors went further to say that multiple
imputation can introduce bias not present in a complete case analysis if a poorly fitting imputation model is used [from Yong Hao Pua]
.
[105] Seokho Lee, Jianhua Z. Huang, and Jianhua Hu.Sparse logistic principal components analysis for binary data. In:
Ann Appl Stat 4.3 (2010), pp. 15791601 (cit. on p. 2-34).
[106] Chenlei Leng and Hansheng Wang. On general adaptive sparse principal component analysis. In: J Comp Graph
Stat 18.1 (2009), pp. 201215 (cit. on p. 2-34).
[107] Kung-Yee Liang and Scott L. Zeger. Longitudinal data analysis of continuous and discrete responses for pre-post
designs. In: Sankhya 62 (2000), pp. 134148 (cit. on p. 7-5).
makes an error in assuming the baseline variable will have the same univariate distribution as the response except for a shift;baseline
may have for example a truncated distribution based on a trials inclusion criteria;if correlation between baseline and response is zero,
ANCOVA will be twice as efficient as simple analysis of change scores;if correlation is one they may be equally efficient
.
[108] James K. Lindsey. Models for Repeated Measurements. Clarendon Press, 1997.
[109] Roderick J. A. Little and Donald B. Rubin. Statistical Analysis with Missing Data. second. New York: Wiley, 2002
(cit. on pp. 3-4, 3-9, 3-17).
[110] Guanghan F. Liu et al. Should baseline be a covariate or dependent variable in analyses of change from baseline
in clinical trials? In: Stat Med 28 (2009), pp. 25092530 (cit. on p. 7-5).
seems to miss several important points, such as the fact that the baseline variable is often part of the inclusion/exclusion criteria and
so has a truncated distribution that is different from that of the follow-up measurements;sharp rebuke in ken10sho
.
[111] Richard Lockhart et al. A significance test for the lasso. Tech. rep. arXiv, 2013. arXiv: 1301.7161. url: http:
//arxiv.org/abs/1301.7161 (cit. on p. 4-10).
[112] Xiaohui Luo, Leonard A. Stfanski, and Dennis D. Boos. Tuning variable selection procedures by adding noise. In:
Technometrics 48 (2006), pp. 165175 (cit. on p. 1-14).
adding a known amount of noise to the response and studying 2 to tune the stopping rule to avoid overfitting or underfitting;simulation
setup
.
[113] Nathan Mantel.Why stepdown procedures in variable selection. In: Technometrics 12 (1970), pp. 621625 (cit. on
p. 4-14).
[114] S. E. Maxwell and H. D. Delaney. Bivariate median splits and spurious statistical significance. In: Psych Bull 113
(1993), pp. 181190. doi: 10.1037//0033- 2909.113.1.181. url: http://dx.doi.org/10.1037//00332909.113.1.181 (cit. on p. 2-13).
[115] George P. McCabe. Principal variables. In: Technometrics 26 (1984), pp. 137144 (cit. on p. 4-25).
11-8
[116] George Michailidis and Jan de Leeuw. The Gifi system of descriptive multivariate analysis. In: Statistical Sci 13
(1998), pp. 307336 (cit. on p. 4-26).
[117] Karel G. M. Moons et al. Using the outcome for imputation of missing predictor values was preferred. In: J Clin
Epi 59 (2006), pp. 10921101. doi: 10.1016/j.jclinepi.2006.01.009. url: http://dx.doi.org/10.1016/
j.jclinepi.2006.01.009 (cit. on p. 3-13).
use of outcome variable; excellent graphical summaries of simulations
.
[118] Barry K. Moser and Laura P. Coombs. Odds ratios for a continuous outcome variable without dichotomizing. In:
Stat Med 23 (2004), pp. 18431860 (cit. on p. 2-13).
large loss of efficiency and power;embeds in a logistic distribution, similar to proportional odds model;categorization;dichotomization of
a continuous response in order to obtain odds ratios often results in an inflation of the needed sample size by a factor greater than 1.5
.
[119] Raymond H. Myers. Classical and Modern Regression with Applications. Boston: PWS-Kent, 1990 (cit. on p. 4-21).
[120] N. J. D. Nagelkerke. A note on a general definition of the coefficient of determination. In: Biometrika 78 (1991),
pp. 691692 (cit. on p. 4-39).
[121] Todd G. Nick and J. Michael Hardin. Regression modeling strategies: An illustrative case study from medical
rehabilitation outcomes research. In: Am J Occ Ther 53 (1999), pp. 459470 (cit. on p. 4-1).
[122] David J. Nott and Chenlei Leng.Bayesian projection approaches to variable selection in generalized linear models.
In: Computational Statistics & Data Analysis 54.12 (Dec. 2010), pp. 32273241. issn: 01679473. doi: 10.1016/
j.csda.2010.01.036. url: http://dx.doi.org/10.1016/j.csda.2010.01.036 (cit. on p. 2-34).
[123] Debashis Paul et al. Preconditioning for feature selection and regression in high-dimensional problems. In: Ann
Stat 36.4 (2008), pp. 15951619. doi: 10.1214/009053607000000578. url: http://dx.doi.org/10.1214/
009053607000000578 (cit. on p. 2-34).
develop consistent Y using a latent variable structure, using for example supervised principal components. Then run stepwise regression
or lasso predicting Y (lasso worked better). Can run into problems when a predictor has importance in an adjusted sense but has no
marginal correlation with Y ;model approximation;model simplification
.
[124] Peter Peduzzi et al. A simulation study of the number of events per variable in logistic regression analysis. In: J
Clin Epi 49 (1996), pp. 13731379 (cit. on pp. 4-16, 4-17).
[125] Peter Peduzzi et al. Importance of events per independent variable in proportional hazards regression analysis. II.
Accuracy and precision of regression estimates. In: J Clin Epi 48 (1995), pp. 15031510 (cit. on p. 4-16).
[126] N. Peek et al. External validation of prognostic models for critically ill patients required substantial sample sizes.
In: J Clin Epi 60 (2007), pp. 491501 (cit. on p. 4-40).
large sample sizes need to obtain reliable external validations;inadequate power of DeLong, DeLong, and Clarke-Pearson test for
differences in correlated ROC areas (p. 498);problem with tests of calibration accuracy having too much power for large sample sizes
.
[127] Michael J. Pencina, Ralph B. DAgostino, and Olga V. Demler. Novel metrics for evaluating improvement in discrimination: net reclassification and integrated discrimination improvement for normal variables and nested models.
In: Stat Med 31.2 (2012), pp. 101113. doi: 10.1002/sim.4348. url: http://dx.doi.org/10.1002/sim.4348
(cit. on p. 4-40).
[128] Michael J. Pencina, Ralph B. DAgostino, and Ewout W. Steyerberg.Extensions of net reclassification improvement
calculations to measure usefulness of new biomarkers. In: Stat Med 30 (2011), pp. 1121. doi: 10.1002/sim.4085.
url: http://dx.doi.org/10.1002/sim.4085 (cit. on p. 4-40).
lack of need for NRI to be category-based;arbitrariness of categories;category-less or continuous NRI is the most objective and versatile
measure of improvement in risk prediction;authors misunderstood the inadequacy of three categories if categories are used;comparison
of NRI to change in C index;example of continuous plot of risk for old model vs. risk for new model
.
[129] Michael J. Pencina et al. Evaluating the added predictive ability of a new marker: From area under the ROC curve
to reclassification and beyond. In: Stat Med 27 (2008), pp. 157172 (cit. on p. 4-40).
small differences in ROC area can still be very meaningful;example of insignificant test for difference in ROC areas with very significant
results from new method;Yates discrimination slope;reclassification table;limiting version of this based on whether and amount by
which probabilities rise for events and lower for non-events when compare new model to old;comparing two models;see letter to the
editor by Van Calster and Van Huffel, Stat in Med 29:318-319, 2010 and by Cook and Paynter, Stat in Med 31:93-97, 2012
11-9
[130] Sanne A. Peters et al. Multiple imputation of missing repeated outcome measurements did not add to linear
mixed-effects models. In: J Clin Epi 65.6 (2012), pp. 686695. doi: 10.1016/j.jclinepi.2011.11.012. url:
http://dx.doi.org/10.1016/j.jclinepi.2011.11.012 (cit. on p. 7-9).
[131] Jose C. Pinheiro and Douglas M. Bates. Mixed-Effects Models in S and S-PLUS. New York: Springer, 2000 (cit. on
pp. 7-10, 7-12).
[132] Tjeerd van der Ploeg, Peter C. Austin, and Ewout W. Steyerberg.Modern modelling techniques are data hungry: a
simulation study for predicting dichotomous endpoints. In: BMC medical research methodology 14.1 (Dec. 2014),
pp. 137+. issn: 1471-2288. doi: 10.1186/1471-2288-14-137. url: http://dx.doi.org/10.1186/1471-228814-137 (cit. on pp. 2-4, 4-16).
Would be better to use proper accuracy scores in the assessment. Too much emphasis on optimism as opposed to final discrimination
measure. But much good practical information. Recursive partitioning fared poorly.
.
[133] Richard F. Potthoff and S. N. Roy. A generalized multivariate analysis of variance model useful especially for
growth curve problems. In: Biometrika 51 (1964), pp. 313326 (cit. on p. 7-6).
included an AR1 example
.
[134] Peter Radchenko and Gareth M. James.Variable inclusion and shrinkage algorithms. In: J Am Stat Assoc 103.483
(2008), pp. 13041315 (cit. on p. 2-33).
solves problem caused by lasso using the same penalty parameter for variable selection and shrinkage which causes lasso to have to
keep too many variables in the model to avoid overshrinking the remaining predictors;does not handle scaling issue well
.
[135] D. R. Ragland. Dichotomizing continuous outcome variables: Dependence of the magnitude of association and
statistical power on the cutpoint. In: Epi 3 (1992). See letters to editor May 1993 P. 274-, Vol 4 No. 3, pp. 434440.
doi: 10.1097/00001648-199209000-00009. url: http://dx.doi.org/10.1097/00001648-199209000-00009
(cit. on p. 2-13).
[136] Brendan M. Reilly and Arthur T. Evans.Translating clinical research into clinical practice: Impact of using prediction
rules to make decisions. In: Ann Int Med 144 (2006), pp. 201209 (cit. on p. 1-10).
impact analysis;example of decision aid being ignored or overruled making MD decisions worse;assumed utilities are constant across
subjects by concluding that directives have more impact than predictions;Goldman-Cook clinical prediction rule in AMI
.
[137] Ellen B. Roecker. Prediction error and its estimation for subset-selected models. In: Technometrics 33 (1991),
pp. 459468 (cit. on pp. 4-14, 5-16).
[138] Patrick Royston, Douglas G. Altman, and Willi Sauerbrei. Dichotomizing continuous predictors in multiple regression: a bad idea. In: Stat Med 25 (2006), pp. 127141. doi: 10.1002/sim.2331. url: http://dx.doi.org/10.
1002/sim.2331 (cit. on p. 2-13).
destruction of statistical inference when cutpoints are chosen using the response variable; varying effect estimates when change cutpoints;difficult to interpret effects when dichotomize;nice plot showing effect of categorization; PBC data
.
[139] D. Rubin and N. Schenker.Multiple imputation in health-care data bases: An overview and some applications. In:
Stat Med 10 (1991), pp. 585598 (cit. on p. 3-12).
[140] Warren Sarle. The VARCLUS Procedure. In: SAS/STAT Users Guide. fourth. Vol. 2. Cary, NC: SAS Institute,
Inc., 1990. Chap. 43, pp. 16411659. url: http://support.sas.com/documentation/onlinedoc/stat (cit. on
pp. 4-22, 4-25).
[141] Willi Sauerbrei and Martin Schumacher. A bootstrap resampling procedure for model building: Application to the
Cox regression model. In: Stat Med 11 (1992), pp. 20932109 (cit. on pp. 4-15, 5-17).
[142] G. Schulgen et al. Outcome-oriented cutpoints in quantitative exposure. In: Am J Epi 120 (1994), pp. 172184
(cit. on pp. 2-13, 2-15).
[143] E. Selvin et al. Glycated hemoglobin, diabetes, and cardiovascular risk in nondiabetic adults. In: NEJM 362.9
(Mar. 2010), pp. 800811. doi: 10.1056/NEJMoa0908359. url: http://dx.doi.org/10.1056/NEJMoa0908359
(cit. on p. 2-24).
[144] Stephen Senn. Change from baseline and analysis of covariance revisited. In: Stat Med 25 (2006), pp. 43344344
(cit. on p. 7-5).
11-10
shows that claims that in a 2-arm study it is not true that ANCOVA requires the population means at baseline to be identical;refutes
some claims of lia00lon;problems with counterfactuals;temporal additivity (amounts to supposing that despite the fact that groups
are difference at baseline they would show the same evolution over time);causal additivity;is difficult to design trials for which simple
analysis of change scores is unbiased, ANCOVA is biased, and a causal interpretation can be given;temporally and logically, a baseline
cannot be a response to treatment, so baseline and response cannot be modeled in an integrated framework as Laird and Wares
model has been used;one should focus clearly on outcomes as being the only values that can be influenced by treatment and examine
critically any schemes that assume that these are linked in some rigid and deterministic view to baseline values. An alternative tradition
sees a baseline as being merely one of a number of measurements capable of improving predictions of outcomes and models it in this
way.;You cannot establish necessary conditions for an estimator to be valid by nominating a model and seeing what the model implies
unless the model is universally agreed to be impeccable. On the contrary it is appropriate to start with the estimator and see what
assumptions are implied by valid conclusions.;this is in distinction to lia00lon
.
[145] Jun Shao.Linear model selection by cross-validation. In: J Am Stat Assoc 88 (1993), pp. 486494 (cit. on p. 5-17).
[146] Noah Simon et al. A sparse-group lasso. In: J Comp Graph Stat 22.2 (2013), pp. 231245. doi: 10 . 1080 /
10618600.2012.681250. eprint: http://www.tandfonline.com/doi/pdf/10.1080/10618600.2012.681250.
url: http://www.tandfonline.com/doi/abs/10.1080/10618600.2012.681250 (cit. on p. 2-34).
sparse effects both on a group and within group levels;can also be considered special case of group lasso allowing overlap between
groups
.
[147] Sean L. Simpson et al.A linear exponent AR(1) family of correlation structures. In: Stat Med 29 (2010), pp. 1825
1838 (cit. on p. 7-13).
[148] L. R. Smith, F. E. Harrell, and L. H. Muhlbaier. Problems and potentials in modeling survival. In: Medical
Effectiveness Research Data Methods (Summary Report), AHCPR Pub. No. 92-0056. Ed. by Mary L. Grady and
Harvey A. Schwartz. Rockville, MD: US Dept. of Health, Human Services, Agency for Health Care Policy, and
Research, 1992, pp. 151159. url: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/FrankHarrell/
smi92pro.pdf (cit. on p. 4-16).
[149] Ian Spence and Robert F. Garrison. A remarkable scatterplot. In: Am Statistician 47 (1993), pp. 1219 (cit. on
p. 4-38).
[150] D. J. Spiegelhalter. Probabilistic prediction in patient management and clinical trials. In: Stat Med 5 (1986),
pp. 421433 (cit. on pp. 4-19, 4-45, 5-20, 5-21).
z-test for calibration inaccuracy (implemented in Stata, and R Hmisc packages val.prob function)
.
[151] Ewout W. Steyerberg. Clinical Prediction Models. New York: Springer, 2009 (cit. on pp. x, 11-13).
[152] Ewout W. Steyerberg et al. Prognostic modeling with logistic regression analysis: In search of a sensible strategy
in small data sets. In: Med Decis Mak 21 (2001), pp. 4556 (cit. on p. 4-1).
[153] Ewout W. Steyerberg et al. Prognostic modelling with logistic regression analysis: A comparison of selection and
estimation methods in small data sets. In: Stat Med 19 (2000), pp. 10591079 (cit. on p. 2-33).
[154] C. J. Stone. Comment: Generalized additive models. In: Statistical Sci 1 (1986), pp. 312314 (cit. on p. 2-25).
[155] C. J. Stone and C. Y. Koo. Additive splines in statistics. In: Proceedings of the Statistical Computing Section
ASA. Washington, DC, 1985, pp. 4548 (cit. on pp. 2-21, 2-26).
[156] Samy Suissa and Lucie Blais. Binary regression with continuous outcomes. In: Stat Med 14 (1995), pp. 247255.
doi: 10.1002/sim.4780140303. url: http://dx.doi.org/10.1002/sim.4780140303 (cit. on p. 2-13).
[157] Guo-Wen Sun, Thomas L. Shook, and Gregory L. Kay.Inappropriate use of bivariable analysis to screen risk factors
for use in multivariable analysis. In: J Clin Epi 49 (1996), pp. 907916 (cit. on p. 4-17).
[158] Robert Tibshirani. Regression shrinkage and selection via the lasso. In: J Roy Stat Soc B 58 (1996), pp. 267288
(cit. on p. 2-33).
[159] Jos Twisk et al. Multiple imputation of missing values was not necessary before performing a longitudinal mixedmodel analysis. In: J Clin Epi 66.9 (2013), pp. 10221028. doi: 10 . 1016 / j . jclinepi . 2013 . 03 . 017. url:
http://dx.doi.org/10.1016/j.jclinepi.2013.03.017 (cit. on p. 3-3).
[160] Werner Vach and Maria Blettner. Missing Data in Epidemiologic Studies. In: Ency of Biostatistics. New York:
Wiley, 1998, pp. 26412654 (cit. on p. 3-5).
[161] S. van Buuren et al. Fully conditional specification in multivariate imputation. In: J Stat Computation Sim 76.12
(2006), pp. 10491064 (cit. on pp. 3-15, 3-16).
justification for chained equations alternative to full multivariate modeling
11-11
[162] Geert J. M. G. van der Heijden et al. Imputation of missing values is superior to complete case analysis and
the missing-indicator method in multivariable diagnostic research: A clinical example. In: J Clin Epi 59 (2006),
pp. 11021109. doi: 10.1016/j.jclinepi.2006.01.015. url: http://dx.doi.org/10.1016/j.jclinepi.
2006.01.015 (cit. on p. 3-5).
Invalidity of adding a new category or an indicator variable for missing values even with MCAR
.
[163] J. C. van Houwelingen and S. le Cessie. Predictive value of statistical models. In: Stat Med 9 (1990), pp. 1303
1325 (cit. on pp. 2-26, 4-19, 5-17, 5-20, 5-22).
[164] William N. Venables and Brian D. Ripley. Modern Applied Statistics with S. Fourth. New York: Springer-Verlag,
2003. isbn: 0387954570 (cit. on p. 9-2).
[165] Geert Verbeke and Geert Molenberghs. Linear Mixed Models for Longitudinal Data. New York: Springer, 2000.
[166] Pierre J. Verweij and Hans C. van Houwelingen. Penalized likelihood in Cox regression. In: Stat Med 13 (1994),
pp. 24272436 (cit. on p. 4-19).
[167] Andrew J. Vickers. Decision analysis for the evaluation of diagnostic tests, prediction models, and molecular
markers. In: Am Statistician 62.4 (2008), pp. 314320 (cit. on p. 1-7).
limitations of accuracy metrics;incorporating clinical consequences;nice example of calculation of expected outcome;drawbacks of conventional decision analysis, especially because of the difficulty of eliciting the expected harm of a missed diagnosis;use of a threshold
on the probability of disease for taking some action;decision curve;has other good references to decision analysis
.
[168] Eric Vittinghoff and Charles E. McCulloch. Relaxing the rule of ten events per variable in logistic and Cox regression. In: Am J Epi 165 (2006), pp. 710718 (cit. on p. 4-16).
the authors may have not been quite stringent enough in their assessment of adequacy of predictions;letter to the editor submitted
.
[169] Paul T. von Hippel. Regression with missing Ys: An improved strategy for analyzing multiple imputed data. In:
Soc Meth 37.1 (2007), pp. 83117 (cit. on p. 3-4).
[170] Howard Wainer. Finding what is not there through the unfortunate binning of results: The Mendel effect. In:
Chance 19.1 (2006), pp. 4956 (cit. on pp. 2-13, 2-16).
can find bins that yield either positive or negative association;especially pertinent when effects are small;With four parameters, I can
fit an elephant; with five, I can make it wiggle its trunk. - John von Neumann
.
[171] Hansheng Wang and Chenlei Leng. Unified LASSO estimation by least squares approximation. In: J Am Stat
Assoc 102 (2007), pp. 10391048. doi: 10.1198/016214507000000509. url: http://dx.doi.org/10.1198/
016214507000000509 (cit. on p. 2-33).
[172] S. Wang et al.Hierarchically penalized Cox regression with grouped variables. In: Biometrika 96.2 (2009), pp. 307
322 (cit. on p. 2-34).
[173] Yohanan Wax. Collinearity diagnosis for a relative risk regression analysis: An application to assessment of dietcancer relationship in epidemiological studies. In: Stat Med 11 (1992), pp. 12731287 (cit. on p. 4-22).
[174] Ian R. White and John B. Carlin.Bias and efficiency of multiple imputation compared with complete-case analysis
for missing covariate values. In: Stat Med 29 (2010), pp. 29202931 (cit. on p. 3-15).
[175] Ian R. White and Patrick Royston. Imputing missing covariate values for the Cox model. In: Stat Med 28 (2009),
pp. 19821998 (cit. on p. 3-3).
approach to using event time and censoring indicator as predictors in the imputation model for missing baseline covariates;recommended
an approximation using the event indicator and the cumulative hazard transformation of time, without their interaction
.
[176] Ian R. White, Patrick Royston, and Angela M. Wood. Multiple imputation using chained equations: Issues and
guidance for practice. In: Stat Med 30.4 (2011), pp. 377399 (cit. on pp. 3-1, 3-10, 3-15, 3-19).
practical guidance for the use of multiple imputation using chained equations;MICE;imputation models for different types of target
variables;PMM choosing at random from among a few closest matches;choosing number of multiple imputations by a reproducibility
argument, suggesting 100f imputations when f is the fraction of cases that are incomplete
.
[177] John Whitehead. Sample size calculations for ordered categorical data. In: Stat Med 12 (1993). See letter to
editor SM 15:1065-6 for binary case;see errata in SM 13:871 1994;see kol95com, jul96sam, pp. 22572271 (cit. on
p. 4-17).
11-12
[178] Ryan E. Wiegand. Performance of using multiple stepwise algorithms for variable selection. In: Stat Med 29
(2010), pp. 16471659 (cit. on p. 4-14).
fruitless to try different stepwise methods and look for agreement;the methods will agree on the wrong model
.
[179] Daniela M. Witten and Robert Tibshirani. Testing significance of features by lassoed principal components. In:
Ann Appl Stat 2.3 (2008), pp. 9861012 (cit. on p. 2-34).
reduction in false discovery rates over using a vector of t-statistics;borrowing strength across genes;one would not expect a single gene
to be associated with the outcome, since, in practice, many genes work together to effect a particular phenotype. LPC effectively downweights individual genes that are associated with the outcome but that do not share an expression pattern with a larger group of genes,
and instead favors large groups of genes that appear to be differentially-expressed.;regress principal components on outcome;sparse
principal components
.
[180] S. N. Wood. Generalized Additive Models: An Introduction with R. ISBN 9781584884743. Boca Raton, FL: Chapman & Hall/CRC, 2006 (cit. on p. 2-35).
[181] C. F. J. Wu. Jackknife, bootstrap and other resampling methods in regression analysis. In: Ann Stat 14.4 (1986),
pp. 12611350 (cit. on p. 5-17).
[182] Shifeng Xiong. Some notes on the nonnegative garrote. In: Technometrics 52.3 (2010), pp. 349361 (cit. on
p. 2-34).
... to select tuning parameters, it may be unnecessary to optimize a model selectin criterion repeatedly;natural selection of penalty
function
.
[183] Jianming Ye. On measuring and correcting the effects of data mining and model selection. In: J Am Stat Assoc
93 (1998), pp. 120131 (cit. on p. 1-14).
[184] F. W. Young, Y. Takane, and J. de Leeuw. The principal components of mixed measurement level multivariate
data: An alternating least squares method with optimal scaling features. In: Psychometrika 43 (1978), pp. 279281
(cit. on p. 4-26).
[185] Recai M. Yucel and Alan M. Zaslavsky. Using calibration to improve rounding in imputation. In: Am Statistician
62.2 (2008), pp. 125129 (cit. on p. 3-17).
using rounding to impute binary variables using techniques for continuous data;uses the method to solve for the cutpoint for a continuous
estimate to be converted into a binary value;method should be useful in more general situations;idea is to duplicate the entire dataset and
in the second half of the new datasets to set all non-missing values of the target variable to missing;multiply impute these now-missing
values and compare them to the actual values
.
[186] Hao H. Zhang and Wenbin Lu. Adaptive lasso for Coxs proportional hazards model. In: Biometrika 94 (2007),
pp. 691703 (cit. on p. 2-33).
penalty function has ratios against original MLE;scale-free lasso
.
[187] Hui Zhou, Trevor Hastie, and Robert Tibshirani. Sparse principal component analysis. In: J Comp Graph Stat 15
(2006), pp. 265286 (cit. on p. 2-34).
principal components analysis that shrinks some loadings to zero
.
[188] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. In: J Roy Stat Soc B 67.2
(2005), pp. 301320 (cit. on p. 2-33).
R packages written by FE Harrell are freely available from CRAN, and are managed at https://github.
com/harrelfe.
11-13
To obtain a book with detailed examples and case studies

and notes on the theory and applications of survival analysis,
logistic regression, ordinal regression, linear models, and longitudinal models, order Regression Modeling Strategies with
Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis, 2nd Edition by FE Harrell from
Springer NY (2015). Steyerberg [151] and Dupont [47] are
excellent texts for accompanying the book.
To obtain a glossary of statistical terms and other handouts related to diagnostic and prognostic modeling,
point your Web browser to biostat.mc.vanderbilt.edu/ClinStat.

Course Regression Model Strategies PDF

Uploaded by

Copyright:

Available Formats

Course Regression Model Strategies PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Course Regression Model Strategies PDF

Uploaded by

Copyright:

Available Formats

Regression Modeling Strategies

Joint Statistical Meetings

Copyright 1995-2016 FE Harrell

All Rights Reserved

Hypothesis Testing, Estimation, and Prediction . . . . . . . . . . . . . 1-1

Examples of Uses of Predictive Multivariable Modeling . . . . . . . . . 1-3

Misunderstandings about Prediction vs. Classification . . . . . . . . . . 1-5

Planning for Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 1-10

Choice of the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-13

Model uncertainty / Data-driven Model Specification . . . . . . . . . . 1-14

2 General Aspects of Fitting Regression Models

Notation for Multivariable Regression Models . . . . . . . . . . . . . . 2-5

Model Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6

Interpreting Model Parameters . . . . . . . . . . . . . . . . . . . . . . 2-7

Nominal Predictors . . . . . . . . . . . . . . . . . . . . . . . . 2-7

Example: Inference for a Simple Model . . . . . . . . . . . . . 2-9

Review of Composite (Chunk) Tests . . . . . . . . . . . . . . . 2-11

Relaxing Linearity Assumption for Continuous Predictors . . . . . . . . 2-13

Avoiding Categorization . . . . . . . . . . . . . . . . . . . . . 2-13

Simple Nonlinear Terms . . . . . . . . . . . . . . . . . . . . . 2-17

Cubic Spline Functions . . . . . . . . . . . . . . . . . . . . . . 2-20

Restricted Cubic Splines . . . . . . . . . . . . . . . . . . . . . 2-21

Choosing Number and Position of Knots . . . . . . . . . . . . 2-25

Nonparametric Regression . . . . . . . . . . . . . . . . . . . . 2-27

Advantages of Regression Splines over Other Methods . . . . . 2-29

Recursive Partitioning: Tree-Based Models . . . . . . . . . . . . . . . 2-31

New Directions in Predictive Modeling . . . . . . . . . . . . . 2-32

Multiple Degree of Freedom Tests of Association . . . . . . . . . . . . 2-36

Assessment of Model Fit . . . . . . . . . . . . . . . . . . . . . . . . . 2-38

Regression Assumptions . . . . . . . . . . . . . . . . . . . . . 2-38

Modeling and Testing Complex Interactions . . . . . . . . . . . 2-42

Fitting Ordinal Predictors . . . . . . . . . . . . . . . . . . . . 2-47

Distributional Assumptions . . . . . . . . . . . . . . . . . . . 2-48

Types of Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1

Prelude to Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2

Missing Values for Different Types of Response Variables . . . . . . . . 3-3

Problems With Simple Alternatives to Imputation

Strategies for Developing an Imputation Model . . . . . . . . . . . . . 3-7

Single Conditional Mean Imputation . . . . . . . . . . . . . . . . . . . 3-11

Predictive Mean Matching . . . . . . . . . . . . . . . . . . . . . . . . 3-12

Multiple Imputation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12

3.10 Summary and Rough Guidelines . . . . . . . . . . . . . . . . . . . . . 3-19

Prespecification of Predictor Complexity Without Later Simplification . 4-3

Learning From a Saturated Model . . . . . . . . . . . . . . . . 4-5

Using Marginal Generalized Rank Correlations . . . . . . . . . 4-7

Checking Assumptions of Multiple Predictors Simultaneously . . . . . . 4-9

Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10

Overfitting and Limits on Number of Predictors

Data Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-23

Redundancy Analysis . . . . . . . . . . . . . . . . . . . . . . . 4-24

Variable Clustering . . . . . . . . . . . . . . . . . . . . . . . . 4-25

Transformation and Scaling Variables Without Using Y

Simultaneous Transformation and Imputation . . . . . . . . . . 4-28

Simple Scoring of Variable Clusters . . . . . . . . . . . . . . . 4-32

Simplifying Cluster Scores . . . . . . . . . . . . . . . . . . . . 4-33

How Much Data Reduction Is Necessary? . . . . . . . . . . . . 4-33

Other Approaches to Predictive Modeling . . . . . . . . . . . . . . . . 4-36

Overly Influential Observations . . . . . . . . . . . . . . . . . . . . . 4-36

4.10 Comparing Two Models . . . . . . . . . . . . . . . . . . . . . . . . . 4-39