A Guide To Model Selection For Survival Analysis
A Guide To Model Selection For Survival Analysis
A Guide To Model Selection For Survival Analysis
Analysis
How to examine and evaluate models for time-to-event data
Aashish Nair
Follow
Published in
·
10 min read
·
May 18, 2022
Photo by m. on Unsplash
Selecting a suitable model for the time-to-event data of interest is an important step in survival
analysis.
Unfortunately, with the sheer number of models available for survival analysis, it is easy to
become inundated by all of the available information.
While it might be tempting to meticulously study every modeling approach, drowning yourself in
statistical jargon will only hamper efforts to properly find the best model.
Here, we explore a few popular models in survival analysis and then discuss the factors that
should be considered when identifying the best model for the time-to-event data of interest.
Key Terminology
Before covering any of the models, it is important to become familiar with some of the measures
that are of interest in survival analysis.
1. Survival function
The survival function is denoted by S(t), where:
It shows the probability of a subject surviving (i.e., not experiencing the event) beyond time t.
This is a non-increasing function.
2. Hazard function
The hazard function, also known as the hazard rate, is denoted by h(t), where
It shows the instantaneous probability of the event occurring at time t, given that the event has
not already occurred. The hazard function can be derived from the survival function, and vice
versa.
The cumulative hazard function is a non-decreasing function that shows the total accumulated
risk of an event occurring at time t. In mathematical terms, it is the area under the hazard
function.
4. Hazard Ratio
The hazard ratio is the ratio of the hazard rate between two groups. This is a quantified measure
of how a covariate affects the survival duration of the subjects in the study.
Therefore, picking the right model comes down to identifying the one that makes the
assumptions that befit the time-to-event data and enables users to obtain the desired information
with the subsequent analysis.
Popular Models in Survival Analysis
Out of the many models that can be used to analyze time-to-event data, there are 4 that are most
prominent: the Kaplan Meier model, the Exponential model, the Weibull model, and the Cox
Proportional-Hazards model.
In the following demonstration, each modeling approach will be explored with Python’s lifelines
module. The models will be used to examine time-to-event data from a built-in dataset.
The week column shows the survival duration and the arrest column shows whether or not the
event (i.e., arrest) has occurred.
It is classified as a non-parametric model, meaning that it does not assume the distribution of the
data. It generates a survival function with only the information provided.
The Kaplan Meier model computes the survival function with the formula:
Here’s the Kaplan Meier curve built with the time-to-event data:
Code Output (Created By Author)
The survival functions of different groups can be compared with the log-rank test, a non-
parametric hypothesis test.
As an example, we can compare the survival functions of subjects with and without financial aid
to see if financial aid affects the survival duration using the logrank_test method.
The benefit of the Kaplan Meier model is that it is intuitive and easy to interpret. As it makes few
underlying assumptions, this model is often used as a baseline in survival analysis.
Unfortunately, due to the model’s minimal complexity, it can be difficult to draw meaningful
insights from it. It can not be used to compare risk between groups and compute metrics like the
hazard ratio.
2 - Exponential Model
The Exponential model is another popular model in survival analysis. Unlike the Kaplan Meier
model, it is a parametric model, meaning that it assumes that the data fits a specific distribution.
The survival function of the Exponential model is derived from the formula:
Survival function (Created By Author)
The hazard rate of the Exponential model is derived from the formula:
The exponential model assumes that the hazard rate is constant. In other words, the risk of the
event of interest occurring remains the same throughout the period of observation.
To determine the hazard metrics, one can compute the λ value in this distribution:
One can also directly get the hazard metrics with the properties offered in the lifelines package,
which reports the hazard rate and the cumulative hazard function in a data frame:
Based on the output, the hazard rate remains constant, which is in line with the nature of the
Exponential model.
Overall, the Exponential model provides substantial information on the survival function and the
hazard function. Moreover, it can be used to compare the hazard rates of different groups.
However, it makes the strong assumption that the hazard rate is constant at any given time,
which may not suit the time-to-event data of interest.
3 - Weibull Model
Another parametric model one can consider is the Weibull model.
The data distribution in the Weibull model is determined by the two parameters: λ and ρ.
The λ parameter indicates how long it takes for 63.2% of the subjects to experience the event.
The ρ parameter indicates whether the hazard rate is increasing, decreasing, or constant. If ρ is
greater than 1, the hazard rate is constantly increasing. If ρ is less than 1, the hazard rate is
constantly decreasing.
In other words, the Weibull model assumes that the change in hazard rate is linear. The hazard
rate can always increase, always decrease, or stay the same. However, the hazard rate can not
fluctuate.
Since the ρ value is greater than 1, the hazard rate in this model is always increasing.
We can confirm this by deriving the hazard rate and cumulative hazard function.
Similar to the Exponential model, the Weibull model is capable of computing many of the
relevant metrics in the survival analysis.
However, its results are based on the strong assumption that the hazard rate changes linearly
across time, which may not suit the time-to-event data in question.
If the goal is to conduct a survival analysis that examines time-to-event data with respect to
multiple variables at once, the Cox Proportional-Hazards model (also known as the Cox model)
can be the preferable alternative.
The Cox model implements survival regression, a technique that regresses covariates against the
survival duration, to give insight into how the covariates affect survival duration.
This model is classified as semi-parametric since it incorporates both parametric and non-
parametric elements.
The Cox model allows the hazard rate to fluctuate, as opposed to the parametric models where
the hazard rate adheres to a fixed pattern.
The model is, however, dependent on the proportional hazards assumption. It assumes that the
hazard ratios between groups remain constant. In other words, no matter how the hazard rates of
the subjects change during the period of observation, the hazard rate of one group relative to the
other will always stay the same.
The lifelines module allows users to visualize the baseline survival curve, which illustrates the
survival function when all the covariates are set to their median value.
Code Output (Created By Author)
In addition, the Cox model can quantify the strength of the relationships between the covariates
and the survival duration of the subjects with survival regression. The summary method in the
lifelines module shows the results of the regression.
The output provides a lot of information, but the greatest insights can be obtained with the
exp(coef) column and the p column (p-value).
The p-value indicates which covariates have a significant effect on the survival duration. Based
on the results, the fin, age, and prio covariates are statistically significant predictors for
determining survival duration, given their coefficient’s small p-values.
Moreover, the hazard ratio for each covariate is equivalent to e to the power of the covariate’s
coefficient (eᶜᵒᵉᶠ), which is already provided in the exp(coef) column.
As an example, the eᶜᵒᵉᶠ value of the fin covariate, which represents financial aid, is 0.68.
Mathematically, this can be interpreted as the following:
This means that being getting financial aid changes the hazard rate by a factor of 0.68 (i.e., a
32% decrease).
Of course, the Cox model is only suitable if its assumption of proportional hazards befits the
time-to-event data. To justify the use of the model, one can test this assumption using the
check_assumptions method.
To simplify your search for the best model, consider the following factors:
1. The objective
The first thing you should decide is what information you are looking to obtain from your
survival analysis. With a specific objective in mind, it will be easier to hone in on the ideal
model.
2. Domain knowledge
There might be multiple models that can be used to derive the desired metrics for the time-to-
event data in question. In such a case, utilizing the domain knowledge of the subject can help
filter out inadequate models.
For instance, if you are conducting a survival analysis to study machine failure, you should
account for the fact that machines are subject to wear and tear. Due to the wear and tear, the risk
of machine failure should increase with time. For such a case, the Exponential model, which
assumes that the hazard rate is always constant, is not appropriate and should be removed from
consideration.
3. Model performance
If you find that there are multiple models that help achieve the aim of the study and conform to
the domain knowledge of the subject, you can identify the ideal model by gauging the models’
performance with evaluation metrics.
One suitable metric is the Akaike Information Criterion (AIC), which is an estimate of the
prediction error of the model. A lower AIC score corresponds to a better model.
As an example, if we are choosing between the Exponential model and the Weibull model for
analyzing the time-to-event data, we can determine the superior model by computing the AIC
scores for both models.
Based on the AIC metric, the Weibull model is a better fit for analyzing the time-to-event data.
For the rest of this article, we’ll look at a fabricated example about the survival rate of
domesticated dogs on different diets.
Want to save this for later? Click here to download the eBook
Research questions range from general lifespan questions about a population, such as:
Survival analysis also provides tools to incorporate covariates and other predictors. Some
example research questions in this case are:
How do various factors and covariates (e.g., genetics, diet, exercise, smoking, etc.) affect
lifespan?
Of patients diagnosed with a particular form of cancer, how do various medical treatments
affect lifespan, prognosis, or likelihood of remission?
How do manufacturing processes (e.g., temperature, time, material composition, etc.) affect the
failure rate of a product (such as a structural beam)?
With our simulated data, this graph indicates that for Diet 2, after 3 years, 70% of the dogs
remain, but after 4 years, only about 25% of dogs on Diet 2 survived. This is strikingly different
from Diet 1, which still has 90% surviving after 4 years.
Because the survival curves after 10 years elapsed to have a greater than 0 probability, this plot
shows that some values were censored, meaning that some dogs were still alive at the conclusion
of the study. With the censored observations, we can’t know for how long they will survive.
In practice, censoring is a very common occurrence. A study is designed and funded for a
particular amount of time, with the intention of observing the event of interest, but that might not
be the case. Also, dogs, in this case, might come into the study after the study has been running
for seven years, so they are only observed for a maximum of three years in this case.
In the discrete case, the survival function at time t, S(t), is S(t) = probability of surviving after
(not including) time t
Mathematically, the survival function is 1 - the cumulative distribution function (CDF), or:
S(t) = 1 - F(t) = 1 - Pr {T ≤ t}
This means that in the discrete case, the probability density function (PDF) is the probability of
the event occurring at time t.
Intuitively, hazard functions give you a sense of the risk of the event occurring for an individual
at a current point in time. In our demo example, we only recorded data annually, so our data are
discrete. This makes the interpretation a little more challenging. Instead of an instantaneous rate
of death, we have something close to (but not exactly) an annual rate of death, which we call a
“hazard.”
In our example, notice the hazard function for Diet 2 spikes in three locations (ages 4, 8, and 10).
This reflects the fact that on the survival curve, more dogs died after 4 years elapsed than
remained after 4 years. So clearly, that was a highly hazardous year, and the estimated hazard
function value of 1.3 reflects this. Similar situations occurred at years 8 and 10. Even though not
nearly as many dogs were surviving at that time, the proportion of dogs that died in years 8 and
10 was relatively large.
Both of these require that your data are a sample of independent observations from some
“population of interest.” With our example, this means the domesticated dogs are randomly
sampled and don’t have confounding effects and relationships with other dogs in the study (such
as being from the same litter, breeder, kennel, etc.).
The Kaplan-Meier method is intuitive and nonparametric and therefore requires few
assumptions. However, besides a treatment variable (control, treatment 1, treatment 2, …), it
cannot easily incorporate additional variables and predictors into the model.
The Cox proportional hazard model, on the other hand, easily incorporates predictor variables,
but it is more esoteric. The model has been around for decades, is tried and true, and continues to
perform well compared to other alternatives.
With our example data about domestic dogs on two different diets, we recorded the diet and the
year of death of each dog in the study. If we wanted to get an idea of survival rates and
probabilities, the most straightforward way to do that would be to just count up how many dogs
on each diet died each year. We can also easily aggregate the data to calculate the number of
dogs still alive at each time point.
In a nutshell, that’s the basis of the Kaplan-Meier method. It’s called a nonparametric method
because there are no distributional assumptions about the data. It’s just a fancy way of tabulating
and discussing the results.
If this sounds too simple, you are correct. This perspective oversimplifies Kaplan-Meier, but not
by a lot. For example, if some observations in the study don’t experience the event of
interest before the study ends, those values need to be represented appropriately in the
calculations.
Additionally, statisticians have worked out a mathematical theory that justifies the Kaplan-Meier
estimate as being a reasonable choice. Although not all that important in practice (besides giving
statisticians like us a job), this provides credence for the method. For example, the Kaplan-Meier
estimator for the survival curve is asymptotically unbiased, meaning that as the sample size goes
to infinity, the estimator converges on the true value.
However, we have additional (simulated) data about the breed of dogs and their level of activity.
Those are likely interesting and important confounding factors in the survival of dogs. We don’t
have a way of including them in the analysis with Kaplan-Meier, but we can with the Cox
proportional hazards model below.
The Kaplan-Meier Curve is an estimate for the survival curve, which is a graphical
representation of the proportion of observations that have not experienced the event of interest at
each time point.
What is the Cox proportional hazards model?
The industry standard for survival analysis is the Cox proportional hazards model (also called the
Cox regression model). To this day, when a new survival model is proposed, researchers
compare their model to this one.
It is a robust model, meaning that it works well even if some of the model assumptions are
violated. That’s a good thing because the assumptions are difficult to validate empirically, let
alone understand.
Rather than modeling the survival curve, which is the approach taken by the Kaplan-Meier
method, the Cox model estimates the hazard function. In general, hazard functions are more
stable and thus easier to model than survival curves. They depict the hazard, i.e. the
instantaneous rate of death (or failure) given that an individual has survived up to that time.
It’s just a more ambiguous name for the Cox proportional hazards model.
This constraint (that the hazards functions are proportional) also provides an easy way to add in
additional variables (covariates) to the model. With our simulated example of dogs on different
diets, we can now include the additional information of breed (Great Pyrenees, Labrador,
Neapolitan Mastiff) and activity level (Low, Medium, High).
Mathematically, the primary Cox model assumption is that the hazard function, h(t), can be
written:
Another way of saying that the hazard functions are proportional is that the predictor variables’
effects on the hazard function are multiplicative. That’s a major assumption that is difficult to
assess.
Unless we include interaction terms (such as activity by breed), this assumes, in our example,
that activity level has the same effect on the hazard regardless of how long the dog has been in
the study, what breed the dog is, or what diet it is on.
Interaction terms can be included, but greatly complicate interpretation, and introduce
multicollinearity, which makes the estimates unstable. As with many statistical models, George
Box’s quip that, “All models are wrong but some are useful,” applies here.
The baseline hazard function, h0(t), is key to David Cox’s formulation of the hazard function
because that value gets canceled out when taking a ratio of two different hazards (say for Diet 1
vs Diet 2 in our example).
Numerical results
The most informative part of the numerical results are in the parameter estimates (and hazard
ratios). If you are familiar with linear and logistic regression, the interpretation of the numerical
results only requires a slight adjustment. The following estimates provide the guts of the
information that is needed to understand how each predictor variable affects the hazard
functions.
Mathematically, these parameter estimates are used to calculate the hazard function at different
values (or levels) of the covariates using the equation:
The Cox model uses the data to find the maximum likelihood estimators for the regression (β)
coefficients in the hazard function. Each variable in the model (in our example, these are Diet,
Breed, and Activity) has its own regression coefficient and estimate. Categorical variables in the
model use reference level coding.
It’s necessary to have a baseline reference with Cox regression models because all of the
interpretation is based on calculating proportional hazard functions to the baseline, h0(t).
For our example, the primary question of interest is: Do the two different diets have a significant
effect on the survival of dogs? From the parameter estimates and hazard ratio, we can see they
do, and, in fact, have quite a drastic difference. In particular (regardless of breed or activity level)
dogs on Diet 2 had a 4.322 times higher hazard than dogs on Diet 1, with a 95% confidence
interval of (2.720 to 6.953). Because the 95% CI does not include 1, we can also say that this
coefficient is statistically significant (p<0.05).
The value we reported above is the hazard ratio, which is just e[ˆβ1] in this case.
The hazard ratio is used for interpreting the results of a Cox proportional hazards model and is
the multiplicative effect of a variable on the baseline hazard function. For continuous predictor
variables, this is the multiplicative effect of a 1-unit change in the predictor (e.g., if weight was a
predictor and was measured in kilograms, it would be the multiplicative effect per kilogram). For
categorical variables, it is the multiplicative effect that results from that level of the predictor
(e.g., Diet 2).
Graphical results
The main graphs for interpretation of the Cox regression model are the cumulative survival
functions for specific values of the predictor variables.
There are a number of interesting graphics to look at with our simulated data. For example, the
two plots below show the drastic differences between the survival rates of Diet 1 and Diet 2.
Here we fixed the activity level at medium and show the differences between breeds by color.
Notice the much steeper decline of Diet 2, which indicates a much lower survival rate. Because
there aren’t any interaction terms in the model, these survival curves don’t cross. Our data was
simulated to behave nicely, and interaction terms weren’t needed. Note that these survival rates
per breed are completely fictitious!
A second graphical example looks at the effect of diet and activity level within a single breed
(Great Pyrenees). Again, this clearly shows that Diet 1 has a much higher survival rate. It also
shows that as the activity level increases, the survival rate increases. Diet 2 is so much worse
than Diet 1, that even at a low activity level on Diet 1 there is a higher survival rate than a high
activity level on Diet 2.
See how to graph your Survival Analysis results in Prism.
Logistic regression, on the other hand, is a tool for predicting a binary response such as
success/failure, present/absent, yes/no. Logistic regression also uses predictor variables, but it’s
to ascertain whether or not the event occurs for a specific observational unit. In its standard form,
there is no element of time involved in the predictions. You could, for example, use logistic
regression to predict whether a student passes a class based on some predictor variables
(previous exam scores, age, head circumference, etc.).