Disease Association II and Attribution 2022

Measures of Disease Association II
and Measures of Attribution

• Measures of association in case-control studies
– If/how the OR estimates other ratio measures depends on
type of control sampling & nature of underlying cohort
• Strengths and weaknesses of case-control studies

• Interpreting the magnitude of a measure of association
– What we can learn from the “sufficient-component cause”
model of human disease
• Measures of attribution
– Attribution among the exposed
– Attribution among the entire population
Descriptive and Analytic Goals
• Goal of research may be “descriptive” or “analytic”
• Description Descriptive
– How frequent/common are risk factors/exposures/conditions/diseases? How often does Y occur?
• Causation
– The science of establishing causal relationships among biological, behavioral, environmental (etc.)
factors among humans
– Does X cause Y?
• Attribution
– What fraction of disease Y can be eliminated if a causal exposure X is eliminated?
• Mediation
– Understanding the mechanisms of causation
Analytic
– How does X cause Y?
• Interaction
– When and for whom does X cause Y?
• Prediction
– Do A, B, and C predict occurrence of Y? (e.g., diagnosis or prognosis)
• Everything in this course relates to designing studies that will
accomplish a descriptive or analytic goal.
How to measure disease association in
case-control designs?
• In cross-sectional or cohort studies, we compare
the occurrence of disease in exposed vs unexposed
– This is the most intuitive approach to evaluating
causal/predictive effect of the exposure on outcome
– It emphasizes temporality of exposure and disease.
Temporality refers to the timing of events. If one thing
comes before another, it is easier to reason that the first
thing caused the second thing (as opposed to the other
way around).
• In case-control design, we typically can’t do this
What can we estimate in a case-control study?
Case-control study of TZD use & incident fracture in
diabetes [TZD = thiazolidinedione, a class of diabetes
drugs]. Dynamic study base: UK General Practice Research
Database.
Fracture No Study with 3-4
fracture controls per case.
What happens if we
TZD use 65 198 try to estimate
incidence of fracture
No TZD use 955 3530 in either exposure
group?
1020 3728
Meier et al. Arch Intern Med 2008

Can’t estimate probability of event by exposure status
Case-control study of TZD use & fracture in diabetes

Conducted over 10 yrs. 3-4 CONTROLS PER CASE
Fracture No “Probability of
fracture event, by exposure”
TZD use 65 198 65/(65+198) =0.25
No TZD use 955 3530

1020 3728
“Probability” of a fracture in TZD users is 0.25 (over 10 years).

Seems high but we can make it even higher…
Can’t estimate probability of event by exposure status
REDUCE CONTROLS BY 50%
Fracture No “Probability of
fracture event, by exposure”
TZD use 65 99 65/(99+65) =0.40
No TZD use 955 1765
1020 1864
“Probability” of a fracture in TZD users is now 0.4

Illustrates why the probability (or odds) of disease, by exposure
status, can’t be simply estimated in a case-control design
What can we estimate in a case-control study?
If we try to estimate
probability or odds of
fracture in either
Fracture No exposure group, the
fracture result is nonsense. It
depends on whether
TZD use 65 198
No TZD use
X 955 3530
the study selected 4
controls per case, or 2
controls per case, etc.
X 1020 3728 We can, however,
work down the table.
Meier et al. Arch Intern Med 2008

Measure of Association in Case-Control Studies
• Can’t typically measure disease occurrence (prevalence, risk,
rate, or odds) in case-control design
• Can measure prevalence (i.e., probability) of exposure in
diseased and the non-diseased reference group
– But this is a dead end; these cannot be compared with our
familiar measures of disease association
• Yet, we can measure odds of exposure in diseased and odds
of exposure in some reference (non-diseased) group
– and hence odds ratio of exposure
– not immediately apparent that this is what we want
– until we take advantage of mathematical properties of OR to
obtain our desired measure
• the OR for development of disease
Favorable property of odds ratio #3:
OR is a lifesaver in case-control studies
• A useful property of OR:
OR of exposure (what we can get)
= OR of disease (what we want)
• Key methodologic advance in epidemiology
– Jerome Cornfield (1951) as method to
quantify smoking’s effect on lung cancer
• First published “modern” case-control study:

– Janet Lane-Claypon (1926) study of
breast cancer
Odds ratio comparing exposure odds in diseased and not diseased
Disease
Yes No
a b a
Yes
c
Exposure
ORexp =
b
No c d
d
a+c b+d
Important characteristic of an odds ratio
a c a
ORexp = c X
b
=
b = ORdis
b c c
d b d
OR for exposure = OR for disease

Favorable property of odds ratio #3:
OR is a lifesaver in case-control studies
…but can we do even better?
• We can get OR of disease in a case-control study
• Yet, we spoke earlier about some of the problems

with ORs (e.g., interpretation, non-collapsibility)
• Can we get OR in case-control design to

represent/estimate even more interpretable measures of
association (e.g., risk ratio, hazard ratio)?
What the OR in a case-control study estimates
Depends on underlying cohort and control sampling
Type of Control Sampling Scheme Interpretation
Cohort of OR
Fixed Random sample of everyone at time zero Consistent estimate
(baseline) of risk ratio
“case-cohort” or hazard ratio
Random sample of person-time at risk

each time a case occurs Consistent estimate
of hazard ratio
“incidence density”
Random sample of non-cases after cases If disease incidence
have been identified low in all exposure
“prevalent control”, “cumulative”, groups:
“epidemic”, “exclusive”, “traditional” Approximation of
risk ratio
Type of Control Sampling Assumptions about Interpretation of
Cohort Exposure Prevalence OR
Dynamic Each time a case
occurs Consistent estimate
None
of hazard ratio
Exposure and covariate

At midpoint of prevalence is at steady Consistent estimate
study period state or changing of hazard ratio
linearly over time
a) Exposure and
covariate prevalence is Consistent estimate
At any point, at steady state of hazard ratio
including after all
cases identified b) Exposure and/or Hard to interpret odds
covariates not at steady ratio
state
OR as estimate of risk ratio
• How can the odds ratio in a case-control study,

specifically in context of a fixed cohort study
base, estimate the risk ratio?
Notation in a 2 x 2 table for fixed cohort
1 = exposed; 0 = not exposed
Disease
Over some time T
Yes No
No. of
No. of persons
events
Yes E (cases) in in exposed
1 N1 group
exposed
Exposure
(baseline)
group
No. of
No. of persons
events
No E0 (cases) in N0 in unexposed
group
unexposed
(baseline)
group
Risk ratio in a cohort study
E1 E1
N1 E1 N 0 E0
Risk ratio =   
E 0 N1 E 0 N1
N0 N0
In a fixed cohort study with complete and equal follow-up T
• We now have a ratio of 2 odds: Odds of exposure in
those with event (E1/E0) and odds of exposure in the
cohort at baseline (N1/N0).
• How can we estimate these two odds with a case-
control design?
Capturing the events with a case-control design
From a well-defined study base:

– Capture all the incident cases (E) that arise, measure
exposure. Can then form the ratio
E1
E0
– Or a random sample of the cases will give the same ratio
– This is the odds of exposure in the cases
Estimating risk ratio in a case-control study
E1 E1
N1 E1 N 0 E0
Risk ratio =   
E 0 N1 E 0 N1
N0 N0
In a fixed cohort study with complete and equal follow-up T
E
1
We have this ratio with the cases: E
0
N1
Now we just need an estimate of this ratio:
N0
Notation in a 2 x 2 table of a cohort study
Disease
Yes No
Yes E1 N1
Exposure
No E0 N0
Estimating exposure in the baseline cohort
• This is a ratio of exposed to
N1 unexposed amongst everyone
in the study at baseline (time
N0 0). It is the odds of exposure
amongst the cohort at
baseline.
• Obtain estimate of this ratio by taking a

sample of the study at baseline.
Case-cohort study design:
N1
Sample baseline of underlying cohort to get
N0
cases
E
N1 1
E
N0 0
controls
Time
Case-Cohort Sampling
• Control (reference) group is random sample of
cohort at baseline
• Controls used to estimate the odds of exposure in
the study base at time 0 (i.e., estimates N 1 / N0)
• Control group can be used for > 1 outcome
• Can use same controls later for longer follow-up
gathering more cases
• First formalized by Kupper (1975) and extended by
Prentice (1986)
• Odds ratio estimates risk ratio (or hazard ratio)
Stata: Case-cohort sampling
• Once incident cases are identified, need a random
sample of the baseline cohort
• Exclude prevalent cases at baseline
• Take random sample of all other participants
• Stata command for random sample:
• Sample #, count
• For example, to obtain a sample of 200

• Sample 200, count
Case-cohort design and hazard ratio
For completeness, it is important to note:
• Case-cohort design can also provide an estimate of
the hazard ratio (HR)
• Estimate of HR can even be done with unequal
follow-up times – a more typical setting
• Requires statistical regression models to estimate
(e.g., modified form of proportional hazards model)
– We won’t derive the proof
• HR now the most common measure of association
derived from case-cohort studies
Example: Case-Cohort Design & Risk Ratio
• Question: Relationship between EKG pattern and death
• Study base: Cohort of Dutch civil servants assembled and
examined in 1953. All had EKGs.
– Cases: all deaths; Controls: random sample of cohort at time 0.
• Exposure: Type of ST segment elevation on EKG.
Required tedious manual interpretation of EKG.
– Case-cohort design required fewer EKG interpretations; saved $
• Outcome: All-cause mortality by 15 years
• Analysis: multivariable logistic regression, which yielded
odds ratios which were estimates of risk ratios:
– In women, risk ratio for 15-year all-cause mortality was 0.5 (95%
CI 0.2-1.0) comparing elevated with isoelectric ST elevation.
Schouten et al. BMJ 1992
Case-Cohort Study: Serum 25 Hydroxyvitamin D
and Fractures in Older Men
The present study is a case-cohort study nested within the prospective
design of MrOS. Men without sufficient serum for vitamin D assays
were excluded from all analyses. Of the 5,908 eligible participants,
we randomly selected 1608 men to serve as the sub-cohort. In this
subcohort, two participants were excluded: one participant with
insufficient serum, and another who had 25(OH) vitamin D levels >3
SD above the mean (75.6 ng/ml). The resulting 1606 men constituted
the subcohort for this study.
We observed 435 incident non spine fracture cases (including 81 hip

fractures) in the entire cohort over the 5.3 years of follow-up.
Among these cases, 112 individuals were also sampled within the
subcohort.
Cauley et al. JBMR 2009

Case-cohort sampling within MrOS Cohort
Cohort baseline = 5,908 participants E
1
cases
E
CE 0
N1 435 incident
cases of
N0 non-spine
fracture
controls
112 of 435 cases included in 1608

Assays on 1608+435-112 = 1931 1608 men
Efficient! randomly
sampled
for
blood
tests
Time
Serum 25 Hydroxyvitamin D and Fractures in Older Men:
Results
ABSTRACT
To test the hypothesis that low serum 25-hydroxyvitamin D
[(25(OH) vitamin D] levels are associated with an
increased risk of fracture we performed a case-cohort
study of 435 men with incident non-spine fractures
including 81 hip fractures and a random subcohort of 1608
men; average follow-up time 5.3 years. Serum 25(OH)
vitamin D2 and D3 were measured on baseline sera…
Modified Cox proportional hazards models were used to
estimate the hazard ratio (HR) of fracture with 95%
confidence intervals. …
Cauley et al. JBMR 2010

Results
Clearly
labelled
reference
group
*Base model adjusting for age, race, clinic, season of blood draw, physical
activity, height, and weight. ** Per SD decrease in Vitamin D
Describing results for quartiles of
vitamin D and fracture
• Highest quartile of vitamin D is the reference group.
Other quartiles of vitamin D are compared to this
reference group.
– Always label reference group, even with dichotomous variable
– Don’t make readers guess/assume the reference group
• HR comparing those in the first to those in the fourth

quartile for incident fracture is 1.21.
• “Those in the lowest quartile of serum vitamin D have a
rate of non-spine fracture that is 1.21 times as high as
those in the highest quartile.”
Results
*Base model adjusting for age, race, clinic, season of blood draw, physical
activity, height, and weight. ** Per SD decrease in Vitamin D
Describing results for continuous exposure
• For continuous exposures placed in regression models in
their native form, HR = association for 1 unit increment in
the exposure. Note: need to describe the units.
• For exposures with a wide range of values, expressing the
measure of association for 1 unit increment in the scale
can often produce very small measures of association
– Humans find these hard to interpret.
• As an alternative, measures of association expressed as:
– “per standard deviation”
• e.g., HR = 1.07 is for a SD decrease in vitamin D
– per some larger increment in the scale
• e.g., “The rate of non-spine fracture is 1.11 times as high for
each 10 ng/ml decrease in vitamin D”
– Note use of decrease rather than increase in exposure
– In any case, the units must always be clearly labeled
Some practical concerns in case-cohort design
• What % of baseline participants have specimens (or
other exposures: images, EKGs, etc.) available/archived?
• Are specimens (or images, etc.) missing randomly?

• Previous case-cohort or cross-sectional studies of the
baseline may have used specimens. What is the effect on
distribution of those remaining?
• If baseline accrual was lengthy, will different storage
times affect measurement of exposure and covariates?
OR as estimate of hazard ratio
• How else can the odds ratio in a case-control
study estimate the hazard ratio?
• Particularly in a dynamic cohort where case-

cohort sampling is typically not feasible
Cohort of OR
“case-cohort” (Can also estimate
hazard ratio)
of hazard ratio
Random sample of non-cases after cases If disease incidence
have been identified low in all exposure
“prevalent control”, “cumulative”, groups:
“epidemic”, “exclusive”, “traditional” Approximation of
risk ratio
None
of hazard ratio

linearly over time
a) Exposure and
including after all
state
Incidence density sampling
• Controls are matched to cases on time at risk

– Duration of follow-up time in fixed cohort
– Calendar time in dynamic cohort
• Each time a case occurs, those still in follow-up

who did not experience the outcome earlier (the
“risk set”) are sampled for controls
• This allows us to calculate an odds ratio, using
conditional logistic regression, which is a
consistent estimate of the hazard ratio
Hazard ratio in cohort where N1T1 = exposed
and N 0T0 = unexposed person-time
E1 E1 Odds of exposure
in cases
1
E0
Hazard Ratio = = Odds of exposed
E0 1
person-time in
lim N 0 × ∆ t 0 lim N 0 × ∆ t 0 cohort
∆𝑡 →0 ∆𝑡 →0
So analogous to estimating risk ratio, we need to

estimate the odds of exposed person-time.
If we can estimate this odds in a case-

control study, we can estimate a hazard ratio
Hazard ratio estimation within a fixed cohort
CE
Time
A case-control study with incidence density sampling
within a fixed cohort primary study base
(e.g., within Framingham Cohort)
CE
E
1
E
0
= ORi = HRi
Time
Controls are randomly sampled each time a case is diagnosed from risk set.
Exposed/unexposed person-time in risk set plus exposure odds in case yields
OR that estimates hazard ratio at time of case.
Incidence density sampling within a fixed cohort
primary study base (e.g., within Framingham Cohort)
CE
Time HRi HRii HRiii HRiv Average HR
A weighted average of the “mini” hazard ratios, which are formed at the
occurrence of each case, estimates the overall hazard ratio.
Incidence Density Sampling In Dynamic Cohort
(e.g., SF County residents)
New
Residents
CE
Sampling in a dynamic cohort gives a consistent

estimate of ratio of exposed to unexposed person-
time in the same way as sampling in a fixed cohort
OR estimates HR
Calendar Time
Incidence density sampling
...In a population-based case-control study in Germany, the authors
determined the effect of alcohol consumption at low-to-moderate levels
on breast cancer risk among women up to age 50 years. The study
included 706 case women whose breast cancer had been newly
diagnosed in 1992-1995 and 1,381 controls matched on date, age,
and residence. In multivariate conditional logistic regression analysis,
the adjusted odds ratios for breast cancer were 0.71 (95% confidence
interval (CI): 0.54, 0.91) for average ethanol intake of 1-5 g/day, 0.67
(95% CI: 0.50, 0.91) for intake of 6-11 g/day, 0.73 (95% CI: 0.51, 1.05)
for 12-18 g/day, 1.10 (95% CI: 0.73, 1.65) for 19-30 g/day, and 1.94
(95% CI: 1.18, 3.20) for > or = 31 g/day. . . These data suggest that
low-level consumption of alcohol does not increase breast cancer risk
in premenopausal women.
Kropp S et al. Low-to-moderate alcohol consumption and breast cancer risk by

age 50 years among women in Germany. Am J Epidemiol 2001
Selection of cases and controls
• Subjects eligible for participation were German-speaking women with no
former history of breast cancer who resided in one of two geographic areas
in southern Germany. We attempted to recruit all patients who were
under 51 years of age at the time of diagnosis of incident in-situ or
invasive breast cancer. We compiled cases diagnosed between January 1,
1992, and December 31, 1995, in the Rhein-Neckar-Odenwald study region
(popn of about 1.3 million) and between January 1, 1993, and December 31,
1995, in the Freiburg study region (popn of about 0.9 million), by surveying
38 hospitals that serve the populations of these two regions.
• Controls were selected from random lists of residents supplied by the
population registries. For every recruited patient, two controls matched
according to exact age and study region were immediately contacted by
letter.
• There were 1,020 eligible cases, of whom 1,005 were alive when identified.
Of these living case participants, 706 (70.2 percent) completed the study
questionnaire. Among the 2,257 eligible controls, 1,381 (61.2 percent)
participated.
• Note efficiency. 706 cases + 1381 controls = 2087 participants represent
2.2-million-person population in underlying dynamic real-world cohort.
Incidence density sampling within a dynamic cohort
(German population 1992-1995; 2.2 million)
New
Residents
cases
CE
706 incident
cases of
breast cancer
1,381 age &

residence
matched
controls
Calendar time
Random sample of population each time breast cancer diagnosed
Results reported as odds ratio
Reference group
clearly labeled
Hazard (rate) ratio is easier to understand than an odds ratio.

Plasma Insulinlike Growth Factor 1 and
Binding-Protein 3 and
Myocardial Infarction in Women:
A Prospective Study
• Case-control study nested in Nurses Health Study, a large
fixed cohort. Incidence density sampling.
• Appropriately, authors make this statement in the Methods,
with citations: “Conditional logistic regression was used to
estimate odds ratios, which were taken as direct estimates of
rate ratios ….”
Page et al. Clin Chem 2008

With incidence density sampling, report results as
rate ratio (or even better, hazard ratio)
Better title: Rate ratio of myocardial infarction…
Reference group “those participants with values in the third

clearly labeled quartile of IGF1 had 1.5 times the rate of
myocardial infarction as those in the lowest
quartile”
Practical considerations in incidence density sampling
• Exposure measurement (e.g., biological specimens)

availability. If missing, why? Missing at random?
• Date that case occurred is key in order to define the risk
set. Not always easy to define for certain conditions.
• Frequency of observation of underlying cohort (e.g.,
every 6 or 12 months) may be important when nested in
research cohorts and has cost considerations
• Unlike case-cohort sampling, the control group cannot be
used for other case/disease outcomes
Stata: Incidence density sampling
• Stata command to identify controls matched to

cases on follow-up time:
Identify as survival data: stset timevar, fail(failvar)
sttocc
sttocc, n(3) [Will identify 3 controls per case]
• Adds 3 variables to dataset:

_case Control = 0, Case = 1
_set ID that matches case and controls(s)
_time Follow-up time
Biliary Cirrhosis Dataset
First 9 observations
. stset time, fail(d)

+--------------------+
| id time d |
1. | 51 1.582478 1 |
“time” is follow-up time 2. | 23 9.032169 0 |
“d” is outcome (death) 3. | 40 2.286105 1 |
4. | 42 2.078029 1 |
5. | 45 5.31143 1 |
6. | 48 2.633812 0 |
7. | 50 3.931554 0 |
8. | 52 4.843258 0 |
9. | 54 2.105407 0 |
Biliary Cirrhosis Dataset
With 3 controls per case, matched on time
. sttocc, n(3)
id time d _case _set _time |

|-----------------------------------------------|
1. | 907 6.047913 1 0 1 .02464066 |
2. | 75 9.500342 0 0 1 .02464066 |
3. | 74 6.844627 1 0 1 .02464066 |
4. | 950 .0246407 1 1 1 .02464066 |
5. | 936 1.36345 0 0 2 .02464066 |
|-----------------------------------------------|
6. | 145 3.655031 0 0 2 .02464066 |
7. | 915 2.475017 0 0 2 .02464066 |
8. | 213 .0246407 1 1 2 .02464066 |
9. | 97 5.952087 0 0 3 .05201916 |
10. | 166 3.386721 0 0 3 .05201916 |
|-----------------------------------------------|
11. | 265 6.830938 0 0 3 .05201916 |
12. | 922 .0520192 1 1 3 .05201916 |
Risk or rate difference in case-control study?
• We can obtain an estimate of a risk ratio or hazard ratio in
an appropriately designed case-control study.
• Can we calculate a similar estimate of the risk or rate
difference from a case-control study? Depends on the
design:
• Case-cohort: Can derive disease incidence and difference
measures with proper weighting of contribution from
cases and random sub-cohort observations.
– Methods are becoming easier and starting to be seen in literature
• Incidence density: Theoretically possible if you know the
sampling fractions every time you select controls.
– Complex and almost never done in practice
Example: Cumulative incidence in case-control study
with case-cohort sampling
Lactate and Risk of Incident Diabetes in a Case-Cohort of
the Atherosclerosis Risk in Communities (ARIC) Study
ABSTRACT: We conducted a case-cohort study in the
Atherosclerosis Risk in Communities (ARIC) study at year 9 of
follow-up… Following adjustment for demographic factors, medical
history, physical activity, adiposity, and serum lipids, the hazard in the
highest quartile [of plasma lactate, a marker of oxidative capacity] was
2.05 times the hazard in the lowest quartile (95% CI: 1.28, 3.28)].
In addition to the hazard ratio, the authors provide the cumulative

incidence of diabetes by quartile of plasma lactate…
Juraschek et al. PLoS ONE 2013

Example: Cumulative incidence in case control study
with case-cohort sampling
Figure 2. Kaplan-Meier
cumulative incidence plot with
follow-up years as the time axis
and incident diabetes as the
outcome, stratified by baseline
plasma lactate value.
How was
cumulative
incidence derived
for each category
of baseline lactate
level?
Juraschek et al. PLoS ONE 2013

Estimating cumulative incidence in a case-cohort study
• We have full details, from
cases
time 0 onward, on the
sub-cohort and the
incident cases.
control
• We also know fraction of

s
cases we sampled
(typically 100%) and
what fraction of the
baseline sub-cohort we
Time sampled (usually 5-20%)
• We can then reweight the observations for the cases and sub-cohort
to recreate the entire cohort.
• We won’t show the math, but this allows for estimation of measures
of incidence (risk and hazard) and risk and hazard differences
• Illustrates the power of the growing techniques in reweighting
Estimating cumulative incidence from
case-cohort sampling If all incident cases are
included, we know when
they occurred.
We have complete information on a
random sample of the complete cohort Since controls are a
random sample at
baseline, they can
cases
represent the experience
of full cohort. We know
follow-up time for each
control. Use this to
estimate all participants
controls
still at risk when each

case occurs. Can then
obtain cumulative
incidence.
Can extend this to a

design with a random
sample of cases.
What if you weren’t clever enough to perform case-
cohort or incidence density sampling in a fixed cohort?
• There still may be a salvageable interpretation of the OR


Cohort of OR
“case-cohort” (and hazard ratio)
of hazard ratio
Random sample of non-cases after cases If disease
have been identified incidence low:
“prevalent control”, “cumulative”, Approximation of
“epidemic”, “exclusive”, “traditional” risk ratio
“Prevalent control” sampling in a fixed cohort study base
cases
CE
• Odds ratio here is said to be a “close

approximation” of risk ratio only if
disease occurrence is rare in all groups
controls
• OR is not an “unbiased” or
“consistent” estimate of the risk ratio
– exception: risk ratio = 1, in which
case the odds ratio = 1
Observation Time Time of the

Study
Inability to calculate unbiased estimate of risk
ratio if controls sampled from non-cases
E1 ratio is known in all case-control
E0 designs
Sampling only non-cases at a point in time

after cases have occurred cannot get
unbiased estimate of N1
N0
But, if E1 and E0 are small relative to N1 and
N0, then we can get close approximation
Notation in a 2 x 2 table for a cohort study
We can only get
Disease this ratio if we
sample controls at
Yes No
end of follow-up
Yes E1 N 1 - E1 N1
Exposure
We want
this ratio
No E0 N 0 - E0 N0
Sampling controls after cases identified at
the end of a fixed cohort
• If E1 is small relative to N1, then N1-E1 is very close to
N1 which is what you want. Likewise, if E0 is small
relative to N0, then N0 - E0 closely approximates N0
• If controls are selected among those without disease at

end of study, the OR approximates risk ratio only with
the rare disease assumption
• Rare disease assumption:

– disease incidence low in all exposure groups (<10%)
– i.e., exposure odds in non-case controls  exposure

OR using controls from prevalent non-cases
when incidence is high Cases Non-cases
Exposed
40 60
100
2 years
Unexposed 10 90 100
Risk ratio = (40/100)/(10/100) = 4.0
Using all prevalent non-cases in cohort, the OR (of exposure) is:
OR = 40/10 = 6.0
60/90
A random sample of the non-cases would give the same OR.
OR is not an unbiased estimate of risk ratio. In this example, with high
incidence of disease, OR also not a close approximation of risk ratio.
OR using controls from prevalent non-cases
when incidence low Cases Non-cases
Exposed
4 96
2 years
Unexposed 1 99
Risk ratio = (4/100)/(1/100) = 4.0
Using all prevalent non-cases in cohort would be
OR = 4/1 = 4.13
96/99
A random sample from cells b (96) and d (99) will give a ratio equal
to 96/99 and therefore an OR = 4.13. With this low incidence of
disease, the odds ratio is close approximation to true risk ratio =
4.0 (but not an unbiased estimate).
Even if rare disease assumption holds, there are other
problems with this design
E
1
cases
CE E
0
• Over time, participants are subject to

drop-out and competing events
• Non-cases who are left at the time
controls
controls are selected are “battle-
tested” in that they have not dropped
out or developed a competing event N1  E 1
• Controls thus may not represent
N0  E0
study base at time 0 (Result = bias)
Time Time of the

Study
Summary
Cohort of OR
“case-cohort” and hazard ratio

of hazard ratio
Random sample of non-cases after cases If disease
have been identified incidence low:
“prevalent control”, “cumulative”, Approximation of
“epidemic”, “exclusive”, “traditional” risk ratio
Which regression model to use?
Type of Control Sampling Scheme Regression model
Cohort
Fixed Random sample of everyone at time For risk ratio: Logistic
zero (baseline) regression (if
“case-cohort” complete/equal follow-up)
For hazard ratio:
Proportional hazards
(modified)
each time a case occurs Conditional logistic
regression
Random sample of non-cases after
cases have been identified
Logistic regression
“prevalent control”, “cumulative”,
“epidemic”, “exclusive”, “traditional”
Summary
None
of hazard ratio

linearly over time
a) Exposure and
including after all
state
Which regression model to use?
Type of Control Sampling Assumptions about Regression Model
Cohort Exposure Prevalence
occurs Conditional logistic
None
regression

At midpoint of prevalence is at steady
Logistic regression
study period state or changing
linearly over time
a) Exposure and
covariate prevalence is Logistic regression
At any point, at steady state
including after all
cases identified b) Exposure and/or
covariates not at steady
state
Statistical penalties for sampling the study base
• Case-control design obtains a strategic sample of the
underlying cohort rather than entire underlying cohort
• Using a sample introduces some sampling error compared
to analysis using entire cohort. Manifestations are:
– Reduces precision of the risk ratio, hazard ratio or odds ratio
estimate (i.e., increases standard error and CIs) compared to
analysis of whole cohort
– Point estimate from case-control design will rarely be exactly
equal to the analysis of whole underlying cohort
• The difference is from sampling error; typically, the estimates are close
• Loss of precision offset by large reductions in cost and

time of study
How many controls per case?
Biggest gain from 1

to 2 controls. In
general, >4 controls
per case not cost
effective.
Power to detect OR of 2 in study with 188 cases;

exposure prevalence of 30%; 2-sided alpha 5%
Presenting results in a case-control study
• Case-cohort:
– If using logistic regression, declare OR is a consistent estimate of
risk ratio and describe as risk
– If using proportional hazards regression, report hazard ratio &
describe as hazard (or rates, if readers are phobic of hazards)
• Incidence density sampling:

– Declare OR is consistent estimate of hazard ratio and describe in
language of hazards (or rates, if readers are phobic of hazards)
• All other situations:

– Only describe as odds
– Mention, if appropriate, if OR is an approximation of the risk ratio
– If using OR, beware that language like “X times as likely to” implies
a comparison of probabilities, not odds
– Abstracts/press releases determine how results are seen by public
Common misunderstandings about
case-control studies
• They can only study one disease outcome.
– Case-cohort sampling can study multiple outcomes.
• Inference is not as valid as from a cohort
– They are equally as valid if control sampling represents
underlying cohort that gave rise to cases
• “Rare disease assumption” is required for OR from case-
control to estimate anything meaningful
– Depending on design, can obtain estimates of the risk ratio or
hazard ratio without rare disease assumption
• It is not possible to obtain exposure measurements that
occur before outcome
– Can use archived biospecimens, medical records, etc.
What is true about case-control studies
• There are typically more opportunities for bias

in case-control than in cohort studies
• Relative ease with which they can be done has
encouraged a lot of badly designed studies
• Low cost and shorter time should be an
incentive for better, not worse, design
Case-control design recommendations
• Look for a primary study base that can be clearly
defined and has complete case ascertainment
– Know research study bases available in your field
– Use incidence density or case-cohort sampling in fixed cohort
– In dynamic cohort, use incidence density sampling if feasible.
• Use measurements recorded prior to the diagnosis

when possible (medical records, etc.) or perform
measurements on stored specimens/records
What does the magnitude of a measure of
association tell you?
• Larger magnitude values referred to as “strong” associations
• What does “large” or “strong” actually mean?
• For both ratio & difference measures, the larger the value the
less apt the association is the result of occult bias
– Reasoning: if bias was responsible, it would have to be
large and researchers would presumably have noticed it
• For difference measures, strong associations translate to
smaller numbers needed to treat/harm/protect
– Strong translates directly to public health/clinical impact
• But for neither difference nor ratio measures does “strong”
translate into any particular intrinsic biologic strength
Why does “strong” not translate into biologic meaning?
• “Sufficient-component cause” model of human disease
– Popularized by Rothman (1976 – see CLE reading) in health
research but originated earlier by philosophers
– Occurrence of a disease in a given person is because some set of
culprit exposures comes together
• A minimal set of culprit exposures is called a “sufficient cause”
• Each individual culprit exposure is called a “component cause”
– For most diseases, there are several “sufficient causes”
Rothman &
A “sufficient cause” Greenland,
AJPH 2005
Why does “strong” not translate into biologic meaning?
Consequence of SCC model:
• Whether a given exposed person gets disease more often than an
unexposed person (which is what we see in our conventional
measures of association) mainly depends upon the prevalence of the
requisite complementary component causes
• Classic example: Disease called phenylketonuria (PKU)
– Exposures: mutation in gene for phenylalanine hydroxylase (PAH); and,
phenylalanine in diet (persons with mutation cannot break down phenylalanine)
– If phenylalanine rarely in ambient diet, PAH mutation genotype would rarely
manifest in disease-PKU (would yield “weak” or absent measure of association)
– If phenylalanine common in ambient diet, the PAH mutation genotype would
commonly manifest in disease-PKU (“strong” measure of association)
– Thus, the measures of association that we observe don’t fully tell us about
intrinsic biologic characteristics/ability of a given exposure in causing disease
An optional reading you might want to peruse
Numeracy
• The ability to understand and work with numbers
• Help your readers improve their numeracy by limiting your

digits in your publication-quality tables, figures, and text
– %’s (e.g., prevalence or cumulative incidence)
• 2 significant digits: 33%, 3.3%, 0.33%
– Measures of association
• 2 digits to right of decimal point, except when over 10:
– prevalence ratio = 1.82 (95% CI: 1.35 to 2.39)
– risk ratio = 0.35 (95% CI: 0.11 to 0.49)
– hazard ratio = 12.3 (95% CI: 10.5 to 14.7)
Measures of Attribution
• So far, we have introduced measures of association that
compare occurrence of an outcome by exposure status,
using ratio or absolute difference.
• We can also ask how relevant the exposure is in causing

the outcome, especially in comparison to other exposures
causing the outcome.
– i.e., how much would the outcome be reduced if
exposure was removed?
• Concept known as “attribution” of outcome to exposure

Terminology Alert
• No field in epidemiology is so full of ambiguous terminology
– Attributable risk
– Attributable rate
– Attributable risk percent Terminology is so
– Attributable fraction confusing that you
– Excess fraction cannot simply
– Etiologic fraction choose the term you
– Excess caseload due to exposure like and use it. You
– Attributable risk in the exposed always need to spell
– Percent attributable risk in the exposed out what you are
– Population attributable risk doing.
– Population attributable risk percent
– Population attributable fraction
– Percent population attributable risk
– Rate fraction
Prerequisites & Introductory Comments
• Don’t bother to consider measure of attribution until:
– Pretty sure that exposure is “causally” related to outcome
– Measure of association free of bias (e.g., no selection,
measurement or confounding bias) (i.e., the number is right)
• Measures of attribution can be expressed in the
context of risks or rates
• All of the terminology boils down to 2 concepts:
– Attribution of outcome to exposure among the exposed
– Attribution of outcome to exposure in a wider population
• Stata will automatically calculate both in epitab command
– even if not justified
Measures of Attribution:
Terminology we like in context of cumulative incidence (“risk”)**
Among a population
Scale Among the exposed
(exposed and unexposed)
“Attributable risk in the
“Attributable risk in population”
Absolute the exposed” or “Population attributable
ARexp* risk”
ARpop or Pop AR
“Percent attributable risk in
“Percent attributable
the population” or
risk in the exposed”
Percentage “Percent population
%ARexp
attributable risk”
% ARpop or %Pop AR
*We earlier called this “risk difference” ** analogous terms should be used in context of rates
Attributable Risk in the Exposed
0.3
0.25
ARexp
Risk at 5 years
0.2
Unexposed
0.15 Exposed
0.1
0.05
-4.16333634234434E-17
Series1
Attribution among the exposed
• Can be expressed as difference measure, but not insightful
ARexp = risk difference = Incidenceexp- Incidenceunexp
• Or, even better, as a percent of incidence in exposed:

%ARexp = [(Incexp- Incunexp)/(Incexp)] x 100
• %ARexp can also be calculated from the risk ratio:

[(RR-1)/RR] x 100
– Useful for case-control design where risk ratio is estimated but
not incidence
Example: %AR in exposed
Fracture No Risk
fracture
Risk Ratio
Exposed 800 3200 0.20
0.020/0.010
Unexposed 600 5400 0.10 = 2.0
* 5 year study
ARexp = risk difference = 0.20-0.10 = 0.10

% ARexp = [(0.20-0.10)/0.20] x 100 = 50%
= [(RR-1)/RR] x 100 = (2.0-1)/2.0 x 100
= 50%
In Stata: csi 800 600 3200 5400 ( ARexp is part of output)
• Stata feels AR is so important, it is part of default output
Percent Attributable Risk in the Exposed:
Interpretation
• % ARexp = 50%
• If we remove exposure, the risk of the outcome

in the exposed over 5 years would be reduced by
50%, from 20% to 10%.
• You can say that at least 50% of disease among
the exposed was caused by the exposure. But,
you can’t say that this exposure caused the
outcome in only this 50%. Exposure might have
been a cause in as much as 100% of the
exposed.
Population Attributable Risk
0.3 Unexposed
Exposed
0.25 Entire population
Risk at 5 years
0.2
0.15 Pop AR
0.1
0.05
-4.16333634234434E-17
Series1
Attribution in the Population
• Pop AR = Incidencepop- Incidenceunexp
• Most relevant to express this as:

%Pop AR = [(Incpop- Incunexp)/(Incpop)] x100
• Can also be calculated using risk ratio (RR) and
prevalence of exposure in population (p e):
100 x [pe x (RR-1)]/ [pe x (RR-1) + 1]
• To calculate in a case-control study, need
knowledge of exposure prevalence in study base
– Can be gleaned from control group when incidence
density or case-cohort sampling performed
Example: Attribution in the population
5-year
Risk Ratio
risk
0.020/0.010
Exposed 0.20 = 2.0
Unexposed 0.10
What is the overall risk in the population? Depends on

prevalence of exposure.
If 40% of the population is exposed, the risk of the outcome
in the population will be (this is a weighted average):
(0.40 x 0.20)+(0.60 x 0.10) = 0.08 + 0.06 = 0.14
Attribution in the population
Risk Risk Ratio Prevalence of
0.20/0.10 = exposure = 40%
Exp 0.20
2.0 Risk in popn = 0.14
Unexp 0.10
Pop AR = 0.14-0.10 = 0.04

% Pop AR = (0.04/0.14) x 100 = 28.6%
= 100 x [pe x (RR-1)] / [pe x (RR-1) + 1]
= 100 x [0.40 x (2.0-1)] / [0.40 x (2.0-1)+1] = 100 x [0.40/1.40]
= 28.6%
[Our earlier example in Stata: csi 800 600 3200 5400]
Percent Population Attributable Risk:
Interpretation
• % Pop AR = 29% over 5 years
• If we removed the exposure, the risk of the
outcome in the population over 5 years would be
reduced by 29%, from 14% to 10%.
• As was the case for % ARexp, we can say that at
least 29% of the disease among the population
was caused by the exposure. But, we cannot say
the exposure is responsible for only 29% of all
disease. Exposure might be involved in much
more than 29%.
Prevalence of Exposure and
Population Attributable Risk
Low prevalence of exposure High prevalence of exposure
0.3 0.3 Unexposed

0.25 Exposed
0.25
Popn
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
Higher prevalence of exposure -> larger %Pop AR for same risk ratio
Percent Population Attributable Risk:
Example from Framingham
Wolf et al. Stroke 1991

% Population Attributable Risk depends
on exposure prevalence and risk ratio
Attribution among the exposed - Rates
• All of the same equations used for risks in the prior slides
can also be used for rates
• Can be expressed as difference measure, but not insightful

ARateexp = rate difference =
Incidence rateexp- Incidence rateunexp
• Or, even better, as a percent of incidence rate (IR) in

exposed:
%ARexp = [(IRexp- IRunexp)/(IRexp)] x 100
Example: Percent Attributable Rate in the Exposed
• Background: There are many respiratory infections circulating in
what is known as ‘flu season’. The exact contribution of influenza
virus in causing symptomatic respiratory illness is unclear.
• Findings: Of those infected with influenza virus, there were 69
respiratory illnesses per 100 person-inﬂuenza-seasons compared
with 44 per 100 in those not infected with inﬂuenza. The age-
adjusted attributable rate of illness if infected was 23 illnesses
per 100 person-seasons (95% CI: 13–34).
– This is just an average incidence rate difference
• Our (better) description: Among persons infected with influenza

virus, if we could have prevented this infection, then we could
have decreased their rate of illness by 33% (95% CI: 19% to 49%).
– i.e., prevention of influenza useful, but there are other pathogens in play
Hayward et al. Lancet Respir Med 2014
Measures of Attribution
• Can be expressed with risks or rates, but take on different meanings
depending upon which you use
• Main utility: inform where to most efficiently allocate resources
– Prioritize elimination of exposures with highest % pop AR
• assuming they don’t cost a fortune to alter/eliminate
• Remind us why we need accurate quantitative measures of

association (and not just the qualitative estimate).
– Prior graph: the AR for a RR of 2 is considerably different than an RR of 5
– We cannot blithely think that an OR of 5 is the same as a risk ratio of 5
• AR’s across a set of exposures for a given disease typically will sum
to more than 100%.
– Explanation is apparent in Rothman’s sufficient-component cause model
Reality Check
• Attribution is a theoretical construct
• Completely removing an exposure is easier said than done
• Even if you removed the exposure today, its prior cumulative effects
may last for years
– Thus, idea of removing an exposure today might not fully take effect
until everyone alive today has died
• Furthermore, these attribution measures assume that removing one
exposure would not influence others
– Warning: Among humans, removing one unhealthy volitional behavior
might simply be replaced by another
Increasing use in medical literature
5000
4500
4000
Number of articles
3500
3000
2500
2000
1500
1000
500
0
20 18 16 14 12 10 08 06 04 02 00 98 96 94 92 90 88 86 84 82 80 78 76
20 20 20 20 20 20 20 20 20 20 20 19 19 19 19 19 19 19 19 19 19 19 19
Summary of Measures of Association
Ratio Difference
Design (often easiest to assess for (for public health impact,
strength of association) especially of interventions)
Cross- prevalence ratio prevalence difference*

sectional prevalence odds ratio prevalence odds difference*
Cohort risk ratio risk difference
rate ratio rate difference
hazard ratio hazard difference*
incidence odds ratio* incidence odds difference*
Case- odds ratio Risk & rate difference in
control (can estimate risk ratio or hazard
ratio)
case-cohort if observations
weighted properly*
* Rarely used
SUMMARY: Measures of Attribution:
In context of cumulative incidence (“risk”)
Among a population
Among the exposed
Scale (exposed and unexposed)
“Attributable risk in the
“Attributable risk in population”
Absolute the exposed” or “Population attributable
ARexp* risk”
ARpop or Pop AR
“Percent attributable risk in
“Percent attributable
the population” or
risk in the exposed”
Percentage “Percent population
%ARexp
attributable risk”
% ARpop or %Pop AR
*We earlier called this “risk difference” ** analogous terms used in context of rates
Additional Slides
• Case-control study with incident cases in secondary

study base
• Case-control study with prevalent cases in secondary

study base
• Unbiased vs. Consistent Estimator
• Etiologic fraction
Secondary Study Base
• In the lecture, we looked at case-control designs in context of a
primary study base.
• Because researchers can “get their hands around” primary
study bases, they are relatively easy to sample to find controls.
Case-control study via a secondary study base:
• Identify incident cases
• Then attempt to identify study base that gave rise to cases in
order to sample controls
• Study base is always a dynamic cohort
• Possible to describe conceptually but difficult to identify
individual members of that study base in practice.
New
Member
Case-Control Design in E
1
cases
Secondary Study Base
CE
E
0
N1T1
Hypothetical
cohort N 0T0
controls
Example: Case-Control Study with
Prevalent cases in Secondary Study Base
• Study of the effect of cortical porosity (on high
resolution CT scan) on fracture in older women
• “Subjects were eligible for inclusion as fracture cases if they had
a documented history of a low-trauma vertebral or nonvertebral
fracture that occurred after menopause… Control subjects had no
history of low-trauma fractures and no vertebral deformity on
lateral radiographs.”
• We can report an OR as measure of association
for cortical porosity and fracture, but it’s not an
approximation of risk ratio
• Why? Stein et al. J Bone Miner Res. 2010
Unbiased vs. Consistent Estimator
• Recall that an “estimator” is some computation performed with
data from a study sample that attempts to estimate some
parameter in the source population
• In the statistics literature, estimators are defined as biased vs
unbiased; and consistent vs not consistent
• An estimator is unbiased if, on average, the estimate
corresponds to the true value. This is unrelated to sample size.
That is, the average value for any given sample size is not
systematically different than the true vale.
• An estimator is consistent if, with increasing sample size, the
estimate converges towards the true value. This concept is
related to sample size.
Perhaps the most meaningful metric of “strength” of an
exposure in causing disease is one we cannot estimate
• It would be useful to be able to fully
describe all of the “component
causes” in all of the “sufficient
causes” that resulted in disease One “sufficient cause”
• The fraction of all “sufficient causes” that contain a particular

component cause (say cause “B”) is known as the etiologic fraction
• Etiologic fraction has intuitive appeal; e.g., it tells what fraction of all
instances of disease were (in part) caused by exposure B.
• Unfortunately, in any given person with disease we cannot describe
the sufficient cause. Hence, we cannot estimate etiologic fraction.
– This is because we cannot fully describe how disease occurred in a given
person. This is a limitation of modern medicine, pathology, & epidemiology.
– This is unless a cause is NECESSARY in all instances of disease

Disease Association II and Attribution 2022

Uploaded by

Copyright:

Available Formats

Disease Association II and Attribution 2022

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Disease Association II and Attribution 2022

Uploaded by

Copyright:

Available Formats

Measures of Disease Association II

and Measures of Attribution

• Strengths and weaknesses of case-control studies

Meier et al. Arch Intern Med 2008

Case-control study of TZD use & fracture in diabetes

TZD use 65 198 65/(65+198) =0.25

No TZD use 955 3530

“Probability” of a fracture in TZD users is 0.25 (over 10 years).

“Probability” of a fracture in TZD users is now 0.4

Meier et al. Arch Intern Med 2008

• First published “modern” case-control study:

OR for exposure = OR for disease

• Yet, we spoke earlier about some of the problems

• Can we get OR in case-control design to

Random sample of person-time at risk

Exposure and covariate

• How can the odds ratio in a case-control study,

From a well-defined study base:

• Obtain estimate of this ratio by taking a

• For example, to obtain a sample of 200

We observed 435 incident non spine fracture cases (including 81 hip

Cauley et al. JBMR 2009

112 of 435 cases included in 1608

Cauley et al. JBMR 2010

• HR comparing those in the first to those in the fourth

• Are specimens (or images, etc.) missing randomly?

• Particularly in a dynamic cohort where case-

Exposure and covariate

• Controls are matched to cases on time at risk

• Each time a case occurs, those still in follow-up

So analogous to estimating risk ratio, we need to

If we can estimate this odds in a case-

Time HRi HRii HRiii HRiv Average HR

Sampling in a dynamic cohort gives a consistent

Kropp S et al. Low-to-moderate alcohol consumption and breast cancer risk by

1,381 age &

Hazard (rate) ratio is easier to understand than an odds ratio.

Page et al. Clin Chem 2008

Better title: Rate ratio of myocardial infarction…

Reference group “those participants with values in the third

• Exposure measurement (e.g., biological specimens)

• Stata command to identify controls matched to

• Adds 3 variables to dataset:

. stset time, fail(d)

id time d _case _set _time |

In addition to the hazard ratio, the authors provide the cumulative

Juraschek et al. PLoS ONE 2013

Juraschek et al. PLoS ONE 2013

• We also know fraction of

still at risk when each

Can extend this to a

• There still may be a salvageable interpretation of the OR

Type of Control Sampling Scheme Interpretation

• Odds ratio here is said to be a “close

Observation Time Time of the

Sampling only non-cases at a point in time

• If controls are selected among those without disease at

• Rare disease assumption:

– i.e., exposure odds in non-case controls  exposure

• Over time, participants are subject to