Disease Association II and Attribution 2022
Disease Association II and Attribution 2022
Disease Association II and Attribution 2022
• Prediction
– Do A, B, and C predict occurrence of Y? (e.g., diagnosis or prognosis)
• Everything in this course relates to designing studies that will
accomplish a descriptive or analytic goal.
How to measure disease association in
case-control designs?
• In cross-sectional or cohort studies, we compare
the occurrence of disease in exposed vs unexposed
– This is the most intuitive approach to evaluating
causal/predictive effect of the exposure on outcome
– It emphasizes temporality of exposure and disease.
Temporality refers to the timing of events. If one thing
comes before another, it is easier to reason that the first
thing caused the second thing (as opposed to the other
way around).
• In case-control design, we typically can’t do this
What can we estimate in a case-control study?
Case-control study of TZD use & incident fracture in
diabetes [TZD = thiazolidinedione, a class of diabetes
drugs]. Dynamic study base: UK General Practice Research
Database.
Fracture No Study with 3-4
fracture controls per case.
What happens if we
TZD use 65 198 try to estimate
incidence of fracture
No TZD use 955 3530 in either exposure
group?
1020 3728
Fracture No “Probability of
fracture event, by exposure”
If we try to estimate
probability or odds of
fracture in either
Fracture No exposure group, the
fracture result is nonsense. It
depends on whether
TZD use 65 198
No TZD use
X 955 3530
the study selected 4
controls per case, or 2
controls per case, etc.
X 1020 3728 We can, however,
work down the table.
a b a
Yes
c
Exposure
ORexp =
b
No c d
d
a+c b+d
Important characteristic of an odds ratio
a c a
ORexp = c X
b
=
b = ORdis
b c c
d b d
a) Exposure and
covariate prevalence is Consistent estimate
At any point, at steady state of hazard ratio
including after all
cases identified b) Exposure and/or Hard to interpret odds
covariates not at steady ratio
state
OR as estimate of risk ratio
(baseline)
group
No. of
No. of persons
events
No E0 (cases) in N0 in unexposed
group
unexposed
(baseline)
group
Risk ratio in a cohort study
E1 E1
N1 E1 N 0 E0
Risk ratio =
E 0 N1 E 0 N1
N0 N0
In a fixed cohort study with complete and equal follow-up T
• We now have a ratio of 2 odds: Odds of exposure in
those with event (E1/E0) and odds of exposure in the
cohort at baseline (N1/N0).
• How can we estimate these two odds with a case-
control design?
Capturing the events with a case-control design
E1
E0
– Or a random sample of the cases will give the same ratio
– This is the odds of exposure in the cases
Estimating risk ratio in a case-control study
E1 E1
N1 E1 N 0 E0
Risk ratio =
E 0 N1 E 0 N1
N0 N0
In a fixed cohort study with complete and equal follow-up T
E
1
We have this ratio with the cases: E
0
N1
Now we just need an estimate of this ratio:
N0
Notation in a 2 x 2 table of a cohort study
Disease
Yes No
Yes E1 N1
Exposure
No E0 N0
Estimating exposure in the baseline cohort
• This is a ratio of exposed to
N1 unexposed amongst everyone
in the study at baseline (time
N0 0). It is the odds of exposure
amongst the cohort at
baseline.
cases
E
N1 1
E
N0 0
controls
Time
Case-Cohort Sampling
• Control (reference) group is random sample of
cohort at baseline
• Controls used to estimate the odds of exposure in
the study base at time 0 (i.e., estimates N 1 / N0)
• Control group can be used for > 1 outcome
• Can use same controls later for longer follow-up
gathering more cases
• First formalized by Kupper (1975) and extended by
Prentice (1986)
• Odds ratio estimates risk ratio (or hazard ratio)
Stata: Case-cohort sampling
• Once incident cases are identified, need a random
sample of the baseline cohort
• Exclude prevalent cases at baseline
• Take random sample of all other participants
• Stata command for random sample:
• Sample #, count
cases
E
CE 0
N1 435 incident
cases of
N0 non-spine
fracture
controls
ABSTRACT
To test the hypothesis that low serum 25-hydroxyvitamin D
[(25(OH) vitamin D] levels are associated with an
increased risk of fracture we performed a case-cohort
study of 435 men with incident non-spine fractures
including 81 hip fractures and a random subcohort of 1608
men; average follow-up time 5.3 years. Serum 25(OH)
vitamin D2 and D3 were measured on baseline sera…
Modified Cox proportional hazards models were used to
estimate the hazard ratio (HR) of fracture with 95%
confidence intervals. …
Clearly
labelled
reference
group
*Base model adjusting for age, race, clinic, season of blood draw, physical
activity, height, and weight. ** Per SD decrease in Vitamin D
Describing results for quartiles of
vitamin D and fracture
• Highest quartile of vitamin D is the reference group.
Other quartiles of vitamin D are compared to this
reference group.
– Always label reference group, even with dichotomous variable
– Don’t make readers guess/assume the reference group
*Base model adjusting for age, race, clinic, season of blood draw, physical
activity, height, and weight. ** Per SD decrease in Vitamin D
Describing results for continuous exposure
• For continuous exposures placed in regression models in
their native form, HR = association for 1 unit increment in
the exposure. Note: need to describe the units.
• For exposures with a wide range of values, expressing the
measure of association for 1 unit increment in the scale
can often produce very small measures of association
– Humans find these hard to interpret.
• As an alternative, measures of association expressed as:
– “per standard deviation”
• e.g., HR = 1.07 is for a SD decrease in vitamin D
– per some larger increment in the scale
• e.g., “The rate of non-spine fracture is 1.11 times as high for
each 10 ng/ml decrease in vitamin D”
– Note use of decrease rather than increase in exposure
– In any case, the units must always be clearly labeled
Some practical concerns in case-cohort design
• What % of baseline participants have specimens (or
other exposures: images, EKGs, etc.) available/archived?
a) Exposure and
covariate prevalence is Consistent estimate
At any point, at steady state of hazard ratio
including after all
cases identified b) Exposure and/or Hard to interpret odds
covariates not at steady ratio
state
Incidence density sampling
E1 E1 Odds of exposure
in cases
1
E0
Hazard Ratio = = Odds of exposed
E0 1
person-time in
lim N 0 × ∆ t 0 lim N 0 × ∆ t 0 cohort
∆𝑡 →0 ∆𝑡 →0
CE
Time
A case-control study with incidence density sampling
within a fixed cohort primary study base
(e.g., within Framingham Cohort)
CE
E
1
E
0
= ORi = HRi
Time
Controls are randomly sampled each time a case is diagnosed from risk set.
Exposed/unexposed person-time in risk set plus exposure odds in case yields
OR that estimates hazard ratio at time of case.
Incidence density sampling within a fixed cohort
primary study base (e.g., within Framingham Cohort)
CE
A weighted average of the “mini” hazard ratios, which are formed at the
occurrence of each case, estimates the overall hazard ratio.
Incidence Density Sampling In Dynamic Cohort
(e.g., SF County residents)
New
Residents
CE
OR estimates HR
Calendar Time
Incidence density sampling
...In a population-based case-control study in Germany, the authors
determined the effect of alcohol consumption at low-to-moderate levels
on breast cancer risk among women up to age 50 years. The study
included 706 case women whose breast cancer had been newly
diagnosed in 1992-1995 and 1,381 controls matched on date, age,
and residence. In multivariate conditional logistic regression analysis,
the adjusted odds ratios for breast cancer were 0.71 (95% confidence
interval (CI): 0.54, 0.91) for average ethanol intake of 1-5 g/day, 0.67
(95% CI: 0.50, 0.91) for intake of 6-11 g/day, 0.73 (95% CI: 0.51, 1.05)
for 12-18 g/day, 1.10 (95% CI: 0.73, 1.65) for 19-30 g/day, and 1.94
(95% CI: 1.18, 3.20) for > or = 31 g/day. . . These data suggest that
low-level consumption of alcohol does not increase breast cancer risk
in premenopausal women.
cases
CE
706 incident
cases of
breast cancer
controls
Calendar time
Random sample of population each time breast cancer diagnosed
Results reported as odds ratio
Reference group
clearly labeled
How was
cumulative
incidence derived
for each category
of baseline lactate
level?
cases
time 0 onward, on the
sub-cohort and the
incident cases.
control
cases we sampled
(typically 100%) and
what fraction of the
baseline sub-cohort we
Time sampled (usually 5-20%)
• We can then reweight the observations for the cases and sub-cohort
to recreate the entire cohort.
• We won’t show the math, but this allows for estimation of measures
of incidence (risk and hazard) and risk and hazard differences
• Illustrates the power of the growing techniques in reweighting
Estimating cumulative incidence from
case-cohort sampling If all incident cases are
included, we know when
they occurred.
We have complete information on a
random sample of the complete cohort Since controls are a
random sample at
baseline, they can
cases
represent the experience
of full cohort. We know
follow-up time for each
control. Use this to
estimate all participants
controls
cases
CE
controls
• OR is not an “unbiased” or
“consistent” estimate of the risk ratio
– exception: risk ratio = 1, in which
case the odds ratio = 1
Yes E1 N 1 - E1 N1
Exposure
We want
this ratio
No E0 N 0 - E0 N0
Sampling controls after cases identified at
the end of a fixed cohort
• If E1 is small relative to N1, then N1-E1 is very close to
N1 which is what you want. Likewise, if E0 is small
relative to N0, then N0 - E0 closely approximates N0
Exposed
40 60
100
2 years
Unexposed 10 90 100
Risk ratio = (40/100)/(10/100) = 4.0
Using all prevalent non-cases in cohort, the OR (of exposure) is:
OR = 40/10 = 6.0
60/90
A random sample of the non-cases would give the same OR.
OR is not an unbiased estimate of risk ratio. In this example, with high
incidence of disease, OR also not a close approximation of risk ratio.
OR using controls from prevalent non-cases
when incidence low Cases Non-cases
Exposed
4 96
2 years
Unexposed 1 99
Risk ratio = (4/100)/(1/100) = 4.0
Using all prevalent non-cases in cohort would be
OR = 4/1 = 4.13
96/99
A random sample from cells b (96) and d (99) will give a ratio equal
to 96/99 and therefore an OR = 4.13. With this low incidence of
disease, the odds ratio is close approximation to true risk ratio =
4.0 (but not an unbiased estimate).
Even if rare disease assumption holds, there are other
problems with this design
E
1
cases
CE E
0
controls
controls are selected are “battle-
tested” in that they have not dropped
out or developed a competing event N1 E 1
• Controls thus may not represent
N0 E0
study base at time 0 (Result = bias)
a) Exposure and
covariate prevalence is Consistent estimate
At any point, at steady state of hazard ratio
including after all
cases identified b) Exposure and/or Hard to interpret odds
covariates not at steady ratio
state
Which regression model to use?
Type of Control Sampling Assumptions about Regression Model
Cohort Exposure Prevalence
Dynamic Each time a case
occurs Conditional logistic
None
regression
“incidence density”
a) Exposure and
covariate prevalence is Logistic regression
At any point, at steady state
including after all
cases identified b) Exposure and/or
covariates not at steady
state
Statistical penalties for sampling the study base
• Case-control design obtains a strategic sample of the
underlying cohort rather than entire underlying cohort
• Using a sample introduces some sampling error compared
to analysis using entire cohort. Manifestations are:
– Reduces precision of the risk ratio, hazard ratio or odds ratio
estimate (i.e., increases standard error and CIs) compared to
analysis of whole cohort
– Point estimate from case-control design will rarely be exactly
equal to the analysis of whole underlying cohort
• The difference is from sampling error; typically, the estimates are close
Rothman &
A “sufficient cause” Greenland,
AJPH 2005
Why does “strong” not translate into biologic meaning?
Consequence of SCC model:
• Whether a given exposed person gets disease more often than an
unexposed person (which is what we see in our conventional
measures of association) mainly depends upon the prevalence of the
requisite complementary component causes
• Classic example: Disease called phenylketonuria (PKU)
– Exposures: mutation in gene for phenylalanine hydroxylase (PAH); and,
phenylalanine in diet (persons with mutation cannot break down phenylalanine)
– If phenylalanine rarely in ambient diet, PAH mutation genotype would rarely
manifest in disease-PKU (would yield “weak” or absent measure of association)
– If phenylalanine common in ambient diet, the PAH mutation genotype would
commonly manifest in disease-PKU (“strong” measure of association)
– Thus, the measures of association that we observe don’t fully tell us about
intrinsic biologic characteristics/ability of a given exposure in causing disease
An optional reading you might want to peruse
Numeracy
• The ability to understand and work with numbers
– Measures of association
• 2 digits to right of decimal point, except when over 10:
– prevalence ratio = 1.82 (95% CI: 1.35 to 2.39)
– risk ratio = 0.35 (95% CI: 0.11 to 0.49)
– hazard ratio = 12.3 (95% CI: 10.5 to 14.7)
Measures of Attribution
• So far, we have introduced measures of association that
compare occurrence of an outcome by exposure status,
using ratio or absolute difference.
*We earlier called this “risk difference” ** analogous terms should be used in context of rates
Attributable Risk in the Exposed
0.3
0.25
ARexp
Risk at 5 years
0.2
Unexposed
0.15 Exposed
0.1
0.05
-4.16333634234434E-17
Series1
Attribution among the exposed
• Can be expressed as difference measure, but not insightful
ARexp = risk difference = Incidenceexp- Incidenceunexp
0.2
0.15 Pop AR
0.1
0.05
-4.16333634234434E-17
Series1
Attribution in the Population
• Pop AR = Incidencepop- Incidenceunexp
Higher prevalence of exposure -> larger %Pop AR for same risk ratio
Percent Population Attributable Risk:
Example from Framingham
• All of the same equations used for risks in the prior slides
can also be used for rates
– i.e., prevention of influenza useful, but there are other pathogens in play
Hayward et al. Lancet Respir Med 2014
Measures of Attribution
• Can be expressed with risks or rates, but take on different meanings
depending upon which you use
• Main utility: inform where to most efficiently allocate resources
– Prioritize elimination of exposures with highest % pop AR
• assuming they don’t cost a fortune to alter/eliminate
• AR’s across a set of exposures for a given disease typically will sum
to more than 100%.
– Explanation is apparent in Rothman’s sufficient-component cause model
Measures of Attribution:
Reality Check
• Attribution is a theoretical construct
• Completely removing an exposure is easier said than done
• Even if you removed the exposure today, its prior cumulative effects
may last for years
– Thus, idea of removing an exposure today might not fully take effect
until everyone alive today has died
• Furthermore, these attribution measures assume that removing one
exposure would not influence others
– Warning: Among humans, removing one unhealthy volitional behavior
might simply be replaced by another
Measures of Attribution:
Increasing use in medical literature
5000
4500
4000
Number of articles
3500
3000
2500
2000
1500
1000
500
0
20 18 16 14 12 10 08 06 04 02 00 98 96 94 92 90 88 86 84 82 80 78 76
20 20 20 20 20 20 20 20 20 20 20 19 19 19 19 19 19 19 19 19 19 19 19
Summary of Measures of Association
Ratio Difference
Design (often easiest to assess for (for public health impact,
strength of association) especially of interventions)
* Rarely used
SUMMARY: Measures of Attribution:
In context of cumulative incidence (“risk”)
Among a population
Among the exposed
Scale (exposed and unexposed)
“Attributable risk in the
“Attributable risk in population”
Absolute the exposed” or “Population attributable
ARexp* risk”
ARpop or Pop AR
“Percent attributable risk in
“Percent attributable
the population” or
risk in the exposed”
Percentage “Percent population
%ARexp
attributable risk”
% ARpop or %Pop AR
*We earlier called this “risk difference” ** analogous terms used in context of rates
Additional Slides
• Etiologic fraction
Secondary Study Base
• In the lecture, we looked at case-control designs in context of a
primary study base.
• Because researchers can “get their hands around” primary
study bases, they are relatively easy to sample to find controls.
Case-control study via a secondary study base:
• Identify incident cases
• Then attempt to identify study base that gave rise to cases in
order to sample controls
• Study base is always a dynamic cohort
• Possible to describe conceptually but difficult to identify
individual members of that study base in practice.
New
Member
Case-Control Design in E
1
cases
Secondary Study Base
CE
E
0
N1T1
Hypothetical
cohort N 0T0
controls
Example: Case-Control Study with
Prevalent cases in Secondary Study Base
• Study of the effect of cortical porosity (on high
resolution CT scan) on fracture in older women
• “Subjects were eligible for inclusion as fracture cases if they had
a documented history of a low-trauma vertebral or nonvertebral
fracture that occurred after menopause… Control subjects had no
history of low-trauma fractures and no vertebral deformity on
lateral radiographs.”
• We can report an OR as measure of association
for cortical porosity and fracture, but it’s not an
approximation of risk ratio
• Why? Stein et al. J Bone Miner Res. 2010
Unbiased vs. Consistent Estimator
• Recall that an “estimator” is some computation performed with
data from a study sample that attempts to estimate some
parameter in the source population
• In the statistics literature, estimators are defined as biased vs
unbiased; and consistent vs not consistent
• An estimator is unbiased if, on average, the estimate
corresponds to the true value. This is unrelated to sample size.
That is, the average value for any given sample size is not
systematically different than the true vale.
• An estimator is consistent if, with increasing sample size, the
estimate converges towards the true value. This concept is
related to sample size.
Perhaps the most meaningful metric of “strength” of an
exposure in causing disease is one we cannot estimate
• It would be useful to be able to fully
describe all of the “component
causes” in all of the “sufficient
causes” that resulted in disease One “sufficient cause”