Propensity Scores: A Practical Introduction Using R
Propensity Scores: A Practical Introduction Using R
Propensity Scores: A Practical Introduction Using R
Scores:
A
Practical
Introduction
Using
R
68
ISSN
1556-8180
http://www.jmde.com
Antonio
Olmos
University
of
Denver
Priyalatha
Govindasamy
University
of
Denver
Background:
This
paper
provides
an
introduction
to
Data
Collection
and
Analysis:
Not
applicable
propensity
scores
for
evaluation
practitioners.
Findings:
In
this
demonstration
paper,
we
describe
the
context
in
which
propensity
scores
are
used,
including
the
Purpose:
The
purpose
of
this
paper
is
to
provide
the
reader
conditions
under
which
the
use
of
propensity
scores
is
with
a
conceptual
and
practical
introduction
to
propensity
recommended,
as
well
as
the
basic
assumptions
needed
for
a
scores,
matching
using
propensity
scores,
and
its
correct
implementation
of
the
technique.
Next,
we
describe
implementation
using
statistical
R
program/software.
some
of
the
more
common
techniques
used
to
conduct
propensity
score
matching.
We
conclude
with
a
description
of
Setting:
Not
applicable
the
recommended
steps
associated
with
the
implementation
of
propensity
score
matching
using
several
packages
Intervention:
Not
applicable
developed
in
R,
including
syntax
and
brief
interpretations
of
the
output
associated
with
every
step.
Research
Design:
Not
applicable
Keywords:
propensity
score
analysis;
propensity
score
matching;
R.
69
Introduction
The aim of this paper is to provide the reader with
a conceptual and practical introduction to
propensity scores, matching using propensity
scores, and its implementation using a statistics
program. We start with a description of the
context in which propensity scores have been used,
the basic assumptions needed to use propensity
scores, and a brief description of some of the most
useful techniques for propensity score matching.
We then provide a detailed description of how to
estimate propensity scores, matching using
propensity scores, and brief examples of the
results of implementing propensity scores
matching using several packages developed in R.
70
Given the importance of causality, and the
requisites needed to assign it with confidence,
evidence-based programs tend to rely on the use of
experimental approaches. The experimental
approach has two characteristics: 1) it manipulates
the independent variable, that is, whether an
individual receives (or not) the intervention under
scrutiny. 2) Individuals are randomly assigned to
the independent variable. The first characteristic
does not define the experimental approach: most
of the so-called quasi-experiments (Shadish et al.,
2002) also manipulate the independent variable.
What defines the experimental method is the use
of random assignment. In particular, the use of
random assignment helps to prove causality by
improving the chances that we have ruled out
alternative explanations. Another way to think
about the importance of random assignment is
that it increases the chances that groups are
probabilistically balanced on some variables that
otherwise may affect the final outcome
(DAgostino & DAgostino, 2007; Shadish et al.,
2002), and, therefore, the Neyman-Rubin
counterfactual framework holds. For example, in
an obesity-reduction program, there may be
several reasons for weight loss (such as peer
support, level of motivation), that are not
associated with the intervention, and that may
affect weight reduction. Balancing through
random assignment becomes important because
then we can determine with a high degree of
certainty that the reason we observe the weight
change in this obesity reduction program is
because of the intervention, and not because of
some other reason (Bonell et al., 2009).
However, if one of these two conditions
(manipulating the independent variable, or
random assignment to rule out alternative
explanations) is not met, our confidence about the
causal relationship between independent and
dependent variable is substantially reduced. There
are several reasons why we may not be able to
meet these two assumptions: 1) Despite the use of
random assignment, equivalent groups are not
achieved. 2) Due to ethical or logistical reasons
random assignment is not possible (Bonell et al.,
2009).
The first reason is known as randomization
failure (Bonell et al., 2009), and sometimes can go
undetected. Usual reasons why randomization can
fail are associated with missing data which
happened in a systematic way. In the obesity
reduction example, some individuals in the control
group may drop out because they are not losing
weight. Or individuals in the treatment group may
drop out because they lost the weight they had as a
goal, and therefore are not motivated to continue
71
our confidence about causality by selecting
individuals to either the control or treatment
condition based on a cutoff score. Another
alternative when random assignment fails, or
when we cannot randomly assign people to
treatment conditions because of ethical or
logistical reasons, is propensity scores.
Propensity
Scores
Propensity scores is a statistical technique that has
proven useful to evaluate treatment effects when
using quasi-experimental or observational data
(Austin, 2011; Rubin, 1983). Some of the benefits
associated with propensity scores are: (a) Creating
adequate
counterfactuals
when
random
assignment is infeasible or unethical, or when we
are interested in assessing treatment effects from
survey, census administrative, or other types of
data, where we cannot assign individuals to
treatment conditions (Austin, 2011). (b) The
development and use of propensity scores reduces
the number of covariates needed to control for
external
variables
(thus
reducing
its
dimensionality) and increasing the chances of a
match for every individual in the treatment group.
(c) The development of a propensity score is
associated with the selection model, not with the
outcomes model, therefore the adjustments are
independent of the outcome.
Propensity scores are defined as the conditional
probability of assigning a unit to a particular
treatment condition (i.e., likelihood of receiving
treatment), given a set of observed covariates:
(=|X)
Where z = treatment, i = treatment condition,
and X = covariates. In a two-group (treatment,
control) experiment with random assignment, the
probability of each individual in the sample to be
assigned to the treatment condition is:
(=X)=0.5. In a quasi-experiment, the
probability (=X) is unknown, but it can be
estimated from the data using a logistic regression
model, where treatment assignment is regressed
on the set of observed covariates (the so-called
selection model). The propensity score then allows
matching of individuals in the control and
treatment conditions with the same likelihood of
receiving treatment. Thus, a pair of participants
(one in the treatment, one in the control group)
sharing a similar propensity score are seen as
equal, even though they may differ on the specific
values of the covariates (Holmes, 2014).
Endogeneity
In order to work, regression models need to meet
some assumptions (Draper & Smith, 1998). One of
them calls for independence between the
independent variables in the model and the error
term. Violations of this assumption are usually
associated with omitted variables. That is, there is
some other variable that is not included in the
model which is correlated with both the dependent
and the independent(s) variables. Omitted
variables are one of the major problems in nonexperimental (observational/quasi-experimental)
studies, because if we do not take them into
account, they will create a biased estimate of the
effect. That is, our interpretation of the regression
model will either under-estimate or over-estimate
the relationship between the independent and
dependent variables. Omitted variables represent a
form of endogeneity which affects our ability to
establish accurate causal relationships.
72
model (Rubin, 2005). This framework relies in one
important assumption known as the Ignorable
Treatment Assignment Assumption, which states
that conditional on covariates, the assignment of
study participants to treatment conditions is
independent of the outcome:
.
Under random assignment, this assumption holds,
but it does not necessarily hold under quasiexperiments, where it may be important to
investigate how participants were assigned to the
treatment conditions. Although some of the
processes by which individuals select/are assigned
to specific treatment conditions can be examined
empirically, a full account of treatment selection is
sometimes impossible (e.g., when subjects are
motivated to select one treatment condition, and
the researcher does not have a valid measure/is
not aware of their motivation).
If we cannot determine all the reasons why a
participant is assigned to a treatment, then we will
have an endogeneity problem (Morgan & Winship,
2012). Thus it is important to make sure that we
can identify all of the reasons why participants are
in the treatment or control conditions.
Conventional
Matching
in
R
Conventional forms of matching allow researchers
to create two groups of individuals (control,
treatment groups) that are matched in variables
that are believed to be critical in the selection
process, thus creating counterfactuals for the
individuals in the opposite group. Below, we
briefly describe two of the conventional ways of
matching, as well as R code to conduct it.
However, a more thorough description of the
interpretation will be included in the section:
"Implementing a propensity score analysis with
R." The following code and examples use the
dataset Lalonde, included and described in the
packages MatchIt (Ho, Imai, King & Stuart, 2011)
and matching (Sekhon, 2011).
m.mahal
<-
matchit(treat
~
age
+
educ
+
nodegree
+
re74
+
re75,
data
=lalonde,
mahvar
s
=
c("age",
"educ",
"nodegree",
"re74",
"re75"),
caliper
=
0.25,
replace
=
FALSE,
di
stance=
"mahalanobis")
summary(m.mahal)
Figure
1.
Conventional
matching
using
Mahalanobis
distance
with
the
package
MatchIt
73
In this figure it can be observed that a caliper is
included (i.e., caliper = 0.25). As recommended by
Rosenbaum and Rubin, 1985, the default is 0.25
p.
Greedy
Matching
This type of matching is called greedy because the
match for a participant in the treatment group is
based on the first case of the control group that
meets the criteria for matching. Even if that
participant in the control group would serve as a
better match for a subsequent participant in the
treatment group, the match will still be based on
the first case. Most algorithms will select
participants from both the control and treatment
group at random; thus, the match from one run to
74
Near
Neighbor
with
Caliper
Matching
Similarly to the case defined for Mahalanobis
distance, in near neighbor matching, sometimes
individuals who are not close in terms of their
propensity scores, can be matched. Thus, in this
case, near neighbor-match can be considered only
if the absolute distance between treatment and
control
participants
meets
the
condition: ! ! < , ! where Pi and Pj are
propensity scores for treatment and control,
is
are
randomly
ordered.
Afterwards,
the
Mahalanobis distances for the participants in the
control and treatment groups are calculated using
the combination of variables (x) and the
propensity score (Guo & Fraser, 2015). Figure
5 presents the commands to estimate the
propensity scores, and the Mahalanobis distance
plus propensity scores using MatchIt. The steps
are recounted below:
Optimal
Matching
This is a more complex approach to propensity
score matching, which is possible because of fast
computer processing speed that can make the
implementation of these algorithms possible. The
main goal of this approach is to find the matched
75
! , ! ! , !
!!!
#---Full
matching
m.fl
<-
matchit(treat
~
age
+
educ
+
nodegree
+
re74
+
re75,
data
=lalonde,
method=
"
full",
min.controls
=
1,
max.controls
=
10,
discard
=
"both")
summary(m.fl)
Figure
7.
Full
matching
using
MatchIt.
to calculate propensity scores, interpret the results
Just like for optimal matching, MatchIt calls the
using both statistical and graphical procedures,
package optmatch (Hansen et. al., 2013). Figure 7
and it can also estimate post-matching outcomes
also shows that a minimum number and a
analysis.
maximum number of control cases can be
The analysis of quasi-experimental or
specified.
observational data using propensity scores
involves the development of two models: 1) the socalled selection model (which is intended to
Implementing
a
Propensity
Score
balance the groups using variables that affect only
Analysis
with
R
the selection process), and 2) the outcomes model
(which will include variables that are associated
In this section, we provide some suggestions for
with the outcomes only). The result of the selection
the implementation of a propensity score analysis.
model affects the outcome model through the
propensity scores. Thus several of the steps
We use the statistical program R (R Core Team,
2014), because it has multiple packages intended
described below are aimed exclusively to the
76
selection process, and some others are intended
for the outcomes model.
Preliminary analysis
Estimation of Propensity scores
Propensity Score matching
Outcome analysis
Sensitivity analysis
1.
Preliminary
Analysis
Before propensity scores are calculated, it is a good
practice to determine if the two groups are
balanced. The best practice to determine the
For the standardized difference, absolute scores
higher than 25% are considered suspect, and may
indicate an imbalance for that specific variable
(Stuart & Rubin, 2008). A statistically significant
chi-square will indicate that at least one of the
variables included in the model is creating an
imbalance between the two groups. Variables that
create imbalance should be included in the
77
78
79
#---Attach
the
predicted
propensity
score
to
the
datafile
lalonde$psvalue
<-
predict(ps,
type
=
"response")
#---Back
to
back
histogram
histbackback(split(lalonde$psvalue,
lalonde$treat),
main=
"Propensity
score
before
ma
tching",
xlab=c("control",
"treatment"))
Figure
11.
Back
to
back
histogram
using
Hmisc
Important parameters to determine the fit are not
only the shape, but also degree of overlap between
the two distributions (known as the common
support region (Lehner, 2008)). Matching is best
when there is a common support region.
80
##
[1]
"To
identify
the
units,
use
first
mouse
button;
to
stop,
use
second."
##
integer(0)
Figure
13.
Jitter-type
plot,
package
MatchIt
As can be observed in this figure, the section
labeled Unmatched control Units shows that
most of the non-matched individuals were in the
lower (0.0 to 0.4) part of the propensity scores.
However, there were a few cases in a higher range
(0.5-0.6).
It is important to determine that the groups
are balanced, thus eliminating (or substantially
reducing) the initial selection bias. In step 1 in this
section (preliminary analysis) it was mentioned
that there are both statistical as well as graphical
approaches that can be used to determine the
degree of imbalance. After the match has been
conducted, both techniques are used again to
determine that all the critical variables have been
balanced. Figure 14 shows the output after the
original match.
81
82
83
histbackback(split(match.data$psvalue,
match.data$treat),
main=
"Propensity
score
aft
er
matching",
xlab=c("control",
"treatment"))
Figure
15.
Back
to
back
histogram
after
match
As can be observed in this figure, there is a
remarkable improvement in the match between
the two distributions of propensity scores after the
match (compared to Figure 11, which shows the
histograms for the same data before the match).
This match suggests that the two groups are much
more similar in terms of their propensity scores,
and thus, the selection bias has been reduced
substantially.
4. Outcomes
Analysis
Once the researcher is satisfied with the
propensity score matching, it is time to proceed
with the outcome model. Several of the more
frequently used techniques such as near neighbor,
and Mahalanobis distances, can be used with
analytic techniques such as linear regression
models, ANCOVA, or even matched t-tests.
However, the selection of any analytic approach to
estimate the treatment effect and statistical
significance should take into account the fact that
84
5.
Sensitivity
Analysis
A question that any evaluator who uses propensity
score matching should ask herself is: how sensitive
are these results to hidden bias? Rosenbaum
(2002, 2005) recommends that researchers try to
answer this question by conducting a sensitivity
analysis. The idea is to determine how susceptible
the results presented might be to the presence of
biases not identified by the researcher or removed
by the matching. Rosenbaum (2002) developed
methods to determine bias through several nonparametric tests such as McNemars and
Wilcoxons signed rank test. Keele (2015)
developed the package rbounds which estimates
the sensitivity of the results to hidden bias.
rbounds can compute sensitivity analysis straight
from the package matching (Sekhon, 2011). For
other propensity scores packages, some file
reformatting needs to be completed before it can
be submitted to rbounds. Figure 17 shows the
85
sensitivity analysis using the Wilcoxons rank sign
test.
library("Matching")
attach(lalonde)
Y
<-
lalonde$re78
Tr
<-lalonde$treat
ps
<-
glm(treat
~
age
+
educ
+
nodegree
+
re74
+
re75
+
married
+
black
+
hispan,
dat
a
=lalonde,
family
=
binomial())
#---Match
-
without
replacement
Match
<-
Match(Y=Y,
Tr=Tr,
X=ps$fitted,
replace=FALSE)
#---Runs
the
sensitivity
test
based
on
the
matched
sample
using
Wilcoxon's
rank
sign
test
psens(Match,
Gamma
=
2,
GammaInc
=
0.1)
##
##
Rosenbaum
Sensitivity
Test
for
Wilcoxon
Signed
Rank
P-Value
##
##
Unconfounded
estimate
....
0.2858
##
##
Gamma
Lower
bound
Upper
bound
##
1.0
0.2858
0.2858
##
1.1
0.1338
0.4904
##
1.2
0.0541
0.6809
##
1.3
0.0194
0.8227
##
1.4
0.0063
0.9113
##
1.5
0.0019
0.9595
##
1.6
0.0005
0.9828
##
1.7
0.0001
0.9932
##
1.8
0.0000
0.9975
##
1.9
0.0000
0.9991
##
2.0
0.0000
0.9997
##
##
Note:
Gamma
is
Odds
of
Differential
Assignment
To
##
Treatment
Due
to
Unobserved
Factors
##
Figure
17.
Sensitivity
analysis
using
Wilcoxons
rank
sign
test
that could be obtained if the study is free of bias.
In Figure 17, the value of Gamma is interpreted as
Thus results will be more robust to hidden bias, if
the odds of treatment assignment hidden bias. A
a very large change in the odds is needed before a
change in the odds lower/upper bounds from
change in statistical significance happens.
significant to non-significant (or vice-versa)
indicates by how much the odds need to change
before the statistical significance of the outcome
Limitations
shifts. For example, in Figure 17, the lower bound
estimate changes from non-significant (0.0541) to
In spite of the benefits described above, propensity
significant (0.0194) when gamma is 1.3. That is, a
scores have important limitations. Endogeneity
change of 0.3 in the odds will produce a change in
problems are not controlled. That is, researchers
the significance value. Rosenbaum (2002) defines
still need to be able to identify and measure many
a study as sensitive if values of Gamma close to 1
if not all the variables associated with treatment
lead to changes in significance compared to those
86
assignment/selection. There is a possibility that
one can use proxies to address some of these
variables; for example, using age as a proxy for
general state of health (Gagne, 2010). However, it
is not clear to what extent proxies can alleviate this
problem. Rosenbaum (2005) and Guo and Fraser
(2015) suggest the use of sensitivity analyses to
explore the extent to which the results can be
trusted as identification of all associated variables
is unlikely.
Also important is the fact that there needs to
be a strong overlap of the distributions of
propensity scores between the two groups (the socalled common support region). If the overlap is
small, that there may not be enough participants
in the control group to match all the participants
in the treatment group, then propensity score
matching will be no better than any standard form
of matching.
Given that lack of overlap can be most times
be associated with a similar (or smaller) sample
for the control group, it is recommended that the
sample size for the control group be at least 3-4
times larger than for the treatment group to assure
matches in the common support region. A larger
sample for the control group also increases the
number of matches for every treatment
participant.
Conclusion
Propensity scores can provide an alternative that
can
strengthen
quasi-experiments
and
observational studies in their quest to demonstrate
causality. In particular, they are intended to
identify the probabilities associated with
assignment to treatment conditions, and match
participants based on those probabilities. This
matching in particular helps directly with one of
the four requirements associated with causal
inference (the counterfactual), and indirectly with
another (ruling out alternative explanations).
However, the use of propensity scores requires a
deep understanding and measurement of all the
variables that can affect selection into groups.
Furthermore, if any variable that can be critical for
the selection into treatment is not included in the
propensity scores, then the propensity scores will
not be able to eliminate selection bias. Finally, a
sensitivity analysis is always recommended as a
way to determine how robust the results are.
References
Austin, P. C., Grootendorst, P., & Anderson, G. M.
(2007). A comparison of the ability of different
propensity score models to balance measured
variables between treated and untreated
subjects: A Monte Carlo study. Statistics in
Medicine, 26, 734-753.
Austin, P.C. (2008). A critical appraisal of
propensity score matching in the medical
literature between 19996-2003. Statistics in
Medicine,
27,
2037-2049.
doi:
10.1002/sim.3150
Austin, P. C. (2011). An introduction to propensity
score methods for reducing the effects of
confounding
in
observational
studies.
Multivariate Behavioral Research, 46, 399424. doi: 10.1080/00273171.2011.568786
Bai, H., & Clark, M. H. (2012, October).
Propensity score matching: Theories and
Applications. Workshop presented at the
American
Evaluation
Association,
Minneapolis, MN.
Bowers, J., Fredrickson, M., & Hansen, B. (2014).
RItools: Randomization Inference Tools. R
package version 0.1-12.
Bonell, C. P., Hargreaves, J., Cousens, S., Ross, D.,
Hayes, R., Petticrew, M., & Kirkwood, B. R.
(2009). Journal of Epidemiology Community
Health, 1-6. doi: 10.1136/jech.2008.082602
Caliendo, M., & Kopeinig, S. (2008). Some
practical guidance for the implementation of
propensity score matching. Journal of
87
Economic Surveys, 22(1), 31-72. doi:
10.1111/j.1467-6419.2007.00527.x
Campbell, D. T., & Stanley, J. C. (1963).
Experimental and quasi-experimental designs
for research. United States of America:
Houghton Mifflin Company.
Cochran, W. G., & Rubin, D. B. (1973). Controlling
bias in observational studies: A review. Indian
Journal of Statistics Series, 35(4), 417-446.
Cook, T. D., & Campbell, D. T. (1979). Quasi experimentation: Design and analysis issues
for field settings. Boston: Houghton Mifflin
Company.
DAgostino, R. B., & DAgostino, R. B. (2007).
Estimating
treatment
effects
using
observational data. Journal of American
Medical Association, 297(3), 314-316.
Drake, R. E., Goldman, H. H., Leff, H. S., Lehman,
A. F., Dixon, L., Mueser, K. T., & Torrey, W. C.
(2001).
Implementing
evidence-based
practices in routine mental health service
settings. Psychiatric Service, 52(2), 197-182.
Draper, N. R., & Smith, H. (1998). Applied
regression analysis. (3rd ed.). United States of
America: John Wiley & Sons, Inc.
Gagne, J. J. (2010). High-dimensional propensity
scores for comparative effectiveness research.
Presentation at the Lewin Summit, June 15,
2010
Gliner, J. A., Morgan, G. A., & Leech, N. L. (2009).
Research methods in applied settings (2nd.
Ed). Mahwah, NJ: Lawrence Erlbaum.
Gu, X. S., & Rosenbaum, P. R. (1993). Comparison
of multivariate matching methods: Structures,
distances, and algorithms. Journal of
Computational and Graphical Statistics, 2(4),
405-420.
Guo, X. S., & Fraser, M. W. (2015). Propensity
score analysis: Statistical methods and
applications (2nd ed.). Thousand Oaks, CA:
Sage Publications, Inc.
Guskey. T. (1999). The age of our accountability.
Journal of Staff Development, 19(4), 36-44.
Hansen, B. B., Fredrickson, M., Bertsekas, D., &
Tseng, P., (2013) Package optmatch. R
package version 0.8-1
Hansen, B. B. (2004). Full Matching in an
Observational Study of Coaching for the SAT.
Journal
of
the
American
Statistical
Association,
99(467).
doi:
10.1198/016214504000000647
Hansen, B. B., & Bowers, J. (2008). Covariate
balance in simple, stratified and clustered
comparative studies. Statistical Science, 23(2),
219-236. doi:10.1214/08-STS254
Harrell,
F.
E.
(2015).
Hmisc:
Harrell
Miscellaneous. R package version 3.15-0
88
Rubin, D. B. (1979). Using multivariate matched
sampling and regression adjustment to control
bias in observational studies. Journal of the
American Statistical Association, 74(366),
318-328.
Rubin, D. B. (2005). Causal inference using
potential
outcomes:
Design,
modeling,
decisions. Journal of the American Statistical
Association, 100(469), 322-331.
Scriven, M. (1991). Evaluation Thesaurus.
Thousand Oaks, CA: Sage
Sekhon, J. S. (2011). Multivariate and propensity
score matching software with automated
balance optimization: The matching package
for R. Journal of Statistical Software, 42(7), 152.
Shadish, W. R., Cook, T. D., & Campbell, D. T.
(2002). Experimental and quasi-experimental
design for generalized causal inference.
Boston: Houghton Mifflin Company.
Stuart, E. A., & Rubin, D. B. (2008). Best practices
in quasi-experimental design: Matching
methods for causal inference. In Osborne, J.
Best Practices in Quantitative Methods (pp.
155-177). Thousand Oaks, CA:Sage.
Stuart, E. A. (2010). Matching methods for causal
inference: A review and a look forward.
Statistical Science, 25(1), 1-21.
Trochim, W. M. K. (1984). Research design for
program evaluation. Thousand Oaks, CA:
Sage.
Weiss, C. H. (1998). Evaluation: Methods for
Studying Programs and Policies. Upper
Saddle NJ: Prentice Hall
Zhao, Z. (2004). Using matching to estimate
treatment
effects:
Data
requirements,
matching metrics, and Monte Carlo evidence.
Review of Economics and Statistics, 86(1), 91107.