23 STS890
23 STS890
23 STS890
1. INTRODUCTION
Carly Lupton Brantner is a PhD Candidate, Department of
Identifying the right treatment for the right patient can
Biostatistics, Johns Hopkins Bloomberg School of Public
improve quality of healthcare for individuals and popula-
Health, Baltimore, Maryland 21205, USA (e-mail:
clupton1@jhu.edu). Ting-Hsuan Chang is a PhD Student, tions. Treatments for disorders and diseases like depres-
Department of Biostatistics, Columbia Mailman School of sion (Trivedi et al., 2006), schizophrenia (Samara et al.,
Public Health, New York, New York 10032, USA (e-mail: 2019), and diabetes (Xie, Chan and Ma, 2018) can ex-
tc3255@cumc.columbia.edu). Trang Quynh Nguyen is an hibit differential treatment effects across individuals due
Associate Scientist, Department of Mental Health, Johns to effect moderators, defined as known and unknown in-
Hopkins Bloomberg School of Public Health, Baltimore, dividual, genetic, environmental, and other characteristics
Maryland 21205, USA (e-mail: trang.nguyen@jhu.edu). that are associated with the effectiveness of medical treat-
Hwanhee Hong is an Associate Professor, Department of
ments (Baron and Kenny, 1986). Finding ways to identify
Biostatistics and Bioinformatics, Duke University, Durham,
North Carolina 27710, USA (e-mail:
hwanhee.hong@duke.edu). Leon Di Stefano is a PhD Professor, Departments of Biostatistics, Mental Health, and
Candidate, Department of Biostatistics, Johns Hopkins Health Policy and Management, Johns Hopkins Bloomberg
Bloomberg School of Public Health, Baltimore, Maryland School of Public Health, Baltimore, Maryland 21205, USA
21205, USA (e-mail: lds@jhu.edu). Elizabeth A. Stuart is (e-mail: estuart@jhu.edu).
640
INTEGRATING DATA FOR EFFECT HETEROGENEITY 641
and leverage effect moderators at the point of care to fa- covariates that could be quite complex and so requires
cilitate clinical decision-making can improve efficiency, large sample sizes to estimate reliably. A key assumption
quality and outcomes of healthcare. when combining studies to estimate the conditional aver-
Although crucial for delivery of treatment and preven- age treatment effect is that the CATE function is substan-
tative medicine, detecting treatment effect heterogeneity tially similar across studies. When discussing the CATE,
is challenging with common study designs. Randomized it is relevant to note that the CATE function is related to
trials yield comparable treatment groups on average but subgroup average treatment effects and identification of
are typically under-powered to detect moderation. One groups who benefit from treatment; these similar goals
rule-of-thumb is that study samples need to be four times are mostly outside of the scope of this review. We there-
larger to test an effect moderator than to detect the overall fore focus on the CATE and mention subgroup treatment
average effect (Enderlein, 1988). In addition, randomized effects and other similar topics briefly when relevant.
trial samples are also often not representative of the target There have been recent statistical advances in model-
population for which treatment decisions will be made; ing heterogeneous treatment effects and a separate bur-
for instance, Black individuals are on the whole under- geoning interest in combining data from multiple sources.
represented in pivotal clinical trials (Green et al., 2022). A select few works have done both—simultaneously
Therefore, conclusions from one particular trial might not leveraging data from multiple studies to assess treatment
reflect conclusions for a target population, and different effect heterogeneity. Methods like these are needed to
trials might give conflicting results due to differences in best harness the available data to optimize and individ-
their enrolled participants. On the other hand, large-scale ualize treatments, and to leverage information from mul-
non-experimental studies can have improved external va- tiple studies to provide more systematic, comprehensive,
lidity, but these studies can suffer from confounding bias. and generalizable conclusions. This paper reviews these
Given power concerns in single randomized trials and bias novel methods of assessing treatment effect heterogeneity
concerns in non-randomized studies, much can be gained using multiple studies in the form of multiple random-
by combining multiple trials, or combining experimental ized trials, or one randomized trial with a large observa-
and non-experimental studies, to examine effect modera- tional dataset. We focus on methods identifying which of
tion (Berlin et al., 2002, Brown et al., 2013). two treatments is more likely to improve outcomes for
Many methods have been proposed to examine effect an individual or subgroup—a causal question that sits at
moderation in a single study. One of the popular ap- the core of clinical practice. In this review, we consider
proaches is to prespecify a few key subgroups and fit mod- the situation where the variables are similarly defined and
els with treatment-subgroup interactions. This approach is available from all studies. It is common though that dif-
limited in that data analysts could explore a range of pos- ferent studies may have different sets of variables. In this
sible subgroups and report only those that are statistically more complicated case, either harmonization is needed on
significant (Kent et al., 2010); additionally, this approach the variables or some shared structure is required on con-
does not allow the contribution of multivariate factors in ceptually related variables. We will return to this point in
effect moderation. Another approach is “risk modeling” the Discussion section (6).
(Kent et al., 2010, 2020), where a risk score is created Methods discussed in this paper are broken down
using the covariates to predict the outcome (usually out- based on data setting: aggregate-level data, federated
come under the comparison/control condition), and the learning, and individual participant-level data (IPD). The
treatment effect is assessed based on the interaction be- aggregate-level data setting occurs when researchers only
tween treatment and this risk score in a regression model have access to summary information from each study.
of the outcome. This review focuses on what is some- With aggregate-level data, individual-level effect hetero-
times called “effect modeling.” Effect modeling spans a geneity can only be truly assessed if each study estimated
spectrum that includes parametric approaches in which a treatment-covariate interactions using the same statistical
few effect moderators are prespecified, and nonparametric models (e.g., same link function, same set of covariates),
approaches where effect moderation is assumed to be via which is not often feasible. In the federated learning set-
some potentially complex function of a large set of covari- ting, sensitive individual-level data are distributed across
ates. Regression analyses and variable selection are com- decentralized studies and cannot be shared beyond their
mon approaches for the former; machine learning meth- original storage location (Vo et al., 2021). Finally, the IPD
ods for the latter. setting is the most straightforward and powerful scenario
In order to examine treatment effect heterogeneity for assessing treatment effect heterogeneity, as individual-
based on observed characteristics, the target estimand in level covariates are available from all studies simultane-
the present work is the conditional average treatment ef- ously. With IPD, we can harmonize covariates, estimate
fect (CATE). Notation for this estimand is presented in effect moderation by using the same statistical models in
the following section. The CATE is a general function of each study, and assess model assumptions consistently.
642 C. L. BRANTNER ET AL.
Within each of these data settings, methods are primar- Importantly, to effectively combine information from
ily geared towards either combining multiple RCTs or one multiple datasets, the original studies need to have high
RCT with one observational dataset. We discuss the use of transparency and reproducibility. Whether data are re-
meta-analysis models with multiple RCTs (Debray et al., ported in aggregate or at the individual participant level,
2015, Burke, Ensor and Riley, 2017), along with the op- researchers using the data for additional analyses—such
portunity to employ variable selection approaches to iden- as those discussed here—need extensive information
tify effect moderators (Seo et al., 2021). When combin- about how the data were collected, analyzed, and pre-
ing an RCT with observational data, we consider various sented to be able to determine if and how to combine
methods that allow for complicated relationships to be in- the information with other datasets. It is therefore vital
cluded in the treatment effect function and account for to keep these ideas of transparency and reproducibility
potential bias from the observational data. These methods of data, code, and results at the forefront when apply-
can involve estimating the CATE in the RCT and observa- ing these methods. Movements towards data sharing and
tional data separately and then combining them through reproducible research will greatly facilitate the types of
an estimated weighting factor (Rosenman et al., 2022,
research discussed here, which can lead to important new
2020, Cheng and Cai, 2021, Yang, Zeng and Wang, 2020),
insights regarding effect heterogeneity that cannot be an-
or estimating the observational CATE and the confound-
swered from single studies alone due to generalizability,
ing effect in the observational dataset (Kallus, Puli and
sample size, or confounding concerns.
Shalit, 2018, Yang, Zeng and Wang, 2020, Wu and Yang,
2021, Hatt et al., 2022). Colnet et al. (2021a) reviewed In the following section (2), we introduce the esti-
some methods that combine RCT and observational data, mand and assumptions. The next sections are then or-
and we extend upon this review by focusing on this com- ganized based on the level of data access so that re-
bination explicitly for treatment effect heterogeneity. We searchers can determine available methods in their given
also add in more methods that combine RCT with obser- data setting. Specifically, Section 3 discusses aggregate-
vational data along with methods that focus on combin- level data; Section 4, federated learning; and Section 5,
ing multiple RCTs. In general, there are many approaches individual participant-level data (IPD). Finally, Section 6
outside of those we reference here that focus on estimat- compares methods and provides an overview of potential
ing the average treatment effect by combining datasets, future areas for research.
some of which are discussed by Colnet et al. (2021a); we
choose to primarily focus on efforts to examine treatment 2. NOTATION
effect heterogeneity in the present review.
To provide context to the methods discussed in this 2.1 Target Estimand
review, we can consider a few example scenarios. We Our target estimand to assess effect heterogeneity is the
first consider an assessment of the efficacy of surgery conditional average treatment effect (CATE), defined us-
in stage IV breast cancer according to 15 studies where ing the potential outcomes framework under the Stable
researchers combining the studies only had access to Unit Treatment Value assumption (Rubin, 1974). Suppose
aggregate-level data (Petrelli and Barni, 2012). We also S is the categorical variable indicating study membership,
discuss a comparison of outcomes for veterans who re-
A = 0, 1 is a binary treatment variable, Y is the observed
ceived the Moderna versus the Pfizer vaccination for
outcome, Y (1) and Y (0) are the potential outcomes under
COVID-19 in five different sites where IPD was avail-
treatment and control respectively, X is a set of covari-
able within each site but could not be shared across sites,
ates, and Z is a subset of X containing the proposed effect
known as a “federated learning” situation (Han et al.,
2021). Another setting investigates a diabetes medication, moderators.
pioglitazone, versus placebo for individuals coming from The CATE can be formally defined as a function of X:
one of six RCTs, where IPD was available in each trial
τ (X) = g E Y (1)|X − g E Y (0)|X
(Hong et al., 2015). And finally, we discuss data assessing
the treatment effect comparing two active treatments for (Abrevaya, Hsu and Lieli, 2015, Künzel et al., 2019),
major depression, duloxetine and vortioxetine, wherein where E[·|·] denotes conditional expectation in the tar-
we have access to IPD from a combination of RCT data get population of interest and g(·) is a link function that
and electronic health records (EHR) from a hospital sys- defines the scale on which the interactions occur, whether
tem (Brantner et al., 2023a). These scenarios all could additive (mean or risk difference) or multiplicative (risk,
clearly benefit from combining data to examine hetero- rate, or odds ratio). In this paper, we primarily discuss a
geneity in treatment effects, but they each require distinct continuous outcome, in which case we use the identity
considerations and statistical approaches to best integrate link function and write the CATE as
information. We will use these examples throughout the
paper to ground the methods in specific applications. (1) τ (X) = E Y (1) − Y (0)|X .
INTEGRATING DATA FOR EFFECT HETEROGENEITY 643
This τ (·) can often be assumed to be a flexible function assumed to be approximately correctly specified (Debray
in which all covariates are considered as potential moder- et al., 2015, Yang, Zeng and Wang, 2022, 2020). Specif-
ators, so we do not have to a priori differentiate Z and X ically in the meta-analytic framework when combining
when methods allow for this flexibility. multiple RCTs, effect moderation is often assessed using
One can also consider study-specific CATE functions. treatment-covariate interaction terms. This approach typ-
This is often the case when researchers are interested in ically uses an outcome model of the form
assessing heterogeneity of the treatment effect functions
h E(Y ) = μ(X) + A × τ (Z),
across trials/datasets, or when this heterogeneity is high
and it is potentially unreasonable to combine informa- where h(·) is a link function, μ(X) is the modelled mean
tion across studies. We can denote study by S: in the of the outcomes under control, Z contains a subset of the
case where data is being combined from one RCT and variables in X that often needs to be prespecified, and
one observational dataset, S = 0 will indicate RCT and τ (Z) is the the CATE function:
S = 1 observational data; otherwise, S will be a categori-
cal variable ranging from 1 to K, where K is the number (3) τ (Z) = δ + θ T Z.
of RCTs. The above equation (1) defines a general CATE In this expression for τ (Z), δ corresponds to the effect
that is not study-specific. When estimating study-specific of treatment A when Z = 0 (or when the covariates in
CATEs, equation (1) can be rewritten as Z equal their means if they have been centered), and θ
corresponds to the coefficients of treatment-moderator in-
(2) τs (X) = E Y (1) − Y (0)|X, S = s .
teraction terms AZ in the h(E(Y )) model. Similarly to
In most of the methods to follow, the CATE is defined the general format of the CATE in equation (1), this para-
by conditioning on a set of available covariates, X. An al- metric form of τ (Z) can be expressed as multiple study-
ternative is to a priori define subgroups of interest and es- specific functions:
timate subgroup-specific treatment effects. This approach
is similar to the methods discussed in this review but (4) τs (Z) = δs + θ Ts Z.
somewhat distinct because subgroups must be specified When combining an RCT with an observational dataset,
first. The form of the estimand when examining subgroup- there are a few within-study assumptions, including un-
specific effect estimates is instead confoundedness (Assumption 1), positivity (Assump-
tion 2), and consistency (Assumption 3) (Colnet et al.,
τk = E Y (1) − Y (0)|K = k ,
2021a, Cheng and Cai, 2021).
where K represents subgroup membership (Rosenman
A SSUMPTION 1. {Y (0), Y (1)} ⊥⊥ A|X within each
et al., 2020, 2022).
study.
2.2 Assumptions
A SSUMPTION 2. For almost all X with π(X) =
Across many methods, the key assumption that allows P (A = 1|X) (the propensity score), there exists a con-
pooling data from multiple studies to estimate the treat- stant c > 0 such that c < π(X) < 1 − c within each study.
ment effect is that either entire or partial components of
A SSUMPTION 3. Y = AY (1) + (1 − A)Y (0) almost
the treatment effect function τ (X) is shared across stud-
surely.
ies. This review also focuses solely on the case when there
are only two treatments (or one treatment and one con- The unconfoundedness assumption (1) is satisfied by
trol/placebo) being compared. If there are more than two design in an RCT. Assumption 2 also holds by design in
conditions being compared, different approaches would an RCT since the probability of treatment is independent
need to be used (i.e., network meta-analysis; Efthimiou of observed covariates and is prespecified.
et al., 2016, Debray et al., 2018, Hong et al., 2015). Aside When combining datasets, we expand upon the previ-
from these overarching assumptions, individual methods ous assumptions. In the setting where observational data
employ their own specific assumptions. When multiple is being combined with an RCT, the unconfoundedness
RCTs are included in meta-analyses, they are often as- assumption (1) can be relaxed in the observational data.
sumed to have similar eligibility criteria (specifically in This is because there are analysis possibilities with mul-
terms of the covariates thought to be effect modifiers) tiple datasets that include assessing whether this assump-
(Dahabreh et al., 2020), and distributional assumptions tion is met or not and using the RCT to account for any
are made for model parameters (Debray et al., 2015). confounding in the observational data (Cheng and Cai,
Broadly, parametric approaches require the assumption 2021, Yang, Zeng and Wang, 2020, 2022). Assumption 3
of a parametric relationship between covariates (including in the multi-study setting implies that the treatments be-
treatment, effect moderators, and interactions between the ing compared are the same across all studies (since there
two) and outcomes; further, this parametric relationship is is no s subscript) to ensure that the potential outcomes
644 C. L. BRANTNER ET AL.
Y (0) and Y (1) are well-defined. We also can introduce this approach was not taken by Petrelli and Barni (2012),
two other assumptions that are involved at some level in if a treatment-age interaction term was estimated in each
methods that combine an RCT with observational data; of the individual studies assessing the effect of surgery on
these assumptions include study membership positivity mortality in individuals with stage IV breast cancer, then
(Assumption 4) (Colnet et al., 2021a, Cheng and Cai, these interaction terms could be pooled together. In this
2021) and unconfounded study membership (Assump- way, researchers can estimate an individual-level effect
tion 5) (Hatt et al., 2022, Cheng and Cai, 2021, Kallus, moderation term across multiple studies and can combine
Puli and Shalit, 2018). such terms to estimate τ (Z) as in equation (3). However,
this requires that the studies assess and report the interac-
A SSUMPTION 4. For almost all X, there exists a con- tions of interest consistently. Similarly, the aggregate data
stant d > 0 such that d < P (S = s|X = x) < 1 − d. could include subgroup-specific treatment effects rather
A SSUMPTION 5. {Y (0), Y (1)} ⊥⊥ S|X. than interactions, which could also be pooled to describe
effect moderation if the effects are reported in each study
The following sections break down methods based on (Godolphin et al., 2023).
available data.
3.2 Meta-Regression
3. AGGREGATE-LEVEL DATA If such study-specific interaction coefficients are not
The broadest level of data access is in the form of available across all studies, AD can be also modeled
aggregate-level data (AD), where individual studies have through meta-regression with treatment-covariate interac-
been carried out and analyzed, and only summary data tion terms, where importantly only aggregate level covari-
(e.g., sample mean, standard deviation, or regression ates (e.g., mean age, proportion female) are available. For
example, the individual-level covariate of interest might
model coefficient estimates) are available. AD are often
be whether the person has severe disease or not; in an
used in meta-analyses when IPD are unavailable. Meta-
AD meta-regression, this covariate would become the per-
analysis with AD can estimate average effects effec-
centage of individuals in the study who have severe dis-
tively and provide similar results as meta-analysis with
ease. Meta-regression was the approach taken by Petrelli
IPD (Burke, Ensor and Riley, 2017, Hong et al., 2015).
and Barni (2012) in their assessment of surgery efficacy.
However, aggregation bias (also known as the ecological
Specifically, they investigated hazard ratios of overall sur-
fallacy), which occurs when conclusions are incorrectly
vival according to the 15 different studies and did so while
drawn about individuals when the relationship is found
including covariates such as median age and mastectomy
at the group level, can easily be introduced if researchers rate.
want to make a conclusion about individual-level effect AD analyses can handle study-level effect moderators
moderation when only AD is available (Berlin et al., 2002, well. However, the ability to assess individual-level mod-
Debray et al., 2015, Teramukai et al., 2004). This ag- erators depends on the level of detail available in the
gregation bias will not be present if each paper reports AD. Multiple papers have assessed the differences be-
subgroup-specific outcomes for all necessary subgroups; tween AD and IPD meta-regressions for estimating treat-
however, this is rare in practice because subgroups are of- ment effect heterogeneity. In an analysis by Berlin and
ten defined by more than one covariate. AD therefore has colleagues (2002), models using IPD picked up on a key
limited power for detecting effect moderation (Lambert effect moderator that had been found in previous litera-
et al., 2002). However, IPD is not always easy to access or ture, but all models using AD missed this effect modera-
use, so the following section discusses what can be done tor at the group level. Extensive simulation studies also
with AD. In framing this discussion, one can think of the have shown that the power for detecting treatment ef-
example assessing the effects of tumor-removal surgery in fect moderation is much lower in meta-regression using
individuals with breast cancer (Petrelli and Barni, 2012) AD; in these simulations, effect moderation was only ef-
using aggregate data from several relevant studies. fectively discovered in AD analyses when there were a
3.1 Meta-Analysis of Interaction Terms large number of trials with large sample sizes (Lambert
et al., 2002). Again, relationships that are picked up in an
If AD is all that is available for a question of interest, AD meta-regression cannot be immediately interpreted as
there is still an opportunity to estimate individual-level individual-level effects; for example, if the percentage of
effect moderation under specific circumstances. If all pre- individuals with severe disease is an effect moderator in
vious studies have performed similar analyses and have the AD model, researchers cannot immediately conclude
included a particular treatment-covariate interaction term that the individual-level presence of severe disease is an
using the IPD from that given study, then these interaction effect moderator at the individual level.
terms can be pooled at the aggregate level (Simmonds and Furthermore, the aggregate-level covariates also often
Higgins, 2007, Kovalchik, 2013). For instance, although do not vary much across studies. Since studies included in
INTEGRATING DATA FOR EFFECT HETEROGENEITY 645
more IPD has become accessible to researchers, allowing ment effects at the mean value of each covariate (Dagne
them to go a step further from AD and more effectively et al., 2016, Gelman, Hill and Vehtari, 2020).
assess effect moderation. Having IPD available, such as The model above includes random effects for all coef-
in the example of assessing the effects of pioglitazone for ficients, and so explicitly models between-study hetero-
individuals with diabetes (Hong et al., 2015), allows for geneity for each coefficient (the β s ’s and θ s ’s). This ap-
baseline individual-level covariates to be used to study proach can be thought of as interpolating between two ex-
subgroup effects and effect moderation at the individual tremes. The first of these is a “no-pooling” model, with
level. the same structure as equation (5) but with study-specific
coefficients fit as fixed effects independently to the data
5.1.1 Types of IPD meta-analyses. There are two com- from each study. Such a model avoids the sharing of infor-
monly discussed IPD meta-analysis estimation methods: mation across studies, but also includes more free param-
two-stage and one-stage. In two-stage IPD meta-analysis, eters, which may be less stably estimated. This approach
aggregate statistics are calculated within each study (e.g., also does not ultimately provide a global treatment effect
overall treatment effects, effects for each subgroup, inter- estimate across studies, as all studies are given their own
action terms), and then these results are combined in a fixed coefficients.
between-study model. In one-stage IPD meta-analysis, all A simpler model would treat some coefficients as
individual-level data are put directly into a hierarchical shared across studies. This might take the form of as-
or multilevel model (Burke, Ensor and Riley, 2017). Al- suming a common intercept or slope (Thomas, Radji and
though results with respect to average treatment effects Benedetti, 2014); for example, in equation (5), if between-
are often similar between the two approaches (Burke, En- study variability of the main covariate effects (represented
sor and Riley, 2017, Debray et al., 2015, Tierney et al., by β ) were small, a common coefficient could be esti-
2015), model assumptions do differ, and choosing the ap- mated instead by replacing β s with β. In practice, θ is
proach that seems best fit to a specific research question often assumed to be shared across studies. GLMMs can
is an important decision. In this paper, we focus on one- quickly become too complicated if many effects are al-
stage IPD meta-analysis because of its flexibility (Debray lowed to vary across studies (especially when study sam-
et al., 2015). ple sizes are small); on the other hand, the model might be
5.1.2 One-stage IPD meta-analysis. In one-stage IPD misspecified if it ignores important variation that does ex-
meta-analysis, a common technique is to use a general- ist. Therefore, each coefficient—and whether it should be
ized linear mixed model (GLMM) to estimate the mean treated as common across studies, modelled as random,
outcome given covariates. The model can have the form or estimated independently within each study—should be
considered carefully to ensure that the model effectively
(5) g E(Yis ) = αs + δs Ais + β Ts X is + θ Ts Ais Z is , represents between-study variability while still being suf-
ficiently simple.
where Yis is the outcome for individual i from study s, GLMMs can be fit under both frequentist and Bayesian
αs ∼ N(α, σα2 ) is a study-specific intercept, δs ∼ N(δ, σδ2 ) frameworks (Debray et al., 2015). If a Bayesian frame-
is the vector of study-specific treatment effects when work is used, prior distributions need to be assigned to
the covariates are set to 0 (or their means, if centered), each parameter; an option for this is noninformative pri-
β s ∼ N(β, β ) is the study-specific vector of main ef- ors to all parameters of interest (McCandless, 2009). In-
fects of covariates on the outcome, and θ s ∼ N(θ, θ ) is formative priors can be used when information about the
the study-specific vector of effect moderation terms (Seo parameters is available from expert opinion or historical
et al., 2021). Here, σα2 , σδ2 and the diagonal elements of data analysis. Hong et al. (2015) utilize a Bayesian frame-
β and θ measure the between-study variability of the work for their analysis of diabetes medication; however,
effects. β s and θ s are often assumed to be uncorrelated in they compare more than just two treatments and perform
the literature; however, we can extend this model to allow network meta-analysis, which is not the focus of this pa-
for correlation between β s and θ s . per.
If the outcome is continuous (as assumed in this paper), One other consideration in one-stage IPD meta-analysis
g(·) is often set to be the identity function; if the outcome is the option to decompose between-study and within-
is binary, g(·) could be the logit link function. Key param- study variability. To avoid aggregation bias, some re-
eters of interest are δ, which indicates an overall measure searchers (Hua et al., 2017, Debray et al., 2015, Donegan
of the treatment effect when the moderators are set to 0, et al., 2012, Hong et al., 2015) suggest decomposing
and θ , which indicates the magnitude of the effect moder- the interactions into two sources: individual-level (i.e.,
ation. For easy interpretation, covariates can be centered within-study effect) and aggregate-level (i.e., between-
at zero so that the treatment effects, δs represent the treat- study effect) interactions. This model can be written by
INTEGRATING DATA FOR EFFECT HETEROGENEITY 647
involves estimating the CATE in both datasets. In sev- minimizing the unbiased risk estimate such that
eral of these approaches, the final CATE estimate is a
(7) τ̂k (λ) = τ̂kr − λ τ̂kr − τ̂ko ,
weighted combination of the two study-specific CATE es-
timates, where the weight is derived based on a method- where r indexes the RCT estimator, o the observational
specific estimate of bias in the observational data. This estimator, k indexes strata, and τ̂kr and τ̂ko can be esti-
is the approach taken by Rosenman et al. in two papers mated as specified in equation (6). They also discuss an
(2022; 2020). In each paper, Rosenman and colleagues estimator that is the same as equation (7) but multiplies
discuss the CATE in terms of average treatment effects the difference λ(τ̂kr − τ̂ko ) by the variance matrix from the
within “strata,” or subgroups that can be defined as a com- RCT. Note that both of these approaches by Rosenman
plex function of covariates (Rosenman et al., 2022). The and colleagues are technically at the subgroup-level; how-
authors construct strata based on effect moderators and ever, these subgroups can be complex functions of covari-
propensity score estimates from the observational data. ates, so the approach can be easily discussed in terms of
They assume that within each stratum, the true average covariates, X, instead of stratum membership.
treatment effect is the same for both the observational and A recent paper by Cheng and Cai (2021) incorporates
RCT data; however, the observational data may yield a a similar approach to the shrinkage estimation by Rosen-
biased estimate due to unobserved confounding. The base man et al. (2020) by adaptively combining CATE func-
estimator used in their papers is a difference in mean out- tions between an RCT and observational dataset based on
comes between the treatment and control group within the estimated degree of bias in the observational estima-
stratum k: tor to yield study-specific CATE estimates that minimize
MSE. Cheng and Cai (2021) also use a weighted linear
i∈Ok Ai Yi i∈O (1 − Ai )Yi
(6) τ̂k =
o
− k , combination of CATE estimators from the RCT, τ̂sr (X)
i∈Ok Ai i∈Ok (1 − Ai )
and the observational data, τ̂so (X):
where o indicates observational study, k indexes strata,
τ̂s (X) = τ̂sr (X) + νX τ̂so (X) − τ̂sr (X) ,
and Ok is the set of individuals in the observational study
belonging to stratum k. The same estimator can be es- where s = 0, 1 denotes RCT and observational data, re-
tablished for the RCT by replacing o and Ok with r spectively and νX is a weight function. To estimate CATE
and Rk , respectively. From this, Rosenman et al. (2022) functions in each study separately, the authors use doubly-
construct a “spiked-in” estimator, in which individuals robust pseudo-outcomes (Kennedy, 2020) that are defined
from the RCT are assigned to their corresponding strata as influence functions for the average treatment effect
with individuals from the observational data. Then the (see more in the Supplementary Material, Brantner et al.,
stratum-specific treatment effects are estimated as in 2023b). These influence functions are then regressed on
equation (6) but including both RCT and observational the potential effect moderators, X, to estimate the CATE
data. They compare this “spiked-in” estimator with a dy- in both the RCT (τ̂sr (X)) and observational data (τ̂so (X))
namic weighted average in which stratum-specific treat- separately. The weight νX is estimated by minimizing a
ment effects are estimated separately in the RCT and ob- decomposition of an estimate of the mean squared error
servational data, and then the weight for combining the (MSE) for the CATE function and varies based on X. This
RCT and observational stratum-specific treatment effects strategy allows for the weight to heavily favor the RCT es-
is constructed based on the variance of the RCT estima- timator when the observational data is biased and to com-
tor and the mean squared error (MSE) of the observa- bine both estimators efficiently to minimize asymptotic
tional data estimator. Ultimately, they discover that the variance in the presence of insignificant bias in the obser-
“spiked-in” estimator is only effective when the covari- vational data.
ate distributions are very similar across datasets and that Cheng and Cai’s method of estimating νX is similar to
their dynamic weighted average has low bias regardless Rosenman et al. (2020) approach of estimating λ using an
of whether the covariate distributions are similar or not. unbiased risk estimate. An important distinction between
In their second paper in this stratum-specific treatment the two approaches is that Rosenman et al. (2020) rep-
effect framework, Rosenman et al. (2020) utilize shrink- resent treatment effect heterogeneity through K distinct
age estimation to combine CATE estimators from the strata within which they assume that the treatment effect
RCT and observational dataset. They first determine a is common across the RCT and observational datasets.
structure for a given shrinkage factor, λ, and then optimize Cheng and Cai (2021) instead use individual covariates
an unbiased risk estimate to solve for this λ. They again as part of their CATE estimation, and they do not require
define stratum-specific average treatment effects under the treatment effects to be equivalent between the RCT
the assumption that treatment effect heterogeneity can be and observational datasets. Cheng and Cai (2021) also use
assessed by dividing up the dataset into strata. For exam- a different base estimation procedure for the initial esti-
ple, they define a common shrinkage factor λ selected by mates of τ in the RCT and observational data.
INTEGRATING DATA FOR EFFECT HETEROGENEITY 649
Finally, Yang, Zeng and Wang (2020) also combine following equation. For the propensity score in the RCT,
separate estimates of the CATE from the RCT and obser- π r (X) = P (A = 1|X, S = 0), Kallus et al. define
vational data to minimize MSE under the assumptions of Ai 1 − Ai
unconfoundedness in the RCT (Assumption 1 in the RCT; q r (X i ) = −
π r (X i ) 1 − π r (Xi )
satisfied via randomization) and a structural model for the
CATE (τ (X) = τψ0 (X)). This approach uses elastic in- for individuals in the RCT. This leads to the final equation
tegration to combine the estimates based on a hypothe- to solve to estimate the confounding effect:
sis test that determines whether the assumption of uncon- nr
r
foundedness in the observational data (Assumption 1 in θ̂ = arg min q (Xi )Yi − τ̂ o (X i ) − θ T X i 2
the observational data) is sufficiently met or not (Yang, θ i=1
Zeng and Wang, 2020). To construct this test, Yang et al.
again applied to only individuals in the RCT, where nr is
(2020) introduce
the total number of individuals in the RCT. Finally, they
(8) Hψ0 (X) = Y − τψ0 (X)A T
set η̂(X) = θ̂ X and ultimately define
such that E(Hψ0 |A, X, S) = E(Y (0)|A, X, S). From τ̂ (X) = τ̂ o (X) + η̂(X).
here, they introduce a semiparametric efficient score of
the parameters ψ0 which we will call SESψ0 . This semi- Yang, Zeng and Wang (2022) also estimate confound-
parametric efficient score is used in their hypothesis test ing in the observational study directly. They focus on
with a null hypothesis of E(SESoψ0 ) = 0 where SESoψ0 is the conditional average treatment effect on the treated
the score in the observational data. If this null hypothe- (CATT), τ (X) = E[Y (1) − Y (0)|X, A = 1], and define a
sis is rejected, the ultimate parameters for the CATE are confounding function to estimate the effect of unobserved
determined solely from the RCT data; if not, parameters confounding in the observational data. They assume un-
are solved for using an elastic integration of both the RCT confoundedness in the RCT (Assumption 1), a structural
and observational data. Estimating the parameters is dis- model for both the CATT and the confounding function,
cussed in more detail in Yang et al.’s (2020) paper; briefly, ζ , and that the RCT and observational data come from
they solve the same target population, though their covariate distri-
N butions need not overlap. Their confounding function is
i=1 SESψ defined in the observational study as the difference in po-
=0
N tential outcome means between treatment groups:
by plugging in estimators of unknown quantities and solv-
ζ (X) = E Y (0)|A = 1, X, S = 1
ing for ψ.
− E Y (0)|A = 0, X, S = 1 .
5.2.2 Estimating and accounting for the confounding
bias in the observational data. Another category focuses When all confounders are measured, ζ (X) = 0, but in re-
on estimating the CATE—and the confounding bias, as ality, unobserved confounders will lead the function to
estimated by bringing in the RCT data—in the obser- be nonzero. Yang, Zeng and Wang (2022) show that this
vational data, rather than estimating the CATE in each function is only identifiable when the RCT data is used
dataset. Kallus and colleagues (2018) estimate the CATE with the observational data.
in the observational data first and then estimate a correc- To estimate the parameters for the CATT and the con-
tion term to adjust for confounding. They focus on deriv- founding function, Yang, Zeng and Wang (2022) utilize
ing a CATE estimator that is consistent. The approach as- estimating equations and semiparametric efficiency the-
sumes unconfoundedness (Assumption 1) in the RCT, but ory, similar to the approach taken by Yang, Zeng and
does not assume that the observational data fully overlaps Wang (2020). Specifically, they define an equation sim-
with the RCT data (Kallus, Puli and Shalit, 2018, Colnet ilar to that of their previous work (Yang, Zeng and Wang,
et al., 2021a). The authors note that the CATE function 2020) shown in equation (8):
in the observational data, τ o (X) does not equal the true
Hψ0 = Y − τϕ0 (X)A − Sζφ0 (X) A − e(X, S) ,
CATE, τ (X) because of confounding, so they define the
confounding effect to be where ψ0 = (ϕ0 , φ0 ) are parameters and such that the final
term in the equation will only come into play when S = 1,
η(X) = τ (X) − τ o (X)
that is, in the observational data. They solve an estimating
and focus on estimating this η to correct the observa- equation based around this H to get a preliminary esti-
tional CATE estimator. The observational CATE is esti- mator of the parameters for τ and ζ ; next, they update
mated using any single-study approach, such as a causal this solution based on a semiparametric efficient score.
forest (Athey, Tibshirani and Wager, 2019, Brantner et al., The authors finally show that their estimator of the CATT,
2023b), and the confounding effect is estimated using the which integrates both datasets, is more efficient than the
650 C. L. BRANTNER ET AL.
CATT from the RCT data when the predictors from the 6. DISCUSSION
CATT function and confounding function are linearly in-
6.1 Comparison of Approaches
dependent.
The “integrative R-learner” falls in a similar category of The recent influx of interest in studying treatment effect
methods and is based on adapting the original R-learner heterogeneity has led to novel and adapted methods that
by Nie and Wager (2021) (see Supplementary Material, strive to improve the identification of tailored interven-
Brantner et al., 2023b) to the setting with one RCT and tions. Furthermore, with the increase of IPD availability
one observational dataset (Wu and Yang, 2021). This ap- and the simultaneous research interests of combining data
proach minimizes loss and is consistent and asymptoti- sources, assessing treatment effect heterogeneity in a re-
cally efficient compared to an RCT-only estimator. The producible manner is more feasible than before. Table 1
authors use a very similar definition of the confounding summarizes the aforementioned approaches, with a focus
function as in Yang, Zeng and Wang (2022), with a slight on their data setting, modeling approach, and motivation.
adjustment:
6.2 Parametric and Nonparametric Approaches
c(X) = E(Y |X, A = 1, S = 1)
Meta-analyses have been in use for many years but are
− E(Y |X, A = 0, S = 1) − τ (X), less often conceptualized in terms of identifying treatment
effect moderation. This review and some other continuing
where c(X) = 0 when there is no unobserved confound-
work (i.e., Seo et al., 2021) have tied meta-analyses into
ing in the observational dataset (Assumption 1). Wu and
this framework. Traditional methods for assessing moder-
Yang (2021) estimate this confounding function and τ (X)
ation generally have involved parametric approaches that
by minimizing an empirical loss function that has the
require prespecification of the potential moderators. How-
Neyman orthogonality property, as found in the original
ever, parametric regression models are limited by the need
R-learner (Nie and Wager, 2021).
to prespecify interaction terms, and complex nonlineari-
Finally, Hatt et al. (2022) propose a method that uti-
ties might be missed in the ultimate CATE function. Vari-
lizes the estimated confounding effect in the observational
able shrinkage techniques (including priors) could help to
data through a representation learning approach. Under
ensure that the most important interactions are included
similar assumptions to previous methods such as consis-
without overfitting the model (Seo et al., 2021).
tency (Assumption 3), common support across the RCT
Newer approaches listed in Section 5.2 include flex-
and observational data (Assumption 4), and unconfound-
ible machine learning methods that allow for compli-
edness in the RCT (Assumption 1) among others, Hatt
cated functional forms for the covariates in the CATE
et al. (2022) define φ ∗ to be a representation of the shared
and do not require that moderators be prespecified. The
structure of covariates in both the RCT and the observa-
nonparametric side to estimation that is often employed
tional data. They also define hra and hoa as “hypotheses” in
the RCT and observational data, respectively, for a = 0, 1 when combining an RCT with observational data allows
indicating control or treatment. These so-called hypothe- for the CATE function to be more complex, but there
ses are functions meant to be applied to the representation, are some potential weaknesses of these methods com-
φ ∗ where for r representing membership in the RCT and pared with simpler parametric models. First, the result-
o in the observational data, ing CATE estimates may be more difficult to interpret,
particularly if the goal is to pick out individual effect
E Y r |Ar = a, Xr = x − E Y o |Ao = a, X o = x moderators and assess their precise relationship with the
treatment effect. Second, the desirable theoretical prop-
= hra φ ∗ (x) − hoa φ ∗ (x) .
erties of these methods—consistency of the estimators,
Similarly to previous methods, Hatt et al. (2022) use a robustness against model misspecification, accuracy of
confounding function to represent the bias, defined as the associated confidence intervals—are for the most part
γa = hra − hoa . Their algorithm starts by estimating φ̂ and asymptotic, and so a priori one would expect that the non-
ĥoa for a = 0, 1 from the observational data by minimiz- parametric/machine learning methods are better suited to
ing an empirical loss. Next, these estimates are applied situations with enough data. The point at which the ro-
to the RCT data and the empirical loss in this dataset is bustness of the nonparametric approaches is to be pre-
minimized to derive an estimate for the bias γ̂a , a = 0, 1. ferred over the explicitness and simplicity of the paramet-
Finally, these estimates are combined using the fact that ric approaches is perhaps best assessed using a combi-
γa = hra − hoa to solve for ĥra = γ̂a + ĥoa and to ultimately nation of contextual or scientific background knowledge,
estimate the CATE as simulation studies, data splitting techniques like cross-
validation and training/test/validation sets, and real-world
τ̂ (X) = ĥr1 φ̂(X) − ĥr0 φ̂(X) . experience with the methods.
INTEGRATING DATA FOR EFFECT HETEROGENEITY 651
TABLE 1
Comparison of approaches to estimate CATE using multiple studies
AD = aggregate-level data, FL = federated learning, IPD = individual participant-level data, RCT = randomized controlled trial, OD = observa-
tional data
In conclusion, parametric models may suffer from in real data. Real-world applications will be important
model misspecification but are easy to interpret and ap- for understanding the practical implications and consid-
ply. Although machine learning methods are relatively erations such as differential measurement across datasets,
untested, their statistical properties are mostly asymptotic, missing data, and more—such implications must be ad-
and their implementation can be more computationally in- dressed for the methods to be fully useful in applications.
tensive, they incorporate a large amount of flexibility and Furthermore, any comparisons that have been done do not
could be ideal when complex nonlinear associations are combine parametric and nonparametric approaches in this
expected with a large number of variables. field of CATE estimation using multiple studies.
Another useful field of follow-up study is consolidating
6.3 Current Shortcomings and Future Directions
and evaluating assumptions. The assumptions of methods
Because this field is growing rapidly and the meth- discussed here vary in whether they are required, relaxed,
ods discussed are somewhat new, many methods have or unneeded. It would be helpful to be able to empirically
not been thoroughly compared to one another in simu- evaluate the assumptions across datasets to examine their
lation studies or illustrated using real trials and/or obser- feasibility, although not all assumptions explored in this
vational datasets. There is therefore a broad opening for paper can be empirically assessed. Specific approaches
future research that assesses these approaches in compar- for inference in the form of variance estimation and con-
ison to one another through data applications. For meta- fidence intervals are also needed in many approaches. For
analysis, many real-world applications exist, but not all go parametric approaches discussed throughout the review,
in-depth into treatment effect heterogeneity. The remain- often standard methods such as Wald confidence inter-
ing approaches discussed in this study are all very recent, vals can be employed (Yang, Zeng and Wang, 2022), or
and the new methods have not been tried out extensively bootstrapping can be used to estimate intervals and stan-
652 C. L. BRANTNER ET AL.
dard errors as well. However, there is an opening for more However, further study is needed to determine which ap-
work to determine the best inference approaches in the proach will yield the most accurate predictions depending
parametric and nonparametric cases, and how these ap- on the types of heterogeneity present in the study (i.e., het-
proaches vary depending on the method. erogeneity across studies, heterogeneity within studies).
More work could also be done when it comes to the type For those working in this field or those who want to
of data being combined. One might be interested in deter- learn more, it is important to continue to look out for new
mining how to apply the meta-analytic framework to the research that comes out, since this field is changing and
combination of trial and observational data; this field has growing rapidly. At the time of this review, many future
been called cross-design synthesis and has been debated directions of work are open for pursuit. The new methods
in the literature (Debray et al., 2015). On the other hand, mentioned throughout this review increase the feasibil-
the methods geared towards combining an RCT with ob- ity of reproducible conclusions regarding individualized
servational data could be tailored to combine multiple treatment decisions. Because we can employ data from
RCTs, but this option was not discussed in the methods multiple sources, we are developing a deeper understand-
previously described aside from briefly in the federated ing and can more effectively estimate individual treatment
learning setting (Tan, Chang and Tang, 2021) effects that are reliable and generalizable.
In terms of specific data availability settings, aggregate-
level data consistently provides a challenge for estimat- ACKNOWLEDGMENTS
ing individual-level effect moderation, and there are only The authors would like to thank the anonymous referees
a couple of limited settings in which this goal can be and the special issue Guest Editors for their constructive
achieved. Therefore, more IPD data access is the sim- comments that improved the quality of this paper.
plest solution to being able to derive an in-depth model T.-H. Chang completed the work for this paper while
to estimate the CATE. For the case when IPD is avail- employed as a Biostatistician at the Johns Hopkins
able but cannot be shared across studies (i.e., federated Bloomberg School of Public Health.
learning), the approaches discussed in this review could
be tailored to deal with this. Very few methods exist in FUNDING
this field within federated learning; only one paper specif-
ically discusses treatment effect heterogeneity when data Research reported in this publication was partially
is distributed privately across studies (Tan, Chang and funded through a Patient-Centered Outcomes Research
Tang, 2021). Thus, future work could be done to derive Institute (PCORI) Award (ME-2020C3-21145; PI: Stu-
approaches to estimate the CATE in federated learning. art) and by the National Institute of Mental Health
Data availability also can vary within a given set of (R01MH126856; PI: Stuart). Ms. Brantner also received
studies, and researchers often run into the issue of sys- financial support in the form of a training grant through
tematically missing covariates—that is, covariates avail- the National Institutes of Health (T32AG000247). The
able in some but not all data sources. Covariates also can statements in this work are solely the responsibility of the
be sporadically missing, where the covariate is present in authors and do not necessarily represent the views of the
all studies but missing for some individuals throughout Patient-Centered Outcomes Research Institute (PCORI),
the studies. Future development of the methods discussed its Board of Governors or Methodology Committee, or of
previously should incorporate these considerations, as the National Institute of Mental Health.
many of the new approaches leave this for future work.
Some papers have looked into these types of missingness SUPPLEMENTARY MATERIAL
in a slightly separate context (Colnet et al., 2022); for
example, Audigier et al. (2018) investigated the perfor- Single-Study CATE Estimation Methods (DOI: 10.
mance of multiple imputation procedures for systemati- 1214/23-STS890SUPP; .pdf). This supplement provides
cally and sporadically missing data. Jolani et al. (2015) an overview of approaches that estimate the conditional
average treatment effect (CATE) in a single randomized
also describe a generalized imputation approach for IPD
controlled trial or observational dataset. Both parametric
meta-analysis when covariates are systematically missing.
and nonparametric methods are included, and the non-
An appropriate follow-up question from this work is
parametric methods are grouped into classes to help dif-
when to best implement each method. Because the ma-
ferentiate the approaches.
chine learning methods have not been compared to one
another in simulation studies, it is difficult to conclude
which of the methods is optimal in which scenario. This REFERENCES
review does attempt to clarify which type of data can be A BREVAYA , J., H SU , Y.-C. and L IELI , R. P. (2015). Estimating con-
handled by each method, and whether the method works ditional average treatment effects. J. Bus. Econom. Statist. 33 485–
with RCT and observational data, or multiple RCTs. 505. MR3416596 https://doi.org/10.1080/07350015.2014.975555
INTEGRATING DATA FOR EFFECT HETEROGENEITY 653
ATHEY, S., T IBSHIRANI , J. and WAGER , S. (2019). General- meta-analysis using individual participant data: When do bene-
ized random forests. Ann. Statist. 47 1148–1178. MR3909963 fits arise? Stat. Methods Med. Res. 27 1351–1364. MR3777761
https://doi.org/10.1214/18-AOS1709 https://doi.org/10.1177/0962280216660741
AUDIGIER , V., W HITE , I. R., J OLANI , S., D EBRAY, T. P. A., D ONEGAN , S., W ILLIAMSON , P., D’A LESSANDRO , U. and T UDUR
Q UARTAGNO , M., C ARPENTER , J., VAN B UUREN , S. and S MITH , C. (2012). Assessing the consistency assumption by ex-
R ESCHE -R IGON , M. (2018). Multiple imputation for multilevel ploring treatment by covariate interactions in mixed treatment
data with continuous and binary variables. Statist. Sci. 33 160–183. comparison meta-analysis: Individual patient-level covariates ver-
MR3797708 https://doi.org/10.1214/18-STS646 sus aggregate trial-level covariates. Stat. Med. 31 3840–3857.
BARON , R. M. and K ENNY, D. A. (1986). The moderator–mediator MR3041777 https://doi.org/10.1002/sim.5470
variable distinction in social psychological research: Conceptual, E FTHIMIOU , O., D EBRAY, T. P. A., VAN VALKENHOEF, G.,
strategic, and statistical considerations. J. Pers. Soc. Psychol. 51 T RELLE , S., PANAYIDOU , K., M OONS , K., R EITSMA , J. B.,
1173–1182. https://doi.org/10.1037/0022-3514.51.6.1173 S HANG , A. and S ALANTI , G. (2016). GetReal in network meta-
B ERLIN , J. A., S ANTANNA , J., S CHMID , C. H., S ZCZECH , L. A., analysis: A review of the methodology. Res. Synth. Methods 7 236–
F ELDMAN , H. I. and A NTI -LYMPHOCYTE A NTIBODY I NDUC - 263. https://doi.org/10.1002/jrsm.1195
TION T HERAPY S TUDY G ROUP (2002). Individual patient- versus E NDERLEIN , G. (1988). Fleiss, J. L.: The design and analysis of
group-level data meta-regressions for the investigation of treatment clinical experiments. Biom. J. 30 304–304. https://doi.org/10.1002/
effect modifiers: Ecological bias rears its ugly head. Stat. Med. 21 bimj.4710300308
371–387. https://doi.org/10.1002/sim.1023 G ELMAN , A., H ILL , J. and V EHTARI , A. (2020). Regression and
B RANTNER , C. L., N GUYEN , T. Q., TANG , T., Z HAO , C., H ONG , H. Other Stories. Cambridge Univ. Press, Cambridge.
and S TUART, E. A. (2023a). Comparing machine learning methods G ODOLPHIN , P. J., W HITE , I. R., T IERNEY, J. F. and F ISHER , D. J.
for estimating heterogeneous treatment effects by combining data (2023). Estimating interactions and subgroup-specific treatment
from multiple randomized controlled trials. arXiv preprint. Avail- effects in meta-analysis without aggregation bias: A within-trial
able at arXiv:2303.16299. framework. Res. Synth. Methods 14 68–78. https://doi.org/10.1002/
B RANTNER , C. L., C HANG , T.-H., N GUYEN , T. Q., H ONG , H., jrsm.1590
D I S TEFANO , L. and S TUART, E. A. (2023b). Supplement to
G REEN , A. K., T RIVEDI , N., H SU , J. J., Y U , N. L., BACH , P. B.
“Methods for integrating trials and non-experimental data to
and C HIMONAS , S. (2022). Despite the FDA’s five-year plan, black
examine treatment effect heterogeneity.” https://doi.org/10.1214/
patients remain inadequately represented in clinical trials for drugs:
23-STS890SUPP
Study examines FDA’s five-year action plan aimed at improving
B ROWN , C. H., S LOBODA , Z., FAGGIANO , F., T EASDALE , B.,
diversity in and transparency of pivotal clinical trials for newly-
K ELLER , F., B URKHART, G., V IGNA -TAGLIANTI , F., H OWE , G.,
approved drugs. Health Aff. 41 368–374. https://doi.org/10.1377/
M ASYN , K. et al. (2013). Methods for synthesizing findings on
hlthaff.2021.01432
moderation effects across multiple randomized trials. Prev. Sci. 14
H AN , L., H OU , J., C HO , K., D UAN , R. and C AI , T. (2021). Fed-
144–156. https://doi.org/10.1007/s11121-011-0207-8
erated Adaptive Causal Estimation (FACE) of target treatment ef-
B URKE , D. L., E NSOR , J. and R ILEY, R. D. (2017). Meta-analysis
fects. Available at arXiv:2112.09313.
using individual participant data: One-stage and two-stage ap-
H ATT, T., B ERREVOETS , J., C URTH , A., F EUERRIEGEL , S. and
proaches, and why they may differ. Stat. Med. 36 855–875.
VAN DER S CHAAR , M. (2022). Combining observational and ran-
MR3597661 https://doi.org/10.1002/sim.7141
C HENG , D. and C AI , T. (2021). Adaptive combination of randomized domized data for estimating heterogeneous treatment effects. arXiv
and observational data. Available at arXiv:2111.15012. preprint. Available at arXiv:2202.12891.
C OLNET, B., J OSSE , J., VAROQUAUX , G. and S CORNET, E. (2022). H AYWARD , R. A., G AGNIER , J. J., B ORENSTEIN , M., VAN D ER H EI -
JDEN , G. J. M. G., DAHABREH , I. J., S UN , X., S AUERBREI , W.,
Causal effect on a target population: A sensitivity analysis to handle
missing covariates. J. Causal Inference 10 372–414. MR4512969 WALSH , M., I OANNIDIS , J. P. A. et al. (2020). Instrument for the
https://doi.org/10.1515/jci-2021-0059 Credibility of Effect Modification Analyses (ICEMAN) in random-
C OLNET, B., M AYER , I., C HEN , G., D IENG , A., L I , R., VARO - ized controlled trials and meta-analyses: Manual version 1.0.
QUAUX , G., V ERT, J., J OSSE , J. and YANG , S. (2021a). Causal H ONG , H., F U , H. and C ARLIN , B. P. (2018). Power and commen-
inference methods for combining randomized trials and observa- surate priors for synthesizing aggregate and individual patient level
tional studies: A review. Available at arXiv:2011.08047. data in network meta-analysis. J. R. Stat. Soc. Ser. C. Appl. Stat. 67
DAGNE , G. A., B ROWN , C. H., H OWE , G., K ELLAM , S. G. and 1047–1069. MR3832263 https://doi.org/10.1111/rssc.12275
L IU , L. (2016). Testing moderation in network meta-analysis with H ONG , H., F U , H., P RICE , K. L. and C ARLIN , B. P. (2015). In-
individual participant data. Stat. Med. 35 2485–2502. MR3513700 corporation of individual-patient data in network meta-analysis for
https://doi.org/10.1002/sim.6883 multiple continuous endpoints, with application to diabetes treat-
DAHABREH , I. J., P ETITO , L. C., ROBERTSON , S. E., ment. Stat. Med. 34 2794–2819. MR3375982 https://doi.org/10.
H ERNÁN , M. A. and S TEINGRIMSSON , J. A. (2020). To- 1002/sim.6519
wards causally interpretable meta-analysis: Transporting infer- H UA , H., B URKE , D. L., C ROWTHER , M. J., E NSOR , J., T UDUR
ences from multiple studies to a target population. Available at S MITH , C. and R ILEY, R. D. (2017). One-stage individual
arXiv:1903.11455. participant data meta-analysis models: Estimation of treatment-
D EBRAY, T. P. A., M OONS , K. G. M., VALKENHOEF, G., covariate interactions must avoid ecological bias by separating out
E FTHIMIOU , O., H UMMEL , N., G ROENWOLD , R. H. H. and R E - within-trial and across-trial information. Stat. Med. 36 772–789.
ITSMA , J. B. (2015). Get real in individual participant data (IPD) MR3597655 https://doi.org/10.1002/sim.7171
meta-analysis: A review of the methodology. Res. Synth. Methods J OLANI , S., D EBRAY, T. P. A., KOFFIJBERG , H., VAN B UUREN , S.
6 293–309. https://doi.org/10.1002/jrsm.1160 and M OONS , K. G. M. (2015). Imputation of systematically
D EBRAY, T. P. A., S CHUIT, E., E FTHIMIOU , O., R EITSMA , J. B., missing predictors in an individual participant data meta-analysis:
I OANNIDIS , J. P. A., S ALANTI , G., M OONS , K. G. M. and A generalized approach using MICE. Stat. Med. 34 1841–1863.
W ORKPACKAGE , G. (2018). An overview of methods for network MR3334696 https://doi.org/10.1002/sim.6451
654 C. L. BRANTNER ET AL.
K ALLUS , N., P ULI , A. M. and S HALIT, U. (2018). Removing S ARAMAGO , P., S UTTON , A. J., C OOPER , N. J. and M ANCA , A.
hidden confounding by experimental grounding. Available at (2012). Mixed treatment comparisons using aggregate and individ-
arXiv:1810.11646. ual participant level data. Stat. Med. 31 3516–3536. MR3041828
K ENNEDY, E. H. (2020). Optimal doubly robust estimation of hetero- https://doi.org/10.1002/sim.5442
geneous causal effects. Available at arXiv:2004.14497. S EO , M., W HITE , I. R., F URUKAWA , T. A., I MAI , H., VAL -
K ENT, D. M., PAULUS , J. K., VAN K LAVEREN , D., GIMIGLI , M., E GGER , M., Z WAHLEN , M. and E FTHIMIOU , O.
D’AGOSTINO , R., G OODMAN , S., H AYWARD , R., I OANNI - (2021). Comparing methods for estimating patient-specific treat-
DIS , J. P. A., PATRICK -L AKE , B., M ORTON , S. et al. (2020). The ment effects in individual patient data meta-analysis. Stat. Med. 40
predictive approaches to treatment effect heterogeneity (PATH) 1553–1573. MR4212329 https://doi.org/10.1002/sim.8859
statement. Ann. Intern. Med. 172 35–45. S ILVA , S., G UTMAN , B. A., ROMERO , E., T HOMPSON , P. A., A LT-
K ENT, D. M., ROTHWELL , P. M., I OANNIDIS , J. P. A., A LT- MANN , A. and L ORENZI , M. (2019). Federated learning in dis-
MAN , D. G. and H AYWARD , R. A. (2010). Assessing and report- tributed medical databases: Meta-analysis of large-scale subcortical
ing heterogeneity in treatment effects in clinical trials: A proposal. brain data. In 2019 IEEE 16th International Symposium on Biomed-
Trials 11 85. https://doi.org/10.1186/1745-6215-11-85 ical Imaging (ISBI 2019) 270–274. IEEE, Los Alamitos, CA.
KOVALCHIK , S. A. (2013). Aggregate-data estimation of an individ-
S IMMONDS , M. C. and H IGGINS , J. P. T. (2007). Covariate het-
ual patient data linear random effects meta-analysis with a pa-
erogeneity in meta-analysis: Criteria for deciding between meta-
tient covariate-treatment interaction term. Biostatistics 14 273–283.
regression and individual patient data. Stat. Med. 26 2982–2999.
https://doi.org/10.1093/biostatistics/kxs035
MR2370988 https://doi.org/10.1002/sim.2768
K ÜNZEL , S. R., S EKHON , J. S., B ICKEL , P. J. and Y U , B. (2019).
TAN , X., C HANG , C.-C. H. and TANG , L. (2021). A tree-based feder-
Metalearners for estimating heterogeneous treatment effects using
ated learning approach for personalized treatment effect estimation
machine learning. In Proceedings of the National Academy of Sci-
ences 116 4156–4165. from heterogeneous data sources. Available at arXiv:2103.06261.
L AMBERT, P. C., S UTTON , A. J., A BRAMS , K. R. and J ONES , D. R. T ERAMUKAI , S., M ATSUYAMA , Y., M IZUNO , S. and S AKAMOTO , J.
(2002). A comparison of summary patient-level covariates in meta- (2004). Individual patient-level and study-level meta-analysis for
regression with individual patient data meta-analysis. J. Clin. investigating modifiers of treatment effect. Jpn. J. Clin. Oncol. 34
Epidemiol. 55 86–94. https://doi.org/10.1016/S0895-4356(01) 717–721. https://doi.org/10.1093/jjco/hyh138
00414-0 T HOMAS , D., R ADJI , S. and B ENEDETTI , A. (2014). Systematic re-
M C C ANDLESS , L. (2009). Bayesian Methods for Data Analysis, 3rd view of methods for individual patient data meta-analysis with bi-
ed. Bradley P. Carlin and Thomas A. Louis, Chapman & Hall/CRC, nary outcomes. BMC Med. Res. Methodol. 14. https://doi.org/10.
Boca Raton, 2008. ISBN 9781584886976. 1186/1471-2288-14-79
N IE , X. and WAGER , S. (2021). Quasi-oracle estimation of hetero- T IERNEY, J. F., VALE , C., R ILEY, R., S MITH , C. T., S TEWART, L.,
geneous treatment effects. Biometrika 108 299–319. MR4259133 C LARKE , M. and ROVERS , M. (2015). Individual participant data
https://doi.org/10.1093/biomet/asaa076 (IPD) meta-analyses of randomised controlled trials: Guidance on
P ETRELLI , F. and BARNI , S. (2012). Surgery of primary tumors their use. PLoS Med. 12 e1001855. https://doi.org/10.1371/journal.
in stage IV breast cancer: An updated meta-analysis of pub- pmed.1001855
lished studies with meta-regression. Med. Oncol. 29 3282–3290. T RIVEDI , M. H., RUSH , A. J., W ISNIEWSKI , S. R., N IEREN -
https://doi.org/10.1007/s12032-012-0310-0 BERG , A. A., WARDEN , D., R ITZ , L., N ORQUIST, G., H OW-
R ILEY, R. D., L AMBERT, P. C., S TAESSEN , J. A., WANG , J., LAND , R. H., L EBOWITZ , B. et al. (2006). Evaluation of outcomes
G UEYFFIER , F., T HIJS , L. and B OUTITIE , F. (2008). Meta- with citalopram for depression using measurement-based care in
analysis of continuous outcomes combining individual patient STAR*D: Implications for clinical practice. Am. J. Psychiatr. 163
data and aggregate data. Stat. Med. 27 1870–1893. MR2420350 28–40. https://doi.org/10.1176/appi.ajp.163.1.28
https://doi.org/10.1002/sim.3165 VO , T. V., H OANG , T. N., L EE , Y. and L EONG , T.-Y. (2021). Feder-
R ILEY, R. D., S TEWART, L. A. and T IERNEY, J. F. (2021). Individ- ated estimation of causal effects from observational data. Available
ual participant data meta-analysis for healthcare research. Individ- at arXiv:2106.00456.
ual Participant Data Meta-Analysis: A Handbook for Healthcare W U , L. and YANG , S. (2021). Integrative R-learner of heterogeneous
Research 1–6. treatment effects combining experimental and observational stud-
ROSENMAN , E., BASSE , G., OWEN , A. and BAIOCCHI , M. (2020). ies. In First Conference on Causal Learning and Reasoning.
Combining observational and experimental datasets using shrink-
X IE , F., C HAN , J. C. and M A , R. C. (2018). Precision medicine in
age estimators. Available at arXiv:2002.06708.
diabetes prevention, classification and management. J. Diabetes In-
ROSENMAN , E. T. R., OWEN , A. B., BAIOCCHI , M. and BA -
vestig. 9 998–1015. https://doi.org/10.1111/jdi.12830
NACK , H. R. (2022). Propensity score methods for merging
YANG , Q., L IU , Y., C HENG , Y., K ANG , Y., C HEN , T. and
observational and experimental datasets. Stat. Med. 41 65–86.
Y U , H. (2022). Federated Learning. Synthesis Lectures on Ar-
MR4376789 https://doi.org/10.1002/sim.9223
RUBIN , D. B. (1974). Estimating causal effects of treatments in ran- tificial Intelligence and Machine Learning 43. Springer, Cham.
domized and nonrandomized studies. J. Educ. Psychol. 66 688– Reprint of the 2020 original. MR4592510 https://doi.org/10.1007/
701. https://doi.org/10.1037/h0037350 978-3-031-01585-4
S AMARA , M. T., N IKOLAKOPOULOU , A., S ALANTI , G. and YANG , S., Z ENG , D. and WANG , X. (2020). Elastic integrative anal-
L EUCHT, S. (2019). How many patients with schizophrenia do ysis of randomized trial and real-world data for treatment hetero-
not respond to antipsychotic drugs in the short term? An analy- geneity estimation. Available at arXiv:2005.10579.
sis based on individual patient data from randomized controlled YANG , S., Z ENG , D. and WANG , X. (2022). Improved inference for
trials. Schizophr. Bull. 45 639–646. https://doi.org/10.1093/schbul/ heterogeneous treatment effects using real-world data subject to
sby095 hidden confounding. Available at arXiv:2007.12922.