Figures
Abstract
Researchers are often interested in understanding the disease subtype heterogeneity by testing whether a risk exposure has the same level of effect on different disease subtypes. The polytomous logistic regression (PLR) model provides a flexible tool for such an evaluation. Disease subtype heterogeneity can also be investigated with a case-only study that uses a case-case comparison procedure to directly assess the difference between risk effects on two disease subtypes. Motivated by a large consortium project on the genetic basis of non-Hodgkin lymphoma (NHL) subtypes, we develop PolyGIM, a procedure to fit the PLR model by integrating individual-level data with summary data extracted from multiple studies under different designs. The summary data consist of coefficient estimates from working logistic regression models established by external studies. Examples of the working model include the case-case comparison model and the case-control comparison model, which compares the control group with a subtype group or a broad disease group formed by merging several subtypes. PolyGIM efficiently evaluates risk effects and provides a powerful test for disease subtype heterogeneity in situations when only summary data, instead of individual-level data, is available from external studies due to various informatics and privacy constraints. We investigate the theoretic properties of PolyGIM and use simulation studies to demonstrate its advantages. Using data from eight genome-wide association studies within the NHL consortium, we apply it to study the effect of the polygenic risk score defined by a lymphoid malignancy on the risks of four NHL subtypes. These results show that PolyGIM can be a valuable tool for pooling data from multiple sources for a more coherent evaluation of disease subtype heterogeneity.
Author summary
Researchers usually classify a disease condition into subtypes with different progression patterns and treatment responses. Multiple studies often investigate a complex disease, but not all of them consider the same set of subtypes. In addition, due to various informatics and privacy constraints, it can be challenging to pool individual data across all studies for more efficient analyses. On the other hand, summarized data, such as those generated from genetic association studies, can be easily accessed. We develop PolyGIM, a flexible statistical framework to integrate detailed individual-level data with summary data from multiple sources to comprehensively assess the risk effect on different disease subtypes. We use PolyGIM to understand the genetic basis underlying four major non-Hodgkin lymphoma subtypes.
Citation: Fu S, Purdue MP, Zhang H, Qin J, Song L, Berndt SI, et al. (2023) Improve the model of disease subtype heterogeneity by leveraging external summary data. PLoS Comput Biol 19(7): e1011236. https://doi.org/10.1371/journal.pcbi.1011236
Editor: Hongyu Zhao, Yale, UNITED STATES
Received: November 1, 2022; Accepted: June 2, 2023; Published: July 12, 2023
This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.
Data Availability: Data used in this paper can be accessed through dbGaP, with Study Accession id: phs000801.v2.p1. The R package developed for the method can be found at https://github.com/fushengstat/PolyGIM.
Funding: This work was supported by the Intramural Research Program, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, to MPP, SIB, and KY. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
The polytomous logistic regression (PLR) model is a standard approach to modeling the effects of risk factors on a multicategory outcome [1]. It can be applied to a retrospectively sampled case-control study with one control group and several case groups, such as those defined by different disease subtypes [2]. The PLR model provides a consistent estimate of subtype-specific odds ratio associated with risk exposure [3]. Besides subtype-specific odds ratio, researchers are often interested in understanding the disease subtype heterogeneity by testing whether a risk exposure has the same effect on different disease subtypes [4–6]. Demonstrated evidence of the non-uniform exposure effect would suggest that different etiologic mechanisms cause some disease subtypes.
Leveraging robust findings on genetic associations from large-scale genome-wide association studies (GWAS), recent studies have been using the polygenic risk score (PRS) [7–9] to dissect the complex genetic architecture underlying different disease subtypes [10–14]. We calculate PRS as a weighted average of genotypes on a set of trait-associated genetic markers, i.e., single nucleotide polymorphisms (SNPs), with weights being their effect sizes estimated from existing large-scale GWAS [9, 15, 16]. The PRS provides an estimate of the overall genetic influence on specific traits and can be used as an effective instrument to dissect the genetic architecture underlying different disease subtypes.
Multiple studies are often conducted to investigate a complex disease but do not consider the same subtypes. In addition, due to various informatics and privacy constraints, it can be challenging to share individual data among studies for a more powerful pooled analysis. Our method is motivated by an InterLymph Consortium project [17]. We intend to study the non-Hodgkin lymphoma (NHL) subtype heterogeneity using data generated from multiple GWAS on four major NHL subtypes, known as hronic lymphocytic leukemia (CLL), diffuse large B-cell lymphoma (DLBCL), follicular lymphoma (FL), and marginal zone lymphoma (MZL) [18–23]. NHL is the most common hematological malignancy and has multiple subtypes with distinct morphologic, genetic, and clinical features [24, 25]. In particular, we are interested in evaluating whether the PRS of a lymphoid malignancy, such as Hodgkin lymphoma, exhibits different effects on the four considered NHL subtypes. In this project, we have individual-level genotype data from one study (the internal study) that consists of subjects from each of the four NHL subtype groups and controls. Instead of having individual-level data, we obtain the typical SNP-level summary data from the other seven GWAS (external studies). The SNP-level summary data consist of estimated regression coefficients, each representing an SNP’s marginal effect on a specific NHL subtype. This type of GWAS summary statistics is usually accessible from public databases and has become a valuable resource for future genetic studies [26–29]. Among those seven external case-control studies, one study has cases from three NHL subtype groups, and each of the remaining studies consists of cases of a unique subtype. We aim to integrate individual-level and summary data from all eight GWAS for a more efficient evaluation of the PRS effect and a more powerful heterogeneity test on whether the PRS has a common effect on different NHL subtypes.
Like SNP-level summary data, we consider model-based summary data, which consists of estimated coefficients from working models established by external studies. Working models can be quite different from the target model (i.e., the underlying risk model assumed by the internal study) and thus misspecified. In the NHL study mentioned above, SNP-level summary data are calculated from marginal models, each of which assesses the effect of a single SNP on an outcome. In contrast, the target model is the PRS effect model, which models the effect of a composite genotype score defined by a set of SNPs. In some applications, summary data consist of the estimated coefficient of a risk exposure measured in a different scale (e.g., in log scale) from the one used in the target model. Therefore, in general, we cannot use a standard meta-analysis procedure to pool summary data with estimates from the internal study to improve the inference on the target model.
A few procedures have been developed to integrate model-based summary data with individual-level data, but they mainly focused on studies with prospectively collected samples [30–37]. [38] recently developed an empirical likelihood approach to account for the sampling bias in the case-control study design to synthesize data from retrospective case-control studies. They focused on a binary outcome with a logistic regression model as the underlying risk model.
Here we expand the approach of [38] to multicategory outcomes and develop an efficient procedure to fit the PLR model using individual-level and summary data. Unlike the procedure of [38], the new approach allows different types of outcomes (i.e., one is multicategory, and the other is binary) to be studied by internal and external studies. The binary outcome considered by an external study can be any dichotomized version of the original multicategory outcome. For example, an external study can be a case-control study that adopts a logistic regression model to evaluate a risk exposure’s effect (odds ratio) on a specific disease subtype or a group of several disease subtypes. In another scenario, an external study might collect only cases (i.e., a case-only study) and uses a logistic regression model to directly assess the difference between the effects of a risk factor on two disease subtypes. We also implement the new method into a user-friendly R package, PolyGIM, which can incorporate summary data from multiple external studies with possibly overlapping subjects. We conduct theoretical and simulation studies to demonstrate the advantage of the proposed procedure. We apply the new procedure to the NHL study mentioned above study.
2 Method
2.1 Setup
Let’s assume that we have a case-control study of a multicategory outcome Y and a set of covariates X. The outcome Y takes integer value from 0 to K, with Y = 0 for controls, and Y > 0 for other K different outcomes. For example, in the motivation example we consider a case-control study of NHL with four subtypes. We can denote individual-level data from this study (called internal study) as {Xi, Yi, i = 1, …, n}, with nk subjects having outcome Y = k, and . In the following discussion, we refer Y = 1, …, K as different disease subtypes. We assume the following prospective PLR model as the underlying model for the effect of X on Y,
(1)
where ωk is the intercept and Mk(X; θk) is a given function, such as Mk(X; θk) = X⊤θk, with θk being the vector of parameters of interest. In the NHL example, we have Mk(X; θk) = θk ⋅ S(X), with S(X) being the PRS. S(X) is defined as ∑i wiXi, with Xi being the genotype at the k-th SNP, and wi being the weight defined by previous GWAS. Let
be the collection of all coefficients. We are interested in estimating θ or comparing θk among different disease subtypes.
Beside data from the internal study, we have summary data from multiple external studies. In the NHL example, we consider the PRS defined by a set of 21 SNPs [39]. For each of those SNPs, we have SNP-level summary statistics from seven external studies. The SNP-level summary data is the estimated SNP’s marginal association (the regression coefficient) with one NHL subtype based on a standard logistic regression model (called the working model). Therefore, for an external study consisting of three different NHL subtype cases and controls, its summary data consists of 21 × 3 estimated regression coefficients, each of which is derived from a separate working model. The goal is to fit the PLR model (1) using individual-level and summary data and to utilize the fitted model for more efficient inference of θk and disease subtype heterogeneity. An illustration of the proposed integration framework is given in Fig 1. Before considering this complicated scenario, we first present the methodology assuming that there is one external study, and that the summary data is derived from a single working model. We will describe the method under the more general setting in Section 2.7.
We consider a setting similar to the NHL study. The outcome Y takes values from {0, 1, 2, 3, 4}. The set of covariates X comprises measures on 21 genetic markers (SNPs). We have individual-level data (Y, X) from the internal study and summary data from three external studies. Summary data consists of coefficient estimates from marginal logistic regression models established by the k-th external study (k = 1, 2, 3), with the binary outcome (D(k)) of each external model defined by Y. The goal is to fit the polytomous regression model for the PRS effect given in the red box using individual-level and summary data.
More specifically, we consider summary data consisting of regression coefficient estimates derived from a working model of a binary outcome D. The binary outcome D is derived from of the multicategory outcome Y. The working model is assumed to be a standard logistic regression model. Several versions of D can occur in practice. For example, a working model can treat several disease subtypes as one broad disease group, that is, D = 0 if Y = 0, and D = 1 if Y ∈ {c1, …, cL}, with 1 ≤ cl ≤ K, 1 ≤ l ≤ L, where L is the total number of disease subtypes under study. We call this type of working model the grouped case-control (GC) model. We further assume the number of cases within each subtype group is known in the GC model. In some cases, such as in the NHL study, the GC model can consider just one subtype (i.e., L = 1). Another version of D arises from a case-only study, in which a logistic regression model is used to evaluate whether a risk factor has the same effect on two disease subtypes c1 and c2. We call this the case-case (CC) comparison model, with D = 0 for Y = c1, and D = 1 for Y = c2.
2.2 Likelihood for the internal study
Observations from the internal case-control study can be retrospectively sampled conditioning on the outcome Y from a source population. For some other case-control studies, cases are sampled from P(X|Y > 0), with their disease subtypes classified after the enrollment. As long as the sampling criterion is independent of X, we can consider both types of case-control studies as stratified samples from P(X|Y = 0) and P(X|Y = k), k = 1, …, K.
We consider the retrospective likelihood for the case-control data as this framework is convenient for incorporating external summary data. Let πk = P(Y = k) and , k = 1, …, K. We define the vector τ = (τ1, …, τK)⊤. By Bayes’ rule, model (1) has the following equivalent retrospective representation,
(2)
where Δk(X; ξ) = exp{τk + Mk(X; θk)}, k = 1, …, K, with ξ = (τ⊤, θ⊤)⊤. We denote ξ* = (τ*⊤, θ*⊤)⊤ as the true population value of ξ.
We use (2) to form the likelihood of observing X given Y. Denote as the empirical distribution of P(X|Y = 0) supported on samples from the internal study, and
as the indicator function. We can estimate ξ by maximizing the following empirical log-likelihood,
(3)
subject to constraints
and
for k = 1, …, K. Using the method of Lagrange multipliers to profile out
, we can infer ξ by finding the stationary point of the following profile log-likelihood,
(4)
where ρk = nk/n0 for k = 1, …, K. Let
be the maximum likelihood estimate (MLE) based on the internal data. When Mk(X; θk) = X⊤θk(k = 1, …, K), we can follow [40] to show that the empirical likelihood estimate of θ derived from (4) is exactly the same as the maximum likelihood estimate based on the standard prospective likelihood function specified by (1).
2.3 Properties of summary data
Here we will show the property of the summary data and its relationship with θ, the parameter of interest. We assume that the external study consists of observations {Xi, Di, i = n + 1, …, n + N}, representing stratified samples from P(X|D = 0) and P(X|D = 1). Suppose a standard logistic regression model is used as the working model to study the effect of X on D. We can represent this model by its equivalent retrospective formation as,
(5)
with P(X|D = 1) and P(X|D = 0) being distributions of X in the two groups. This model can be misspecified and thus inconsistent with (2). In model (5), all unknown parameters are divided into two parts,
and β. We use α to represent the set of nuisance parameters whose estimates are not given as part of the summary data. Note that the intercept term α0 is always assumed to be part of nuisance parameters. The summary data only consists of the estimate of β.
Let N0 and N1 be sample sizes in groups D = 0 and D = 1, respectively. Based on the working model (5), the log-likelihood function of the external study is
where ρ = N1/N0 and
.
is the solution of the following estimating equation,
(6)
Note that
is the same as the estimate from the standard package for the logistic regression model. For simplicity, let
and
. According to [41],
is a consistent estimate of (α*, β*), which satisfies the following stochastic constraint equation
(7)
where
and
are expectations over P(X|D = 0) and P(X|D = 1), respectively.
The estimate depends on how D is derived from Y. As mentioned in Section 2.1, two types of binary outcomes are typically encountered. One is used in the CC model, and the other is used in the GC model.
The CC model compares the distribution of X between disease subtype groups c0 and c1, c0 ≠ c1 ∈ {1, …, K}. Its binary outcome is defined as D = DCC, with DCC = 0 if Y = c0, and DCC = 1 if Y = c1. The CC model is commonly used for the study of disease subtype heterogeneity. Let μ = (τ⊤, θ⊤, α⊤, β⊤)⊤ be the vector of all unknown parameters, and μ* = (τ*⊤, θ*⊤, α*⊤, β*⊤)⊤ be their true population values. Notice that and
due to (2), we can express (7) as
(8)
with
where
is the expectation of over P(X|Y = 0). The asymptotic distribution of
is given by
(9)
with A and B shown in S1 Appendix. Based on (9), we know
, where Σ0 = (A−1B A−1)β β is the submatrix of A−1B A−1 corresponding to β. Notice that A and B are determined by the expectation defined by P(X|Y = 0). We will provide an empirical likelihood based estimate of P(X|Y = 0) later. According to (9),
is a consistent estimate of β*, whose relationship with θ*, the parameter of interest, is governed by (8). Therefore, summary data
contains information on θ*.
In the GC model, one or several disease subtypes are considered as one broad case group. The binary outcome is defined as D = DGC, with DGC = 0 if Y = 0, and DGC = 1 if Y belongs to {c1, c2, …, cL}. We further assume that the proportion of cases belonging to each disease subtype k is known, and denote it as qk(k = 1, …, K). Note that some qks can be zero for studies that do not collect cases with certain disease subtypes. Based on (2), we have
Then (8) still applies with g(X; μ) defined as
Furthermore, the asymptotic distribution of
takes the same form as (9), where A and B are defined in S1 Appendix. We can obtain
similarly as for the CC model.
In the following discussion, we will classify summary data into two distinct types, the regular and irregular summary data. Regular summary data has its corresponding g(X; μ*) to be not constantly 0 over X. Irregular summary data is the one with g(X; μ*) ≡ 0 for all X. The validity of the proposed procedure for integrating regular summary data requires some standard regularity conditions, which would not hold if g(X; μ*) ≡ 0. Therefore, a different integration procedure is needed for irregular summary data.
2.4 Procedure for integrating regular summary data
Here we extend the generalized integration model (GIM) of [38] to fit a PLR model by integrating individual-level and summary data. We call the new procedure PolyGIM.
The log-likelihood for the internal study is given by (3). The log-likelihood function of the summary data can be represented as . Since we do not know Σ0, we can replace it with a known matrix V (e.g., the identity matrix I) as the starting point. By combining the two into a joint (pseudo) log-likelihood function, we can estimate μ via solving the following optimization problem over
,
(10)
The last constraint equation in (10) is from (8) with a specific g.
We employ the Lagrange multiplier approach to solve (10). The Lagrange function can be written as,
where κ, λ and ν are the Lagrange multipliers. It can be seen that κ = 1 and
Let η = (λ⊤, ν⊤, μ⊤)⊤ be the vector of all variables. Therefore, the profiled log-likelihood function can be written as
(11)
Hence, solving the original problem (10) can be translated into finding the solution of a set of score equations (shown in S1 Appendix). Then we can apply the Newton-Raphson algorithm to to find the solution
. In S1 Appendix, we show in Lemma 1 that under some regularity conditions
is a consistent estimate of η* = (λ*⊤, ν*⊤, μ*⊤)⊤, with
and
. In particular, regularity condition C4 requires g(X; μ*) to be not constantly 0. More intuitively, if g(X; μ*) ≡ 0, the Lagrange multiplier ν in (11) is not identifiable. Consequently, the procedure based on (11) is only applicable to summary data that satisfy those regular conditions. We refer to such summary data as regular summary data, while summary data not meeting those conditions are called irregular summary data. Summary data derived from a GC model in general is regular. It is also regular if it is derived from a CC model that does not consider the same set of covariates as the one by the underlying PLR model, as in this setting g(X; μ) can not be constantly 0 for any μ.
Based on the estimate , we can obtain the estimated empirical distribution of P(X|Y = 0) as
(12)
Furthermore, we can estimate A and B in (9) by calculating the expectation over P(X|Y = 0) with
. An example is shown in S1 Appendix. Therefore,
can be estimated as
, with
given by
(13)
which is the submatrix of
corresponding to β.
Here is the summary of theoretic properties of the estimate.
Proposition 1 Under the regularity conditions given in S1 Appendix, assuming that N/n → γ, N1/N0 → ρ and nk/n → ρk for k = 1, …, K as n → ∞. We have is asymptotically normal, and its asymptotic variance-covariance matrix attains its minimum (in term of positive semidefinite) at V = Σ0. Furthermore,
is asymptotically more efficient than the internal data based MLE
.
Proofs are given in S1 Appendix. The optimality of estimate still holds when we replace Σ0 with its consistent estimate
. We propose the following iterative Algorithm 1 to obtain this optimal estimate.
Algorithm 1 Algorithm for PolyGIM
1: Based on the internal study, we fit the PLR model to obtain a consistent estimate of ξ, and fit the working model to obtain a consistent estimate of (α, β). We denote these estimates as , where
and
. Then an initial estimate of
is computed according to (13) at
.
2: Resolve the score equation with . Let the estimates be
.
3: Estimate the empirical probability via (12).
4: Update via (13).
5: Repeat Steps 2 to 4 until is converged.
We use the following strategy to choose the initial point in Step 1. We obtain by fitting the PLR model with the internal data. Since we formulate the PLR model with the empirical likelihood representation (4), we adjust estimates of intercept terms from the standard
package for the PLR model by subtracting log(ρi), i = 1, …, K. To obtain the initial estimate of (α, β), we first adjust for the sample size difference between the internal and external studies by assigning each subject from the internal study an appropriate weight. Then we fit a weighted logistic regression model to obtain the initial estimate of (α, β). More specifically, suppose that the external study fits a GC model based on N0 controls and N1 grouped cases, with ri proportion of them having disease subtype i(i = 1, …, K). We assign N0/n0 as the weight for each control and (N1ri)/ni as the weight for each subtype i case in the internal study. We can obtain
by fitting a weighted logistic regression model with those weights. Again, we need to adjust the estimate of the intercept term due to the use of the empirical likelihood representation.
2.5 Integrating irregular summary data
When summary data is derived from a working CC model that is consistent with the underlying risk model, it may become irregular with g(X; μ) ≡ 0. More specifically, if Mk(⋅) = X⊤θk in the underlying risk model (2) and m(⋅) = X⊤β in the working CC model (5) with two subtypes c0 and c1, based on the definitions of {φ0, φ1, Δk, δ} in Section 2.2 and 2.3, it can be shown that
So we have g(X; μ ≡ 0 if we let
(14)
Notice that the true value of μ satisfies (14) due to the consistency between the working model and the underlying risk model. Under these constraints (14), we can eliminate
from (10) and use the Lagrange multiplier approach to solve a modified version of (10). The resultant profile log-likelihood function can be written as
(15)
We can obtain the estimate
based on (15). Under regularity conditions given in S1 Appendix, we can show that
is consistent for any given positive definite V.
The asymptotic distribution of and the optimal choice of V are summarized by the following result, with proofs given in S1 Appendix.
Proposition 2 Under model (15) and regularity conditions given in S1 Appendix, we have is asymptotically normal, and its asymptotic variance-covariance matrix attains its minimum at V = Σ0. In particular, the asymptotic variance-covariance matrix of
has the following form,
where
is defined in S1 Appendix, and
with
and 1 being a vector of 1’s.
Similar to the regular PolyGIM procedure, we can obtain by an iterative algorithm. Even though we use different procedures for integrating regular and irregular summary data, in the following discussion we still call them the PolyGIM procedure when there is no confusion.
We can also use a restricted MLE (RMLE) approach to incorporate this irregular summary data. Let be the solution of the following constraint optimization problem,
(16)
This setup is different from the standard RMLE, as in the constraint equation has variability. We need to account for this uncertainty when estimating the variance-covariance matrix of
. In S1 Appendix, we prove the following result that shows the PolyGIM estimate is more efficient than other considered estimates.
Proposition 3 The estimate
based on (15) is asymptotically more efficient than both the internal data based MLE
and the RMLE
.
2.6 Test for disease subtype heterogeneity
As mentioned in the Introduction, researchers are often interested in testing whether a risk factor Xj has the same effect on different disease subtypes, with the null hypothesis being H0 : θj1 = θj2 = … = θjK. We can use PolyGIM to combine data from multiple sources for a more efficient test.
By Proposition 1, , the estimated coefficients corresponding to Xj, has a multivariate normal distribution
, where
is extracted from the estimated variance-covariance matrix of
. So its log-likelihood function can be written as
Under the null,
follows
. The MLE of a under the null can be derived as
. Thus, we can construct the following likelihood ratio test,
Under the null, this test follows a chi-square distribution with K − 1 degrees of freedom.
2.7 Summary data from multiple models
So far, we have considered summary data from a single working model based on one external study. Here we show how to incorporate summary data from multiple working models, some of which can be fitted with overlapping samples. For example, in the NHL example, we have summary data from seven independent external studies. From one study consisting of cases of three NHL subtypes, we have three summary statistics on each SNP, which are estimated from three GC models sharing a common set of controls. For notation simplicity, we focus on regular summary data in the following discussion.
First, we provide some theoretical insights on the PolyGIM procedure with multiple summary data. Suppose that we are given the summary data , from two external models. Let αi (i = 1, 2) be the corresponding nuisance parameters. Similar to Section 2.3, we can establish the following asymptotics
where
are the true values. The specific form of Σ depends on the two external models and whether there are overlapped samples used for fitting them. For example, if the two external models are fitted with data from two different studies, we have Σ12 = 0. If overlapped samples are used for fitting the two models, Σ12 ≠ 0. Later we will provide more details on how to estimate Σ.
Let be the optimal PolyGIM estimate of ξ using summary data
,
be the optimal estimate using summary data
. We can show theoretically that it is always beneficial by using more summary data (Proposition 4), see S1 Appendix.
More technical details on integrating summary data from multiple studies are presented in S1 Appendix.
3 Results
3.1 Simulation studies with summary data from one external study
To verify the theoretic properties of the proposed PolyGIM method, we first considered a simple scenario when summary data comes from one external study. Assume that X = (X1, X2) was a vector of two binary biomarkers and that the outcome Y had three classes, with 0 for the control group and 1/2 for the two disease subtypes. In the study population, the two biomarkers were correlated, with the joint probability of (X1, X2) = (0, 0), (0, 1), (1, 0), and (1, 1) specified as 0.28, 0.12, 0.18 and 0.42, respectively. The true disease risk model was chosen as
We let (ω1, ω2) = (−0.2, 0.1) and (θ11, θ12, θ21, θ22) = (1.5, −1.0, −0.5, 1.2). From this population, case-control studies were retrospectively generated conditioning on the outcome Y, with the internal study consisting of 2000 controls, 500 subtype 1 cases and 500 subtype 2 cases, and the external study consisting of 1000 controls, 600 subtype 1 cases and 400 subtype 2 cases.
We considered two classes of summary data, one from the CC model and the other from the GC model. The binary outcome D used in a CC model was defined as D = 0 for subtype 1 and D = 1 for subtype 2. Summary data from two types of CC working models (CC1 and CC2) were studied. The CC1 working model only included covariate X1 and thus was inconsistent with the underlying risk model. It generated regular summary data , which was the estimated coefficient of X1. The CC2 working model was consistent with the underlying risk model and included covariates (X1, X2). It generated irregular summary data
, which were estimated coefficients of X1 and X2.
The GC model merged the two disease subtypes into one group and compared it with the control group. Two GC working models (GC1 and GC2) were used to generate regular summary data. GC1 model included only X1 as a covariate and generated summary data . GC2 model had X1 and X2 as covariates and generated summary data
.
For each simulated dataset (including an internal study and summary data), we applied three versions of PolyGIM, one based on the most optimal estimate (called GIMopt), the others based on the estimate with two options of V, one called GIMI with V = I, and the other call
with V = Vσ, a diagonal matrix that has each diagonal element being
, with
being the variance of the i-th summary statistic from the summary data
. We applied different types of PolyGIM procedures depending on whether the summary data was regular or irregular. As a comparison, we also analyzed the internal study using the standard PLR model (called MLEint). We simulated 2000 datasets under each scenario to evaluate the performance of all considered methods. Tables 1 and 2 summarized simulation results in situations when summary data is generated from CC and GC models, respectively.
Table 1 for the CC models shows that all considered methods have expected performances, with unbiased estimates and their estimated standard errors matching well with corresponding empirical standard errors. When using irregular summary data , we can see that GIMopt provides more efficient estimates of {θk1, θk2, k = 1, 2} compared to GIMI,
and MLEint. These findings align with the conclusions of Propositions 2 and 3. If only
is used as summary data, GIMopt and
are more effective for estimating coefficients of X1, whereas all methods have similar efficiency levels for estimating coefficients of X2. Notably, results from Table 1 reveal that
has a similar level of efficiency as GIMopt when integrating only one summary statistic, but it becomes less efficient when integrating two correlated summary statistics. This efficiency loss arises because
selects the matrix V = Vσ in (10), assuming all summary statistics to be independent. This chosen Vσ is suboptimal when summary statistics are correlated, since the optimal matrix should be a consistent estimate of the variance-covariance matrix for these statistics. However, when the summary statistics are indeed independent, Vσ can serve as a satisfactory practical approximation for the optimal matrix. Although they differ theoretically in their diagonal terms when external models are mis-specified (e.g., the CC1 model), this discrepancy is more theoretical than practical in our context, as evident from S1 Table. From Table 2 and S2 Table. for the GC models, we can reach similar conclusions as those from Table 1 and S1 Table.
3.2 Simulation studies with summary data from multiple external studies
We conducted additional simulation studies under a more complex setting to mimic the NHL study. We considered 21 SNPs X = (X1, …, X21), and the outcome Y had five classes, with Y = 0 representing controls and Y = 1 to 4 representing four different disease subtypes. We selected the 21 SNPs identical to those used in the NHL example, with the exception that we assumed they were independent to simplify the simulation procedure. The characteristics of these SNPs are summarized in S3 Table. The true underlying risk model was defined as
with the PRS
. We considered the following two set of parameters for the true model:
- Null PRS Model: (ω1, ω2, ω3, ω4) = (−3.9, −4.1, −3.6, −3.8) and (θ1, θ2, θ3, θ4) = (0, 0, 0, 0).
- Alternative PRS Model: ωi = −3.8, i = 1, …, 4, and (θ1, θ2, θ3, θ4) = (0.019, 0.092, −0.12, 0.047).
Both models were designed with intercept terms to ensure the rarity of each disease subtype in the study population (each with a prevalence of less than 2%). For Alternative PRS Model, the effects of the PRS were chosen to match those observed in the NCI study of the NHL example.
In practice the weights (wi) used in PRS are estimated with uncertainty from other studies, leading to a PRS with measurement error. We aim to evaluate the performance of procedures under consideration while accounting for the measurement error. To accomplish this, we generated estimates () of the true weights (wi) for each simulated dataset, assuming that the true weights wi were identical to those used in the real NHL example (S3 Table). We randomly generated
from a normal distribution
, where sei was the standard error reported in the published GWAS from which the SNP was genome-wide significantly detected (see S3 Table). We used
to calculate the PRS with measurement error, denoted as
, and varied the level of measurement error by choosing the scaling factor c from {0, 1, 16}. In the analysis of each simulated dataset, we used
instead of S(X) to reflect the measurement error.
First, we considered performance of MLEint and GIMopt under the null model. We used summary data from External Studies 1–5 listed in Table 3. All these five external studies, as well as the internal study, were generated from the same source population (the internal study population). S3 Table provides the allele frequency of the effect allele “1” for each of the 21 SNPs in the study population. For each external study, we used a logistic regression model to estimate the marginal effect of each SNP (i.e., the regression coefficient) on the risk of a specific disease subtype. The summary data consisted of these coefficient estimates for all considered 21 SNPs. We simulated 2000 datasets under Null PRS model and analyzed each dataset with MLEint and GIMopt. To assess the impact of measurement error, we employed three sets of PRS in the analysis of each simulated dataset, including the PRS without measurement error (i.e., ), the PRS with measurement error at the same level as the real NHL study (i.e.,
), and the PRS with elevated measurement error (i.e.,
). The simulation results, presented in Table 4, indicate that the level of uncertainty in PRS does not have an impact on the statistical properties of MLEint and GIMopt. This is expected since likelihood models for both procedures remain valid under the null model (i.e., θ = 0), even when using PRS with high levels of measurement error. Consequently, the estimate of the effect of the PRS remains consistent. Additionally, we evaluated the type I errors of GIMopt at various significance levels, as shown in S4 Table, and generated the Q-Q plots of estimated Z-scores for the PRS effect on different subtypes in S1 Fig. These results demonstrate that GIMopt has well calibrated type I errors and P-values.
Next, we assessed the performance of the considered procedures under Alternative PRS Model with non-zero PRS effects, and we summarized the simulation results in Table 5. Here are some notable observations. Firstly, when the PRS has either no or relatively low measurement error, both MLEint and GIMopt exhibit desirable statistical properties in terms of consistency, the accuracy of standard error estimation, and 95% confidence interval coverage probability. In these cases, GIMopt is more efficient than MLEint. Secondly, when the PRS has increased measurement error with , estimates obtained using MLEint and GIMopt become inconsistent, with a noticeable increase in bias.
We further evaluated how the sample sizes of external studies affect the efficiency of GIMopt by increasing the sample size of the five external studies by 5 and 10 times, with a focus on PRS that have no or relatively low measurement error (i.e., ). The results, which are summarized in Table 6, demonstrate that the efficiency of GIMopt improves as the sample sizes of external studies increase. This same pattern is observed when using an internal study with only 10% of the original sample size (see S5 Table).
Next, we conducted simulations to compare strategies for integrating summary data when there were differences in the joint distribution of X between the internal and external study populations. Seven external studies were considered, as listed in Table 3. The first five were generated from the same source population as the internal study, while External Studies 6 and 7 were generated from two different external study populations. The allele frequency of allele “1” in External Study 6 was 0.2 lower than that in the internal study population for each of the first 10 SNPs listed in S3 Table, with a lower bound of 0.05. Similarly, in External Study 7, the allele frequency of allele “1” was 0.2 lower than that in the internal study population for each of the last 11 SNPs listed in S3 Table, with a lower bound of 0.05. All other SNPs in these two external studies had the same distribution as in the internal study population. It is worth noting that the discrepancy in the SNP distribution between the two studies is substantially higher than the discrepancy observed in the NHL study. We examined three strategies for integrating summary data from the seven external studies. The first strategy utilized summary data from only the first five external studies. The second strategy employed complete summary data, including summary statistics on the 21 SNPs across all seven studies. The third strategy used summary data from the first five external studies and partial summary data from the remaining two external studies, which included summary data only on SNPs having the same allele frequency as in the internal study population. This excluded summary data on the first 10 SNPs from External Study 6 and the last 11 SNPs from External Study 7. To investigate the impact of varying allele frequencies between the internal and external study populations, we conducted additional simulations. Specifically, we increased the sample sizes of External Studies 6 and 7 by a factor of five to further explore this effect.
Table 7 summarizes the results of our simulation study, from which several observations can be made. The GIMopt method, when using additional partial summary data from External Studies 6 and 7, shows desirable statistical properties and is more efficient than using only summary data from the first five external studies. This advantage becomes more pronounced as we increase the sample size of the two additional external studies. In contrast, GIMopt using complete summary data could lead to erroneous standard error estimates, especially when the sample size is increased in External Studies 6 and 7. This is an expected outcome, as the PolyGIM procedure is tailored to the scenario where the internal and external studies are conducted in the same source population. Regarding the use of partial summary data, we can show, using arguments similar to those presented in [42], that it is valid to use partial summary data on a subset of SNPs if the following conditions are met. First, the joint distribution of this subset of SNPs in the external study population must be the same as in the internal study population. Second, these SNPs must be independent of the remaining SNPs in both study populations. Third, the disease prevalence must be relatively rare in both populations so that the SNP distribution is similar between the control group and the general population. Our simulation study demonstrated that the use of partial summary data is valid and beneficial when these underlying assumptions are satisfied.
Finally, we carried out supplementary experiments to assess the computational efficiency and memory requirements of the PolyGIM package. In these experiments, we focused on the internal study and the external studies listed in Table 3, and analyzed a PRS with an increasing number of SNPs, ranging from 21 to 105. We considered summary data derived from the first 1, 3, or 5 external studies provided in Table 3. The computational time and memory requirements were recorded over 100 replications and are presented in S6 and S7 Tables. These tables offer a glimpse into the performance of the PolyGIM package under various conditions.
3.3 Real data application: The NHL study
NHL is the most common hematological malignancy with many subtypes with distinct molecular and clinical features. It has been hypothesized that various NHL subtypes and other lymphoid malignancies might share some degree of genetic susceptibility. Recently, [39] used eight GWAS within the InterLymph Consortium to study the genetic heritability in four major NHL subtypes, including CLL, DLBCL, FL, and MZL. To explore pleiotropy between NHL subtypes and other lymphoid malignancies (e.g., acute lymphoblastic leukemia and Hodgkin lymphoma), they generated PRS using SNPs that had been established as being associated with each lymphoid malignancy and tested their associations with risk for the four NHL subtypes.
We utilized PolyGIM to analyze the project described by [39], using the PRS for Hodgkin lymphoma (HL) as our primary example. This PRS was derived from 21 HL-associated SNPs that were selected by [39] due to their identification as genome-wide significant SNPs, each with a P-value of less than 5 × 10−8 from seven previously published GWAS [43–49]. Each SNP was selected as an index SNP to represent a nearby gene, and they were mostly independent of each other, with only two pairs having R2 > 0.01 (one pair with R2 = 0.024 and the other with R2 = 0.038. The PRS was calculated as , where Xj represents the genotype, and wj (weight) is the estimated effect by the j-th SNP on HL. Further information on the 21 SNPs and their estimated weights used in the PRS calculation can be found in S3 Table.
[39] collected data from eight GWAS with European ancestry, including six US-based studies and two European studies. Table 3 shows the sample sizes of cases with specific NHL subtypes and controls from each study. S2 Fig displays the minor allele frequency (MAF) of each of the 21 SNPs in the control groups of the eight studies. The plot exhibits a consistent distribution of MAF across the studies, with a maximum range of approximately 0.1. This suggests that it is reasonable to assume that all eight studies were conducted on the same population.
For our analysis, we obtained individual-level data from the US-based NCI study, which collected cases from all four NHL subtypes and controls, and treated it as the internal study. We obtained SNP-level summary statistics from the other seven studies (external studies), consisting of each SNP’s marginal effect (i.e., the estimated regression coefficient) on one NHL subtype. We assumed the following PLR model to assess the effect of PRS on the four NHL subtype,
We utilized PolyGIM to fit the model in two different ways. The first approach involved integrating summary data solely from the five US-based studies, while the second approach integrated summary data from all seven studies, which included the five US-based studies and two European studies.
Table 8 summarizes the estimates of the effects of PRS on the four subtypes of NHL. The two PolyGIM approaches provide consistent results, indicating that the PRS has a significant effect on DLBCL and FL subtypes. Using summary data from all seven studies yields slightly more significant results compared to using only summary data from the five US-based studies, as the two European studies contribute additional cases on DLBCL and FL subtypes. It is interesting to notice that the PRS is positively associated with the risk of DLBCL but inversely associated with the risk of FL. This observation indicates that SNPs that increase the risk of HL tend to be associated with an increased risk of DLBCL but a reduced risk of FL. On the other hand, the PRS has no significant effect on the risk of CLL or MZL, suggesting that it is less likely to have SNPs with pleiotropic effects on both HL and CLL/MZL. In Table 9, we compared PRS effects on each pair of NHL subtypes. A global test for disease subtype heterogeneity based on the one given in Section 2.6 is significant, with P-value = 6.97 × 10−11 based on PolyGIM using summary data from all seven external studies. We also present results based on the PLR model fitted with the internal study (the NCI study). As expected, PolyGIM estimates are more efficient than internal study-based MLE since summary data from external studies provide additional helpful information. However, the overall improvement is somewhat limited as the internal study has a much larger sample size in controls and each NHL subtype group.
As an experiment, we reduced the internal study sample size by randomly removing 2/3 of the control and each subtype group. Results based on this downsized internal study and the original summary data are given in S8 and S9 Tables. Compared with the results in Tables 8 and 9, the advantage of PolyGIM over MLEint becomes more evident. Finally, we considered the following model by allowing the PRS had a nonlinear effect on each subtype,
By applying PolyGIM, we found that the PRS had no significant nonlinear effect on any subtype, with P-values associated θk2 all being larger than 0.05.
4 Discussion
We developed PolyGIM, an integrative procedure based on the PLR model to study a disease outcome with multiple subtypes. PolyGIM fits the PLR model using individual-level data from the internal study while incorporating constraints on the parameter space imposed by the summary data derived from external studies. The summary data consist of coefficient estimates from working logistic regression models, which can be quite general as long as their targeted binary outcomes are functions of the original multicategory outcome. Examples of the working model include the case-case comparison model, which focuses on comparing two disease subtypes, and the grouped case-control comparison model, which compares the control group with a broad disease group formed by merging several subtypes. We established the theoretic properties of the procedure and demonstrated the advantage of PolyGIM using simulation studies. We applied PolyGIM to evaluate the effect of the HL-associated PRS on the risks of four major NHL subtypes. We found that the PRS has an uneven effect on different subtypes. As shown in the NHL study, the PolyGIM procedure provides a versatile tool for exploring genetic heterogeneity by leveraging SNP-level summary statistics generated by large-scale GWAS.
We can use PolyGIM to integrate summary data from the grouped case-control comparison model, assuming we know the mixture proportion of subtypes within the broad disease group. In some practices, the mixture proportion is unknown. It is possible to expand PolyGIM by treating the mixture proportions as unknown parameters in the likelihood formation and adjusting the constraint equations accordingly. But the mixture proportions could be unidentifiable unless the risk factor has very distinct effects on different subtypes. We are still investigating strategies to incorporate this unknown mixture proportion into the PolyGIM procedure, possibly with the help of certain additional information.
We utilized PolyGIM to evaluate the impact of PRS on different NHL subtypes. Since the weights used in PRS calculation are subject to uncertainty, the PRS used in the model is essentially a variable with measurement error. However, we demonstrate that it is acceptable to ignore the measurement error by treating the observed weights as their true values for the purpose of the association testing. PolyGIM can still maintain the proper type I error using PRS with any level of measurement error. However, we should exercise caution when attempting to estimate the true effect of PRS using this approach, particularly when the variability of each weight is relatively large. Addressing measurement error in nonlinear regression models is a significant challenge, and there is no standard approach to deal with it. As a result, future research efforts could focus on developing a measurement error model for PRS and integrating it into the PolyGIM framework.
In PolyGIM, the number of unknown parameters increases linearly with the dimension of the summary data. The current version of the PolyGIM package is capable of handling several hundred SNPs simultaneously. To study a PRS with a much larger number of SNPs, one solution is to adopt the strategy proposed by [42], provided that all considered SNPs can be divided into smaller, independent batches. For instance, it is reasonable to assume that SNPs on different chromosomes are independent. This approach entails partitioning the SNPs into separate groups while ensuring the independence of SNPs within each group from those in other groups. Subsequently, PolyGIM is applied to each batch to obtain estimates of the regression coefficients. These individual estimates are then combined using a meta-analysis procedure to derive the final estimate.
PolyGIM was proposed to incorporate summary data from external studies in the situation when their individual-level data are not readily available. In fact, it is also helpful even when researchers can collect individual-level data from all studies. Suppose we want to evaluate a PRS effect on a disease outcome, as in the setting of NHL study, certain participating studies might have missing (imputed) genotypes on some SNPs required in the PRS calculation. Thus, we can not directly fit the PRS model using data from those studies since the PRS on subjects in them are undefined. On the other hand, PolyGIM can still incorporate information from those studies by using their summary statistics on measured SNPs. Therefore, PolyGIM provides a solution to the missing data problem when gathering data from multiple studies.
PolyGIM was developed to analyze individual-level and summary data from studies conducted within the same population. However, our simulation results demonstrate that it is possible to utilize partial summary data from external studies conducted in different populations under certain conditions. Moreover, we can extend PolyGIM to integrate complete summary data from different external study populations, as we did for binary outcomes in our previous work [38]. To accomplish this, we require a reference set of X randomly chosen from controls in the external study population to estimate the empirical distribution of X in that population.
Large-scale GWAS often generate SNP-level summary data using logistic or linear mixed models to account for related subjects. However, PolyGIM is formulated within an empirical likelihood framework that assumes unrelated subjects within each study, making it unable to handle summary data derived from these mixed models. To address this, a heuristic solution would be to treat the summary data from logistic mixed models as if they were generated from a standard logistic regression model, fitted on a study with an “effective sample size” of independent subjects [50]. Alternatively, for summary data generated from linear mixed models, existing methods (e.g., [51]) can be used to convert them to the corresponding logistic regression model and then proceed with PolyGIM. Nonetheless, it is important to note that both approaches are heuristic and require further evaluation.
We have developed the PolyGIM package and made it available on GitHub for public use (https://github.com/fushengstat/PolyGIM). It can incorporate summary data sets from one external study or multiple studies with overlapping subjects. This package allows users to specify their target and working models using the
model formulae. Aided by this package, we expect PolyGIM to be a valuable tool for pooling data from multiple sources for a more coherent evaluation of disease subtype heterogeneity.
Supporting information
S1 Appendix. Supplementary material for mathematical details and technical proofs for Section 2.
https://doi.org/10.1371/journal.pcbi.1011236.s001
(PDF)
S1 Table. Simulation results in situations when summary data is derived from one external study based on case-case (CC) comparison models considering the rare disease and independent markers.
All numbers are multiplied by 100.
https://doi.org/10.1371/journal.pcbi.1011236.s002
(PDF)
S2 Table. Simulation results in situations when summary data is derived from one external study based on grouped case-control (GC) comparison models considering the rare disease and independent markers.
All numbers are multiplied by 100.
https://doi.org/10.1371/journal.pcbi.1011236.s003
(PDF)
S3 Table. Information on the 21 SNPs used in the definition of Hodgkin lymphoma associated polygenic risk score.
https://doi.org/10.1371/journal.pcbi.1011236.s004
(PDF)
S4 Table. Simulation results of type I errors under the null PRS model for the impact of measurement error.
Measurement errors are considered at three different levels: none (i.e., ), low (i.e.,
), and high (i.e.,
). Summary data are derived from five external studies with their sample sizes giving in Table 3. All numbers are multiplied by 100.
https://doi.org/10.1371/journal.pcbi.1011236.s005
(PDF)
S5 Table. Simulation results on the impact of sample sizes of external studies under the alternative PRS model.
We consider an internal study with a sample size reduced to 10% of its original size. Summary data are derived from five external studies, with their sample sizes varying from 1, 5 to 10 times the original sizes reported in Table 3. PRS with no or low measurement error is considered. All numbers are multiplied by 100.
https://doi.org/10.1371/journal.pcbi.1011236.s006
(PDF)
S6 Table. Summary of the average computational time (in minutes) over 100 replications under the Null PRS model.
Summary data are derived from the first 1, 3, or 5 external studies shown in Table 3, with the number of SNPs varying from 21 to 105 by simply stacking the original 21 SNPs.
https://doi.org/10.1371/journal.pcbi.1011236.s007
(PDF)
S7 Table. Summary of the average memory usage (in Gigabyte) over 100 replications under the Null PRS model.
Summary data are derived from the first 1, 3, or 5 external studies shown in Table 3, with the number of SNPs varying from 21 to 105 by simply stacking the original 21 SNPs.
https://doi.org/10.1371/journal.pcbi.1011236.s008
(PDF)
S8 Table. Estimated PRS effects on four NHL subtypes using 1/3 of the internal data from the NHL study.
https://doi.org/10.1371/journal.pcbi.1011236.s009
(PDF)
S9 Table. Disease subtype heterogeneity testing P-values based on the NHL study by randomly using 1/3 of the internal data from the NHL study.
Each P-value is for testing the null hypothesis that the PRS has the same effect on different NHL subtypes.
https://doi.org/10.1371/journal.pcbi.1011236.s010
(PDF)
S1 Fig. Q-Q plots of Z-scores generated by GIMopt.
Measurement errors for the PRS are considered at three different levels: none (i.e., ), low (i.e.,
), and high (i.e.,
).
https://doi.org/10.1371/journal.pcbi.1011236.s011
(PDF)
S2 Fig. Minor allele frequencies of the 21 SNPs in the eight NHL studies.
The minor allele frequency is estimated from the control group within each study.
https://doi.org/10.1371/journal.pcbi.1011236.s012
(PDF)
Acknowledgments
This study utilized the computational resources of the NIH Biowulf cluster (https://hpc.nih.gov/).
References
- 1.
Agresti A. Categorical Data Analysis. 3rd ed. John Wiley & Sons; 2012.
- 2. Dubin N, Pasternack BS. Risk assessment for case-control subgroups by polychotomous logistic regression. Am J Epidemiol. 1986;123(6):1101–1117. pmid:3706280
- 3. Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika. 1979;66(3):403–411.
- 4. Wang M, Spiegelman D, Kuchiba A, Lochhead P, Kim S, Chan AT, et al. Statistical methods for studying disease subtype heterogeneity. Stat Med. 2016;35(5):782–800. pmid:26619806
- 5. Begg CB. A strategy for distinguishing optimal cancer subtypes. Int J Cancer. 2011;129(4):931–937. pmid:20949563
- 6. Begg CB, Zabor EC. Detecting and exploiting etiologic heterogeneity in epidemiologic studies. Am J Epidemiol. 2012;176(6):512–518. pmid:22922440
- 7. Meuwissen TH, Hayes BJ, Goddard ME. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001;157(4):1819–1829. pmid:11290733
- 8. Wray NR, Goddard ME, Visscher PM. Prediction of individual genetic risk to disease from genome-wide association studies. Genome Res. 2007;17(10):1520–1528. pmid:17785532
- 9. Purcell S, Wray N, Stone J, Visscher P, O’Donovan M, Sullivan P, et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009;460(7256):748–752. pmid:19571811
- 10. Allardyce J, Leonenko G, Hamshere M, Pardiñas AF, Forty L, Knott S, et al. Association between schizophrenia-related polygenic liability and the occurrence and level of mood-incongruent psychotic symptoms in bipolar disorder. JAMA Psychiatry. 2018;75(1):28–35. pmid:29167880
- 11. Duncan LE, Ratanatharathorn A, Aiello AE, Almli LM, Amstadter AB, Ashley-Koch AE, et al. Largest GWAS of PTSD (N = 20070) yields genetic overlap with schizophrenia and sex differences in heritability. Mol Psychiatry. 2018;23(3):666–673. pmid:28439101
- 12. Markota M, Coombes BJ, Larrabee BR, McElroy SL, Bond DJ, Veldic M, et al. Association of schizophrenia polygenic risk score with manic and depressive psychosis in bipolar disorder. Transl Psychiatry. 2018;8(1):1–7. pmid:30201969
- 13. Ruderfer DM, Ripke S, McQuillin A, Boocock J, Stahl EA, Pavlides JMW, et al. Genomic dissection of bipolar disorder and schizophrenia, including 28 subphenotypes. Cell. 2018;173(7):1705–1715.
- 14. Coombes BJ, Markota M, Mann JJ, Colby C, Stahl E, Talati A, et al. Dissecting clinical heterogeneity of bipolar disorder using multiple polygenic risk scores. Transl Psychiatry. 2020;10(1):1–8. pmid:32948743
- 15. Canela-Xandri O, Rawlik K, Tenesa A. An atlas of genetic associations in UK Biobank. Nat Genet. 2018;50(11):1593–1599. pmid:30349118
- 16. Lambert SA, Gil L, Jupp S, Ritchie SC, Xu Y, Buniello A, et al. The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation. Nat Genet. 2021;53(4):420–425. pmid:33692568
- 17. Morton LM, Sampson JN, Cerhan JR, Turner JJ, Vajdic CM, Wang SS, et al. Rationale and design of the international lymphoma epidemiology consortium (InterLymph) non-Hodgkin lymphoma subtypes project. J Natl Cancer Inst Monogr. 2014;2014(48):1–14. pmid:25174022
- 18. Conde L, Halperin E, Akers NK, Brown KM, Smedby KE, Rothman N, et al. Genome-wide association study of follicular lymphoma identifies a risk locus at 6p21. 32. Nat Genet. 2010;42(8):661–664. pmid:20639881
- 19. Slager SL, Rabe KG, Achenbach SJ, Vachon CM, Goldin LR, Strom SS, et al. Genome-wide association study identifies a novel susceptibility locus at 6p21. 3 among familial CLL. Blood. 2011;117(6):1911–1916. pmid:21131588
- 20. Berndt SI, Skibola CF, Joseph V, Camp NJ, Nieters A, Wang Z, et al. Genome-wide association study identifies multiple risk loci for chronic lymphocytic leukemia. Nat Genet. 2013;45(8):868–876. pmid:23770605
- 21. Cerhan JR, Berndt SI, Vijai J, Ghesquières H, McKay J, Wang SS, et al. Genome-wide association study identifies multiple susceptibility loci for diffuse large B cell lymphoma. Nat Genet. 2014;46(11):1233–1238. pmid:25261932
- 22. Skibola CF, Berndt SI, Vijai J, Conde L, Wang Z, Yeager M, et al. Genome-wide association study identifies five susceptibility loci for follicular lymphoma outside the HLA region. Am J Hum Genet. 2014;95(4):462–471. pmid:25279986
- 23. Vijai J, Wang Z, Berndt SI, Skibola CF, Slager SL, De Sanjose S, et al. A genome-wide association study of marginal zone lymphoma shows association to the HLA region. Nat Commun. 2015;6(1):1–7.
- 24. Swerdlow SH, Campo E, Pileri SA, Harris NL, Stein H, Siebert R, et al. The 2016 revision of the World Health Organization classification of lymphoid neoplasms. Blood. 2016;127(20):2375–2390. pmid:26980727
- 25. Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2021;71(3):209–249. pmid:33538338
- 26. Pasaniuc B, Price AL. Dissecting the genetics of complex traits using summary association statistics. Nat Rev Genet. 2017;18(2):117–127. pmid:27840428
- 27. Hemani G, Zheng J, Elsworth B, Wade KH, Haberland V, Baird D, et al. The MR-Base platform supports systematic causal inference across the human phenome. Elife. 2018;7. pmid:29846171
- 28. Buniello A, MacArthur JAL, Cerezo M, Harris LW, Hayhurst J, Malangone C, et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 2019;47(D1):D1005–D1012. pmid:30445434
- 29. MacArthur JA, Buniello A, Harris LW, Hayhurst J, McMahon A, Sollis E, et al. Workshop proceedings: GWAS summary statistics standards and sharing. Cell Genomics. 2021;1(1):100004. pmid:36082306
- 30. Imbens GW, Lancaster T. Combining micro and macro data in microeconometric models. Rev Econ Stud. 1994;61(4):655–680.
- 31. Chen J, Sitter R. A pseudo empirical likelihood approach to the effective use of auxiliary information in complex surveys. Statist Sin. 1999;9(2):385–406.
- 32. Qin J. Combining parametric and empirical likelihoods. Biometrika. 2000;87(2):484–490.
- 33. Chaudhuri S, Handcock MS, Rendall MS. Generalized linear models incorporating population level information: an empirical-likelihood-based approach. J R Stat Soc B. 2008;70(2):311–328. pmid:22740776
- 34. Chatterjee N, Chen YH, Maas P, Carroll RJ. Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources. J Am Stat Assoc. 2016;111(513):107–117. pmid:27570323
- 35. Han P, Lawless JF. Empirical likelihood estimation using auxiliary summary information with different covariate distributions. Statist Sin. 2019;29(3):1321–1342.
- 36. Zhang H, Deng L, Schiffman M, Qin J, Yu K. Generalized integration model for improved statistical inference by leveraging external summary data. Biometrika. 2020;107(3):689–703.
- 37. Deng L, Fu S, Qin J, Yu K. On combining individual-level data with summary data in statistical inferences. Statist Sin. 2022.
- 38. Zhang H, Deng L, Wheeler W, Qin J, Yu K. Integrative analysis of multiple case-control studies. Biometrics. 2022;78(3):1080–1091. pmid:33768525
- 39. Berndt SI, Vijai J, Benavente Y, Camp NJ, Nieters A, Wang Z, et al. Distinct germline genetic susceptibility profiles identified for common non-Hodgkin lymphoma subtypes. Leukemia. 2022;36:2835–2844. pmid:36273105
- 40. Qin J, Zhang B. A goodness-of-fit test for logistic regression models based on case-control data. Biometrika. 1997;84(3):609–618.
- 41. White HL. Maximum likelihood estimation of misspecified models. Econometrica. 1982;50(1):1–25.
- 42. Fu S, Deng L, Zhang H, Wheeler W, Qin J, Yu K. Integrative Analysis of Individual-Level Data and High-Dimensional Summary Statistics. Bioinformatics. 2023;39(4). pmid:36964712
- 43. Enciso-Mora V, Broderick P, Ma Y, Jarrett RF, Hjalgrim H, Hemminki K, et al. A genome-wide association study of Hodgkin’s lymphoma identifies new susceptibility loci at 2p16.1 (REL), 8q24.21 and 10p14 (GATA3). Nat Genet. 2010;42(12):1126–1130. pmid:21037568
- 44. Moutsianas L, Enciso-Mora V, Ma YP, Leslie S, Dilthey A, Broderick P, et al. Multiple Hodgkin lymphoma–associated loci within the HLA region at chromosome 6p21.3. Blood. 2011;118(3):670–674. pmid:21596858
- 45. Urayama KY, Jarrett RF, Hjalgrim H, Diepstra A, Kamatani Y, Chabrier A, et al. Genome-wide association study of classical Hodgkin lymphoma and Epstein–Barr virus status–defined subgroups. J Natl Cancer Inst. 2012;104(3):240–253. pmid:22286212
- 46. Frampton M, da Silva Filho MI, Broderick P, Thomsen H, Försti A, Vijayakrishnan J, et al. Variation at 3p24.1 and 6q23.3 influences the risk of Hodgkin’s lymphoma. Nat Commun. 2013;4(1):2549. pmid:24149102
- 47. Cozen W, Timofeeva MN, Li D, Diepstra A, Hazelett D, Delahaye-Sourdeix M, et al. A meta-analysis of Hodgkin lymphoma reveals 19p13.3 TCF3 as a novel susceptibility locus. Nat Commun. 2014;5(1):3856. pmid:24920014
- 48. Sud A, Thomsen H, Law PJ, Försti A, Filho MIdS, Holroyd A, et al. Genome-wide association study of classical Hodgkin lymphoma identifies key regulators of disease susceptibility. Nat Commun. 2017;8(1):1892. pmid:29196614
- 49. Sud A, Thomsen H, Orlando G, Foersti A, Law PJ, Broderick P, et al. Genome-wide association study implicates immune dysfunction in the development of Hodgkin lymphoma. Blood. 2018;132(19):2040–2052. pmid:30194254
- 50. Ziyatdinov A, Kim J, Prokopenko D, Privé F, Laporte F, Loh PR, et al. Estimating the effective sample size in association studies of quantitative traits. G3. 2021;11(6). pmid:33734375
- 51. Lloyd-Jones LR, Robinson MR, Yang J, Visscher PM. Transformation of summary statistics from linear mixed model association on all-or-none traits to odds ratio. Genetics. 2018;208(4):1397–1408. pmid:29429966