Privacy-Preserving and Lossless Distributed Estimation of High-Dimensional Generalized Additive Mixed Models
Privacy-Preserving and Lossless Distributed Estimation of High-Dimensional Generalized Additive Mixed Models
Privacy-Preserving and Lossless Distributed Estimation of High-Dimensional Generalized Additive Mixed Models
Abstract
Various privacy-preserving frameworks that respect the individual’s privacy in the analysis of data
have been developed in recent years. However, available model classes such as simple statistics or
generalized linear models lack the flexibility required for a good approximation of the underlying
data-generating process in practice. In this paper, we propose an algorithm for a distributed, privacy-
preserving, and lossless estimation of generalized additive mixed models (GAMM) using component-
wise gradient boosting (CWB). Making use of CWB allows us to reframe the GAMM estimation as
a distributed fitting of base learners using the L2 -loss. In order to account for the heterogeneity of
different data location sites, we propose a distributed version of a row-wise tensor product that allows
the computation of site-specific (smooth) effects. Our adaption of CWB preserves all the important
properties of the original algorithm, such as an unbiased feature selection and the feasibility to fit
models in high-dimensional feature spaces, and yields equivalent model estimates as CWB on pooled
data. Next to a derivation of the equivalence of both algorithms, we also showcase the efficacy of
our algorithm on a distributed heart disease data set and compare it with state-of-the-art methods.
Keywords: Distributed Computing, Functional Gradient Descent, Generalized Linear Mixed Model,
Machine Learning, Privacy-preserving Modelling
1
directly operate at different sites and unite infor- different sites, the assumption of having iden-
mation without having to share sensitive parts of tically distributed observations across all sites
the data to allow privacy-preserving data analysis. often does not hold. In this case, a reasonable
assumption for the data-generating process is a
Distributed Data site-specific deviation from the general population
Distributed data can be partitioned vertically mean. Adjusting models to this situation is called
or horizontally across different sites. Horizon- interoperability (Litwin et al, 1990), while ignor-
tally partitioned data means that observations ing it may lead to biased or wrong predictions.
are spread across different sites with access to
all existing features of the available data point, 1.1 Related Literature
while for vertically partitioned data, different sites
Various approaches for distributed and privacy-
have access to all observations but different fea-
preserving analysis have been proposed in recent
tures (covariates) for each of these observations.
years. In the context of statistical models, Karr
In this work, we focus on horizontally partitioned
et al (2005) describe how to calculate a lin-
data. Existing approaches for horizontally parti-
ear model (LM) in a distributed and privacy-
tioned data vary from fitting regression models
preserving fashion by sharing data summaries.
such as generalized linear models (GLMs; Wu
Jones et al (2013) propose a similar approach for
et al, 2012; Lu et al, 2015; Jones et al, 2013;
GLMs by communicating the Fisher information
Chen et al, 2018), to conducting distributed eval-
and score vector to conduct a distributed Fisher
uations (Boyd et al, 2015; Ünal et al, 2021;
scoring algorithm. The site information is then
Schalk et al, 2022b), to fitting artificial neural
globally aggregated to estimate the model param-
networks (McMahan et al, 2017). Furthermore,
eters. Other privacy-preserving techniques include
various software frameworks are available to run
ridge regression (Chen et al, 2018), logistic regres-
a comprehensive analysis of distributed data. One
sion, and neural networks (Mohassel and Zhang,
example is the collection of R (R Core Team, 2021)
2017).
packages DataSHIELD (Gaye et al, 2014), which
In machine learning, methods such as the naive
enables data management and descriptive data
Bayes classifier, trees, support vector machines,
analysis as well as securely fitting of simple statis-
and random forests (Li et al, 2020a) exist with
tical models in a distributed setup without leaking
specific encryption techniques (e.g., the Paillier
information from one site to the others.
cryptosystem; Paillier, 1999) to conduct model
updates. In these setups, a trusted third party is
Interpretability and Data Heterogeneity
usually required. However, this is often unrealistic
In many research areas that involve critical and difficult to implement, especially in a medi-
decision-making, especially in medicine, meth- cal or clinical setup. Furthermore, as encryption
ods should not only excel in predictive perfor- is an expensive operation, its application is infea-
mance but also be interpretable. Models should sible for complex algorithms that require many
provide information about the decision-making encryption calls (Naehrig et al, 2011). Existing
process, the feature effects, and the feature impor- privacy-preserving boosting techniques often focus
tance as well as intrinsically select important on the AdaBoost algorithm by using aggregation
features. Generalized additive models (GAMs; see, techniques of the base classifier (Lazarevic and
e.g., Wood, 2017) are one of the most flexible Obradovic, 2001; Gambs et al, 2007). A different
approaches in this respect, providing an inter- approach to boosting decision trees in a federated
pretable yet complex models that also allow for learning setup was introduced by (Li et al, 2020b)
non-linearity in the data. using a locality-sensitive hashing to obtain simi-
As longitudinal studies are often the most larities between data sets without sharing private
practical way to gather information in many information. These algorithms focus on aggregat-
research fields, methods should also be able to ing tree-based base components, making them
account for subject-specific effects and account for difficult to interpret, and come with no inferential
the correlation of repeated measurements. Fur- guarantees.
thermore, when analyzing data originating from
3
In order to account for repeated measure- account for repeated measurements, compute fea-
ments, Luo et al (2022) propose a privacy- ture importance, and conduct feature selection.
preserving and lossless way to fit linear mixed Furthermore, CWB is suited for high-dimensional
models (LMMs) to correct for heterogeneous site- data situations (n p). CWB is therefore often
specific random effects. Their concept of only shar- used in practice for, e.g., predicting the develop-
ing aggregated values is similar to our approach, ment of oral cancer (Saintigny et al, 2011), classi-
but is limited in the complexity of the model and fying individuals with and without patellofemoral
only allows normally distributed outcomes. Other pain syndrome (Liew et al, 2020), or detecting
methods to estimate LMMs in a secure and dis- synchronization in bioelectrical signals (Rügamer
tributed fashion are Zhu et al (2020), Anjum et al et al, 2018). However, there have so far not
(2022), or Yan et al (2022). been any attempts to allow for a distributed,
Besides privacy-preserving and distributed privacy-preserving, and lossless computation of
approaches, integrative analysis is another tech- the CWB algorithm. In this paper, we propose
nique based on pooling the data sets into one and a distributed version of CWB that yields the
analyzing this pooled data set while considering identical model produced by the original algo-
challenges such as heterogeneity or the curse of rithm on pooled data and that accounts for site
dimensionality (Curran and Hussong, 2009; Baze- heterogeneity by including interactions between
ley, 2012; Mirza et al, 2019). While advanced from features and a site variable. This is achieved by
a technical perspective by, e.g., outsourcing com- adjusting the fitting process using 1) a distributed
putational demanding tasks such as the analysis of estimation procedure, 2) a distributed version of
multi-omics data to cloud services (Augustyn et al, row-wise tensor product base learners, and 3) an
2021), the existing statistical cloud-based meth- adaption of the algorithm to conduct feature selec-
ods only deal with basic statistics. The challenges tion in the distributed setup. We implement our
of integrative analysis are similar to the ones tack- method in R using the DataSHIELD framework and
led in this work, our approach, however, does not demonstrate its application in an exemplary med-
allow merging the data sets in order to preserve ical data analysis. Our distributed version of the
privacy. original CWB algorithm does not have any addi-
tional HPs and uses optimization strategies from
1.2 Our Contribution previous research results to define meaningful val-
ues for all hyperparameters, effectively yielding a
This work presents a method to fit generalized tuning-free method.
additive mixed models (GAMMs) in a privacy- The remainder of this paper is structured as
preserving and lossless manner1 to horizontally follows: First, we introduce the basic notation,
distributed data. This not only allows the incorpo- terminology, and setup of GAMMs in Section 2.
ration of site-specific random effects and accounts We then describe the original CWB algorithm
for repeated measurements in LMMs, but also in Section 2.3 and its link to GAMMs. In
facilitates the estimation of mixed models with Section 3, we present the distributed setup and our
responses following any distribution from the novel extension of the CWB algorithm. Finally,
exponential family and provides the possibility to Section 4 demonstrates both how our distributed
estimate complex non-linear relationships between CWB algorithm can be used in practice and how
covariates and the response. To the best of our to interpret the obtained results.
knowledge, we are the first to provide an algorithm
to fit the class of GAMMs in a privacy-preserving Implementation
and lossless fashion on distributed data.
We implement our approach as an R package using
Our approach is based on component-wise gra-
the DataSHIELD framework and make it available
dient boosting (CWB; Bühlmann and Yu, 2003).
on GitHub2 . The code for the analysis can also be
CWB can be used to estimate additive models,
found in the repository3 .
1
In this article, we define a distributed fitting procedure as 2
github.com/schalkdaniel/dsCWB
lossless if the model parameters of the algorithm are the same 3
github.com/schalkdaniel/dsCWB/blob/main/usecase/
as the ones computed on the pooled data. analyse.R
2 Background 2019), conditional on features x(i) and the real-
ization of some random effects. The expectation
2.1 Notation and Terminology µ := E(Y (i) |x(i) , u(i) ) of the response Y (i) for
Our proposed approach uses the CWB algorithm observations i = 1, . . . , ns of measurement unit (or
as fitting engine. Since this method was initially site) s in GAMMs is given by
developed in machine learning, we introduce here
h−1 (µ(i) ) = f (i)
both the statistical notation used for GAMMs as X (i) X (i) (i)
X
well as the respective machine learning terminol- = x j βj + uj γj,s + φj (xj ).
ogy and explain how to relate the two concepts. j∈J1 j∈J2 j∈J3
We assume a p-dimensional covariate or fea- (1)
ture space X = (X1 × . . . × Xp ) ⊆ Rp and response
or outcome values from a target space Y. The goal In (1), h is a smooth monotonic response function,
of boosting is to find the unknown relationship f f corresponds to the additive predictor, γj,s ∼
between X and Y. In turn, GAMMs (as presented N (0, ψ) are random effects accounting for hetero-
in Section 2.2) model the conditional distribu- geneity in the data, and φj are non-linear effects
tion of an outcome variable Y with realizations of pre-specified covariates. The different index sets
y ∈ Y, given features = (x1 , .. . , xp ) ∈ X . Given
x (1) J1 , J2 , J3 ⊆ {1, . . . , p} ∪ ∅ indicate which fea-
a data set D = x , y (1) , . . . , x(n) , y (n) tures are modeled as fixed effects, random effects,
with n observations drawn (conditionally) inde- or non-linear (smooth) effects, respectively. The
pendently from an unknown probability distri- modeler usually defines these sets. However, as we
bution Pxy on the joint space X × Y, we aim will also explain later, the use of CWB as a fit-
to estimate this functional relationship in CWB ting engine allows for automatic feature selection
with fˆ. The goodness-of-fit of a given model and therefore does not require explicitly defin-
fˆ is assessed byPcalculating the empirical risk ing these sets. In GAMMs, smooth effects are
Remp (fˆ) = n−1 (x,y)∈D L(y, fˆ(x)) based on a usually represented by (spline) basis functions,
loss function L : Y × R → R and the data set D. i.e., φj (xj ) ≈ (Bj,1 (xj ), . . . , Bj,dj (xj ))> θj , where
Minimizing Remp using this loss function is equiv- θj ∈ Rdj are the basis coefficients corresponding
alent to estimating f using maximum likelihood to each basis function Bj,dj . The coefficients are
by defining L(y, f (x)) = −`(y, h(f (x))) with log- typically constrained in their flexibility by adding
likelihood `, response function h and minimizing a quadratic (difference) penalty for (neighboring)
the sum of log-likelihood contributions. coefficients to the objective function to enforce
In the following, we also require the vector smoothness. GAMMs, as in (1), are not limited to
(1) (n)
xj = (xj , . . . , xj )T ∈ Xj , which refers to the univariate smooth effects φj , but allow for higher-
j th feature. Furthermore, let x = (x1 , . . . , xp ) and dimensional non-linear effects φ(xj1 , xj2 , . . . , xjk ).
y denote arbitrary members of X and Y, respec- The most common higher-dimensional smooth
tively. A special role is further given to a subset interaction effects are bivariate effects (k = 2) and
u = (u1 , . . . , uq )> , q ≤ p, of features x, which will can be represented using a bivariate or a tensor
be used to model the heterogeneity in the data. product spline basis (see Section 2.3.1 for more
details). Although higher-order splines with k > 2
2.2 Generalized Additive Mixed are possible, models are often restricted to bivari-
ate interactions for the sake of interpretability and
Models computational feasibility. In Section 3, we will fur-
A very flexible class of regression models to ther introduce varying coefficient terms φj,s (xj ) in
model the relationship between covariates and the the model (1), i.e., smooth effects f varying with a
response are GAMMs (see, e.g., Wood, 2017). second variable s. Analogous to random slopes, s
In GAMMs, the response Y (i) for observation can also be the index set defining observation units
i = 1, . . . , ns of measurement unit (or site) s is of random effects J2 . Using an appropriate dis-
assumed to follow some exponential family distri- tribution assumption for the basis coefficients θj ,
bution such as the Poisson, binomial, or normal these varying coefficients can then be considered
distributions (see, e.g., McCullagh and Nelder, as random smooth effects.
5
3.2 Base Learners where Zl,k are the distributed design matrices of
bl on sites s = 1, . . . , S.
For shared effects, we keep the original struc-
ture of CWB with base learners chosen from a
set of possible learners B. Section 3.3.1 explains 3.3 Fitting Algorithm
how these shared effects are estimated in the dis- We now describe the adaptions required to allow
tributed setup. We further define a regularized for distributed computations of the CWB fitting
categorical base learner b0 with basis transfor- routine. In Sections 3.3.1 and 3.3.2, we show the
mation g0 (x0 ) = (1{1} (x0 ), . . . , 1{S} (x0 ))T and equality between our distributed fitting approach
design matrix Z0 ∈ Rn×S . We use b0 to extend B and CWB fitted on pooled data. Section 3.3.3
with a second set of base learners B× = {b0 ×b | b ∈ describes the remaining details such as distributed
B} to model site-specific random effects. All base SSE calculations, distributed model updates, and
learners in B× are row-wise tensor product base pseudo residual updates in the distributed setup.
9
Section 3.4 summarizes the distributed CWB algo- Algorithm 2 Distributed Effect Estimation.
rithm and Section 3.5 elaborates on the commu- The line prefixes [S] and [H] indicate whether the
nication costs of our algorithm. operation is conducted at the sites ([S]) or at the
host ([H]).
3.3.1 Distributed Shared Effects Input Sites design matrices Zl,1 , . . . , Zl,S ,
Computation response vectors y1 , . . . , yS and
an optional penalty matrix Kl .
Fitting CWB in a distributed fashion requires
Output Estimated parameter vector θ̂l .
adapting the fitting process of the base learner bl
in Algorithm 1 to distributed data. To allow for 1: procedure distFit(Zl,1 , . . . , Zl,S , y1 , . . . , yS , Kl )
shared effects computations across different sites 2: for s ∈ {1, . . . , S} do
3: [S] Fl,s = ZT l,s Zl,s
without jeopardizing privacy, we take advantage
of CWB’s update scheme, which boils down to a 4: [S] ul,s = ZTl,s ys
5: [S] Communicate Fl,s and ul,s to the host
(penalized) least squares estimation per iteration
6: end for P
for every base learner. This allows us to build upon
7: [H] Fl = S Fl,s + Kl
existing work such as Karr et al (2005) to fit linear Ps=1
8: [H] ul = S s=1 ul,s
models in a distributed fashion by just communi-
9: [H] return θ̂l = Fl−1 ul
cating aggregated statistics between sites and the
10: end procedure
host.
In a first step, the aggregated matrices Fl,s =
T T
Zl,s Zl,s and vectors ul,s = Zl,s ys are computed
on each site. In our privacy setup (Section 2.4),
3.3.2 Distributed Site-specific Effects
communicating Fl,s and ul,s is allowed as long as Computation
the privacy-aggregation level per site is met. In If we pretend that the fitting of the base learner
a second step, the site information PS is aggregated bl× is performed on the pooled data, we obtain
to a global information Fl = s=1 Fl,s + Kl and
PS −1
ul = s=1 ul,s and then used to estimate the
θ̂l× = ZlT× Zl× + Kl× ZlT× y
model parameters θ̂l = Fl−1 ul . This approach,
T
Zl,1 + λ0 Idl + Kl )−1 Zl,1
T
referred to as distFit, is explained again in detail (Zl,1 y1
in Algorithm 2 and used for the shared effect com- = ..
,
[m] .
putations of the model by substituting θ̂l = T −1 T
−1 T [m] (Zl,S Zl,S + λ0 Idl + Kl ) Zl,S yS
ZlT Zl + Kl Zl r̃ (Algorithm 1 line 6) with (4)
[m] [m] [m]
θ̂l = distFit(Zl,1 , . . . , Zl,S , r̃1 , . . . , r̃S , Kl ).
[m]
Note that the pseudo residuals r̃k are also where (4) is due to the block structure, as
securely located at each site and are updated described in (3) of Section 3.2. This shows that the
after each iteration. Details about the dis- fitting of the site-specific effects θ̂l× can be split
tributed pseudo residuals updates are explained up into the fitting of individual parameters
in Section 3.3.3. We also note that the com-
putational complexity of fitting CWB can be
T
θ̂l× ,s = (Zl,s Zl,s + λ0 Idl + Kl )−1 Zl,s
T
ys . (5)
drastically reduced by pre-calculating and stor-
ing (ZlT Zl + Kl )−1 in a first initialization step, It is thus possible to compute site-specific effects
as the matrix is independent of iteration m, and at the respective site without the need to share any
reusing these pre-calculated matrices in all subse- information with the host. The host, in turn, only
quent iterations (cf. Schalk et al, 2022a). Using requires the SSE of the respective base learner
pre-calculated matrices also reduces the amount of (see next Section 3.3.3) to perform the next iter-
required communication between sites and host. ation of CWB. Hence, during the fitting process,
the parameter estimates remain at their sites and
are just updated if the site-specific base learner is
selected. This again minimizes the amount of data
communication between sites and host and speeds
ns
up the fitting process. After the fitting phase, the X [m]
= (r̃s[m](i) − gl (x(i) )T θ̂l× ,s )2 .
aggregated site-specific parameters are communi-
i=1
cated once in a last communication step to obtain
the final model. The site-specific SSE values are then PS sent to the
host and aggregated to SSEl = s=1 SSEl,s . If
3.3.3 Pseudo Residual Updates, SSE privacy constraints have been met in all previous
Calculation, and Base Learner calculations, sharing the individual SSE values is
Selection not critical and does not violate any privacy con-
straints as the value is an aggregation of all ns
The remaining challenges to run the distributed
observations for all sites s.
CWB algorithm are 1) the pseudo residual calcu-
Having gathered all SSE values at the host
lation (Algorithm 1 line 4), 2) the SSE calculation
location, selecting the best base learner in the cur-
(Algorithm 1 line 7), and 3) base learner selection
rent iteration is done in the exact same manner as
(Algorithm 1 line 9).
for the non-distributed CWB algorithm by select-
Distributed pseudo residual updates ing l[m] = arg minl∈{1,...,|B|,1× ,...,|B|× } SSEl . After
the selection, the index l[m] is shared with all sites
The site-specific response vector ys containing
to enable the update of the site-specific models
the values y (i) , i ∈ {1, . . . , ns } is the basis [m]
of the pseudo residual calculation. We assume fˆs . If a shared effect is selected, the parame-
[m]
that every site s has access to all shared effects ter vector θ̂l[m] is shared with all sites. Caution
as well as the site-specific information of all must be taken when the number of parameters of
site-specific base learners bl× only containing one base learner is equal to the number of obser-
the respective parameters θ̂l× ,s . Based on these vations, as this allows reverse-engineering private
base learners, it is thus possible to compute a data. In the case of a site-specific effect selection,
[m] no parameter needs to be communicated, as the
site model fˆs as a representative of fˆ[m] on
[m] respective estimates are already located at each
every site s. The pseudo residual updates r̃s site.
[m] [m](i)
per site are then based on fˆs via r̃s =
−∇f L(y (i) , f (x(i) ))|f =fˆ[m−1] , i ∈ {1, . . . , ns } 3.4 Distributed CWB Algorithm
s
using Ds . Most importantly, all remaining steps
with Site-Specific Effects
of the distributed CWB fitting procedure do not
[m] Assembling all pieces, our distributed CWB algo-
share the pseudo residuals r̃s in order to avoid
information leakage about ys . rithm is summarized in Algorithm 3.
iterations M , i.e., O(M ). For the analysis of com- is d2 |B| for each site and therefore scales lin-
munication costs in terms of the number of base early with |B|, i.e., O(|B|). The host does not
learners, we distinguish between the initialization communicate any values during the initialization.
phase and the fitting phase.
Fitting
Initialization In each iteration, every site shares its vector
As only the sites share Fl,s ∈ Rd×d , ∀l ∈ T [m]
Zl,s r̃s ∈ Rd , ∀l ∈ {1, . . . , |B|}. Over the course
{1, . . . , |B|}, the transmitted amount of values of M boosting iterations, each site therefore shares
dM |B| values. Every site also communicates the
SSE values, i.e., 2 values (index and SSE value)
for every base learner and thus 2M |B| values for each sites. The task is to determine important risk
all iterations and base learners. In total, each site factors for heart diseases. The target variable is
communicates M |B|(d + 2) values. The communi- therefore a binary outcome indicating the presence
cation costs for all sites are therefore O(|B|). The of heart disease or not.
host, in turn, communicates the estimated param-
eters θ̂ [m] ∈ Rd of the |B| shared effects. Hence, 4.2 Analysis and Results
dM |B| values as well as the index of the best
base learner in each iteration are transmitted. In We follow the practices to setup CWB as men-
total, the host therefore communicates dM |B|+M tioned in Section 2.3.2 and run the distributed
values to the sites, and costs are therefore also CWB algorithm with a learning rate of 0.1 and a
O(|B|). maximum number of 100000 iterations. To deter-
mine an optimal stopping iteration for CWB, we
use 20 % of the data as validation data and set
4 Application the patience to 5 iterations. In other words, the
algorithm stops if no risk improvement on the vali-
We now showcase our algorithm on a heart disease
dation data is observed in 5 consecutive iterations.
data set that consists of patient data gathered all
For the numerical covariates, we use a P-spline
over the world. The data were collected at four
with 10 cubic basis functions and second-order dif-
different sites by the 1) Hungarian Institute of
ference penalties. All base learners are penalized
Cardiology, Budapest (Andras Janosi, M.D.), 2)
accordingly to a global degree of freedom that we
University Hospital, Zurich, Switzerland (William
set to 2.2 (to obtain unbiased feature selection)
Steinbrunn, M.D.), 3) University Hospital, Basel,
while the random intercept is penalized according
Switzerland (Matthias Pfisterer, M.D.), and 4)
to 3 degrees of freedom (see the Supplementary
V.A. Medical Center, Long Beach, and Cleveland
Material B.2 for more details). Since we are mod-
Clinic Foundation (Robert Detrano, M.D., Ph.D.),
elling a binary response variable, h−1 is the inverse
and is thus suited for a multi-site distributed
logit function logit−1 (f ) = (1 + exp(−f ))−1 . The
analysis. The individual data sets are freely avail-
model for an observation of site s, conditional on
able at https://archive.ics.uci.edu/ml/datasets/
its random effects γ, is given in the Supplementary
heart+disease (Dua and Graff, 2017). For our
Material B.3.
analysis, we set the privacy level (cf. Section 2.4)
to 5 which is a common default.
Results
4.1 Data Description The algorithm stops after mstop = 5578 itera-
tions as the risk on the validation data set starts
The raw data set contains 14 covariates, such as to increase (cf. Figure 1 in the Supplementary
the chest pain type (cp), resting blood pressure Material B.4). Out of these 5578 iterations, the
(trestbps), maximum heart rate (thalach), sex, distributed CWB algorithm selects a shared effect
exercise-induced angina (exang), or ST depression in 782 iterations and site-specific effects in 4796
(i.e., abnormal difference of the ST segment from iterations. This indicates that the data is rather
the baseline on an electrocardiogram) induced heterogeneous and requires site-specific (random)
by exercise relative to rest (oldpeak). A full effects. Figure 1 (Left) shows traces of how and
list of covariates and their abbreviations is given when the different additive terms (base learners)
on the data set’s website. After removing non- entered the model during the fitting process and
informative covariates and columns with too many illustrates the selection process of CWB.
missing values at each site, we obtain ncleveland =
303, nhungarian = 292, nswitzerland = 116, and
nva = 140 observations and 8 covariates. A table
containing the description of the abbreviations of
these covariates is given in Table 1 in the Sup-
plementary Material B.1. For our application, we
assume that missing values are completely at ran-
dom and all data sets are exclusively located at
13
0.2 0.1262
a cp (site)
age
sex
Figure 3). This again underpins the lossless prop-
a oldpeak (site)
0.1 0.1024
0.0762 a sex (site)
restecg
exang erty of the proposed algorithm. The site-specific
0.0 thalach
0 2000 4000 0 300 600 900 1200 effects on the pooled data are fitted by defining
Iteration VIP
a row-wise Kronecker base learner for all features
Fig. 1: Left: Model trace showing how and when and the site as a categorical variable. The same
the four most selected additive terms entered the approach is used to estimate a GAMM using mgcv
model. Right: Variable importance (cf. Au et al, fitted on the pooled data with tensor products
2019) of selected features in decreasing order. between the main feature and the categorical site
variable. A comparison of all partial feature effects
is given in the Supplementary Material B.7 show-
The estimated effect of the most important ing good alignment between the different methods.
feature oldpeak (cf. Figure 1, Right) found is For the oldpeak effect shown in Figure 3, we
further visualized in Figure 2. Looking at the also see that the partial effects of the two CWB
shared effect, we find a negative influence on the methods are very close to the mixed model-based
risk of heart disease when increasing ST depres- estimation, with only smaller differences caused by
sion (oldpeak). When accounting for site-specific a slightly different penalization strength of both
deviations, the effect becomes more diverse, par- approaches. The empirical risk is 0.4245 for our
ticularly for Hungary. distributed CWB algorithm, 0.4245 for CWB on
the pooled data, and 0.4441 for the GAMM on the
Shared effect Site effects Shared + site effects
pooled data.
Partial feature
Partial feature
Partial feature
cleveland
0.0 0.5
0.5
effect
effect
effect
−0.1 hungarian
−0.2 0.0 0.0
−0.3 −0.5 −0.5 switzerland
−2.5 0.0 2.5 5.0 −2.5 0.0 2.5 5.0 −2.5 0.0 2.5 5.0 va Aggregated site−specific feature effects
oldpeak oldpeak oldpeak
oldpeak
Partial feature effect
dsCWB
(middle). The plot on the right-hand side shows −2
−3 mgcv
−2.5 0.0 2.5 5.0 −2.5 0.0 2.5 5.0 −2.5 0.0 2.5 5.0 −2.5 0.0 2.5 5.0
the sum of shared and site-specific effects. oldpeak
Gambs S, Kégl B, Aı̈meur E (2007) Privacy- ACM Computing Surveys (CSUR) 22(3):267–
preserving boosting. Data Mining and Knowl- 293
edge Discovery 14(1):131–170
Lu CL, Wang S, Ji Z, et al (2015) Webdisco: a
Gaye A, Marcon Y, Isaeva J, et al (2014) web service for distributed cox model learning
Datashield: taking the analysis to the data, not without patient-level data sharing. Journal of
the data to the analysis. International journal the American Medical Informatics Association
of epidemiology 43(6):1929–1944 22(6):1212–1219
Hoerl AE, Kennard RW (1970) Ridge regression: Luo C, Islam M, Sheils NE, et al (2022) Dlmm
Biased estimation for nonorthogonal problems. as a lossless one-shot algorithm for collabora-
Technometrics 12(1):55–67 tive multi-site distributed linear mixed models.
Nature Communications 13(1):1–10
Hofner B, Hothorn T, Kneib T, et al (2011) A
framework for unbiased model selection based McCullagh P, Nelder JA (2019) Generalized linear
on boosting. Journal of Computational and models. Routledge
Graphical Statistics 20(4):956–971
McMahan B, Moore E, Ramage D, et al (2017)
Jones EM, Sheehan NA, Gaye A, et al (2013) Communication-efficient learning of deep net-
Combined analysis of correlated data when data works from decentralized data. In: Artificial
cannot be pooled. Stat 2(1):72–85 intelligence and statistics, PMLR, pp 1273–1282
Karr AF, Lin X, Sanil AP, et al (2005) Secure Mirza B, Wang W, Wang J, et al (2019) Machine
regression on distributed databases. Journal learning and integrative analysis of biomedical
of Computational and Graphical Statistics big data. Genes 10(2):87
14(2):263–279
Mohassel P, Zhang Y (2017) Secureml: A system
Lazarevic A, Obradovic Z (2001) The distributed for scalable privacy-preserving machine learn-
boosting algorithm. In: Proceedings of the sev- ing. In: 2017 IEEE symposium on security and
enth ACM SIGKDD international conference privacy (SP), IEEE, pp 19–38
on Knowledge discovery and data mining, pp
311–316 Naehrig M, Lauter K, Vaikuntanathan V (2011)
Can homomorphic encryption be practical? In:
Li J, Kuang X, Lin S, et al (2020a) Privacy Proceedings of the 3rd ACM workshop on Cloud
preservation for machine learning training and computing security workshop, pp 113–124
classification based on homomorphic encryption
schemes. Information Sciences 526:166–179 Paillier P (1999) Public-key cryptosystems based
on composite degree residuosity classes. In:
Li Q, Wen Z, He B (2020b) Practical feder- International conference on the theory and
ated gradient boosting decision trees. In: Pro- applications of cryptographic techniques,
ceedings of the AAAI conference on artificial Springer, pp 223–238
intelligence, pp 4642–4649
R Core Team (2021) R: A Language and Envi-
Liew BX, Rügamer D, Abichandani D, et al ronment for Statistical Computing. R Founda-
(2020) Classifying individuals with and with- tion for Statistical Computing, Vienna, Austria,
out patellofemoral pain syndrome using ground URL https://www.R-project.org/
force profiles – Development of a method using
functional data boosting. Gait & Posture 80:90– Rügamer D, Brockhaus S, Gentsch K, et al
95 (2018) Boosting factor-specific functional
historical models for the detection of syn-
Litwin W, Mark L, Roussopoulos N (1990) Inter- chronization in bioelectrical signals. Journal
operability of multiple autonomous databases. of the Royal Statistical Society: Series C
(Applied Statistics) 67(3):621–642. https: Zhu R, Jiang C, Wang X, et al (2020)
//doi.org/https://doi.org/10.1111/rssc.12241, Privacy-preserving construction of gener-
URL https://rss.onlinelibrary.wiley.com/doi/ alized linear mixed model for biomedical
abs/10.1111/rssc.12241 computation. Bioinformatics 36(Supple-
ment 1):i128–i135. https://doi.org/10.
Saintigny P, Zhang L, Fan YH, et al (2011) Gene 1093/bioinformatics/btaa478, URL https:
expression profiling predicts the development //doi.org/10.1093/bioinformatics/btaa478
of oral cancer. Cancer Prevention Research
4(2):218–229