Privacy-Preserving and Lossless Distributed Estimation of High-Dimensional Generalized Additive Mixed Models

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Privacy-Preserving and Lossless Distributed Estimation of

High-Dimensional Generalized Additive Mixed Models


arXiv:2210.07723v2 [stat.ML] 10 Mar 2023

Schalk Daniel1,2*, Bischl Bernd1,2 and Rügamer David1,2,3


1* Department of Statistics, LMU Munich, Munich, Germany.
2 Munich
Center for Machine Learning (MCML).
3 Department of Statistics, TU Dortmund, Dortmung, Germany.

*Corresponding author(s). E-mail(s): daniel.schalk@stat.uni-muenchen.de;


Contributing authors: bernd.bischl@stat.uni-muenchen.de;
david.ruegamer@stat.uni-muenchen.de;

Abstract
Various privacy-preserving frameworks that respect the individual’s privacy in the analysis of data
have been developed in recent years. However, available model classes such as simple statistics or
generalized linear models lack the flexibility required for a good approximation of the underlying
data-generating process in practice. In this paper, we propose an algorithm for a distributed, privacy-
preserving, and lossless estimation of generalized additive mixed models (GAMM) using component-
wise gradient boosting (CWB). Making use of CWB allows us to reframe the GAMM estimation as
a distributed fitting of base learners using the L2 -loss. In order to account for the heterogeneity of
different data location sites, we propose a distributed version of a row-wise tensor product that allows
the computation of site-specific (smooth) effects. Our adaption of CWB preserves all the important
properties of the original algorithm, such as an unbiased feature selection and the feasibility to fit
models in high-dimensional feature spaces, and yields equivalent model estimates as CWB on pooled
data. Next to a derivation of the equivalence of both algorithms, we also showcase the efficacy of
our algorithm on a distributed heart disease data set and compare it with state-of-the-art methods.

Keywords: Distributed Computing, Functional Gradient Descent, Generalized Linear Mixed Model,
Machine Learning, Privacy-preserving Modelling

1 Introduction makes data analysis challenging, particularly if


methods require or notably benefit from incorpo-
More than ever, data is collected to record the rating all available (but distributed) information.
ubiquitous information in our everyday life. How- For example, personal patient information is typ-
ever, on many occasions, the physical location ically distributed over several hospitals, while
of data points is not confined to one place (one sharing or merging different data sets in a central
global site) but distributed over different locations location is prohibited. To overcome this limita-
(sites). This is the case for, e.g., patient records tion, different approaches have been developed to
that are gathered at different hospitals but usu-
ally not shared between hospitals or other facilities
due to the sensitive information they contain. This

1
directly operate at different sites and unite infor- different sites, the assumption of having iden-
mation without having to share sensitive parts of tically distributed observations across all sites
the data to allow privacy-preserving data analysis. often does not hold. In this case, a reasonable
assumption for the data-generating process is a
Distributed Data site-specific deviation from the general population
Distributed data can be partitioned vertically mean. Adjusting models to this situation is called
or horizontally across different sites. Horizon- interoperability (Litwin et al, 1990), while ignor-
tally partitioned data means that observations ing it may lead to biased or wrong predictions.
are spread across different sites with access to
all existing features of the available data point, 1.1 Related Literature
while for vertically partitioned data, different sites
Various approaches for distributed and privacy-
have access to all observations but different fea-
preserving analysis have been proposed in recent
tures (covariates) for each of these observations.
years. In the context of statistical models, Karr
In this work, we focus on horizontally partitioned
et al (2005) describe how to calculate a lin-
data. Existing approaches for horizontally parti-
ear model (LM) in a distributed and privacy-
tioned data vary from fitting regression models
preserving fashion by sharing data summaries.
such as generalized linear models (GLMs; Wu
Jones et al (2013) propose a similar approach for
et al, 2012; Lu et al, 2015; Jones et al, 2013;
GLMs by communicating the Fisher information
Chen et al, 2018), to conducting distributed eval-
and score vector to conduct a distributed Fisher
uations (Boyd et al, 2015; Ünal et al, 2021;
scoring algorithm. The site information is then
Schalk et al, 2022b), to fitting artificial neural
globally aggregated to estimate the model param-
networks (McMahan et al, 2017). Furthermore,
eters. Other privacy-preserving techniques include
various software frameworks are available to run
ridge regression (Chen et al, 2018), logistic regres-
a comprehensive analysis of distributed data. One
sion, and neural networks (Mohassel and Zhang,
example is the collection of R (R Core Team, 2021)
2017).
packages DataSHIELD (Gaye et al, 2014), which
In machine learning, methods such as the naive
enables data management and descriptive data
Bayes classifier, trees, support vector machines,
analysis as well as securely fitting of simple statis-
and random forests (Li et al, 2020a) exist with
tical models in a distributed setup without leaking
specific encryption techniques (e.g., the Paillier
information from one site to the others.
cryptosystem; Paillier, 1999) to conduct model
updates. In these setups, a trusted third party is
Interpretability and Data Heterogeneity
usually required. However, this is often unrealistic
In many research areas that involve critical and difficult to implement, especially in a medi-
decision-making, especially in medicine, meth- cal or clinical setup. Furthermore, as encryption
ods should not only excel in predictive perfor- is an expensive operation, its application is infea-
mance but also be interpretable. Models should sible for complex algorithms that require many
provide information about the decision-making encryption calls (Naehrig et al, 2011). Existing
process, the feature effects, and the feature impor- privacy-preserving boosting techniques often focus
tance as well as intrinsically select important on the AdaBoost algorithm by using aggregation
features. Generalized additive models (GAMs; see, techniques of the base classifier (Lazarevic and
e.g., Wood, 2017) are one of the most flexible Obradovic, 2001; Gambs et al, 2007). A different
approaches in this respect, providing an inter- approach to boosting decision trees in a federated
pretable yet complex models that also allow for learning setup was introduced by (Li et al, 2020b)
non-linearity in the data. using a locality-sensitive hashing to obtain simi-
As longitudinal studies are often the most larities between data sets without sharing private
practical way to gather information in many information. These algorithms focus on aggregat-
research fields, methods should also be able to ing tree-based base components, making them
account for subject-specific effects and account for difficult to interpret, and come with no inferential
the correlation of repeated measurements. Fur- guarantees.
thermore, when analyzing data originating from
3

In order to account for repeated measure- account for repeated measurements, compute fea-
ments, Luo et al (2022) propose a privacy- ture importance, and conduct feature selection.
preserving and lossless way to fit linear mixed Furthermore, CWB is suited for high-dimensional
models (LMMs) to correct for heterogeneous site- data situations (n  p). CWB is therefore often
specific random effects. Their concept of only shar- used in practice for, e.g., predicting the develop-
ing aggregated values is similar to our approach, ment of oral cancer (Saintigny et al, 2011), classi-
but is limited in the complexity of the model and fying individuals with and without patellofemoral
only allows normally distributed outcomes. Other pain syndrome (Liew et al, 2020), or detecting
methods to estimate LMMs in a secure and dis- synchronization in bioelectrical signals (Rügamer
tributed fashion are Zhu et al (2020), Anjum et al et al, 2018). However, there have so far not
(2022), or Yan et al (2022). been any attempts to allow for a distributed,
Besides privacy-preserving and distributed privacy-preserving, and lossless computation of
approaches, integrative analysis is another tech- the CWB algorithm. In this paper, we propose
nique based on pooling the data sets into one and a distributed version of CWB that yields the
analyzing this pooled data set while considering identical model produced by the original algo-
challenges such as heterogeneity or the curse of rithm on pooled data and that accounts for site
dimensionality (Curran and Hussong, 2009; Baze- heterogeneity by including interactions between
ley, 2012; Mirza et al, 2019). While advanced from features and a site variable. This is achieved by
a technical perspective by, e.g., outsourcing com- adjusting the fitting process using 1) a distributed
putational demanding tasks such as the analysis of estimation procedure, 2) a distributed version of
multi-omics data to cloud services (Augustyn et al, row-wise tensor product base learners, and 3) an
2021), the existing statistical cloud-based meth- adaption of the algorithm to conduct feature selec-
ods only deal with basic statistics. The challenges tion in the distributed setup. We implement our
of integrative analysis are similar to the ones tack- method in R using the DataSHIELD framework and
led in this work, our approach, however, does not demonstrate its application in an exemplary med-
allow merging the data sets in order to preserve ical data analysis. Our distributed version of the
privacy. original CWB algorithm does not have any addi-
tional HPs and uses optimization strategies from
1.2 Our Contribution previous research results to define meaningful val-
ues for all hyperparameters, effectively yielding a
This work presents a method to fit generalized tuning-free method.
additive mixed models (GAMMs) in a privacy- The remainder of this paper is structured as
preserving and lossless manner1 to horizontally follows: First, we introduce the basic notation,
distributed data. This not only allows the incorpo- terminology, and setup of GAMMs in Section 2.
ration of site-specific random effects and accounts We then describe the original CWB algorithm
for repeated measurements in LMMs, but also in Section 2.3 and its link to GAMMs. In
facilitates the estimation of mixed models with Section 3, we present the distributed setup and our
responses following any distribution from the novel extension of the CWB algorithm. Finally,
exponential family and provides the possibility to Section 4 demonstrates both how our distributed
estimate complex non-linear relationships between CWB algorithm can be used in practice and how
covariates and the response. To the best of our to interpret the obtained results.
knowledge, we are the first to provide an algorithm
to fit the class of GAMMs in a privacy-preserving Implementation
and lossless fashion on distributed data.
We implement our approach as an R package using
Our approach is based on component-wise gra-
the DataSHIELD framework and make it available
dient boosting (CWB; Bühlmann and Yu, 2003).
on GitHub2 . The code for the analysis can also be
CWB can be used to estimate additive models,
found in the repository3 .

1
In this article, we define a distributed fitting procedure as 2
github.com/schalkdaniel/dsCWB
lossless if the model parameters of the algorithm are the same 3
github.com/schalkdaniel/dsCWB/blob/main/usecase/
as the ones computed on the pooled data. analyse.R
2 Background 2019), conditional on features x(i) and the real-
ization of some random effects. The expectation
2.1 Notation and Terminology µ := E(Y (i) |x(i) , u(i) ) of the response Y (i) for
Our proposed approach uses the CWB algorithm observations i = 1, . . . , ns of measurement unit (or
as fitting engine. Since this method was initially site) s in GAMMs is given by
developed in machine learning, we introduce here
h−1 (µ(i) ) = f (i)
both the statistical notation used for GAMMs as X (i) X (i) (i)
X
well as the respective machine learning terminol- = x j βj + uj γj,s + φj (xj ).
ogy and explain how to relate the two concepts. j∈J1 j∈J2 j∈J3
We assume a p-dimensional covariate or fea- (1)
ture space X = (X1 × . . . × Xp ) ⊆ Rp and response
or outcome values from a target space Y. The goal In (1), h is a smooth monotonic response function,
of boosting is to find the unknown relationship f f corresponds to the additive predictor, γj,s ∼
between X and Y. In turn, GAMMs (as presented N (0, ψ) are random effects accounting for hetero-
in Section 2.2) model the conditional distribu- geneity in the data, and φj are non-linear effects
tion of an outcome variable Y with realizations of pre-specified covariates. The different index sets
y ∈ Y, given features = (x1 , .. . , xp ) ∈ X . Given
 x (1)  J1 , J2 , J3 ⊆ {1, . . . , p} ∪ ∅ indicate which fea-
a data set D = x , y (1) , . . . , x(n) , y (n) tures are modeled as fixed effects, random effects,
with n observations drawn (conditionally) inde- or non-linear (smooth) effects, respectively. The
pendently from an unknown probability distri- modeler usually defines these sets. However, as we
bution Pxy on the joint space X × Y, we aim will also explain later, the use of CWB as a fit-
to estimate this functional relationship in CWB ting engine allows for automatic feature selection
with fˆ. The goodness-of-fit of a given model and therefore does not require explicitly defin-
fˆ is assessed byPcalculating the empirical risk ing these sets. In GAMMs, smooth effects are
Remp (fˆ) = n−1 (x,y)∈D L(y, fˆ(x)) based on a usually represented by (spline) basis functions,
loss function L : Y × R → R and the data set D. i.e., φj (xj ) ≈ (Bj,1 (xj ), . . . , Bj,dj (xj ))> θj , where
Minimizing Remp using this loss function is equiv- θj ∈ Rdj are the basis coefficients corresponding
alent to estimating f using maximum likelihood to each basis function Bj,dj . The coefficients are
by defining L(y, f (x)) = −`(y, h(f (x))) with log- typically constrained in their flexibility by adding
likelihood `, response function h and minimizing a quadratic (difference) penalty for (neighboring)
the sum of log-likelihood contributions. coefficients to the objective function to enforce
In the following, we also require the vector smoothness. GAMMs, as in (1), are not limited to
(1) (n)
xj = (xj , . . . , xj )T ∈ Xj , which refers to the univariate smooth effects φj , but allow for higher-
j th feature. Furthermore, let x = (x1 , . . . , xp ) and dimensional non-linear effects φ(xj1 , xj2 , . . . , xjk ).
y denote arbitrary members of X and Y, respec- The most common higher-dimensional smooth
tively. A special role is further given to a subset interaction effects are bivariate effects (k = 2) and
u = (u1 , . . . , uq )> , q ≤ p, of features x, which will can be represented using a bivariate or a tensor
be used to model the heterogeneity in the data. product spline basis (see Section 2.3.1 for more
details). Although higher-order splines with k > 2
2.2 Generalized Additive Mixed are possible, models are often restricted to bivari-
ate interactions for the sake of interpretability and
Models computational feasibility. In Section 3, we will fur-
A very flexible class of regression models to ther introduce varying coefficient terms φj,s (xj ) in
model the relationship between covariates and the the model (1), i.e., smooth effects f varying with a
response are GAMMs (see, e.g., Wood, 2017). second variable s. Analogous to random slopes, s
In GAMMs, the response Y (i) for observation can also be the index set defining observation units
i = 1, . . . , ns of measurement unit (or site) s is of random effects J2 . Using an appropriate dis-
assumed to follow some exponential family distri- tribution assumption for the basis coefficients θj ,
bution such as the Poisson, binomial, or normal these varying coefficients can then be considered
distributions (see, e.g., McCullagh and Nelder, as random smooth effects.
5

2.3 Component-Wise Boosting (Regularized) linear base learners


A linear base learner is used to include linear
Component-wise (gradient) boosting (CWB; effects of a features xj1 , . . . , xjdl into the model.
Bühlmann and Yu, 2003; Bühlmann et al, 2007) The basis transformation is given by gl (x) =
is a an iterative algorithm that performs block- (gl,1 (x), . . . , gl,dl +1 (x))T = (1, xj1 , . . . , xjdl )T . Lin-
coordinate descent steps with blocks (or base ear base learners can be regularized by incorpo-
learners) corresponding to the additive terms in rating a ridge penalization (Hoerl and Kennard,
(1). With a suitable choice of base learners and 1970) with tunable penalty parameter λl as an
objective function, CWB allows efficient optimiza- HP αl . Fitting a ridge penalized linear base
tion of GAMMs, even in high-dimensional settings learner to a response vector y ∈ Rn results
with p  n. We will first introduce the concept in the penalized least squares estimator θ̂l =
of base learners that embed additive terms of the (ZlT Zl + Kl )−1 ZlT y with penalty matrix Kl =
GAMM into boosting and subsequently describe λl Idl +1 , where Id denotes the d-dimensional iden-
the actual fitting routine of CWB. Lastly, we tity matrix. Often, an unregularized linear base
will describe the properties of the algorithm and learner is also included to model the contribution
explain its connection to model (1). of one feature xj as a linear base learner with-
out penalization. The basis transformation is then
2.3.1 Base Learners given by gl (x) = (1, xj )T and λl = 0.

In CWB, the lth base learner bl : X → R is Spline base learners


used to model the contribution of one or mul-
These base learners model smooth effects using
tiple features in the model. In this work, we
univariate splines. A common choice is penal-
investigate parametrized base learners bl (x, θl )
ized B-splines (P-Splines; Eilers and Marx,
with parameters θl ∈ Rdl . For simplicity, we
1996), where the feature xj is transformed
will use θ as a wildcard for the coefficients
using a B-spline basis transformation gl (x) =
of either fixed effects, random effects, or spline
(Bl,1 (xj ), . . . , Bl,dl (xj ))T with dl basis functions
bases in the following. We assume that each base
gl,m = Bl,m , m = 1, . . . , dl . In this case, the
learner can be represented by a generic basis
choice of the spline order B, the number of basis
representation gl : X → Rdl , x 7→ gl (x) =
functions dl , the penalization term λl , and the
(gl,1 (x), . . . , gl,dl (x))T and is linear in the parame-
order v of the difference penalty (represented by
ters, i.e., bl (x, θl ) = gl (x)T θl . For n observations,
a matrix Dl ∈ Rdl−v ×dl ) are considered HPs αl
we define the design matrix of a base learner bl
of the base learner. The base learner’s parameter
as Zl := (gl (x(1) ), . . . , gl (x(n) ))T ∈ Rn×dl . Note
estimator in general is given by the penalized least
that base learners are typically not defined on the
squares solution θ̂l = (ZlT Zl + Kl )−1 ZlT y, with
whole feature space but on a subset Xl ⊆ X . For
penalization matrix Kl = λl Dl> Dl in the case of
example, a common choice for CWB is to define
P-splines.
one base learner for every feature xl ∈ Xl to model
the univariate contributions of that feature. Categorical and random effect base
A base learner bl (x, θl ) can depend on HPs learners
αl that are set prior to the fitting process. For
example, choosing a base learner using a P-spline Categorical features xj ∈ {1, . . . , G} with G ∈
(Eilers and Marx, 1996) representation requires N, G ≥ 2 classes are handled by a binary
setting the degree of the basis functions, the order encoding gl (x) = (1{1} (xj ), . . . , 1{G} (xj ))T with
of the difference penalty term, and a parameter λl the indicator function 1A (x) = 1 if x ∈ A
determining the smoothness of the spline. In order and 1A (x) = 0 if x ∈ / A. A possible alter-
to represent GAMMs in CWB, the following four native encoding is the dummy encoding with
base learner types are used. ğl (x) = (1, 1{1} (xj ), . . . , 1{G−1} (xj ))T with refer-
ence group G. Similar to linear and spline base
learners, it is possible incorporate a ridge penal-
ization with HP αl = λl . This results in the
base learner’s penalized least squared estimator
θ̂l = (ZlT Zl +Kl )−1 ZlT y with penalization matrix chosen based on its sum of squared errors (SSE)
Kl = λl IG . Due to the mathematical equiva- when regressing the pseudo residuals r̃ [m] =
lence of ridge penalized linear effects and random (r [m](1) , . . . , r [m](n) )T onto the base learner’s fea-
effects with normal prior (see, e.g., Brumback tures using the L2 -loss. Further details of CWB
et al, 1999), this base learner can further be used are given in Algorithm 1 (see, e.g., Schalk et al,
to estimate random effect predictions γ̂j when 2022a).
using categorical features uj and thereby account
for heterogeneity in the data. Controling HPs of CWB
Good estimation performance can be achieved by
Row-wise tensor product base learners selecting a sufficiently small learning rate, e.g.,
This type of base learner is used to model a pair- 0.01, as suggested in Bühlmann et al (2007), and
wise interaction between two features xj and xk . adaptively selecting the number of boosting iter-
Given two base learners bj and bk with basis ations via early stopping on a validation set. To
representations gj (x) = (gj,1 (xj ), . . . , gj,dj (xj ))T enforce a fair selection of model terms and thus
and gk (x) = (gk,1 (xk ), . . . , gk,dk (xk ))T , the unbiased effect estimation, regularization param-
basis representation of the row-wise tensor eters are set such that all base learners have
product base learner bl = bj × bk is the same degrees-of-freedom (Hofner et al, 2011).
defined as gl (x) = (gj (x)T ⊗ gk (x)T )T = As noted by Bühlmann et al (2007), choosing
(gj,1 (xj )gk (x)T , . . . , gj,dj (xj )gk (x)T )T ∈ Rdl with smaller degrees-of-freedom induces more penaliza-
dl = dj dk . The HPs αl = {αj , αk } of a row-wise tion (and thus, e.g., smoother estimated function
tensor product base learner are induced by the for spline base learners), which yields a model with
HPs αj and αk of the respective individual base lower variance at the cost of a larger bias. This
learners. Analogously to other base learners, the bias induces a shrinkage in the estimated coeffi-
penalized least squared estimator in this case is cients towards zero but can be reduced by running
θ̂l = (ZlT Zl +Kl )−1 ZlT y with penalization matrix the optimization process for additional iterations.
Kl = τj Kj ⊗ Idk + Idj ⊗ τk Kk ∈ Rdl ×dl . This
Kronecker sum penalty, in particular, allows for
Algorithm 1 Vanilla CWB algorithm
anisotropic smoothing with penalties τj and τk
when using two spline bases for gj and gk , and Input Train data D, learning rate ν, number of
varying coefficients or random splines when com- boosting iterations M , loss function L,
bining a (penalized) categorical base learner and set of base learner B
a spline base learner. Output Model fˆ[M ] defined by fitted parameters
θ̂ [1] , . . . , θ̂ [M ]
2.3.2 Fitting Algorithm 1: procedure CWB(D, ν, L, B)
2: Initialize: fˆ[0] (x) = arg minc∈R Remp (c)
CWB first initializes an estimate fˆ of the addi- 3: for m ∈ {1, . . . , M } do
tive predictor with a loss-optimal constant value 4: r̃[m](i) = −∇f L(y (i) , f (x(i) ))|f =fˆ[m−1] ,
fˆ[0] = arg minc∈R Remp (c). It then proceeds and ∀i ∈ {1, . . . , n}
estimates Eq. (1) using an iterative steepest 5: for l ∈ {1, . . . , |B|} do
descent minimization in function space by fit-  −1
[m]
6: θ̂l = ZT l Zl + Kl ZTl r̃
[m]
ting the previously defined base learners to the Pn [m]
model’s functional gradient ∇f L(y, f ) evaluated 7: SSEl = i=1 (r̃[m](i) − bl (x(i) , θ̂l ))2
at the current model estimate fˆ. Let fˆ[m] denote 8: end for
9: l[m] = arg minl∈{1,...,|B|} SSEl
the model estimation after m ∈ N iterations. In
[m]
each step in CWB, the pseudo residuals r̃[m](i) = 10: fˆ[m] (x) = fˆ[m−1] (x) + νbl[m] (x, θ̂l[m] )
−∇f L(y (i) , f (x(i) ))|f =fˆ[m−1] for i ∈ {1, . . . , n} 11: end for
are first computed. CWB then selects the best- 12: return fˆ = fˆ[M ]
13: end procedure
fitting base learner from a pre-defined pool of
base-learners denoted by B = {bl }l∈{1,...,|B|} and
adds the base learner’s contribution to the pre-
vious model fˆ[m] . The selected base learner is
7

2.3.3 Properties and Link to is located at a different site s and potentially


Generalized Additive Mixed follows a different data distributions Pxy,s . The
Models union of all data sets yields the whole data set
D = ∪Ss=1 Ds with mutually exclusive data sets
The estimated coefficients θ̂ resulting from run- Ds ∩ Dl = ∅ ∀l, s ∈ {1, . . . , S}, l 6= s. The vector
ning the CWB algorithm are known to converge of realizations per site is denoted by ys ∈ Y ns .
to the maximum likelihood solution (see, e.g., In this distributed setup, multiple ways exist to
Schmid and Hothorn, 2008) for M → ∞ under cer- communicate information without revealing indi-
tain conditions. This is due to the fact that CWB vidual information. More complex methods such
performs a coordinate gradient descent update as differential privacy (Dwork, 2006), homomor-
of a model defined by its additive base learn- phic encryption (e.g., the Paillier cryptosystem;
ers that exactly represent the structure of an Paillier, 1999), or k-anonymity (Samarati and
additive mixed model (when defining the base Sweeney, 1998) allow sharing information without
learners according to Section 2.3.1) and by the violating an individual’s privacy. An alternative
objective function that corresponds to the neg- option is to only communicate aggregated statis-
ative (penalized) log-likelihood. Two important tics. This is one of the most common approaches
properties of this algorithm are 1) its coordinate- and is also used by DataSHIELD (Gaye et al, 2014)
wise update routine, and 2) the nature of model for GLMs or by Luo et al (2022) for LMMs.
updates using the L2 -loss. Due to the first prop- DataSHIELD, for example, uses a privacy level that
erty, CWB can be used in settings with p  indicates how many individual values must be
n, as only a single additive term is fitted onto aggregated to allow the communication of aggre-
the pseudo-residuals in every iteration. This not gated values. For example, setting the privacy
only reduces the computational complexity of the level to a value of 5 enables sharing of sum-
algorithm for an increasing number of additive mary statistics such as sums, means, variances,
predictors (linear instead of quadratic) but also etc. if these are computed on at least 5 elements
allows variable selection when stopping the rou- (observations).
tine early (e.g., based on a validation data set), as
not all the additive components might have been Host and Site Setup
selected into the model. In particular, this allows
Throughout this article, we assume the 1, . . . , S
users to specify the full GAMM model without
sites or servers to have access to their respective
manual specification of the type of feature effect
data set Ds . Each server is allowed to communi-
(fixed or random, linear or non-linear) and then
cate with a host server that is also the analyst’s
automatically sparsify this model by an objec-
machine. In this setting, the analyst can poten-
tive and data-driven feature selection. The second
tially see intermediate data used when running
property, allows fitting models of the class of gen-
the algorithms, and hence each message communi-
eralized linear/additive (mixed) models using only
cated from the servers to the host must not allow
the L2 -loss instead of having to work with some
any reconstruction of the original data. The host
iterative weighted least squares routine. In partic-
server is responsible for aggregating intermediate
ular, this allows performing the proposed lossless
results and communicating these results back to
distributed computations described in this paper,
the servers.
as we will discuss in Section 3.

2.4 Distributed Computing Setup 3 Distributed


and Privacy Protection Component-Wise Boosting
Before presenting our main results, we now intro- We now present our distributed version of the
duce the distributed data setup we will work CWB algorithm to fit privacy-preserving and loss-
with throughout the remainder of this paper. The less GAMMs. In the following, we first describe
data set D n is horizontally
 partitioned
 intooS data further specifications of our setup in Section 3.1,
(1) (1) (ns ) (ns )
sets Ds = xs , ys , . . . , xs , ys ,s= elaborate on the changes made to the set of base
1, . . . , S with ns observations. Each data set Ds
learners in Section 3.2, and then show how to learners bl× = b0 × bl of the regularized categor-
adapt CWB’s fitting routine in Section 3.3. ical base learner b0 dummy-encoding every site
and all other existing base learners bl ∈ B. This
3.1 Setup allows for potential inclusion of random effects for
every fixed effect in the model. More specifically,
In the following, we distinguish between site- the lth site-specific effect given by the row-wise
specific and shared effects. As effects estimated tensor product base learner bl× uses the basis
across sites typically correspond to fixed effects transformation gl× = g0 ⊗ gl
and effects modeled for each site separately are
usually represented using random effects, we use
gl× (x̃) = g0 (x0 )T ⊗ gl (x)T
the terms as synonyms in the following, i.e., shared
effects and fixed effects are treated interchange- = (1{1} (x0 )gl (x)T , . . . , 1{S} (x0 )gl (x)T )T ,
ably and the same holds for site-specific effects
| {z } | {z }
=gl× ,1 =gl× ,S
and random effects. We note that this is only for (2)
ease of presentation and our approach also allows
for site-specific fixed effects and random shared where the basis transformation gl is equal for all S
effects. As the data is not only located at dif- sites. After distributed computation (see Eq. (4)
ferent sites but also potentially follows different in the next section), the estimated coefficients
data distributions Pxy,s at each site s, we extend are θ̂l× = (θ̂Tl× ,1 , . . . , θ̂Tl× ,S )T with θ̂l× ,s ∈ Rdl .
Eq. (1) to not only include random effects per The regularization of the row-wise Kronecker base
site, but also site-specific smooth (random) effects learners not only controls their flexibility but also
φj,s (xj ), s = 1, . . . , S for all features xj with assures identifiable when additionally including a
j ∈ J3 . For every of these smooth effects φj,s shared (fixed) effect for the same covariate. The
we assume an existing shared effect fj,shared that penalty matrix Kl× = λ0 K0 ⊗ Idl + IS ⊗ λl× Kl ∈
is equal for all sites. These assumptions – par- RSdl ×Sdl is given as Kronecker sum of the penalty
ticularly the choice of site-specific effects – are matrix of the categorical site effect and the penalty
made for demonstration purposes. In a real-world matrices K0 and Kl with respective regularization
application, the model structure can be defined strengths λ0 , λl× . As K0 = λ0 IS is a diago-
individually to match the given data situation. nal matrix, Kl× is a block matrix with entries
However, note again that CWB intrinsically per- λ0 Idl + λl× Kl on the diagonal blocks. Moreover,
forms variable selection, and there is thus no need as g0 is a binary vector, we can also express the
to manually define the model structure in practice. design matrix Zl× ∈ Rn×Sdl as a block matrix,
In order to incorporate the site information into yielding
(i)
the model, we add a variable x0 ∈ {1, . . . , S} for
(i)
the site to the data by setting x̃(i) = (x0 , x(i) ). Zl× = diag(Zl,1 , . . . , Zl,S ), Kl×
The site variable is a categorical feature with S = diag(λ0 Idl + λl× Kl , . . . , λ0 Idl + λl× Kl ),
classes. (3)

3.2 Base Learners where Zl,k are the distributed design matrices of
bl on sites s = 1, . . . , S.
For shared effects, we keep the original struc-
ture of CWB with base learners chosen from a
set of possible learners B. Section 3.3.1 explains 3.3 Fitting Algorithm
how these shared effects are estimated in the dis- We now describe the adaptions required to allow
tributed setup. We further define a regularized for distributed computations of the CWB fitting
categorical base learner b0 with basis transfor- routine. In Sections 3.3.1 and 3.3.2, we show the
mation g0 (x0 ) = (1{1} (x0 ), . . . , 1{S} (x0 ))T and equality between our distributed fitting approach
design matrix Z0 ∈ Rn×S . We use b0 to extend B and CWB fitted on pooled data. Section 3.3.3
with a second set of base learners B× = {b0 ×b | b ∈ describes the remaining details such as distributed
B} to model site-specific random effects. All base SSE calculations, distributed model updates, and
learners in B× are row-wise tensor product base pseudo residual updates in the distributed setup.
9

Section 3.4 summarizes the distributed CWB algo- Algorithm 2 Distributed Effect Estimation.
rithm and Section 3.5 elaborates on the commu- The line prefixes [S] and [H] indicate whether the
nication costs of our algorithm. operation is conducted at the sites ([S]) or at the
host ([H]).
3.3.1 Distributed Shared Effects Input Sites design matrices Zl,1 , . . . , Zl,S ,
Computation response vectors y1 , . . . , yS and
an optional penalty matrix Kl .
Fitting CWB in a distributed fashion requires
Output Estimated parameter vector θ̂l .
adapting the fitting process of the base learner bl
in Algorithm 1 to distributed data. To allow for 1: procedure distFit(Zl,1 , . . . , Zl,S , y1 , . . . , yS , Kl )
shared effects computations across different sites 2: for s ∈ {1, . . . , S} do
3: [S] Fl,s = ZT l,s Zl,s
without jeopardizing privacy, we take advantage
of CWB’s update scheme, which boils down to a 4: [S] ul,s = ZTl,s ys
5: [S] Communicate Fl,s and ul,s to the host
(penalized) least squares estimation per iteration
6: end for P
for every base learner. This allows us to build upon
7: [H] Fl = S Fl,s + Kl
existing work such as Karr et al (2005) to fit linear Ps=1
8: [H] ul = S s=1 ul,s
models in a distributed fashion by just communi-
9: [H] return θ̂l = Fl−1 ul
cating aggregated statistics between sites and the
10: end procedure
host.
In a first step, the aggregated matrices Fl,s =
T T
Zl,s Zl,s and vectors ul,s = Zl,s ys are computed
on each site. In our privacy setup (Section 2.4),
3.3.2 Distributed Site-specific Effects
communicating Fl,s and ul,s is allowed as long as Computation
the privacy-aggregation level per site is met. In If we pretend that the fitting of the base learner
a second step, the site information PS is aggregated bl× is performed on the pooled data, we obtain
to a global information Fl = s=1 Fl,s + Kl and
PS −1
ul = s=1 ul,s and then used to estimate the

θ̂l× = ZlT× Zl× + Kl× ZlT× y
model parameters θ̂l = Fl−1 ul . This approach,
T
Zl,1 + λ0 Idl + Kl )−1 Zl,1
T
 
referred to as distFit, is explained again in detail (Zl,1 y1
in Algorithm 2 and used for the shared effect com- = ..
,
 
[m] .
putations of the model by substituting θ̂l = T −1 T
 −1 T [m] (Zl,S Zl,S + λ0 Idl + Kl ) Zl,S yS
ZlT Zl + Kl Zl r̃ (Algorithm 1 line 6) with (4)
[m] [m] [m]
θ̂l = distFit(Zl,1 , . . . , Zl,S , r̃1 , . . . , r̃S , Kl ).
[m]
Note that the pseudo residuals r̃k are also where (4) is due to the block structure, as
securely located at each site and are updated described in (3) of Section 3.2. This shows that the
after each iteration. Details about the dis- fitting of the site-specific effects θ̂l× can be split
tributed pseudo residuals updates are explained up into the fitting of individual parameters
in Section 3.3.3. We also note that the com-
putational complexity of fitting CWB can be
T
θ̂l× ,s = (Zl,s Zl,s + λ0 Idl + Kl )−1 Zl,s
T
ys . (5)
drastically reduced by pre-calculating and stor-
ing (ZlT Zl + Kl )−1 in a first initialization step, It is thus possible to compute site-specific effects
as the matrix is independent of iteration m, and at the respective site without the need to share any
reusing these pre-calculated matrices in all subse- information with the host. The host, in turn, only
quent iterations (cf. Schalk et al, 2022a). Using requires the SSE of the respective base learner
pre-calculated matrices also reduces the amount of (see next Section 3.3.3) to perform the next iter-
required communication between sites and host. ation of CWB. Hence, during the fitting process,
the parameter estimates remain at their sites and
are just updated if the site-specific base learner is
selected. This again minimizes the amount of data
communication between sites and host and speeds
ns
up the fitting process. After the fitting phase, the X [m]
= (r̃s[m](i) − gl (x(i) )T θ̂l× ,s )2 .
aggregated site-specific parameters are communi-
i=1
cated once in a last communication step to obtain
the final model. The site-specific SSE values are then PS sent to the
host and aggregated to SSEl = s=1 SSEl,s . If
3.3.3 Pseudo Residual Updates, SSE privacy constraints have been met in all previous
Calculation, and Base Learner calculations, sharing the individual SSE values is
Selection not critical and does not violate any privacy con-
straints as the value is an aggregation of all ns
The remaining challenges to run the distributed
observations for all sites s.
CWB algorithm are 1) the pseudo residual calcu-
Having gathered all SSE values at the host
lation (Algorithm 1 line 4), 2) the SSE calculation
location, selecting the best base learner in the cur-
(Algorithm 1 line 7), and 3) base learner selection
rent iteration is done in the exact same manner as
(Algorithm 1 line 9).
for the non-distributed CWB algorithm by select-
Distributed pseudo residual updates ing l[m] = arg minl∈{1,...,|B|,1× ,...,|B|× } SSEl . After
the selection, the index l[m] is shared with all sites
The site-specific response vector ys containing
to enable the update of the site-specific models
the values y (i) , i ∈ {1, . . . , ns } is the basis [m]
of the pseudo residual calculation. We assume fˆs . If a shared effect is selected, the parame-
[m]
that every site s has access to all shared effects ter vector θ̂l[m] is shared with all sites. Caution
as well as the site-specific information of all must be taken when the number of parameters of
site-specific base learners bl× only containing one base learner is equal to the number of obser-
the respective parameters θ̂l× ,s . Based on these vations, as this allows reverse-engineering private
base learners, it is thus possible to compute a data. In the case of a site-specific effect selection,
[m] no parameter needs to be communicated, as the
site model fˆs as a representative of fˆ[m] on
[m] respective estimates are already located at each
every site s. The pseudo residual updates r̃s site.
[m] [m](i)
per site are then based on fˆs via r̃s =
−∇f L(y (i) , f (x(i) ))|f =fˆ[m−1] , i ∈ {1, . . . , ns } 3.4 Distributed CWB Algorithm
s
using Ds . Most importantly, all remaining steps
with Site-Specific Effects
of the distributed CWB fitting procedure do not
[m] Assembling all pieces, our distributed CWB algo-
share the pseudo residuals r̃s in order to avoid
information leakage about ys . rithm is summarized in Algorithm 3.

Distributed SSE calculation and base 3.5 Communication Costs


learner selection
While the CWB iterations themselves can be per-
After fitting all base learners bl ∈ B and bl× ∈ B× formed in parallel on every site and do not slow
[m] [m] [m]
to r̃s , we obtain θ̂l , l = 1, . . . , |B|, and θ̂l× , down the process compared to a pooled calcu-
l× = 1× , . . . , |B× |. Calculating the SSE distribu- lation, it is worth discussing the communication
tively for the lth and l× th base learner bl and bl× , costs of distrCWB. During the initialization, data
respectively, requires calculating 2S site-specific is shared just once, while the fitting phase requires
SSE values: the communication of data in each iteration. Let
d = maxl dl be the maximum number of basis
ns  2
X [m] functions (or, alternatively, assume d basis func-
SSEl,s = r̃s[m](i) − bl (x(i)
s , θ̂l ) tions for all base learners). The two main drivers of
i=1
ns
the communication costs are the number of boost-
[m] ing iterations M and the number of base learners
X
= (r̃[m](i) − gl (x(i) )T θ̂l )2 ,
i=1
|B|. Because of the iterative nature of CWB with
ns  2 a single loop over the boosting iterations, the
[m]
X
SSEl× ,s = r̃s[m](i) − bl× (x(i)
s , θ̂ l× ) communication costs (both for the host and each
i=1 site) scale linearly with the number of boosting
11

Algorithm 3 Distributed CWB Algorithm.


The line prefixes [S] and [H] indicate whether the operation is conducted at the sites ([S]) or at the
host ([H]).
Input Sites with site data Dk , learning rate ν, number of boosting iterations M , loss
function L, set of shared effects B and respective site-specific effects B×
Output Prediction model fˆ
1: procedure distrCWB(ν, L, B, B× )
2: Initialization:
[0]
3: [H] Initialize shared model fˆshared (x) = arg minc∈R Remp (c)
4: [S] Calculate Zl,s and Fl,s = ZT l,s Zl,s , ∀l ∈ {1, . . . , |B|}, s ∈ {1, . . . , S}
ˆ[0] ˆ [0]
5: [S] Set fs = fshared
6: for m ∈ {1, . . . , M } or while an early stopping criterion is not met do
7: [S] Update pseudo residuals:
[m](i)
8: [S] r̃s = −∇f L(y (i) , f (x(i) ))|f =fˆ[m−1] , ∀i ∈ {1, . . . , ns }
s
9: for l ∈ {1, . . . , |B|} do
[m]
10: [H] Calculate shared effect: θ̂l = distFit(Zl,1 , . . . , Zl,S , y1 , . . . , yS , Kl )
[m]
11: [H] Communicate θ̂l to the sites
12: for k ∈ {1, . . . , S} do
[m] [m]
13: [S] Fit lth site-specific effect: θ̂l× ,s = (Fl,s + λ0 Idl + Kl )−1 Zl,s r̃s
14: [S] Calculate the SSE for the lth shared and site-specific effect:
[m]
SSEl,s = n
P s [m](i)
15: [S] i=1 (r̃ − gl (x(i) )T θ̂l )2
Pns [m](i) [m]
16: [S] SSEl× ,s = i=1 (r̃s − gl (x(i) )T θ̂l× ,s )2
17: [S] Send SSEl,s and SSEl× ,s to the host
18: end for
[H] Aggregate SSE values: SSEl = S
P PS
19: s=1 SSEl,s and SSEl× = s=1 SSEl× ,s
20: end for
21: [H] Select best base learner: l[m] = arg minl∈{1,...,|B|,1× ,...,|B|×} SSEl
22: if bl[m] is a shared effect then
[m] [m−1] [m]
23: [H] Update model: fˆ shared(x) = fˆ (x) + νb [m] (x, θ̂
shared l )
l[m]
[m]
24: [H] Upload model update θ̂l[m] to the sites.
25: end if
[m] [m]
26: [S] Update site model fˆs via parameter updates θ̂l[m] = θ̂l[m] + ν θ̂l[m]
27: end for
28: [S] Communicate site-specific effects θ̂1× , . . . , θ̂|B|× to the host
[M ]
29: [H] Add site-specific effects to the model of shared effects fˆshared to obtain the full model fˆ[M ]
[M ]
30: [H] return fˆ = fˆ
31: end procedure

iterations M , i.e., O(M ). For the analysis of com- is d2 |B| for each site and therefore scales lin-
munication costs in terms of the number of base early with |B|, i.e., O(|B|). The host does not
learners, we distinguish between the initialization communicate any values during the initialization.
phase and the fitting phase.
Fitting
Initialization In each iteration, every site shares its vector
As only the sites share Fl,s ∈ Rd×d , ∀l ∈ T [m]
Zl,s r̃s ∈ Rd , ∀l ∈ {1, . . . , |B|}. Over the course
{1, . . . , |B|}, the transmitted amount of values of M boosting iterations, each site therefore shares
dM |B| values. Every site also communicates the
SSE values, i.e., 2 values (index and SSE value)
for every base learner and thus 2M |B| values for each sites. The task is to determine important risk
all iterations and base learners. In total, each site factors for heart diseases. The target variable is
communicates M |B|(d + 2) values. The communi- therefore a binary outcome indicating the presence
cation costs for all sites are therefore O(|B|). The of heart disease or not.
host, in turn, communicates the estimated param-
eters θ̂ [m] ∈ Rd of the |B| shared effects. Hence, 4.2 Analysis and Results
dM |B| values as well as the index of the best
base learner in each iteration are transmitted. In We follow the practices to setup CWB as men-
total, the host therefore communicates dM |B|+M tioned in Section 2.3.2 and run the distributed
values to the sites, and costs are therefore also CWB algorithm with a learning rate of 0.1 and a
O(|B|). maximum number of 100000 iterations. To deter-
mine an optimal stopping iteration for CWB, we
use 20 % of the data as validation data and set
4 Application the patience to 5 iterations. In other words, the
algorithm stops if no risk improvement on the vali-
We now showcase our algorithm on a heart disease
dation data is observed in 5 consecutive iterations.
data set that consists of patient data gathered all
For the numerical covariates, we use a P-spline
over the world. The data were collected at four
with 10 cubic basis functions and second-order dif-
different sites by the 1) Hungarian Institute of
ference penalties. All base learners are penalized
Cardiology, Budapest (Andras Janosi, M.D.), 2)
accordingly to a global degree of freedom that we
University Hospital, Zurich, Switzerland (William
set to 2.2 (to obtain unbiased feature selection)
Steinbrunn, M.D.), 3) University Hospital, Basel,
while the random intercept is penalized according
Switzerland (Matthias Pfisterer, M.D.), and 4)
to 3 degrees of freedom (see the Supplementary
V.A. Medical Center, Long Beach, and Cleveland
Material B.2 for more details). Since we are mod-
Clinic Foundation (Robert Detrano, M.D., Ph.D.),
elling a binary response variable, h−1 is the inverse
and is thus suited for a multi-site distributed
logit function logit−1 (f ) = (1 + exp(−f ))−1 . The
analysis. The individual data sets are freely avail-
model for an observation of site s, conditional on
able at https://archive.ics.uci.edu/ml/datasets/
its random effects γ, is given in the Supplementary
heart+disease (Dua and Graff, 2017). For our
Material B.3.
analysis, we set the privacy level (cf. Section 2.4)
to 5 which is a common default.
Results
4.1 Data Description The algorithm stops after mstop = 5578 itera-
tions as the risk on the validation data set starts
The raw data set contains 14 covariates, such as to increase (cf. Figure 1 in the Supplementary
the chest pain type (cp), resting blood pressure Material B.4). Out of these 5578 iterations, the
(trestbps), maximum heart rate (thalach), sex, distributed CWB algorithm selects a shared effect
exercise-induced angina (exang), or ST depression in 782 iterations and site-specific effects in 4796
(i.e., abnormal difference of the ST segment from iterations. This indicates that the data is rather
the baseline on an electrocardiogram) induced heterogeneous and requires site-specific (random)
by exercise relative to rest (oldpeak). A full effects. Figure 1 (Left) shows traces of how and
list of covariates and their abbreviations is given when the different additive terms (base learners)
on the data set’s website. After removing non- entered the model during the fitting process and
informative covariates and columns with too many illustrates the selection process of CWB.
missing values at each site, we obtain ncleveland =
303, nhungarian = 292, nswitzerland = 116, and
nva = 140 observations and 8 covariates. A table
containing the description of the abbreviations of
these covariates is given in Table 1 in the Sup-
plementary Material B.1. For our application, we
assume that missing values are completely at ran-
dom and all data sets are exclusively located at
13

added base learners


0.4 oldpeak CWB on pooled data show a perfect overlap (cf.
Proportion of

0.4294 a age (site) cp


0.3 trestbps

0.2 0.1262
a cp (site)
age
sex
Figure 3). This again underpins the lossless prop-
a oldpeak (site)
0.1 0.1024
0.0762 a sex (site)
restecg
exang erty of the proposed algorithm. The site-specific
0.0 thalach
0 2000 4000 0 300 600 900 1200 effects on the pooled data are fitted by defining
Iteration VIP
a row-wise Kronecker base learner for all features
Fig. 1: Left: Model trace showing how and when and the site as a categorical variable. The same
the four most selected additive terms entered the approach is used to estimate a GAMM using mgcv
model. Right: Variable importance (cf. Au et al, fitted on the pooled data with tensor products
2019) of selected features in decreasing order. between the main feature and the categorical site
variable. A comparison of all partial feature effects
is given in the Supplementary Material B.7 show-
The estimated effect of the most important ing good alignment between the different methods.
feature oldpeak (cf. Figure 1, Right) found is For the oldpeak effect shown in Figure 3, we
further visualized in Figure 2. Looking at the also see that the partial effects of the two CWB
shared effect, we find a negative influence on the methods are very close to the mixed model-based
risk of heart disease when increasing ST depres- estimation, with only smaller differences caused by
sion (oldpeak). When accounting for site-specific a slightly different penalization strength of both
deviations, the effect becomes more diverse, par- approaches. The empirical risk is 0.4245 for our
ticularly for Hungary. distributed CWB algorithm, 0.4245 for CWB on
the pooled data, and 0.4441 for the GAMM on the
Shared effect Site effects Shared + site effects
pooled data.
Partial feature

Partial feature

Partial feature

cleveland
0.0 0.5
0.5
effect

effect

effect

−0.1 hungarian
−0.2 0.0 0.0
−0.3 −0.5 −0.5 switzerland

−2.5 0.0 2.5 5.0 −2.5 0.0 2.5 5.0 −2.5 0.0 2.5 5.0 va Aggregated site−specific feature effects
oldpeak oldpeak oldpeak
oldpeak
Partial feature effect

Fig. 2: Decomposition of the effect of oldpeak cleveland hungarian switzerland va


1
into the shared (left) and the site-specific effects 0
−1
compboost

dsCWB
(middle). The plot on the right-hand side shows −2
−3 mgcv
−2.5 0.0 2.5 5.0 −2.5 0.0 2.5 5.0 −2.5 0.0 2.5 5.0 −2.5 0.0 2.5 5.0
the sum of shared and site-specific effects. oldpeak

Fig. 3: Comparison of the site-specific effects


for oldpeak between the distributed (dsCWB) and
In the Supplementary Material B.5 and B.6, pooled CWB approach (compboost) as well as
we provide the partial effects for all features and estimates of from mgcv.
showcase the conditional predictions of the fitted
GAMM model for a given site.

Comparison of Estimation Approaches 5 Discussion


The previous example shows partial feature effects
that exhibit shrinkage due to the early stopping of We proposed a novel algorithm for distributed,
CWB’s fitting routine. While this prevents overfit- lossless, and privacy-preserving GAMM estima-
ting and induces a sparse model, we can also run tion to analyze horizontally partitioned data. To
CWB for a very large amount of iterations without account for data heterogeneity of different sites we
early stopping to approximate the unregularized introduced site-specific (smooth) random effects.
and hence unbiased maximum likelihood solution. Using CWB as the fitting engine allows esti-
We illustrate this in the following by training mation in high-dimensional settings and fosters
CWB and our distributed version for 100000 iter- variable as well as effect selection. This also
ations and compare its partial effects to the ones includes a data-driven selection of shared and
of a classical mixed model-based estimation rou- site-specific features, providing additional data
tine implemented in the R package mgcv (Wood, insights. Owing to the flexibility of boosting and
2017). its base learners, our algorithm is easy to extend
Results of the estimated partial effects of and can also account for interactions, functional
our distributed CWB algorithm and the original
regression settings (Brockhaus et al, 2020), or Bazeley P (2012) Integrative analysis strategies
modeling survival tasks (Bender et al, 2020). for mixed data sources. American Behavioral
An open challenge for the practical use of our Scientist 56(6):814–828
approach is its high communication costs. For
larger iterations (in the 10 or 100 thousands), Bender A, Rügamer D, Scheipl F, et al (2020)
computing a distributed model can take several A general machine learning framework for sur-
hours. One option to reduce the total runtime is to vival analysis. In: Joint European Conference on
incorporate accelerated optimization recently pro- Machine Learning and Knowledge Discovery in
posed in Schalk et al (2022a). Another driver that Databases, Springer, pp 158–173
influences the runtime is the latency of the tech-
nical setup. Future improvements could reduce Boyd K, Lantz E, Page D (2015) Differential
the number of communications, e.g., via multi- privacy for classifier evaluation. In: Proceed-
ple fitting rounds at the different sites before ings of the 8th ACM Workshop on Artificial
communicating the intermediate results. Intelligence and Security, pp 15–23
A possible future extension of our approach is
Brockhaus S, Rügamer D, Greven S (2020) Boost-
to account for both horizontally and vertically dis-
ing functional regression models with fdboost.
tributed data. Since the algorithm is performing
Journal of Statistical Software 94(10):1 – 50
component-wise (coordinate-wise) updates, the
extension to vertically distributed data naturally Brumback BA, Ruppert D, Wand MP (1999)
falls into the scope of its fitting procedure. This Variable selection and function estimation in
would, however, require a further advanced techni- additive nonparametric regression using a data-
cal setup and the need to ensure consistency across based prior: Comment. Journal of the American
sites. Statistical Association 94(447):794–797

Declarations Bühlmann P, Yu B (2003) Boosting with the L2


loss: regression and classification. Journal of the
The authors declare that they have no known com- American Statistical Association 98(462):324–
peting financial interests or personal relationships 339
that could have appeared to influence the work
reported in this paper. Bühlmann P, Hothorn T, et al (2007) Boost-
ing algorithms: Regularization, prediction and
model fitting. Statistical science 22(4):477–505
References
Anjum MM, Mohammed N, Li W, et al (2022) Chen YR, Rezapour A, Tzeng WG (2018)
Privacy preserving collaborative learning of gen- Privacy-preserving ridge regression on dis-
eralized linear mixed model. Journal of Biomed- tributed data. Information Sciences 451:34–49
ical Informatics 127:104,008
Curran PJ, Hussong AM (2009) Integrative data
Au Q, Schalk D, Casalicchio G, et al (2019) analysis: the simultaneous analysis of multiple
Component-wise boosting of targets for data sets. Psychological methods 14(2):81
multi-output prediction. arXiv preprint
Dua D, Graff C (2017) UCI machine learning
arXiv:190403943
repository. URL http://archive.ics.uci.edu/ml
Augustyn DR, Wyciślik L, Mrozek D (2021)
Dwork C (2006) Differential privacy. In: Inter-
Perspectives of using cloud computing in inte-
national Colloquium on Automata, Languages,
grative analysis of multi-omics data. Briefings
and Programming, Springer, pp 1–12
in Functional Genomics 20(4):198–206. https:
//doi.org/10.1093/bfgp/elab007, URL https:// Eilers PH, Marx BD (1996) Flexible smoothing
doi.org/10.1093/bfgp/elab007 with B-splines and penalties. Statistical science
pp 89–102
15

Gambs S, Kégl B, Aı̈meur E (2007) Privacy- ACM Computing Surveys (CSUR) 22(3):267–
preserving boosting. Data Mining and Knowl- 293
edge Discovery 14(1):131–170
Lu CL, Wang S, Ji Z, et al (2015) Webdisco: a
Gaye A, Marcon Y, Isaeva J, et al (2014) web service for distributed cox model learning
Datashield: taking the analysis to the data, not without patient-level data sharing. Journal of
the data to the analysis. International journal the American Medical Informatics Association
of epidemiology 43(6):1929–1944 22(6):1212–1219

Hoerl AE, Kennard RW (1970) Ridge regression: Luo C, Islam M, Sheils NE, et al (2022) Dlmm
Biased estimation for nonorthogonal problems. as a lossless one-shot algorithm for collabora-
Technometrics 12(1):55–67 tive multi-site distributed linear mixed models.
Nature Communications 13(1):1–10
Hofner B, Hothorn T, Kneib T, et al (2011) A
framework for unbiased model selection based McCullagh P, Nelder JA (2019) Generalized linear
on boosting. Journal of Computational and models. Routledge
Graphical Statistics 20(4):956–971
McMahan B, Moore E, Ramage D, et al (2017)
Jones EM, Sheehan NA, Gaye A, et al (2013) Communication-efficient learning of deep net-
Combined analysis of correlated data when data works from decentralized data. In: Artificial
cannot be pooled. Stat 2(1):72–85 intelligence and statistics, PMLR, pp 1273–1282

Karr AF, Lin X, Sanil AP, et al (2005) Secure Mirza B, Wang W, Wang J, et al (2019) Machine
regression on distributed databases. Journal learning and integrative analysis of biomedical
of Computational and Graphical Statistics big data. Genes 10(2):87
14(2):263–279
Mohassel P, Zhang Y (2017) Secureml: A system
Lazarevic A, Obradovic Z (2001) The distributed for scalable privacy-preserving machine learn-
boosting algorithm. In: Proceedings of the sev- ing. In: 2017 IEEE symposium on security and
enth ACM SIGKDD international conference privacy (SP), IEEE, pp 19–38
on Knowledge discovery and data mining, pp
311–316 Naehrig M, Lauter K, Vaikuntanathan V (2011)
Can homomorphic encryption be practical? In:
Li J, Kuang X, Lin S, et al (2020a) Privacy Proceedings of the 3rd ACM workshop on Cloud
preservation for machine learning training and computing security workshop, pp 113–124
classification based on homomorphic encryption
schemes. Information Sciences 526:166–179 Paillier P (1999) Public-key cryptosystems based
on composite degree residuosity classes. In:
Li Q, Wen Z, He B (2020b) Practical feder- International conference on the theory and
ated gradient boosting decision trees. In: Pro- applications of cryptographic techniques,
ceedings of the AAAI conference on artificial Springer, pp 223–238
intelligence, pp 4642–4649
R Core Team (2021) R: A Language and Envi-
Liew BX, Rügamer D, Abichandani D, et al ronment for Statistical Computing. R Founda-
(2020) Classifying individuals with and with- tion for Statistical Computing, Vienna, Austria,
out patellofemoral pain syndrome using ground URL https://www.R-project.org/
force profiles – Development of a method using
functional data boosting. Gait & Posture 80:90– Rügamer D, Brockhaus S, Gentsch K, et al
95 (2018) Boosting factor-specific functional
historical models for the detection of syn-
Litwin W, Mark L, Roussopoulos N (1990) Inter- chronization in bioelectrical signals. Journal
operability of multiple autonomous databases. of the Royal Statistical Society: Series C
(Applied Statistics) 67(3):621–642. https: Zhu R, Jiang C, Wang X, et al (2020)
//doi.org/https://doi.org/10.1111/rssc.12241, Privacy-preserving construction of gener-
URL https://rss.onlinelibrary.wiley.com/doi/ alized linear mixed model for biomedical
abs/10.1111/rssc.12241 computation. Bioinformatics 36(Supple-
ment 1):i128–i135. https://doi.org/10.
Saintigny P, Zhang L, Fan YH, et al (2011) Gene 1093/bioinformatics/btaa478, URL https:
expression profiling predicts the development //doi.org/10.1093/bioinformatics/btaa478
of oral cancer. Cancer Prevention Research
4(2):218–229

Samarati P, Sweeney L (1998) Protecting pri-


vacy when disclosing information: k-anonymity
and its enforcement through generalization and
suppression

Schalk D, Bischl B, Rügamer D (2022a) Accel-


erated componentwise gradient boosting using
efficient data representation and momentum-
based optimization. Journal of Computational
and Graphical Statistics 0(ja):1–27. https:
//doi.org/10.1080/10618600.2022.2116446,
URL https://doi.org/10.1080/10618600.2022.
2116446

Schalk D, Hoffmann VS, Bischl B, et al (2022b)


Distributed non-disclosive validation of pre-
dictive models by a modified roc-glm. arXiv
preprint arXiv:220310828

Schmid M, Hothorn T (2008) Boosting addi-


tive models using component-wise p-splines.
Computational Statistics & Data Analysis
53(2):298–311

Ünal AB, Pfeifer N, Akgün M (2021) ppaurora:


Privacy preserving area under receiver oper-
ating characteristic and precision-recall curves
with secure 3-party computation. ArXiv 2102

Wood SN (2017) Generalized additive models: an


introduction with R. Chapman and Hall/CRC

Wu Y, Jiang X, Kim J, et al (2012) G rid binary lo


gistic re gression (glore): building shared models
without sharing data. Journal of the American
Medical Informatics Association 19(5):758–764

Yan Z, Zachrison KS, Schwamm LH, et al


(2022) Fed-glmm: A privacy-preserving and
computation-efficient federated algorithm for
generalized linear mixed models to analyze cor-
related electronic health records data. medRxiv

You might also like