1907 07157v91

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

The Tradeoff Between Privacy and Accuracy in Anomaly Detection

Using Federated XGBoost


Mengwei Yang1 , Linqi Song∗1 , Jie Xu2 and Congduan Li3 , Guozhen Tan4
1
City University of Hong Kong
2
University of Miami
3
Sun Yat-sen University
4
Dalian University of Technology
mengweiy.ray@gmail.com, linqi.song@cityu.edu.hk, jiexu@miami.edu, licongd@mail.sysu.edu.cn,
gztan@dlut.edu.cn ∗
arXiv:1907.07157v1 [cs.LG] 16 Jul 2019

Abstract 1 Introduction
Privacy has raised considerable concerns recently, Nowadays, many giant internet companies, like Google,
especially with the advent of information explosion Amazon, and Alibaba, have established large scale informa-
and numerous data mining techniques to explore tion technology infrastructures to cope with the current huge
the information inside large volumes of data. In data stream and to provide numerous services to customers.
this context, a new distributed learning paradigm However, the large volume of data will also bring a number
termed federated learning becomes prominent re- of serious privacy issues [Chen and Zhao, 2012] and com-
cently to tackle the privacy issues in distributed puting problems. For example, in social networks like Face-
learning, where only learning models will be trans- book, there is a growing concern of privacy risk in collecting
mitted from the distributed nodes to servers without a large amount of users’ private data, including various per-
revealing users’ own data and hence protecting the sonal information, texts, pictures, and video data. Leveraging
privacy of users. these large volumes of human data, these companies will uti-
lize them to train various machine learning models for vari-
In this paper, we propose a horizontal federated ous data intensive applications. However, users can do little
XGBoost algorithm to solve the federated anomaly to protect their data. As such, in May 2018, the European
detection problem, where the anomaly detection Union has began to implement the General Data Protection
aims to identify abnormalities from extremely un- Regulation (GDPR) to protect individual privacy [Voigt and
balanced datasets and can be considered as a spe- Von dem Bussche, 2017], which is deemed the most impor-
cial classification problem. Our proposed federated tant change in data privacy regulation in 20 years.
XGBoost algorithm incorporates data aggregation Even though in some areas, data can be shared between
and sparse federated update processes to balance different companies or concentrated on some cloud servers, it
the tradeoff between privacy and learning perfor- still carries dramatic risks and transmission issues. On one
mance. In particular, we introduce the virtual data hand, the transfer of private data between different parties
sample by aggregating a group of users’ data to- makes it more likely to leak or to be hacked. On the other
gether at a single distributed node. We compute pa- hand, the transmission of large amounts of data leads to in-
rameters based on these virtual data samples in the efficiency. In this context, the federated learning framework
local nodes and aggregate the learning model in the has been proposed and plays an indispensable role in solv-
central server. In the learning model upgrading pro- ing these problems [Hard et al., 2018]. Instead of transmit-
cess, we focus more on the wrongly classified data ting raw data, federated learning transmits pre-trained learn-
before in the virtual sample and hence to gener- ing models from users to servers, while keeping the users data
ate sparse learning model parameters. By carefully locally. Thus, the user privacy can be protected; computing
controlling the size of these groups of samples, we resources in the user side can be efficiently utilized; and the
can achieve a tradeoff between privacy and learning communication cost is reduced.
performance. Our experimental results show the ef- Recently, federated learning has attracted broader atten-
fectiveness of our proposed scheme by comparing tion and three categories was put forward in [Yang et al.,
with existing state-of-the-arts. 2019], including horizontal federated learning, vertical fed-
erated learning and federated transfer learning. In [Cheng et
∗ al., 2019], the SecureBoost was presented, which achieves
The corresponding author is L. Song. This work was supported
in part by the City University of Hong Kong Grant (No. 7200594), vertical federated learning with a tree-boosting algorithm.
the Hundred Talents Program of Sun Yat-sen University (Project No. In this work, we propose a horizontal federated XGBoost
76150-18841214), and the Key Program of the National Natural Sci- algorithm in anomaly detection with an application in detect-
ence Foundation of China under Grant No. U1808206. ing the fraudulence in bank credit card transactions; and study
the trade-off behavior between the privacy preserving and of the number of virtual samples in data aggregation so that
the anomaly detection performance. Compared with Secure- the privacy of users will be better protected and the learn-
Boost, which was deployed in the vertical federated learning ing performance in federated XGBoost will be reduced to a
framework, our federated XGBoost is a horizontal federated minimum. We show that our proposed algorithm achieves up
learning algorithm where different data samples with all fea- to 5% performance gains in terms of F1-core compared with
tures are distributed among the distributed nodes. Though existing state-of-the-art algorithms, while effectively keeping
tree-boosting algorithm is also utilized in this work, this hor- user privacy.
izontal federated XGBoost is deployed in a totally different
way. First, the biggest difference is the transfer of parameters.
It is far from enough to only pass parameters gi and hi in hori-
zontal federated XGBoost. Because when using tree-boosting
algorithms, different nodes have to obtain instances of every
feature so that the gain of split can be calculated and the split
point can be acquired. Another key difference is to use a
two step (data aggregation and federated update) method to
preserve individual data’s anonymity which we will describe
later. Furthermore, the sparse federated update by focusing
on wrongly classified data is utilized in federated XGBoost
to improve the process of federated update.
In particular, to transfer features of users in a privacy and
efficient way, our proposed two-step method is described as
follows. The first step is Data Aggregation: First of all, the Figure 1: The process of data aggregating in federated XGBoost
privacy of users should be protected and thus users’ informa- framework.
tion can’t be passed directly. Hence, for the purpose of cal-
culating the gain of split mentioned above, features of users
are entailed. In this paper, for the consideration of protecting 2 Related Work
users’ information, instead of directly transmitting all exact Privacy-preserving and Federated Learning: The trans-
data in each feature, the original data in each feature is pro- fer of data will bring the problem of data leakage [Shokri
jected in an anonymous way by using modified K-Anonymity, and Shmatikov, 2015]. Consequently, decentralized meth-
shown in Figure 1, where a group of data samples have been ods (i.e., data is only stored locally) are used to process the
mapped to a virtual data sample. The projection is imple- data and then the risk of data leakage is reduced [Wang et al.,
mented under every feature. Because while finding the split 2010]. Some works use encryption-based federated learning
point, the purpose of original data for tree-boosting algorithm frameworks, like homomorphic encryption [Gilad-Bachrach
is to get the sequence under every feature. So, after passing et al., 2016]. Homomorphic encryption means certain opera-
the number of virtual data samples in each feature, the gain tions can be acted directly on encrypted data without decrypt-
of split can be calculated. Consequently, by doing this, not ing it. However, homomorphic encryption has its disadvan-
only the privacy of users will be protected, but also the tree- tages. Take Paillier-based encryption schemes as an example,
boosting model can decide the split point and build the tree. the cost of generating threshold decryption keys is very high
A second step is the Federated Update: In reality, the [Bonawitz et al., 2017].
amount of data is quite large, it is inefficient to transfer all Federated learning is a new distributed learning paradigm
data and also not all data is valuable for update. In that case, proposed recently to utilize the user-end computing resources
it is necessary to filter data so as to better update models. and preserve user’s privacy by transmitting only model pa-
Though the tree-boost model can implement well in predic- rameters, instead of raw data, to the server [Liu et al., 2018;
tion by building trees, there are still many instances that can- Zhuo et al., 2019]. In federated learning, a general model
not be classified correctly. Hence, in this paper, wrongly clas- will be firstly trained, and then the model will be distributed
sified instances will be processed with more focus and then to each node acting as a local model [Yang et al., 2019;
be transferred to server for federated update. The reason is McMahan et al., 2016; Konečnỳ et al., 2016]. Three cat-
that firstly, these instances are more valuable and will help egories was put forward in [Yang et al., 2019], including
the model improve itself better. Also because the data used horizontal federated learning, vertical federated learning and
in anomaly detection is extremely unbalanced, the boosting federated transfer learning. The federated secure XGBoost
algorithm can solve skewed problem in some degree and ele- framework using vertical federated learning was proposed in
vate the generalization ability of the model. Secondly, if the [Cheng et al., 2019].
correctly classified data is not filtered, these instances will af- In contrast, in this work, we firstly preprocess data where
fect the process of split and the construction of trees in the instances can be merged together to learn an aggregate gradi-
process of federated update, which has an adverse impact on ent such that the communication and computation cost will be
the improvement of the model. significantly reduced. Next, we generate local models and ag-
We show a trade-off behavior between the detection accu- gregate those models in the central server to update the origi-
racy and the the privacy measured in terms of k-anonymity. nal model. By doing so, we can show a tradeoff between the
Through simulation experiments, we find a reasonable size privacy of the user and the learning performance.
Anomaly Detection: Anomaly detection [Patcha and Park, Prediction outcome
2007] is the identification of events or observations that do p n total
not match the expected pattern or other items in the dataset
(i.e., outliers) during data mining. Outliers can be divided p0 True False P0

Actual
into point exceptions, context exceptions, and collective ex- Positive Negative

value
ceptions [Agrawal and Agrawal, 2015]. Anomaly detection
methods include SMOTE algorithm [Chawla et al., 2002]
and various machine learning models, such as K-Nearest n0 False True N0
Neighbors algorithm [Liao and Vemuri, 2002], Random For- Positive Negative
est [Zhang et al., 2008], Support Vector Machine (SVM) [Li
et al., 2003], Gradient Boosting Classification Tree (GBT) total P N
[Krauss et al., 2017], XGBoost [Chen and Guestrin, 2016],
and deep learning neural network models [Mukkamala et al., Figure 3: Illustration of confusion matrix. T P = predicted positive
and it is true; F P = predicted positive but it is negative; F N =
2002]. In this paper, a point exception of fraud detection
predicted negative but it is positive; T N = predicted negative and
in credit card transactions will be focused and the dataset of it is true. For the Precision-Recall curve, the X-axis is Recall and
credit card transactions will be used to train the model. Y-axis is Precision. P recision = T P/(T P + F P ); Recall =
T P/(T P + F N ); The AUPRC is mainly used for the judgment of
3 Problem Formulation unbalanced dataset. The F1-score is introduced to judge the result of
the model prediction: F1-score = 2 ∗ T P/(2 ∗ T P + F P + F N ).
We consider the federated learning in an anomaly detection
problem as follows. There are D distributed nodes, e.g., bank represent for the overall distribution. In that case, each node
institutions, denoted by 1, 2, . . . , D. In each node j, the lo- needs to share data in the whole federated learning frame-
cal data Xj is given with nj data instances (i.e., data exam- work, which can help to improve the existed model in each
ples) and d features, i.e., Xj = {(xi , yi )} (|Xj | = nj , xi ∈ node. The goal is to train a federated learning model to de-
Rd , yi ∈ R). We denote the union of all local data to be X. tect the abnormalities using the federated learning system as
There is a center server node to aggregate the learning model. described above.
The entire system architecture is shown in Fig. 2. In this paper, we ask the question: if the local nodes
choose different learning model parameters to be trans-
mitted to the server, then what is the tradeoff between the
machine learning performance and the user privacy.
Performance Metrics Next, we describe the performance
metrics of the anomaly detection and the user privacy.
• Modified k-Anonymity k-anonymity is a property pos-
sessed by certain anonymized data, where one cannot dis-
tinguish a user out of k candidates from this set of data
[Sweeney, 2002]. Here, since local nodes transmit learn-
ing model parameters to the server, we define a modified k-
anonymity metric as the privacy property that we cannot dis-
tinguish a user out of other candidates from the transmitted
learning parameters instead of a set of data.
• Measurement for Unbalanced Data In anomaly detec-
tion, the data is unbalanced and usually it is not a good idea
to simply use accuracy to measure the learning performance
Figure 2: The illustration of federated learning framework. as classifying all data into the normal category will result in
a sufficient high accuracy. For example, in the bank credit
The federated learning process is as follows. First, the lo-
card fraudulence detection dataset that we use in experiments,
cal nodes preprocess the data (the local data) and send some
fraud cases account for only 0.172% of the total data. In
learning model parameters to the server. Second, the server
actual banking transactions, fraudulent transactions are still
will integrate those received model parameters and obtain a
the minority [Phua et al., 2004]. Consequently, we will use
new global model. The new model will be transmitted to local
the confusion matrix, F1-score and AUPRC (the area un-
nodes as well. This process will be iterated over time again
der Precision-Recall curve) to measure the anomaly detection
and again to train a sufficiently good model. Through this
performance. An illustration of these concepts are shown in
learning process, the local data do not need to be exchanged
Fig. 3.
and the user privacy is protected.
In the anomaly detection problem, the data is often very
unbalanced; namely, in most of the cases, the data points are 4 The Federated XGBoost Framework
normal (i.e., most samples with a label 0), and in rare cases, In this section, we will talk about the federated XGBoost al-
the data points are abnormal (i.e., rear samples with a label 1). gorithm that we will use in the anomaly detection problem.
In this case, the extremely skewed data on each node can not We will first give a recap of the XGBoost algorithm, a general
description of the federated XGBoost algorithm, and some eŷi ). So that
specific design tailored to the anomaly detection problem. 1 1 1
gi = (t−1)
−yi , hi = (t−1)
∗(1− (t−1)
).
−ŷ −ŷ −ŷ
4.1 Preliminaries of XGBoost 1+e i 1+e i 1+e i
We give a brief overview of the XGBoost algorithm, and one These parameters will be used for parameters passing in the
can refer to [Chen and Guestrin, 2016] for more details. federated learning framework.
For machine learning problems, such as the classification
or regression, given a dataset {(xi , yi )} of n examples and d 4.2 Federated XGBoost
features, the goal is to train a learning model with parameters In the federated learning framework, to implement the XG-
θ to minimize the objective loss function as follows Boost algorithm, one simple idea is to calculate the parame-
ters gi and hi of each data sample at each local node, and then
Obj(θ) = L(θ) + Ω(θ) (1) transmit these parameters to the center server to determine an
optimal split.
where L(θ) is the training loss and Ω(θ) is the regulariza- Note that in vertical data partition [Cheng et al., 2019],
tion term. For XGBoost algorithm, it utilizes K regres- different node holds one part of the same instance, so that by
sion/classification trees to predict the output, where the pre- only passing parameters between each other, the model can
dicted output for the i-th data example is make predictions in cooperation with other nodes. In this pa-
K
per, we consider the horizontally partitioned data in different
X local nodes, which means that data provided in different node
ŷi = φ (xi ) = fk (xi ) . (2)
have the same feature dimension and one node holds all fea-
k=1
tures of an instance. It is not easy to update other models in
So for the objective function of XGBoost: different nodes if only transmitting parameters (gi and hi ).
X X Though by averaging the parameters of different models does
L(φ) = l (yi , ŷi ) + Ω (fk ) (3) help, it still cannot ameliorate the model a lot in each node.
i k Here, instead of simply transmitting model parameters gi
and hi , we make two revisions that are tailored to the specific
where Ω (fk ) = γT + 12 λ||w||2 with component wj of w be-
anomaly detection setting: a data aggregation process and a
ing the score/weight on j-th leaf of the tree. Since the newly
sparse federated update process.
generated tree needs to fit the last predicted residual. So ŷi
(t) (t−1) First, in the data aggregation process, we map a range of
can be written as ŷi = ŷi + ft (x) for the t-th iteration. data samples that are close to each other into a virtual data
Also, take Taylor expansion of the objective as follows: sample (or a cluster of samples). Taking into account each
virtual data sample as a new data sample I1 , we sum up the
n    1
 gi s and hi s (i ∈ I1 ) in this cluster to obtain the gI1 and hI1 for
(t−1)
X
L(t) ' l yi , ŷi + gi ft (xi ) + hi ft2 (xi ) +Ω (ft ) this virtual data sample. We then transmit parameters corre-
i=1
2 sponding to these virtual samples to the central server to train
(4) the model. However, when new learning models are obtained,
(t−1) 2 (t−1)
where gi = ∂ŷ(t−1) l(yi , ŷi ), hi = ∂ (t−1) l(yi , ŷi ). these data samples will calculate their losses and parameters
i ŷi
gi and hi separately. By controlling the size of the virtual
Here, gi and hi represent the first and second gradient statis-
sample, we can achieve a tradeoff between learning perfor-
tics of the loss function. By using greedy algorithm to search
mance and the privacy in terms of modified k-anonymity.
the best split which aims to maximize the learning gain at
In our modified k-anonymity, instances in every feature
each iteration:
will be mapped into different virtual data nodes. As shown in
Figure 1, we use the sequence number of virtual data nodes
G2L G2R (GL + GR )2
 
1 (from 1, 2, . . . , D) to represent a cluster of samples’ exact in-
Gain = + − −γ formation, which means individual values in every feature are
2 HL + λ HR + λ (HL + HR ) + λ
(5) replaced by a new category. Since every virtual data node rep-
resents a range of samples’ values, samples inside every node
Gj are anonymous and attackers cannot distinguish a user out of
wj∗ = − (6) other candidates. Also, even though attackers get instance
Hj + λ
P P A’s exact values, they still cannot acquire instance A’s other
where GL = g , GR = i∈IR gi , HL = sensitive information because of the adoption of a new cate-
P P i∈IL i
i∈IL h i , H R = i∈IR hi . Here, I L and IR represent the gory. Therefore, our modified k-anonymity in this work can
left and right sets of data sample indices. The equation of protect users’ privacy in an anonymous way.
Gain is used for evaluating the split point and wj∗ denotes Second, to further improve the communication efficiency
the weight of leaf. When searching for the best split point, in- and the anomaly detection performance, in the sparse feder-
stances’ gi and hi in the left and right space will be calculated ated update process, we will focus on tackling these wrongly
for getting the value of Gain.
Without loss of generality, we consider a particular logistic
classified instances. Our assumption Pt is that at iteration t, for
most data samples, the function k=1 fk (xi ) will give suf-
(t−1)
loss function l(yi , ŷi ) = yi ln(1 + e−ŷi ) + (1 − yi ) ln(1 + ficient accurate estimations. Therefore, at iteration t + 1, we
Algorithm 1 The Federated XGBoost Framework (local scribe the characteristics of the dataset and then show our al-
node) gorithm’s performance (we use F-XGBoost to represent our
Input: I, Instance space of current node algorithm) compared with other existing state-of-the-arts.
Input: d, Feature dimension Credit Card Fraud Dataset1 : This dataset contains trans-
Input: v, Mapping dimension of sequence in every feature actions generated by credit card and it has 492 frauds and
Input: {gi , hi }i∈l , {giL , hL 284807 transactions, which is greatly unbalanced. It is a
i }i∈l
1: for n = 0 to d do dataset that contains 30 features. Features V1, V2, ... V28
2: // Original instances of model are the principal components obtained with PCA and only
3: Propose Sn ={sn1 , sn2 ,... snv } by sequence on feature two features, Time and Amount, are kept original.
n. Experimental Setting: We split the dataset into two parts:
one for basic model training and the other for simulating the
P
4: Gnv = i∈{i|sn,v ≥xi,n >sn,v−1 } gi
5:
P
Hnv = i∈{i|sn,v ≥xi,n >sn,v−1 } hi situation of newly acquired and also wrongly classified in-
stances. Nearly 1/5 of the dataset (59875 tuples) is used for
6: end for
updating existed models in the federated learning settings and
7: for n = 0 to d do
4/5 of the dataset will be divided into testing (45569 tuples)
8: // New wrongly classified instances and training data (179363 tuples). In the experiment, the XG-
9: Propose SnL ={sL L L
n1 , sn2 ,... snv } by sequence on feature Boost2 , GBDT3 and Random Forest4 are utilized for perfor-
n. mance comparisons and the parameter settings of XGBoost
GL giL
P
10: nv = i∈{i|sL L L
n,v ≥xi,n >sn,v−1 } and federated XGBoost framework are the same. Learning
L
hL
P
11: Hnv = i∈{i|sLn,v ≥xL >sL rate is set as 0.1, maximum depth is 4, and min child weight
n,v−1 }
i,n
i
12: end for is 2.
13: // Integrate instances Experimental Results: When the virtual cluster’s size is
14: GN L
nv ← Gnv + Gnv
larger, the protection of privacy will be better. However, it
N
15: Hnv ← Hnv L
+ Hnv is a trade-off between the cluster size and the accuracy. In
Output: GN nv Hnv
, N the original dataset, the number of samples (sample clusters)
in each feature is 275665. As it shown in Figure 4, Line B
means in the original 275665 dimension, the F1-Score of fed-
Algorithm 2 The Federated XGBoost Framework (Server) erated XGBoost framework will be 0.901408. In Figure 4,
Input: I, Instance space of current node we can see a tradeoff behavior between the learning perfor-
Input: d, Feature dimension mance and the privacy, where the horizontal axis represents
Input: v, Mapping dimension of sequence in every feature the number of clusters and the vertical axis represents the F1-
Input: GN N
nv , Hnv from local node
score. We show that with the increase in sample clusters, the
1: for n = 0 to d do privacy preserving ability is decreased while the learning per-
2: // Enumerate all featuresP formance is improved. In Figure 4, we can see that when the
Pv v
3: G ← k=0 GN nk , H ←
N
k=0 Hnk
number of clusters is 405, the F1-Score is 0.895105.
4: GL ← 0, HL ← 0 In Table 1, the high accuracy of all models can be noticed,
5: for k = 0 to v do which means the evaluation parameters such as accuracy is
6: // Enumerate the value of Gain in each split point not appropriate to fully evaluate models for this extremely
7: GL ← GL + GN nk , HL ← HL + Hnk
N unbalanced dataset. With emphasis, some good evaluating
8: GR ← G - GL , HR ← H - HL parameters like F1-Score, AUC and AUPRC can be deployed.
G2L G2 G2 From Table 1, we also show the AUC and F-1 curve for
9: score ← max(score, HL +λ + HRR+λ − H+λ )
10: end for different algorithms. Compared with Random Forest, GBDT
11: end for and the federated XGBoost before Update (Original Dimen-
Output: The direction of split in federated learning settings sion), the updated federated XGBoost framework (Original
Dimension) has obviously good performance over F1-Score.
We can see that the proposed algorithm outperforms exist-
will just focus on the wrongly classified samples. Therefore, ing methods by up to 3.4% in AUC and 5% in F1-score. For
we will calculate the gI1 and hI1 for cluster I1 by summing up federated XGBoost, the dimension of 405 achieves a reason-
only gi s and hi s of those samples i that are wrongly classified able trade-off between privacy and accuracy, though the F1-
from the learning model thus far. score of federated XGBoost framework (Dimension:405) is
The details of our proposed federated XGBoost algorithm 0.63% lower than federated XGBoost framework (Original
are shown in Algorithms 1 and 2, where the first one is for Dimension). Also, the AUPRC displays the improvement of
local node to compute local models and the second one is for 1
the server to aggregate the global model. https://www.kaggle.com/mlg-ulb/creditcardfraud
2
https://github.com/dmlc/xgboost
3
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.
5 Experimental Results GradientBoostingClassifier.html
In this section, we present our experimental results over a 4
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.
real dataset for credit card fraud detection. We will first de- Random ForestClassifier.html
federated learning model compared with itself. The AUPRC
performance is shown in Figures 6 and 7 for training and test
data sets. For the train loss in Figure 5, the learning curve of
federated XGBoost framework shows how the learning works
and we can see that the training loss of our proposed federated
XGBoost decreases faster than the GBDT algorithm.

Figure 7: AUPRC (The Area under Precision-Recall Curve) of The


federated XGBoost framework after update; Left adopts Train set;
Right adopts Test set.

Table 1: Performance of the Federated XGBoost (F-XGBoost)


Framework

Model Accuracy
GBDT 0.9995
Ramdom Forest 0.9995
Figure 4: The F1-Score of Federated XGBoost Framework in differ-
ent dimensions of mapping.
F-XGBoost before update (Original Dimension) 0.9994
F-XGBoost before update (Dimension:405) 0.9996
F-XGBoost after update (Original Dimension) 0.9997
F-XGBoost after update (Dimension:405) 0.9997
Model AUC
GBDT 0.9700
Ramdom Forest 0.9456
F-XGBoost before update (Original Dimension) 0.9214
F-XGBoost before update (Dimension:405) 0.9641
F-XGBoost after update (Original Dimension) 0.9789
F-XGBoost after update (Dimension:405) 0.9794
Model F1-Score
GBDT 0.8571
Ramdom Forest 0.8591
Figure 5: Train Loss of Federated XGBoost Framework and XG- F-XGBoost before update (Original Dimension) 0.8169
Boost F-XGBoost before update (Dimension:405) 0.8652
F-XGBoost after update (Original Dimension) 0.9014
F-XGBoost after update (Dimension:405) 0.8951

In experimental results, we show some reasonable working


points which achieve a balance between the privacy and ac-
curacy. Also, by comparing with other algorithms, the effec-
tiveness of this federated XGBoost framework can be clearly
found with up to 5% performance gains. In our proposed
federated XGBoost framework, we use two techniques, data
aggregation and sparse federated update, to reduce the com-
Figure 6: AUPRC (The Area under Precision-Recall Curve) of The munication and computing cost while improving the anomaly
federated XGBoost framework before update; Left adopts Train set; detection ability. More importantly, the privacy of users is
Right adopts Test set. protected and through the process of data aggregation, the
risk of users’ information leakage is avoided.
In our future work, we will attempt to experiment and de-
6 Conclusion and Future Work ploy differential privacy in federated XGBoost so as to better
protect users’ privacy. Also, it is believed that there are still a
In this paper, we proposed a federated XGBoost algorithm
lot of details should be considered and more research should
to solve the anomaly detection problem. We show a tradeoff
be done on federated learning to make it more significant.
behavior between the learning performance and the privacy.
References [Liao and Vemuri, 2002] Yihua Liao and V Rao Vemuri. Use
[Agrawal and Agrawal, 2015] Shikha Agrawal and Jitendra of k-nearest neighbor classifier for intrusion detection1.
Agrawal. Survey on anomaly detection using data min- Computers & Security, 21(5):439–448, 2002.
ing techniques. Procedia Computer Science, 60:708–713, [Liu et al., 2018] Yang Liu, Tianjian Chen, and Qiang Yang.
2015. Secure federated transfer learning. arXiv preprint
[Bonawitz et al., 2017] Keith Bonawitz, Vladimir Ivanov, arXiv:1812.03337, 2018.
Ben Kreuter, Antonio Marcedone, H Brendan McMahan, [McMahan et al., 2016] H Brendan McMahan, Eider Moore,
Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. Daniel Ramage, Seth Hampson, et al. Communication-
Practical secure aggregation for privacy-preserving ma- efficient learning of deep networks from decentralized
chine learning. In Proceedings of the 2017 ACM SIGSAC data. arXiv preprint arXiv:1602.05629, 2016.
Conference on Computer and Communications Security, [Mukkamala et al., 2002] Srinivas Mukkamala, Guadalupe
pages 1175–1191. ACM, 2017. Janoski, and Andrew Sung. Intrusion detection using neu-
[Chawla et al., 2002] Nitesh V Chawla, Kevin W Bowyer, ral networks and support vector machines. In Neural Net-
Lawrence O Hall, and W Philip Kegelmeyer. SMOTE: works, 2002. IJCNN’02. Proceedings of the 2002 Interna-
synthetic minority over-sampling technique. Journal of tional Joint Conference on, pages 1702–1707. IEEE, 2002.
Artificial Intelligence Research, 16:321–357, 2002. [Patcha and Park, 2007] Animesh Patcha and Jung-Min
[Chen and Guestrin, 2016] Tianqi Chen and Carlos Guestrin. Park. An overview of anomaly detection techniques: Ex-
Xgboost: A scalable tree boosting system. In Proceedings isting solutions and latest technological trends. Computer
of the 22nd ACM SIGKDD International Conference on Networks, 51(12):3448–3470, 2007.
Knowledge Discovery and Data Mining, pages 785–794. [Phua et al., 2004] Clifton Phua, Damminda Alahakoon, and
ACM, 2016. Vincent Lee. Minority report in fraud detection: classifica-
[Chen and Zhao, 2012] Deyan Chen and Hong Zhao. Data tion of skewed data. ACM SIGKDD Explorations Newslet-
security and privacy protection issues in cloud computing. ter, 6(1):50–59, 2004.
In Computer Science and Electronics Engineering (ICC- [Shokri and Shmatikov, 2015] Reza Shokri and Vitaly
SEE), 2012 International Conference on, volume 1, pages Shmatikov. Privacy-preserving deep learning. In Proceed-
647–651. IEEE, 2012. ings of the 22nd ACM SIGSAC Conference on Computer
[Cheng et al., 2019] Kewei Cheng, Tao Fan, Yilun Jin, Yang and Communications Security, pages 1310–1321. ACM,
Liu, Tianjian Chen, and Qiang Yang. Secureboost: A 2015.
lossless federated learning framework. arXiv preprint [Sweeney, 2002] Latanya Sweeney. k-anonymity: A
arXiv:1901.08755, 2019. model for protecting privacy. International Journal of
[Gilad-Bachrach et al., 2016] Ran Gilad-Bachrach, Nathan Uncertainty, Fuzziness and Knowledge-Based Systems,
Dowlin, Kim Laine, Kristin Lauter, Michael Naehrig, and 10(05):557–570, 2002.
John Wernsing. Cryptonets: Applying neural networks to [Voigt and Von dem Bussche, 2017] Paul Voigt and Axel
encrypted data with high throughput and accuracy. In In- Von dem Bussche. The eu general data protection regu-
ternational Conference on Machine Learning, pages 201– lation (gdpr). A Practical Guide, 1st Ed., Cham: Springer
210, 2016. International Publishing, 2017.
[Hard et al., 2018] Andrew Hard, Kanishka Rao, Rajiv [Wang et al., 2010] Cong Wang, Qian Wang, Kui Ren, and
Mathews, Françoise Beaufays, Sean Augenstein, Hubert Wenjing Lou. Privacy-preserving public auditing for data
Eichner, Chloé Kiddon, and Daniel Ramage. Federated storage security in cloud computing. In Infocom, 2010 pro-
learning for mobile keyboard prediction. arXiv preprint ceedings ieee, pages 1–9. Ieee, 2010.
arXiv:1811.03604, 2018.
[Yang et al., 2019] Qiang Yang, Yang Liu, Tianjian Chen,
[Konečnỳ et al., 2016] Jakub Konečnỳ, H Brendan McMa- and Yongxin Tong. Federated machine learning: Concept
han, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and applications. ACM Transactions on Intelligent Systems
and Dave Bacon. Federated learning: Strategies for and Technology (TIST), 10(2):12, 2019.
improving communication efficiency. arXiv preprint
[Zhang et al., 2008] Jiong Zhang, Mohammad Zulkernine,
arXiv:1610.05492, 2016.
and Anwar Haque. Random-forests-based network intru-
[Krauss et al., 2017] Christopher Krauss, Xuan Anh Do, and sion detection systems. IEEE Transactions on Systems,
Nicolas Huck. Deep neural networks, gradient-boosted Man, and Cybernetics, Part C (Applications and Reviews),
trees, random forests: Statistical arbitrage on the s&p 500. 38(5):649–659, 2008.
European Journal of Operational Research, 259(2):689– [Zhuo et al., 2019] Hankz Hankui Zhuo, Wenfeng Feng,
702, 2017.
Qian Xu, Qiang Yang, and Yufeng Lin. Federated re-
[Li et al., 2003] Kun-Lun Li, Hou-Kuan Huang, Sheng-Feng inforcement learning. arXiv preprint arXiv:1901.08277,
Tian, and Wei Xu. Improving one-class SVM for anomaly 2019.
detection. In Machine Learning and Cybernetics, 2003 In-
ternational Conference on, pages 3077–3081. IEEE, 2003.

You might also like