Towards Federated Learning With Byzantine-Robust Client Weighting
Towards Federated Learning With Byzantine-Robust Client Weighting
Towards Federated Learning With Byzantine-Robust Client Weighting
Client Weighting
amitport@post.bgu.ac.il yoavti@post.bgu.ac.il
Danny Hendler
Department of Computer Science
Ben-Gurion University of the Negev
hendlerd@cs.bgu.ac.il
Abstract
1 Introduction
Federated Learning (FL) [Konečnỳ et al., 2015, McMahan et al., 2017, Kairouz et al., 2019, Bonawitz
et al., 2019] is a distributed machine learning paradigm where training data resides at autonomous
client machines and the learning process is facilitated by a central server. The server maintains a
shared model and alternates between requesting clients to try and improve it and integrating their
suggested improvements back into that shared model.
A few interesting challenges arise from this model. First, the need for communication efficiency,
both in terms of the size of data transferred and the number of required messages for reaching
convergence. Second, clients are outside of the control of the server and as such may be unreliable,
or even malicious. Third, while classical learning models generally assume that data is homogeneous,
here privacy and the aforementioned communication concerns force us to deal with the data as it is
seen by the clients; that is 1) non-IID (identically and independently distributed) – data may depend
on the client it resides at, and 2) unbalanced – different clients may possess different amounts of data.
In previous works [Ghosh et al., 2019, Alistarh et al., 2018, Li et al., 2019a, Haddadpour and Mahdavi,
2019, Pillutla et al., 2019], unbalancedness is either ignored or is represented by a collection of a
2 Problem setup
2.1 Optimization goal
We are given K clients where each client k has a local collection Zk of nk samples taken IID
from someS unknown distribution over sample space Z. We denote the unifiedP sample collection
as Z = k∈[K] Zk and the total number of samples as n (i.e., n = |Z| = k∈[K] nk ) . Our
objective is global empirical risk minimization (ERM) for some loss function class `(w; ·) : Z → R,
parameterized by w ∈ Rd 1 :
1X
min F (w), where F (w) := `(w; z). (1)
w∈Rd n
z∈Z
In the following sections we denote the vector of client sample sizes as N = (n1 , n2 , . . . , nK ) and
assume, w.l.o.g., that it is sorted in increasing order.
We restrict ourselves to the FL paradigm, which leaves the training data distributed on client machines,
and learns a shared model by iterating between client updates and server aggregation.
Additionally, a subset of the clients, marked B, can be Byzantine, meaning they can send arbitrary and
possibly malicious results on their local updates. Moreover, unlike previous works, we also consider
clients’ sample sizes to be unreliable because they are reported by possibly Byzantine clients. When
the distinction is important, values that are sent by clients are marked with an overdot to signify that
they are unreliable (e.g., ṅk ), while values that have been preprocessed in some way are marked with
a tilde (e.g., ñk ).
We build upon the baseline federated averaging algorithm (F edAvg) described by McMahan et al.
[2017]. There, it is suggested that in order to save communication rounds clients perform multiple
stochastic gradient descent (SGD) steps while a central server occasionally averages the parameter
vectors.
The intuition behind this approach becomes clearer when we mark the k th client’s ERM objective
function by Fk (w) := n1k z∈Zk `(w; z) and observe that the objective function in equation (1) can
P
be rewritten as a weighted average of clients’ objectives:
1 X
F (w) := nk Fk (w). (2)
n
k∈[K]
1
We note that some previous FL works specify a more generic finite-sum objective [McMahan et al.,
2017]. However, this work investigates client-declared sample sizes, whose meaning is clear under the ERM
interpretation but seems meaningless in the finite-sum objective setting.
2
Similarly to previous works [Pillutla et al., 2019, Chen et al., 2019b,a], we capture a large set of
algorithms by abstracting F edAvg into a meta-algorithm for FL (Algorithm 2). We require three
procedures to be specified by any concrete algorithm:
(a) P reprocess – receives, possibly byzantine, ṅk ’s from clients and produces secure estimates
marked as ñk ’s. To the best of our knowledge, previous works ignore this procedure and
assume nk ’s are correct.
(b) ClientU pdate – per-client wk computation. In F edAvg, this corresponds to a few local
mini-batch SGD rounds. See Algorithm 1 for pseudocode.
(c) Aggregate – the server’s strategy for Pupdating w. In F edAvg, this corresponds to the
weighted arithmetic mean, i.e., w ← ṅ1 k∈[K] ṅk ẇk .
Previous works on robust aggregation [Ghosh et al., 2019, Alistarh et al., 2018, Li et al., 2019a,
Haddadpour and Mahdavi, 2019, Zhao et al., 2018] either used Assumption 1, without considering
the unbalancedness, or implicitly used Assumption 2. However, we observe that Assumption 2 is
unattainable in practice since Byzantine clients can often influence their weight thus rendering their
weight proportion unbounded.
We address this gap with the following definition and an appropriate P reprocess procedure.
Definition 1 (mwp). Given a proportion p, and a vector V = (v1 , ..., v|V | ) sorted in increasing order,
the maximal weight proportion, mwp(V , p), is the maximum combined weight for any p-proportion
of the values of V :
1 X
mwp(V , p) := P vi .
v∈V v
(1−p)|V |<i
Note that this is just the weight proportion of the p|V | clients with the largest sample sizes.
3
In the rest of this work will assume Assumption 1 and design a P reprocess procedure that ensures
the following:
Observe that this requirement enables the use of weighted robust mean estimators in a realistic setting
by ensuring that Assumption 2 holds for the preprocessed client sample sizes. Also note that here, α is
our assumption about the proportion of Byzantine clients while α∗ relates to an analytical property of
the underlying robust algorithm. For example, we may replace the federated average with a weighted
median as suggested by Chen et al. [2017], in which case, α∗ must be less than 1/2.
Our suggested preprocessing procedure uses element-wise truncation of the values of N by some
value U , marked trunc(N , U ) = (min(n1 , U ), min(n2 , U ), . . . , min(nK , U )). Given α and α∗ , we
search for the maximal truncation which satisfies (3):
Here α and U ∗ present a trade-off. Higher α means more Byzantine tolerance but requires smaller
truncation value U ∗ , which, may cause slower and less accurate convergence, as we demonstrate
empirically and theoretically in Theorem 3.2.
We note that given α and α∗ , truncating N by the solution of (4) is optimal in the sense that any
other P reprocess procedure that adheres to (3) has an equal or larger L1 distance from the original
N . This follows immediately from the observation that, when truncating the values of N , the entire
distance is due to the truncated elements and if there was another applicable vector closer to N , we
could have redistributed the difference to the largest elements and increase U ∗ in contradiction to its
maximally.
If one has an estimate for α it is easy to calculate U ∗ . For example, by going over values in N in a
decreasing order (i.e., from index K downwards) until finding a value that satisfies the inequality in
(4). Then we mark the index of this value by u and use the fact that in the range [nu , nu+1 ] we can
express mwp(trunc(N , U ), α) as a simple function of the form a+bU c+dU :
P
(1−α)K<i≤u ni + |{ni : i > max(u, (1 − α)K)}|U
P ,
i≤u ni + |{ni : i > u}|U
a − cα∗
∗
U ← . (5)
dα∗ − b
4
* = 50% Algorithm 3 Report (α, U ∗ ) Pairs
50000
α ← α∗
40000 for u ← 1 to K − 1 do
while mwp(trunc(N , nu+1 ), α) > α∗ do
U 30000 U ∗ ← solve (5) for U ∈ [nu , nu+1 ]
report (α, U ∗ )
20000 α←α− K 1
end while
10000
end for
0
0% 10% 20% 30% 40% 50%
When K is very large we may want to sample only k K elements IID from N . In this case we
will need to test that the inequality in (4) holds with high probability.
We consider k discrete random variables taken IID from N after truncation by U , that is, taken from
a distribution over {0, 1, . . . , U }. We mark these random variables as X1 , X2 , . . . , Xk , and their
order statistic as X(1) , X(2) , . . . , X(k) where X(1) ≤ X(2) ≤ · · · ≤ X(k) .
q q
ln (3/δ) ln ln (3/δ)
Theorem 3.1. Given parameter δ > 0 and ε1 = 2k , ε 2 = U 2(k(α−ε1 )+1) , ε3 =
q
U ln ln2k(3/δ)
, we have that mwp(trunc(N , U ), α) ≤ α∗ is true with 1 − δ confidence if the
following holds:
Pk
i←d(1−(α−ε1 ))ke X(i)
α k−d(1−(α−ε1 ))ke+1 + ε2
≤ α∗ (6)
1
P
k i∈[k] Xi − ε3
After applying our P reprocess procedure we have the truncated number of samples per client,
marked {ñk }k∈[K] . We can trivially ensure that any algorithm instance works as expected by
requiring that clients ignore samples that were truncated. That is, even if an honest client k (non-
Byzantine) has nk samples it may use only ñk samples during its ClienU pdate.
Although this solution always preserves the semantics of any underlying algorithm, it does hurt
convergence guarantees since the total number of samples decreases [Kairouz et al. 2019, Tables 5
and 6; Yin et al. 2018; Haddadpour and Mahdavi 2019]. Interestingly, Li et al. [2019b, Theorem 3]
analyze the baseline FedAvg and show that its convergence bound decreases with max nk (marked
there as ν). This suggests that in some cases truncation itself may mitigate the decrease in total
sample size.
Additionally, we note that in practice, the performance of federated averaging based algorithms
improves when honest clients use all their original nk samples. Intuitively, this follows easily
from the observation that Aggregate procedures are generally composite mean estimators and
ClientU pdate calls are likely to produce more accurate results given more samples.
Lastly, as we have mentioned before, convergence is guaranteed, but we note that the optimization
goal itself is inevitably skewed in our Byzantine scenario. The following theorem bounds this
5
difference between original weighted optimization goal (2) and the new goal after truncation. In
order to emphasize the necessity of this bound (in terms of Assumption 2), we use overdot and tilde
to signify unreliable and truncated values, respectively, as previously described in Subsection 2.2.
Theorem 3.2. Given the same setup as in (1) and a truncation bound U , the following holds for all
w ∈ Rd :
1 X 1 X
k ṅi Fi (w) − ñi Fi (w)k ≤
ṅ ñ
i∈[K] i∈[K]
X ṅi 1 1 1 X
k − Fi (w) + − L(Zi )k
ṅ K ṅ ñ
i:ṅi >U i:ṅi ≤U
P
Where L(Zi ) is defined as z∈Zi `(w; z).
From the bound in Theorem 3.2 we can clearly see how the coefficients in the left term, (ṅi /ṅ−1/K),
stem from unbalancedness in the values above the truncation threshold while the coefficient in the
right term, (1/ṅ−1/ñ), accounts for the increase of relative weight of the values below the truncation
threshold. Additionally, note that this formulation demonstrates how a single Byzantine client can
increase this difference arbitrarily by increasing its ṅi . Lastly, observe how both terms vanish as U
increases, which motivates our selection of U ∗ as the maximal truncation threshold for any given α
and α∗ .
4 Evaluation
In this section we demonstrate how truncating N is a crucial requirement for Byzantine robustness.
That is, we show that no matter what is the specific attack or aggregation method, using N “as-is"
categorically devoids any robustness guaranties.
The code for the experiments is based on the Tensorflow machine learning library
[Abadi et al., 2015]. Specifically, the code for the shakespeare experiments is
based on the Tensorflow Federated sub-library of Tensorflow. It is given under the
Apache license 2.0. Our code can be found in https://github.com/amitport/
Towards-Federated-Learning-With-Byzantine-Robust-Client-Weighting. We perform
the experiments using a single NVIDIA GeForce RTX 2080 Ti GPU, but the results are easily
reproducible on any device.
MNIST: digit recognition with synthetic client partitioning. The MNIST database [LeCun et al.,
2010] (available under Creative Commons Attribution-ShareAlike 3.0 license) includes 28×28
grayscale labeled images of handwritten digits split into a 60,000 training set and a 10,000 test set.
We randomly partition the training set among 100 clients. The partition sizes are determined by
taking 100 samples from a Lognormal distribution with µ = 1.5, σ = 3.45, and then interpolating
corresponding integers that sum to 60,000. This produces a right-skewed, fat-tailed partition size
distribution that emphasizes the importance of correctly weighting aggregation rules and the effects
of truncation. Clients train a classifier using a 64-units perceptron with RelU activation and 20%
6
MNIST synthetic partition Shakespeare token partition
50% 40%
40%
30%
Frequency
Frequency
30%
20%
20%
10% 10%
0% 0 2000 4000 6000 8000 10000 0% 0 10000 20000 30000 40000 50000 60000 70000
Client sample size Client sample size
Figure 2: Histogram of the sample partitions of the MNIST (left) and Shakespeare (right) datasets.
dropout, followed by a softmax layer. Following Yin et al. [2018], on every communication round,
all clients perform mini-batch SGD with 10% of their examples.
Note that the Shakespeare and MNIST synthetic tasks were selected because they are relatively
simple, unbalanced tasks. Simple, because we want to evaluate a preprocessing phase and avoid
tuning of the underlying algorithms we compare. Unbalanced, since as can be understood from
Theorem 3.2, when the client sample sizes are spread mostly evenly, ignoring the client sample size
altogether is a viable approach. See Figure 2 for the histograms of the partitions.
We show three Aggregate procedures. Arithmetic mean, as used by the original F edAvg, and two
additional procedures that replace the arithmetic mean with robust mean estimators. The first of
the latter uses the coordinatewise median [Chen et al., 2017, Yin et al., 2018]. That is, each server
model coordinate is taken as the median of the clients’ corresponding coordinates. The second
robust aggregation method uses the coordinatewise trimmed mean [Yin et al., 2018] that, for a
given hyperparameter β, first removes β-proportion lowest and β-proportion highest values in each
coordinate and only then calculates the arithmetic mean of the remaining values.
When preprocessing the client-declared sample size, we compare three options: We either ignore
client sample size, truncate according to α = 10% and α∗ = 50%, or just passthrough client sample
size as reported.
For the Shakespeare experiments, we examine a model negation attack [Blanchard et al., 2017]. In this
attack, each attacker “pushes" the model towards zero by always returning a negation of the server’s
model. When the data distribution is balanced, this attack is easily neutralized since Byzantine clients
typically send extreme values. However, in our unbalanced case, we demonstrate that without our
preprocessing step, this attack cannot be mitigated even by robust aggregation methods.
For MNIST, in order to provide comparability, we follow the experiment shown by Yin et al. [2018]
in which 10% of the clients use a label shifting attack. In this attack, Byzantine clients train normally
except for the fact that they replace every training label y with 9−y. The values sent by these clients
are then incorrect but are relatively moderate in value making their attack somewhat harder to detect.
This is in addition to the model negation attacks, already shown in the Shakespeare experiments.
We first execute our experiment without any attacks for every server aggregation and preprocessing
combination. Then, for each attack type, we repeat the process two additional times: 1) with a single
attacker that declares 10 million samples, and 2) with 10% attackers that declare 1 million samples
each.
7
Mean Median Trimmed mean = 10% 90%
80%
70%
No attackers
60%
Accuracy
50%
40%
30%
20%
200 400 600 800 1000 200 400 600 800 1000 200 400 600 800 1000 10%
Rounds Rounds Rounds
Passthrough Truncate =10%, * =50% Ignore
Figure 3: Accuracy by round without any attackers for the Shakespeare experiments. Curves
correspond to preprocessing procedures and columns correspond to different aggregation methods. It
can be seen that our method (dashed orange curve) remains comparable to the properly weighted mean
estimators (solid blue curve) while ignoring clients’ sample sizes (dotted green curve) is sub-optimal.
This effect is pronounced when the unweighted median is used, since with our unbalanced partition it
is generally very far from the mean. Figure 5 in Appendix B shows similar results for the MNIST
experiments.
4.2 Results
The Shakespeare experiments without any attackers is shown in Figure 3 and the executions with
attackers are shown in Figure 4. The results of the MNIST experiments were almost identical and are
deferred to Appendix B for brevity.
The results from the first experiment, running without any attackers (Figure 3), demonstrate that
ignoring client sample size results in reduced accuracy, especially when median aggregation is used,
whereas truncating according to our procedure is significantly better and is on par with properly using
all weights. These results highlight the imperativeness of using sample size weights when performing
server aggregations.
While Figure 3 shows that the truncation-based preprocessing performs on par with that of taking
all weights into consideration when all clients are honest, Figure 4 demonstrates that the results are
very different when there is an attack. In this case, we see that when even a single attacker reports a
highly exaggerated sample size and the server relies on all the values of N , the performance of all
aggregation methods including robust median and trimmed mean quickly degrade.
In contrast, in our experiments robustness is maintained when truncation-based preprocessing is
used in conjunction with robust mean aggregations, even when Byzantine clients attain the maximal
supported proportion (α = 10%).
Our method is based on truncating the weight values reported by clients in a manner that bounds from
above the proportion α∗ of weights that can be attributed to Byzantine clients, given an upper bound
on the proportion of clients α that may be Byzantine. Different values of parameter α represent
different points in the trade-off between model quality and Byzantine-robustness, where higher values
increase robustness when attacks do occur but decrease convergence rate even in the lack of attacks.
We evaluated the performance of our truncation method empirically when applied as a preprocessing
stage, prior to several aggregation methods. The results of our experiments establish that: 1) in the
absence of attacks, model convergence is on par with that of properly using all reported weights, and
2) when attacks do occur, the performance of combining truncation-based preprocessing and robust
aggregations incurs almost no penalty in comparison with the performance of using of all weights
in the lack of attacks, whereas without preprocessing, even robust aggregation methods collapse to
performance worse than that of a random classifier.
8
Mean Median Trimmed mean = 10% 90%
80%
Accuracy
40%
30%
20%
10%
90%
80%
70%
60%
50%
Accuracy
40%
30%
20%
10%
200 400 600 800 1000 200 400 600 800 1000 200 400 600 800 1000
Rounds Rounds Rounds
Passthrough Truncate =10%, * =50% Ignore
Figure 4: Accuracy by round under Byzantine attacks for the Shakespeare experiments. Curves
correspond to preprocessing procedures and columns correspond to different aggregation methods.
In the two rows of the experiment the Byzantine clients perform a model negation attack with one
and 10% attackers, respectively.
We observe that even with a single attacker performing a trivial attack (first row), using the weights
directly (solid blue curve) is devastating while when our preprocessing method is used in conjunction
with robust mean aggregations (dashed orange curve, two last columns) convergence remains stable
even when there are actual α (=10%) attackers (second row). We note that in some cases our method
may be slightly less efficient compared with the preprocessing method that ignores sample size
altogether (dotted green curve, second row, leftmost column). This is to be expected because we
allow Byzantine clients to potentially get close to α∗ -proportion (50%, in this case) of the weight.
However, our method is significantly closer to the optimal solution when there are no or only a few
attackers (see Figure 3). Moreover, when used in conjunction with robust mean aggregation methods
it maintains their robustness properties. Figure 6 in Appendix B shows similar results for the MNIST
experiments.
When the number of clients is very large, performing server preprocessing and aggregation on
the server may become computationally infeasible. We prove that, in this case, truncation-based
preprocessing can achieve the same upper bound on α∗ w.h.p. based on the weight values reported
from a sufficiently large number of the clients selected IID.
As with many Byzantine-robust algorithms, the selection of α has a significant impact on the
underlying model and, specifically, on fairness towards clients that hold underrepresented data, which
may inadvertently be considered outliers. In future work, we plan to analyze further the trade-off
between robustness and the usage of client sample size in rectifying data unbalancedness. We also
plan to investigate alternative forms of estimating client importance that may avoid client sample size
altogether.
9
References
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S.
Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew
Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath
Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah,
Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent
Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg,
Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on
heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from
tensorflow.org.
Dan Alistarh, Zeyuan Allen-Zhu, and Jerry Li. Byzantine stochastic gradient de-
scent. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,
and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages
4613–4623. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/
7712-byzantine-stochastic-gradient-descent.pdf.
Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. Machine learning
with adversaries: Byzantine tolerant gradient descent. In I. Guyon, U. V. Luxburg, S. Bengio,
H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information
Processing Systems 30, pages 119–129. Curran Associates, Inc., 2017.
Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir
Ivanov, Chloe Kiddon, Jakub Konecny, Stefano Mazzocchi, H Brendan McMahan, et al. Towards
federated learning at scale: System design. arXiv preprint arXiv:1902.01046, 2019.
Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Konečný, H. Brendan
McMahan, Virginia Smith, and Ameet Talwalkar. Leaf: A benchmark for federated settings, 2019.
Xiangyi Chen, Tiancong Chen, Haoran Sun, Zhiwei Steven Wu, and Mingyi Hong. Distributed
training with heterogeneous data: Bridging median- and mean-based algorithms, 2019a.
Yang Chen, Xiaoyan Sun, and Yaochu Jin. Communication-efficient federated deep learning with
asynchronous model update and temporally weighted aggregation, 2019b.
Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings:
Byzantine gradient descent. Proc. ACM Meas. Anal. Comput. Syst., 1(2), December 2017. doi:
10.1145/3154503. URL https://doi.org/10.1145/3154503.
Avishek Ghosh, Justin Hong, Dong Yin, and Kannan Ramchandran. Robust federated learning in
a heterogeneous environment. CoRR, abs/1906.06629, 2019. URL http://arxiv.org/abs/
1906.06629.
Farzin Haddadpour and Mehrdad Mahdavi. On the convergence of local descent methods in federated
learning, 2019.
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9:1735–1780,
1997.
Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin
Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G. L.
D’Oliveira, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih
Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Harchaoui, Chaoyang He, Lie He, Zhouyuan
Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara Javidi, Gauri Joshi, Mikhail Khodak, Jakub
Konečný, Aleksandra Korolova, Farinaz Koushanfar, Sanmi Koyejo, Tancrède Lepoint, Yang Liu,
Prateek Mittal, Mehryar Mohri, Richard Nock, Ayfer Özgür, Rasmus Pagh, Mariana Raykova,
Hang Qi, Daniel Ramage, Ramesh Raskar, Dawn Song, Weikang Song, Sebastian U. Stich, Ziteng
Sun, Ananda Theertha Suresh, Florian Tramèr, Praneeth Vepakomma, Jianyu Wang, Li Xiong,
Zheng Xu, Qiang Yang, Felix X. Yu, Han Yu, and Sen Zhao. Advances and open problems in
federated learning, 2019.
Jakub Konečnỳ, Brendan McMahan, and Daniel Ramage. Federated optimization: Distributed
optimization beyond the datacenter. arXiv preprint arXiv:1511.03575, 2015.
10
Yann LeCun, Corinna Cortes, and CJ Burges. MNIST handwritten digit database. ATT Labs [Online].
Available: http://yann. lecun. com/exdb/mnist, 2, 2010.
Liping Li, Wei Xu, Tianyi Chen, Georgios B Giannakis, and Qing Ling. Rsa: Byzantine-robust
stochastic aggregation methods for distributed learning from heterogeneous datasets. In Proceed-
ings of the AAAI Conference on Artificial Intelligence, volume 33, pages 1544–1551, 2019a.
Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of
fedavg on non-iid data. arXiv preprint arXiv:1907.02189, 2019b.
H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Ar-
cas. Communication-efficient learning of deep networks from decentralized data. In Artificial
Intelligence and Statistics, pages 1273–1282, 2017.
Krishna Pillutla, Sham M. Kakade, and Zaid Harchaoui. Robust aggregation for federated learning,
2019.
Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečný,
Sanjiv Kumar, and H. Brendan McMahan. Adaptive federated optimization, 2020.
W. Shakespeare. The Complete Works of William Shakespeare. Complete Works Series. Wordsworth
Editions, 1996. ISBN 9781853268953. URL https://books.google.co.il/books?id=
D02LWkkBxn4C.
Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. Byzantine-robust distributed
learning: Towards optimal statistical rates. In Jennifer Dy and Andreas Krause, editors, Proceedings
of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine
Learning Research, pages 5650–5659, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018.
PMLR.
Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated
learning with non-iid data, 2018.
Appendix
A Proofs
A.1 Proof of theorem 3.1
Then we note that membership in top(trunc(N , U ), α) can be viewed as a simple Bernoulli random
variable with probability α, for which we obtain the following bound using Hoeffding’s inequality,
t ≥ 0:
2
Pr |{i ∈ [k] : Xi ∈ top(trunc(N , U ), α)}| ≤ (α − t)k ≤ e−2t k
(8)
δ
Therefore with t = ε1 , we have the following with 1 − 3 confidence:
X X
top({Xi | i ∈ [k]}, α) ≤ {X(i) | d(1 − (α − ε1 ))ke ≤ i ≤ k} (9)
11
Using Hoeffding’s inequality again, we can bound the expectation of X(i) | d(1−(α−ε1 ))ke ≤ i ≤ k
by ε2 with 1 − 3δ confidence and together with (9) have that:
Pk
i←d(1−(α−ε1 ))ke X(i)
E[top(trunc(N , U ), α)] ≤ + ε2 (10)
k−d(1−(α−ε1 ))ke+1
Then, using Hoeffding’s inequality for the third time, E[trunc(N , U )] is bound from below by ε3
with 1 − 3δ confidence:
1 X
E[trunc(N , U )] ≥ Xi − ε3 (11)
k
i∈[k]
The proof is concluded by applying (9-11) to (7) using the union bound.
12
B MNIST experiment results
This section provides ancillary results on our experiments conducted on the MNIST datased. These
results are similar to the results of the Shakespeare experiments.
The experiment without any attackers is shown in Figure 5 and the executions with attackers are
shown in Figure 6.
60%
Accuracy
50%
40%
30%
20%
10%
0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90
Rounds Rounds Rounds
Passthrough Truncate =10%, * =50% Ignore
Figure 5: Accuracy by round without any attackers for the MNIST experiments. Curves correspond
to preprocessing procedures and columns correspond to different aggregation methods. It can be seen
that our method (dashed orange curve) remains comparable to the properly weighted mean estimators
(solid blue curve) while ignoring clients’ sample sizes (dotted green curve) is sub-optimal. This
effect is pronounced when the unweighted median is used, since with our unbalanced partition it is
generally very far from the mean.
13
Mean Median Trimmed mean = 10%
Label shift, 1 attacker 90%
80%
70%
60%
Accuracy
50%
40%
30%
20%
10%
90%
Label shift, 10% attackers
80%
70%
60%
Accuracy
50%
40%
30%
20%
10%
90%
Model negation, 10% attackers Model negation, 1 attacker
80%
70%
60%
Accuracy
50%
40%
30%
20%
10%
90%
80%
70%
60%
Accuracy
50%
40%
30%
20%
10%
10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90
Rounds Rounds Rounds
Passthrough Truncate =10%, * =50% Ignore
Figure 6: Accuracy by round under Byzantine attacks for the MNIST experiments. Curves correspond
to preprocessing procedures and columns correspond to different aggregation methods. In the first
two rows Byzantine clients perform a label shifting attack with one and 10% attackers, respectively.
In the last two rows we repeat the experiment with a model negation attack.
We observe that even with a single attacker performing a trivial attack (first and third rows), using the
weights directly (solid blue curve) is devastating while when our preprocessing method is used in
conjunction with robust mean aggregations (dashed orange curve, two last columns) convergence
remains stable even when there are actual α (=10%) attackers (second and forth rows). We note that
in some cases our method may be slightly less efficient compared with the preprocessing method that
ignores sample size altogether (dotted green curve, second row, last column). This is to be expected
because we allow Byzantine clients to potentially get close to α∗ -proportion (50%, in this case) of the
weight. However, our method is significantly closer to the optimal solution when there are no or only
a few attackers (see Figure 5). Moreover, when used in conjunction with robust mean aggregation
methods it maintains their robustness properties.
14