Tackling System and Statistical Heterogeneity For Federated Learning With Adaptive Client Sampling

Tackling System and Statistical Heterogeneity for
Federated Learning with Adaptive Client Sampling

Bing Luo∗†§ , Wenli Xiao†∗ , Shiqiang Wang‡ , Jianwei Huang†∗ , Leandros Tassiulas§
∗ Shenzhen
Institute of Artificial Intelligence and Robotics for Society, China
† School
of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China
‡ IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
§ Department of Electrical Engineering and Institute for Network Science, Yale University, USA
Email: {luobing, lixiang, jianweihuang}@cuhk.edu.cn, shiqiang.wang@ieee.org, leandros.tassiulas@yale.edu
Abstract—Federated learning (FL) algorithms usually sample

arXiv:2112.11256v1 [cs.LG] 21 Dec 2021
a fraction of clients in each round (partial participation) when the A FL training round
number of participants is large and the server’s communication
Client 1 time
bandwidth is limited. Recent works on the convergence analysis
of FL have focused on unbiased client sampling, e.g., sampling Client 2
uniformly at random, which suffers from slow wall-clock time
for convergence due to high degrees of system heterogeneity and
Client 3
statistical heterogeneity. This paper aims to design an adaptive
.. ..
client sampling algorithm that tackles both system and statistical . . Server
heterogeneity to minimize the wall-clock convergence time. We
obtain a new tractable convergence bound for FL algorithms with Client K
arbitrary client sampling probabilities. Based on the bound, we
analytically establish the relationship between the total learning Sampling Probabilities : heterogeneous round time
time and sampling probabilities, which results in a non-convex
: heterogeneous local data
optimization problem for training time minimization. We design
an efficient algorithm for learning the unknown parameters in
the convergence bound and develop a low-complexity algorithm Fig. 1. An FL training round with system and statistical heterogeneity, with
to approximately solve the non-convex problem. Experimental re- K out of N clients sampled according to probability q = {q1 , . . . , qN }.
sults from both hardware prototype and simulation demonstrate
that our proposed sampling scheme significantly reduces the gence behavior.
convergence time compared to several baseline sampling schemes.
Notably, our scheme in hardware prototype spends 73% less time Due to limited communication bandwidth and across ge-
than the uniform sampling baseline for reaching the same target ographically dispersed devices, FL algorithms (e.g., the de
loss. facto FedAvg algorithm in [3]) usually perform multiple local
iterations on a fraction of randomly sampled clients (known as
I. I NTRODUCTION partial participation) and then aggregates their resulting local
Federated learning (FL) enables many clients1 to collab- model updates via the central server periodically [3]–[6]. Re-
oratively train a model under the coordination of a central cent works have provided theoretical convergence analysis that
server while keeping the training data decentralized and private demonstrates the effectiveness of FL with partial participation
(e.g., [1]–[3]). Compared to traditional distributed machine in various non-i.i.d. settings [15]–[19].
learning techniques, FL has two unique features (e.g., [4]–[9]), However, these prior works [15]–[19] have focused on
as shown in Fig. 1. First, clients are massively distributed and sampling schemes that select clients uniformly at random or
with diverse and low communication rates (known as system proportional to the clients’ data sizes, which often suffer from
heterogeneity), where stragglers can slow down the physical slow convergence with respect to wall-clock (physical) time3
training time.2 Second, the training data are distributed in a due to high degrees of the system and statistical heterogeneity.
non-i.i.d. and unbalanced fashion across the clients (known as This is because the total FL time depends on both the number
statistical heterogeneity), which negatively affects the conver- of training rounds for reaching the target precision and the
The work of Bing Luo Wenli Xiao, and Jianwei Huang were supported by physical time in each round [20]. Although uniform sampling
the Shenzhen Science and Technology Program (JCYJ20210324120011032), guarantees that the aggregated model update in each round
Shenzhen Institute of Artificial Intelligence and Robotics for Society, and
the Presidential Fund from the Chinese University of Hong Kong, Shenzhen.
is unbiased towards that with full client participation, the
The work of Leandros Tassiulas was supported by the AI Institute for Edge aggregated model may have a high variance due to data hetero-
Computing Leveraging Next Generation Networks (Athena) under Grant NSF geneity, thus, requiring more training rounds to converge to a
CNS-2112562 and Grant NRL N00173-21-1-G006. (Corresponding author:
Jianwei Huang.)
target precision. Moreover, considering clients’ heterogeneous
1 We use “device” and “client” interchangeably in this paper. communication delay, uniform sampling also suffers from the
2 As suggested [10]–[14], we consider mainstream synchronized FL in straggling effect, as the probability of sampling a straggler
this paper due to its composability with other techniques (such as secure
aggregation protocols and differential privacy). 3 We use wall-clock time to distinguish from the number of training rounds.
within the sampled subset in each round can be relatively pose a low-cost substitute sampling approach to learn the
high,4 thus yielding a long per-round time. convergence-related unknown parameters and develop an
One effective way of speeding up the convergence with efficient algorithm to approximately solve the non-convex
respect to the number of training rounds is to choose clients problem with low computational complexity. Our solution
according to some sampling distribution where “important” characterizes the impact of communication time (system
clients have high probabilities [21]–[24]. For example, recent heterogeneity) and data quantity and quality (statistical
works adopted importance sampling approaches based on heterogeneity) on the optimal client sampling design.
clients’ statistical property [25]–[28]. However, their sampling • Simulation and Prototype Experimentation: We evaluate
schemes did not account for the heterogeneous physical time the performance of our proposed algorithms through
in each round, especially under straggling circumstances. both a simulated environment and a hardware prototype.
Another line of works aims to minimize the learning time Experimental results demonstrate that for both convex
via optimizing client selection and scheduling based on their and non-convex learning models, our proposed sampling
heterogeneous system resources [20], [29]–[41]. However, scheme significantly reduces the convergence time com-
their optimization schemes did not consider how client selec- pared to several baseline sampling schemes. For example,
tion schemes influence the convergence behavior due to data with our hardware prototype and the EMNIST dataset,
heterogeneity and thus may negatively affect the total learning our sampling scheme spends 73% less time than baseline
time. uniform sampling for reaching the same target loss.
In a nutshell, the fundamental limitation of existing works
II. R ELATED W ORK
is the lack of joint consideration of the impact of the inherent
system heterogeneity and statistical heterogeneity on client Active client sampling and selection play a crucial role in
sampling. In other words, clients with valuable data may addressing the statistical and system heterogeneity challenges
have poor communication capabilities, whereas those who in cross-device FL. In the existing literature, the research
communicate fast may have low-quality data. This motivates efforts in speeding up the training process mainly focus
us to study the following key question. on two aspects: importance sampling and resource-aware
optimization-based approaches.
Key Question: How to design an optimal client sampling
define The goal of importance sampling is to reduce
scheme that tackles both system and statistical heterogeneity
the variance in traditional optimization algorithms based on
to achieve fast convergence with respect to wall-clock time?
stochastic gradient descent (SGD), where SGD draws data
The challenge of this question is threefold: (1) It is difficult samples uniformly at random during the learning process
to obtain an analytical FL convergence result for arbitrary (e.g., [21]–[24]). Recent works have adopted this idea in FL
client sampling probabilities. (2) The total learning time systems to improve communication efficiency via designing
minimization problem can be complex and non-convex due to client sampling strategy. Specifically, clients with “important”
the straggling effect. (3) The optimal client sampling solution data would have higher probabilities to be sampled in each
contains unknown parameters from the convergence result, round. For example, existing works use clients’ local gradi-
which we can only estimate during the learning process ent information (e.g., [25]–[27]) or local losses (e.g., [28])
(known as the chicken-and-egg problem). to measure the importance of clients’ data. However, these
In light of the above discussion, we state the main results schemes did not consider the speed of error convergence with
and key contributions of this paper as follows: respect to wall-clock time, especially the straggling effect due
• Optimal Client Sampling for Heterogeneous FL: We to heterogeneous transmission delays.
study how to design the optimal client sampling strat- Another line of works aims to minimize wall-clock time
egy to minimize FL wall-clock time with convergence via resource-aware optimization-based approaches, such as
guarantees. To the best of our knowledge, this is the first CPU frequency allocation (e.g., [29]), and communication
work that aims to optimize client sampling probabilities bandwidth allocation (e.g., [30], [31]), straggler-aware client
to address both system and statistical heterogeneity. scheduling (e.g., [20], [32]–[37]), parameters control (e.g.,
• Convergence Bound for Arbitrary Sampling: Using an [36]–[39]), and task offloading (e.g., [40], [41]). While these
adaptive client sampling and model aggregation design, papers provided some novel insights, their optimization ap-
we obtain a new tractable convergence upper bound for proaches did not consider how client sampling affects the total
FL algorithms with arbitrary client sampling probabilities. wall-clock time and thus are orthogonal to our work.
This enables us to establish the analytical relationship Unlike all the above-mentioned works, our work focuses
between the total learning time and client sampling on how to design the optimal client sampling strategy that
probabilities and formulate a non-convex training time tackles both system and statistical heterogeneity to minimize
minimization problem. the wall-clock time with convergence guarantees. In addition,
• Optimization Algorithm and Sampling Principle: We pro- most existing works on FL are based on computer simulations.
4 For
In contrast, we implement our algorithm in an actual hardware
example, suppose there are 100 clients with only 5 stragglers, then the
probability of sampling at least a straggler for uniformly sampling 10 clients prototype with resource-constrained devices, which allows us
in each round is more than 40%. to capture real system operations.
The organization of the rest of the paper is as follows. [20]. Thus, a careful client sampling design should tackle both
Section III introduces the system model and problem formu- system and statistical heterogeneity for fast convergence.
lation. Section IV presents our new error-convergence bound B. System Model of FL with Client Sampling q
with arbitrary client sampling. Section V gives the optimal We aim to sample clients according to a probability
PN distri-
client sampling algorithm and solution insights. Section VI bution q = {qi , ∀i ∈ N }, where 0 < qi < 1 and i=1 qi = 1.
provides the simulation and prototype experimental results. Through optimizing q, we want to address system and statis-
We conclude this paper in Section VII. tical heterogeneity so as to minimize the wall-clock time for
III. P RELIMINARIES AND S YSTEM M ODEL convergence. We describe the system model as follows.
1) Sampling Model: Following recent works [6], [15]–[19],
We start by summarizing the basics of FL and its de facto we assume that the server establishes the sampled client set
algorithm FedAvg with unbiased client sampling. Then, we r
K(q) by sampling K times with replacement from the total
introduce the proposed adaptive client sampling for statisti- r
N clients, where K(q) is a multiset in which a client may
cal and system heterogeneity based on FedAvg. Finally, we appear more than once. The aggregation weight of each client
present our formulated optimization problem. r
i is multiplied by the number of times it appears in K(q) .
A. Federated Learning (FL) 2) Statistical Heterogeneity Model: We consider the stan-
Consider a federated learning system involving a set of dard FL setting where the training data are distributed in an
N = 1, . . . , N clients, coordinated by a central server. Each unbalanced and non-i.i.d. fashion among clients.
client i has ni local training data samples (xi,1 , . . . , xi,ni ), 3) System Heterogeneity Model: Following the same setup
and the P total number of training data across N devices is of [9] and [20], we denote ti as the round time of client i,
N which includes both local model computation time and global
ntot := i=1 ni . Further, define f (·, ·) as a loss function
where f (w; xi,j ) indicates how the machine learning model communication time. For simplicity, we assume that ti remains
parameter w performs on the input data sample xi,j . Thus, the same across different rounds for each client i, while for
the local loss function of client i can be defined as different clients i and j, ti and tj can be different. The
1 Xni extension to time-varying ti is left for future work. Without
Fi (w) := f (w; xi,j ). (1) loss of generality, as illustrated in Fig. 1, we sort all N clients
ni j=1
in the ascending order {ti }, such that
Denote pi = nntoti as the weight of the i-th device such that
PN t1 ≤ t2 ≤ . . . ≤ ti ≤ . . . ≤ tN . (3)
i=1 pi = 1. Then, by denoting F (w) as the global loss
function, the goal of FL is to solve the following optimization 4) Total Wall-clock Time Model: We consider the main-
problem [1]: stream synchronized FL model where each sampled client
XN performs multiple (e.g., E) steps of local SGD before sending
min F (w) := pi Fi (w) . (2)
w i=1 back their model updates to the server (e.g., [3], [4], [15]–
The most popular and de facto optimization algorithm to [19]). For synchronous FL, the per-round time is limited by
solve (2) is FedAvg [3]. Here, denoting r as the index of an the slowest client (known as straggler). Thus, the per-round
FL round, we describe one round (e.g., the r-th) of the FedAvg time T (r) (q) of the entire FL process is
algorithm as follows: T (r) (q) := max {ti } . (4)
i∈K(q)(r)
1) The server uniformly at random samples a subset of K
clients (i.e., K := |Kr | with Kr ⊆ N ) and broadcasts the Therefore, the total learning time Ttot (q, R) after R rounds is
latest model wr to the selected clients. XR XR
2) Each sampled client i chooses wir,0 = wr , and runs E Ttot (q, R) = T (r) (q) = max r {ti } . (5)
r=1 r=1 i∈K(q)
steps5 of local SGD on (1) to compute an updated model
C. Problem Formulation
wir,E . Then, the sampled client lets wir+1 = wir,E and
send it back to the server. Our goal is to minimize the expected total learning time
3) The server aggregates (with weight pi ) the clients’ E[Ttot (q, R)], while ensuring that the expected global loss
updated model and computes a new global model wτ +1 . E[F wR (q) ] converges to the minimum value F ∗ with an
precision, with wR (q) being the aggregated global model after
The above process repeats for many rounds until the global
R rounds with client sampling probabilities q. This translates
loss converges.
into the following problem:
Recent works have demonstrated the effectiveness of Fe-
dAvg with theoretical convergence guarantees in various set- P1: minq,R E[Ttot (q, R)]
tings [15]–[19]. However, these works assume that the server s.t. E[F wR (q) ] − F ∗ ≤ ,
PN (6)
samples clients either uniformly at random or proportional i=1 qi = 1,
to data size, which may slow down the wall-clock time for qi > 0, ∀i ∈ N , R ∈ Z+ .
convergence due to the straggling effect and non-i.i.d. data

The expectation in E[Ttot (q, R)] and E[F wR (q) ] in (6) is
5 E is originally defined as epochs of SGD in [3]. In this paper, we denote due to the randomness in client sampling q and local SGD.
E as the number of local iterations for theoretical analysis. Solving Problem P1, however, is challenging in two aspects:
1) It is generally impossible to find out how q and R affect Algorithm 1: FL with Arbitrary Client Sampling
the final model wR (q) and the corresponding loss func- Input: Sampling probabilities q = {q1 , . . . , qN }, K, E,
tion E[F wR (q) ] before actually training the model. precision , initial model w0
Hence, we need to obtain an analytical expression with Output: Final model parameter wR
respect to q and R to predict how they affect wR (q) 1 for r ← 0, 1, 2, ..., R do
2 Server randomly samples a subset of clients K(q)r
and E[F wR (q) ]. according to q, and sends current global model wr to
2) The objective E[Ttot (q, R)] is complicated to optimize the selected clients; // Sampling
due to the straggling effect in (5), which can result in 3 Each sampled client i lets wir,0 ← wr , and performs
a non-convex optimization problem even for simplest wir,j+1 ← wir,j −η r ∇Fk wir,j , ξir,j , j = 0, 1, . . . , E−1,
cases as we will show later. and lets wir+1 ← wir,E ; // Computation
In Section IV and Section V, we address these two chal- 4 Each sampled client i sends back updated model wir+1
to the server ; // Communication
lenges, respectively, and propose approximate algorithms to Server computes aPnew global model parameter
find an approximate solution to Problem P1 efficiently.
5
pi r+1
as
wr+1 ← wr + i∈K(q)r Kq i
w i − w r
;
IV. C ONVERGENCE B OUND FOR A RBITRARY S AMPLING // Aggregation
In this section, we address the first challenge by deriving a

Lemma 1. (Adaptive Client Sampling and Model Aggre-
new tractable convergence bound for arbitrary client sampling
gation) When clients K(q)r are sampled with probability
probabilities.
q = {q1 , . . . qNP
} and their local updates are aggregated as
A. Machine Learning Model Assumptions wr+1 ← wr + i∈K(q)r Kq pi
wir+1 − wr , we have
i
To ensure a tractable convergence analysis, we first state
EK(q)r [wr+1 ] = wr+1 . (8)
several assumptions on the local objective functions Fi (w).
Assumption 1. L-smooth: For each client i ∈ N , Fi is L- Proof Sketch. The basic idea is to take expectation over the
smooth, i.e., k∇f (v)−∇f (w)k ≤ Lkv−wk for all v and w. aggregated global model of the sampled clients K(q)r , and
with some mathematical derivations, we have (8).
Assumption 2. Strongly-convex: For each client i ∈ N , Fi is
Remark: The key insight of our sampling and aggrega-
µ-strongly convex, i.e., Fi (v) ≥ Fi (w) + (v− w)T ∇Fi (w) +
µ 2 tion is that since we sample different clients with different
2 kv − wk2 for all v and w. probabilities (e.g., qi for client i), we need to inversely re-
Assumption 3. Bounded local variance: For each device weight their updated model in the aggregation step (e.g., q1i
i ∈ N , the variance of its stochastic gradient is bounded: for client i), such that the aggregated model is still unbiased
2
E k∇Fi (wi , ξi )−∇Fi (wi )k ≤ σi2 . towards that with full client participation. We summarize how
Assumption 4. Bounded local gradient: For each client i ∈ the server performs client sampling and model aggregation in
N , the expected squared norm of stochastic gradients is Algorithm 1, where the main differences compared to the de
2
bounded: E k∇Fi (wi , ξi )k ≤ G2i . facto FedAvg in [3] are the Sampling (Line 2) and Aggregation
Assumptions 1–3 are common in many existing studies (Line 5) procedures. Notably, Algorithm 1 recovers FedAvg
of convex FL problems, such as `2 -norm regularized linear algorithm with uniform sampling when letting qi = N1 , or
regression, logistic regression (e.g., [7], [18], [19], [25], [28], with weighted sampling when letting qi = pi in [18].
[42]). Nevertheless, the experimental results to be presented C. Main Convergence Result for Arbitrary Client Sampling
in Section VI show that our approach also works well for Based on Lemma 1, we present the main convergence result
non-convex loss functions. Assumption 4, however, is a less for arbitrary client sampling in Theorem 1.
restricted version of the assumption made in [7], [18], [19], Theorem 1. (Convergence Upper Bound) Let Assumptions
[25], [28], [42], where those studies have assumed that Gi is 1 to 4 hold, γ = max{ 8L µ , E}, and decaying learning rate
uniformly bounded by a universal G. Instead, we allow each 2
ηr = µ(γ+r) . For given client sampling probabilities q =
client i to have a unique Gi , which yields our optimal client {q1 , . . . , qN } and the corresponding aggregation described in
sampling design as we will show later. Lemma 1, the optimality gap after R rounds satisfies
B. Aggregation with Arbitrary Client Sampling Probabilities P
N p2 G 2

E[F wR (q) ] − F ∗ ≤ R1 α i=1 iqi i + β ,

(9)
This section shows how to aggregate clients’ model updates
2 2
under sampling probabilities q, such that the aggregated where α = µ8LE 2L 12L 4L
2 K and β = µ2 E B + µ2 E Γ + µE kw0 −w k ,
∗ 2
global model is unbiased compared to that with full client N N PN
pi σi +8 pi G2i E 2 and Γ = F ∗ − i=1 pi Fi∗ .
P 2 2 P
participation, which leads to our convergence result. with B =
i=1 i=1
We first define the virtual weighted aggregated model with
Proof Sketch. First, following the similar proof of convergence
full client participation in round r as
XN under full client
participation in [18], [42], we
show that
β
wr+1 := pi wir+1 . (7) E[F wR ) ] − F ∗ ≤ R , where E[F wR ) ] is the expected
i=1
global loss after R rounds with full participation, and β is
With this, we can derive the following result.
the same as in (9). Then, for client sampling probabilities q,
as we have shown that the expected aggregated global model A. Analytical Expression for E[Ttot (q)]
EK(q)r [wr+1 ] is unbiased compared to full participation wr+1 Theorem 2. The expected total learning time E[Ttot (q, R)] is
in Lemma 1, we can show that the expected difference of the
PN Pi K P K
i−1
two (sampling variance) is bounded as follows: E[Ttot (q,R)] = i=1 q
j=1 j − q
j=1 j ti R. (14)
2 PN p2i G2i r 2
EK(q)r wr+1 − wr+1 ≤ K 4
i=1 q i
(η E) . (10) Proof Sketch. The idea is to show that with sampling proba-
After that we bilities q, the expected per-round time E[T (r) (q)] in (4) is
Ruse induction
∗ 2
to obtain a non-recursive bound
on EK(q)r w − w , which is converted to a bound on
K P K
(r)
PN Pi i−1
E[F wR (q) ] − F ∗ using L-smoothness. Finally, we show
E[T (q)] = i=1 j=1 qj − j=1 qj ti . (15)
that the main difference of the contraction bound compared to
We first show that the probability of client i being the slowest
full client participation is the sampling variance in (10), which
PN p2 G2
yields the additional term of α i=1 iqi i in (9).
one (e.g., P
i Pi−1 the K sampled clients in each
straggler) amongst
round is ( j=1 qj )K −( j=1 qj )K . Since we sample devices
Our convergence bound in (9) characterizes the relationship according to q, taking the expectation of all N clients over
between client sampling probabilities q and the number of time qi gives (15), and for R rounds we have (14).
rounds R for reaching the target precision (E[F wR ) ] −
B. Approximate Optimization Problem for Problem P1
F ∗ ≤ ). Notably, our bound generalizes the convergence
results in [18], where clients are uniformly sampled (qi = N1 ) Based on Theorem 2, and by letting the analytical conver-
or weighted sampled (qi = pi ). Moreover, our convergence gence upper bound in (9) satisfy the convergence constraint,6
bound also motivates the optimal client sampling design for the original Problem P1 can be approximated as
K P K
homogeneous systems, i.e., all clients with the same commu- P2: minq,R
PN Pi
q −
i−1
q ti R
i=1 j=1 j j=1 j
nication time t0 , as follows.
N p2i G2i
P
1
Corollary 1. For FL with homogeneous communication time s.t. R α i=1 qi + β ≤ , (16)
i.e., ti = t0 , for all i ∈ N , the optimal client sampling PN
i=1 qi = 1,
probabilities q for Problem P1 is qi > 0, ∀i ∈ N , R ∈ Z+ .
.P
N
qi∗ = pi Gi j=1 pj Gj . (11) Combining with (9), we can see that Problem P2 is more
constrained than Problem P1, i.e., any feasible solution of
Proof. When clients have the same communication time, the
Problem P2 is also feasible for Problem P1.
round time T (r) (q) in (4) is fixed as t0 , and thus minimiz-
We further relax R as a continuous variable to theoretically
ing the total learning time E[Ttot (q, R)] in Problem P1 is
analyze Problem P2. For this relaxed problem, suppose (q∗ ,
equivalent to minimizing the total number of communication
R∗ ) is the optimal solution, then we must have
rounds for reaching the target precision . Hence, by letting
N p2i G2i
P
E[F (w(q, R))] − F ∗ = , and by moving R in (9) from the 1
R∗ α i=1 q ∗ + β = . (17)
i
right hand side to the left hand side of inequality, we have
N This is because if (17) holds with an inequality, we can always
R≤ α1
P p2i G2i
(12) find an R0 < R∗ that satisfies (17) with equality, but the
qi + β .
i=1 solution (q ∗ , R0 ) can further reduce the objective function
Thus, for a target precision , computing the optimal sampling value. Therefore, for the optimal R, (17) always holds, and
q for minimizing the upper bound of R is equivalent to solving we can obtain R from (17) and substitute it into the objective
PN p2i G2i of Problem P2. Then, the objective of Problem P2 is
minq i=1
N K P K P N

qi (13) P Pi i−1 p2i G2i
PN j=1 jq − q
j=1 j t i α qi + β , (18)
s.t. q
i=1 i = 1, qi > 0, ∀i ∈ N . i=1 i=1
7
The problem in (13) can be easily solved with the Lagrange which is only associated with client sampling probabilities q.
multiplier method in closed form as shown in (11). The objective function (18), however, is still difficult to op-
In the following, we show how to leverage the derived timize because the sampling probabilities q is in a polynomial
convergence bound in Theorem 1 to design the optimal client sum with an order K. For analytical tractability, we define an
sampling for the general heterogeneous system of Problem P1. approximation of E[T (r) (q)] as
N
V. O PTIMAL A DAPTIVE C LIENT S AMPLING A LGORITHM
X
Ẽ[T (r) (q)] := qi ti . (19)
In this section, we first obtain the analytical expression i=1
of the expected total learning time E[Ttot ] with sampling The approximation Ẽ[T (r) (q)] is exactly the same as
probabilities q and training round R. Then, we formulate an E[T (r) (q)] in the following two cases.
approximate problem of the original Problem P1 based on the
6 Optimization using upper bound as an approximation has also been
convergence upper bound in Theorem 1. Finally, we develop
adopted in [29], [30], [38], [39].
an efficient algorithm to solve the new problem with insightful 7 For ease of analysis, we omit as it is a constant multiplied by the entire
sampling principles. objective function.
Case 1: For homogeneous ti (ti = t0 , ∀i ∈ N ), we have Algorithm 2: Approximate Optimal Client Sampling
for FL with System and Statistical Heterogeneity
K P K
(r)
PN Pi i−1
E[T (q)] = i=1 j=1 qj − j=1 qj t0 Input: N , K, E, ti , pi , w0 , loss Fs , precision 0

K
Output: Approximation of q∗
(20)
PN
= j=1 qj − 0 t0 = 1K −0K t0 1 for s ← 1, 2, . . . , S do
PN 2 Server runs Algorithm 1 with uniform sampling q1 and
= i=1 qi ti = Ẽ[T (r) (q)]. weighted sampling q2 , respectively;
3 The sampled clients send back their local gradient norm
Case 2: For heterogeneous ti with K = 1, we have information along with their updated models;
K P K
(r)
PN Pi i−1 4 Server updates all clients’ Gi based on the received
E[T (q)] = i=1 j=1 qj − j=1 qj ti gradient norms;
(21)
PN 5 Server records Rq1 ,s and Rq2 ,s when reaching Fs ;
= i=1 qi ti = Ẽ[T (r) (q)]. 6 Calculate average α using (24);
β
For the general case, we can consider Ẽ[T (r) (q)] as an ap- 7 for M (0 ) ← t1 , t1 + 0 , t1 + 20 . . . , tN do
β
proximation to E[T (r) (q)]. Using this approximation, Problem 8 Substitute M (0 ), α , N , ti , pi , Gi into P4;
9 Solve P4 via CVX, and obtain q∗ (M (0 ))
P2 can be expressed as 10 return q∗ (M ∗ (0 )) ← arg minM (0 ) q∗ (M (0 ))
P 2 2

N pi Gi
P
N
P3: minq i=1 qi ti α i=1 +β
qi (22) Specifically, suppose Rq1 ,s and Rq2 ,s are the number of
PN
s.t. i=1 qi = 1, qi > 0, ∀i ∈ {1, . . . N }. rounds for reaching the pre-defined loss Fs for schemes q1 and
Remark: The objective function of Problem P3 is in a more q2 , respectively. Considering that the training loss decreases
straightforward form compared to Problem P2. However, to quickly at the beginning and slowly when approaching con-
solve for the optimal sampling probabilities q, we need to vergence, the values of Rq1 and Rq2 are normally very small
know the value of the parameters in (22), e.g., Gi , α, and β.8 compared to the required number of rounds for reaching the
In the following, we solve Problem P3 as an approximation target loss (with precision ). According to (9), we have
( PN
of the original Problem P1. Our empirical results in Section VI (Fs − F ∗ ) Rq1 ,s ≈ αN i=1 p2i G2i + β,
demonstrate that the solution obtained from solving Problem PN (23)
(Fs − F ∗ ) Rq2 ,s ≈ α i=1 pi G2i + β.
P3 achieves superior total wall-clock time performances com-
pared to baseline client sampling schemes. Based on (23), we have
PN
C. Solving Problem P3 Rq1 ,s αN i=1 p2i G2i + β
≈ PN . (24)
Problem P3 is challenging to solve because we can only Rq2 ,s α i=1 pi G2i + β
obtain the unknown parameters Gi , α and β during the training Then, we can obtain α β from (24) once we know the value
process of FL. In this subsection, we first show how to estimate of Gi . Notably, we can estimate Gi during the procedure
these unknown parameters. Then, we develop an efficient of estimating α β . The idea is to let the sampled clients send
algorithm to solve Problem P3. We summarize the overall back the norm of their local SGD along with their returned
algorithm in Algorithm 2. Finally, we identify some insightful local model updates, and then the server updates Gi with
solution properties. the received norm values. This approach does not add much
1) Estimation of Parameters Gi and α β : We first show
communication overhead, since we only need to additionally
how to estimate α via a substitute sampling scheme.9 Then, transmit the value of the gradient norm (e.g., only a few bits
β
we show that we can indirectly acquire the knowledge of Gi for quantization) instead of the full gradient information. In
during the estimation process of α addition, instead of retraining the model using the calculated
β.
The basic idea is to utilize the derived convergence upper q∗ from the initial parameter w0 , we can continue to train the
bound in (9) to approximately solve α global model after the estimation process, to avoid repeated
β as a single variable, via
performing Algorithm 1 with two baseline sampling schemes: training and reduce the overall training time.
uniform sampling q1 with qi = N1 and weighted sampling q2 In practice, due to the sampling variance, we may set several
with qi = pi , respectively. different Fs to obtain an averaged estimation of α β . The overall
Note that we only let sampling schemes q1 and q2 run estimation process corresponds to Lines 1–6 of Algorithm 2.
until a pre-defined loss Fs is reached (instead of running all 2) Optimization Algorithm for q∗ : We first identify the
the way until they converge to the precision ), because our property of Problem P3 and then show how to compute q∗ .
goal is to find and run with the optimal sampling scheme q∗ Theorem 3. Problem P3 is non-convex.
so that we can achieve the target precision with the minimum Proof Sketch. The idea is to show that the Hessian of
wall-clock time. the objective function in Problem P3 is not positive semi-
2
8 We assume that clients’ heterogeneous time t and their dataset size p
i i definite. For example, for N = 2 case, we have ∂ ∂Ẽ[T 2q
1
tot ]
=
can be measured offline. 2 2
2
2αq2 t2 p1 G1 ∂ 2 Ẽ[Ttot ] ∂ 2 Ẽ[Ttot ] ∂ 2 Ẽ[Ttot ]
> 0, whereas ∂ 2 q1 ∂ 2 q2 − =
9 We only need to estimate the value of α instead of α and β each, because
β q13 ∂q1 ∂q2
we can divide parameter α on the objective of Problem P3 without affecting 2 2
t p G 2 2 2
t p G

the optimal sapling solution. −α2 1 q22 2 − 2 q12 1 ≤ 0.
2 1
To solve Problem P3, we define a new control variable
XN
M := qi ti , (25)
i=1
where t1 ≤ M ≤ tN . Then, we rewrite Problem P3 as
N p2 G2
P
P4: minq,M g(q, M ) := M · α i=1 iqi i + β
PN
s.t. qi = 1, (26)
Pi=1
N
i=1 qi ti = M,
qi > 0, ∀i ∈ N .
Fig. 2. Hardware prototype with the laptop serving as the central server and
For any fixed feasible value of M ∈ [t1 , tN ], Problem P4 is 40 Raspberry Pis serving as clients. During the FL experiments, we place the
convex with q, because the objective function is strictly convex router 5 meters away from all the devices.
and the constraints are linear. By Cauchy-Schwarz inequality, we have
We will solve Problem P4 in two steps. First, for any fixed N
P √
N
P pi Gi 2
N
P√
2
2
M , we will solve for the optimal q∗ (M ) in Problem P4, via ( qi ti ) √
qi ≥ ti · p G
i i . (29)
a convex optimization tool, e.g., CVX [43]. This allows us to i=1 i=1 i=1
write the objective function of Problem P4 as g(q∗ (M ), M ).

P
N √ 2
Hence, the minimum of (28) is i=1 ti · pi Gi , which
Then we will further solve the problem by using a linear search
is
√ independent of q. The equality of (29) holds ifPand only if
method with a fixed step-size 0 over the interval [t1 , tN ], N
qi ti = c· p√i G
qi (for an arbitrary scalar c). Noting
i
i=1 qi = 1
where we use the optimal M ∗ (0 ) and the corresponding
yields (27) and concludes the proof.
q∗ (M ∗ (0 )) in the search domain to approximate the opti- β
mal M ∗ and q∗ in Problem P4. This optimization process Though Corollary 2 is valid only for the special case of α →
corresponds to Lines 7–10 of Algorithm 2. 0, the global optimal sampling solution in (27) characterizes
an analytical interplay between the system heterogeneity (ti )
Remark: Our optimization algorithm is efficient in the and statistical heterogeneity (pi Gi ). Particularly, when ti = t0 ,
sense that the linear search domain [t1 , tN ] is independent of for each i ∈ N , q∗ in (27) recovers the optimal sampling
the scale of the problem, e.g., number of N . solution in Corollary 1 for homogeneous systems.
3) Property of Optimal Client Sampling: Next we show VI. E XPERIMENTAL E VALUATION
some interesting properties of the optimal sampling strategy.
In this section, we empirically evaluate the performance
Theorem 4. Suppose q∗ is the optimal solution of Problem of our proposed client sampling scheme (Algorithm 2) and
P3. For two different clients i and j, if ti ≤ tj and pi Gi ≥ compare it with four other benchmarks in each round: 1)
pj Gj , then qi∗ ≥ qj∗ . full participation, 2) uniform sampling, 3) weighted sampling,
and 4) statistical sampling where we sample clients according
Proof Sketch. The idea is to show by contradiction that if to Corollary 1. Benchmarks 1–3 are widely adopted for
qi∗ < qj∗ , we can simply let qi0 = qj∗ , qj0 = qi∗ such that qi0 > qj0 convergence guarantees in [15]–[19]. The fourth baseline is
and achieve a smaller E[Ttot (qi0 , qj0 )] than E[Ttot (qi∗ , qj∗ )]. an offline variant of the proposed schemes in [25], [26].10
Theorem 4 shows that the optimal client sampling strategy In the following, we first present the evaluation setup and
should allocate higher probabilities to those who have smaller then show the experimental results.
ti and larger product value of pi Gi , which characterizes A. Experimental Setup
the impact and interplay between system heterogeneity and 1) Platforms: We conduct experiments both on a networked
statistical heterogeneity. Although it may be infeasible to hardware prototype system and in a simulated environment.11
derive an analytical relationship regarding the exact impact As illustrated in Fig. 2, our prototype system consists of N =
of ti and pi Gi on q∗ due to non-convexity of Problem P3, 40 Raspberry Pis serving as clients and a laptop computer
we show by Corollary 2 that we can obtain the closed-form acting as the central server. All devices are interconnected
β
solution of q∗ with ti and pi Gi when α → 0. via an enterprise-grade Wi-Fi router. We develop a TCP-based
β socket interface for the communication between the server and
Corollary 2. When α → 0, the global optimal solution of clients with bandwidth control. In the simulated system, we
Problem P3 is simulate N = 100 virtual devices and a virtual central server.

p√
i Gi
PN pj G j
qi∗ = ti j=1
√ . (27) 2) Datasets and Models: We evaluate our results on two
tj
real datasets and a synthetic dataset. For the real dataset, we
β PN
Proof. If α → 0, because i=1 qi ti is bounded between 10 The client sampling in [25], [26] is weighted by the norm of the local
PN β stochastic gradient in each round, which frequently requires the knowledge
[t1 , tN ], we have i=1 qi ti α → 0. Then, the objective of of stochastic gradient from all clients to calculate the sampling probabilities.
11 The prototype implementation allows us to capture real system opera-
Problem P3 can be written as
P
N
P
N p2i G2i
tion time, and the simulation system allows us to simulate large-scale FL
minq i=1 qi ti i=1 qi . (28) environments with manipulative parameters.
TABLE I
P ERFORMANCES OF WALL - CLOCK T IME FOR R EACHING TARGET L OSS FOR D IFFERENT S AMPLING S CHEMES
Sampling scheme
proposed sampling statistical sampling weighted sampling uniform sampling full participation
Setup
Prototype Setup (EMNIST dataset) 733.2 s 2095.0 s (2.9×) 2221.7 s (3.0×) 2691.5 s (3.7×)† 2748.4 s (3.7×)
Simulation Setup 1 (Synthetic dataset) 445.5 s 952.4 s (2.1×) 940.2 s (2.1×) 933.8 s (2.1×) 1526.8 s (3.4×)
Simulation Setup 2 (MNIST dataset) 245.5 s 373.8 s (1.5×) 542.9 s (2.2×) NA 898.1 s (3.7×)
† “3.7×” represents the wall-clock time ratio of uniform sampling over proposed sampling for reaching the target loss, which is equivalent to proposed
sampling takes 73% less time than uniform sampling.
full participation 68 full participation

uniform sampling uniform sampling
Test Accuracy (%)

weighted sampling weighted sampling
full participation
Global Loss
Global Loss
1.5 statistical sampling 1.5 statistical sampling
uniform sampling
proposed sampling 60 proposed sampling
weighted sampling
statistical sampling
proposed sampling
1.19
400 500
1.19 1.19
50
0 1000 2000 3000 0 1000 2000 3000 0 200 400 600
Time (s) Time (s) Round
(a) Loss with wall-clock time (b) Accuracy with wall-clock time (c) Loss with number of rounds
Fig. 3. Performance of Prototype Setup with logistic regression model, EMNIST dataset, uniform communication time, and target loss 1.19.
adopted the widely used MNIST dataset and EMNIST dataset 4) Training Parameters: For all experiments, we initialize
[6]. For the synthetic dataset, we follow a similar setup to our model with w0 = 0 and use an SGD batch size of b = 24.
that in [18], which generates 60-dimensional random vectors We use an initial learning rate of η0 = 0.1 with a decay rate
η0
as input data. We adopt both the convex multinomial logistic of 1+r , where r is the communication round index. We adopt
regression model and the non-convex convolutional neural the similar FedAvg settings as in [4], [18], [25], [32], where
network (CNN) model with LeNet-5 architecture [44]. we sample 10% of all clients in each round, i.e., K = 4 for
Prototype Setup and K = 10 for Simulation Setups, with each
3) Implementation: we consider three experimental setups.
client performing E = 50 local iterations.13
• Prototype Setup: We conduct the first experiment on 5) Heterogeneous System Parameters: For the Prototype
the prototype system using logistic regression and the Setup, to enable a heterogeneous communication time, we
EMNIST dataset. To generate heterogeneous data parti- control clients’ communication bandwidth and generate a uni-
tion, similar to [6], we randomly subsample 33, 036 lower form distribution ti ∼ U(0.187, 7.159) seconds, with a mean
case character samples from the EMNIST dataset and of 3.648 seconds and the standard deviation of 2.071 seconds.
distribute among N = 40 edge devices in an unbalanced For the simulation system, we generate the client transmission
(i.e., different devices have different numbers of data delays with an exponential distribution, i.e., ti ∼ exp (1)
samples, following a power-law distribution) and non- seconds, with both mean and standard deviation as 1 second.
i.i.d. fashion (i.e., each device has a randomly chosen B. Performance Results
number of classes, ranging from 1 to 10).12 We evaluate the wall-clock time performances of both the
• Simulation Setup 1: We conduct the second experiment global training loss and test accuracy on the aggregated model
in the simulated system using logistic regression and the in each round for all sampling schemes. We average each
Synthetic dataset. To simulate a heterogeneous setting, we experiment over 50 independent runs. For a fair comparison,
use the non-i.i.d. Synthetic (1, 1) setting. We generate we use the same random seed to compare sampling schemes
20, 509 data samples and distribute them among N = 100 in a single run and vary random seeds across different runs.
clients in an unbalanced power-law distribution. Fig. 3–5 show the results of Prototype Setup, Simulation
• Simulation Setup 2: We conduct the third experiment in Setup 1, and Simulation Setup 2, respectively. We summarize
the simulated system using CNN and the MNIST dataset, the key observations as follows.
where we randomly subsample 15, 129 data samples from 1) Loss with Wall-clock Time: As predicted by our theory,
MNIST and distribute them among N = 100 clients in an Figs. 3(a)–5(a) show that our proposed sampling scheme
unbalanced (following the power-law distribution) and achieves the same target loss with significantly less time,
non-i.i.d. (i.e., each device has 1–6 classes) fashion. 13 We also conduct experiments both on Prototype and Simulation Setups
12 The number of samples and the number of classes are randomly matched, with variant E and K, which show a similar performance as the experiments
such that clients with more data samples may not have more classes. in this paper, and due to page limitations, we do not illustrate them all.
2.5 2.5
75.3 full participation
full participation
Test Accuracy (%)

weighted sampling full participation weighted sampling
Global Loss
Global Loss
statistical sampling uniform sampling statistical sampling

proposed sampling 60 weighted sampling proposed sampling
1.5 statistical sampling 1.5
proposed sampling
0.78 40 0.78
0 400 800 1200 1600 0 400 800 1200 1600 0 100 200 300 400
Fig. 4. Performance of Simulation Setup 1 with logistic regression model, Synthetic (1, 1) dataset, exponential communication time, and target loss 0.78.
0.5 96.5 0.5

full participation full participation
Test Accuracy (%)
weighted sampling weighted sampling

Global Loss
Global Loss
statistical sampling statistical sampling
90
0.3 proposed sampling 0.3 proposed sampling
full participation
uniform sampling
85 weighted sampling
statistical sampling
proposed sampling
0.1 0.1
80
0 300 600 900 0 200 400 600 800 0 100 200 300
Fig. 5. Performance of Simulation Setup 2 with CNN model, MNIST dataset, exponential communication time, and target loss 0.1.
compared to the baseline sampling schemes. Specifically, for sampling and full participation schemes. This observation
Prototype Setup in Fig. 3(a), our proposed sampling scheme is expected since our proposed sampling scheme aims to
spends around 73% less time than full sampling and uniform minimize the wall-clock time instead of the number of rounds.
sampling and around 66% less time than weighted sampling Nevertheless, we notice that statistical sampling performs bet-
and statistical sampling for reaching the same target loss. ter than the other sampling schemes, which verifies Corollary 1
Fig. 5(a) highlights the fact that our proposed sampling works since the performance of loss with respect to the number of
well with the non-convex CNN model, under which the naive rounds is equivalent to that with respect to wall-clock time for
uniform sampling cannot reach the target loss within 900 homogeneous systems.
seconds, indicating the importance of a careful client sampling
VII. C ONCLUSION AND F UTURE W ORK
design. Table I summarizes the superior performances of our
proposed sampling scheme in wall-clock time for reaching In this work, we studied the optimal client sampling strategy
target loss in all three setups. that addresses the system and statistical heterogeneity in FL
2) Accuracy with Wall-clock Time: As shown in Fig. 3(b)– to minimize the wall-clock convergence time. We obtained
5(b), our proposed sampling scheme achieves the target test a new tractable convergence bound for FL algorithms with
accuracy14 much faster than the other benchmarks. Notably, arbitrary client sampling probabilities. Based on the bound, we
for Simulation Setup 1 with the target test accuracy of 75.3% formulated a non-convex wall-clock time minimization prob-
in Fig. 4(b), our proposed sampling scheme takes around 70% lem. We developed an efficient algorithm to learn the unknown
less time than full sampling and around 46% less time than parameters in the convergence bound and designed a low-
the other sampling schemes. We can also observe the superior complexity algorithm to approximately solve the non-convex
test accuracy performance of our proposed sampling schemes problem. Our solution characterizes the interplay between
in Prototype Setup and non-convex Simulation Setup 2 in clients’ communication delays (system heterogeneity) and data
Fig. 3(b) and Fig. 5(b), respectively. importance (statistical heterogeneity), and their impact on the
optimal client sampling design. Experimental results validated
3) Loss with Number of Rounds: Fig. 3(c)–5(c) show that
the superiority of our proposed scheme compared to several
our proposed sampling scheme requires more training rounds
baselines in speeding up wall-clock convergence time.
for reaching the target loss compared to baseline statistical
14 In Fig. 3(b), Fig. 4(b), and Fig. 5(b), the target test accuracy corresponds
to the test accuracy result when our proposed scheme reaches the target loss.
R EFERENCES [22] D. Needell, R. Ward, and N. Srebro, “Stochastic gradient descent,
weighted sampling, and the randomized kaczmarz algorithm,” in Ad-
[1] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. vances in neural information processing systems, 2014, pp. 1017–1025.
Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings et al., [23] G. Alain, A. Lamb, C. Sankar, A. Courville, and Y. Bengio, “Variance
“Advances and open problems in federated learning,” arXiv preprint reduction in SGD by distributed importance sampling,” arXiv preprint
arXiv:1912.04977, 2019. arXiv:1511.06481, 2015.
[2] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning: [24] S. Gopal, “Adaptive sampling for SGD by exploiting side information,”
Concept and applications,” ACM Transactions on Intelligent Systems and in Proceedings of the International Conference on Machine Learning
Technology, vol. 10, no. 2, pp. 1–19, 2019. (ICML). PMLR, 2016, pp. 364–372.
[3] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, [25] W. Chen, S. Horvath, and P. Richtarik, “Optimal client sampling for
“Communication-efficient learning of deep networks from decentralized federated learning,” arXiv preprint arXiv:2010.13723, 2020.
data,” in Proceedings of the 20th International Conference on Artificial [26] E. Rizk, S. Vlaski, and A. H. Sayed, “Federated learning under impor-
Intelligence and Statistics, 2017, pp. 1273–1282. tance sampling,” arXiv preprint arXiv:2012.07383, 2020.
[4] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, [27] H. T. Nguyen, V. Sehwag, S. Hosseinalipour, C. G. Brinton, M. Chiang,
V. Ivanov, C. Kiddon, J. Konečnỳ, S. Mazzocchi, H. B. McMahan et al., and H. V. Poor, “Fast-convergent federated learning,” IEEE Journal on
“Towards federated learning at scale: System design,” in Proceedings of Selected Areas in Communications, vol. 39, no. 1, pp. 201–218, 2021.
Machine Learning and Systems (MLSys), 2019. [28] Y. J. Cho, J. Wang, and G. Joshi, “Client selection in federated learning:
[5] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Convergence analysis and power-of-choice selection strategies,” arXiv
Challenges, methods, and future directions,” IEEE Signal Processing preprint arXiv:2010.01243, 2020.
Magazine, vol. 37, no. 3, pp. 50–60, 2020. [29] N. H. Tran, W. Bao, A. Zomaya, N. M. NH, and C. S. Hong,
[6] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated learning over wireless networks: Optimization model design
“Federated optimization in heterogeneous networks,” in Proceedings of and analysis,” in Proceedings of the IEEE Conference on Computer
Machine Learning and Systems (MLSys), 2020. Communications (INFOCOM), 2019, pp. 1387–1395.
[30] M. Chen, H. V. Poor, W. Saad, and S. Cui, “Convergence time optimiza-
[7] H. Yu, S. Yang, and S. Zhu, “Parallel restarted SGD for non-convex
tion for federated learning over wireless networks,” IEEE Transactions
optimization with faster convergence and less communication,” in Pro-
on Wireless Communications, vol. 20, no. 4, pp. 2457–2471, 2020.
ceedings of the AAAI Conference on Artificial Intelligence, 2019.
[31] W. Shi, S. Zhou, Z. Niu, M. Jiang, and L. Geng, “Joint device scheduling
[8] H. Yu, R. Jin, and S. Yang, “On the linear speedup analysis of
and resource allocation for latency constrained wireless federated learn-
communication efficient momentum SGD for distributed non-convex op-
ing,” IEEE Transactions on Wireless Communications, vol. 20, no. 1,
timization,” in Proceedings of the International Conference on Machine
pp. 453–467, 2021.
Learning (ICML). PMLR, 2019, pp. 7184–7193.
[32] T. Nishio and R. Yonetani, “Client selection for federated learning with
[9] J. Wang and G. Joshi, “Adaptive communication strategies to achieve heterogeneous resources in mobile edge,” in Proceedings of the IEEE
the best error-runtime trade-off in local-update SGD,” in Proceedings of International Conference on Communications (ICC), 2019, pp. 1–7.
Machine Learning and Systems (MLSys), 2019. [33] Z. Chai, A. Ali, S. Zawad, S. Truex, A. Anwar, N. Baracaldo, Y. Zhou,
[10] K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, H. Ludwig, F. Yan, and Y. Cheng, “Tifl: A tier-based federated learning
S. Patel, D. Ramage, A. Segal, and K. Seth, “Practical secure aggregation system,” in Proceedings of the International Symposium on High-
for federated learning on user-held data,” in NeurIPS Workshop on Performance Parallel and Distributed Computing, 2020, pp. 125–136.
Private Multi-Party Machine Learning, 2016. [34] H. H. Yang, Z. Liu, T. Q. Quek, and H. V. Poor, “Scheduling policies
[11] B. Avent, A. Korolova, D. Zeber, T. Hovden, and B. Livshits, for federated learning in wireless networks,” IEEE Transactions on
“BLENDER: Enabling local search with a hybrid differential privacy Communications, vol. 68, no. 1, pp. 317–333, 2019.
model,” in USENIX Security Symposium (USENIX Security), 2017, pp. [35] Y. Jin, L. Jiao, Z. Qian, S. Zhang, S. Lu, and X. Wang, “Resource-
747–764. efficient and convergence-preserving online participant selection in fed-
[12] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and erated learning,” in Proceedings of the IEEE International Conference
D. Bacon, “Federated learning: Strategies for improving communica- on Distributed Computing Systems (ICDCS), 2020.
tion efficiency,” in NeurIPS Workshop on Private Multi-Party Machine [36] B. Luo, X. Li, S. Wang, J. Huang, and L. Tassiulas, “Cost-effective
Learning, 2016. federated learning in mobile edge networks,” IEEE Journal on Selected
[13] M. Zhang, E. Wei, and R. Berry, “Faithful edge federated learning: Areas in Communications, vol. 39, no. 12, pp. 3606–3621, 2021.
Scalability and privacy,” IEEE Journal on Selected Areas in Communi- [37] H. Wang, Z. Kaplan, D. Niu, and B. Li, “Optimizing federated learning
cations, vol. 39, no. 12, pp. 3790–3804, 2021. on non-iid data with reinforcement learning,” in Proceedings of the
[14] P. Sun, H. Che, Z. Wang, Y. Wang, T. Wang, L. Wu, and H. Shao, “Pain- IEEE Conference on Computer Communications (INFOCOM), 2020,
fl: Personalized privacy-preserving incentive for federated learning,” pp. 1698–1707.
IEEE Journal on Selected Areas in Communications, vol. 39, no. 12, [38] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and
pp. 3805–3820, 2021. K. Chan, “Adaptive federated learning in resource constrained edge com-
[15] F. Haddadpour and M. Mahdavi, “On the convergence of local descent puting systems,” IEEE Journal on Selected Areas in Communications,
methods in federated learning,” arXiv preprint arXiv:1910.14425, 2019. vol. 37, no. 6, pp. 1205–1221, 2019.
[16] S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and [39] B. Luo, X. Li, S. Wang, J. Huang, and L. Tassiulas, “Cost-effective
A. T. Suresh, “Scaffold: Stochastic controlled averaging for on-device federated learning design,” in IEEE INFOCOM 2021-IEEE Conference
federated learning,” arXiv preprint arXiv:1910.06378, 2019. on Computer Communications. IEEE, 2021, pp. 1–10.
[17] H. Yang, M. Fang, and J. Liu, “Achieving linear speedup with par- [40] Y. Tu, Y. Ruan, S. Wagle, C. G. Brinton, and C. Joe-Wong, “Network-
tial worker participation in non-iid federated learning,” arXiv preprint aware optimization of distributed learning for fog computing,” in
arXiv:2101.11203, 2021. Proceedings of the IEEE Conference on Computer Communications
[18] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the conver- (INFOCOM), 2020.
gence of fedavg on non-iid data,” in Proceedings of the International [41] S. Wang, M. Lee, S. Hosseinalipour, R. Morabito, M. Chiang, and C. G.
Conference on Learning Representation (ICLR), 2019. Brinton, “Device sampling for heterogeneous federated learning: Theory,
[19] Z. Qu, K. Lin, J. Kalagnanam, Z. Li, J. Zhou, and Z. Zhou, “Fed- algorithms, and implementation,” in Proceedings of the IEEE Conference
erated learning’s blessing: Fedavg has linear speedup,” arXiv preprint on Computer Communications (INFOCOM), 2021.
arXiv:2007.05690, 2020. [42] S. U. Stich, “Local SGD converges fast and communicates little,” in
[20] R. Amirhossein, T. Isidoros, H. Hamed, M. Aryan, and P. Ramtin, Proceedings of the International Conference on Learning Representation
“Straggler-resilient federated learning: Leveraging the interplay be- (ICLR), 2018.
tween statistical accuracy and system heterogeneity,” arXiv preprint [43] S. Boyd, S. P. Boyd, and L. Vandenberghe, Convex optimization.
arXiv:2012.14453, 2020. Cambridge university press, 2004.
[21] P. Zhao and T. Zhang, “Stochastic optimization with importance sam- [44] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
pling for regularized loss minimization,” in Proceedings of the Interna- applied to document recognition,” Proceedings of the IEEE, vol. 86,
tional Conference on Machine Learning (ICML), 2015, pp. 1–9. no. 11, pp. 2278–2324, 1998.

Tackling System and Statistical Heterogeneity For Federated Learning With Adaptive Client Sampling

Uploaded by

Copyright:

Available Formats

Tackling System and Statistical Heterogeneity For Federated Learning With Adaptive Client Sampling

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tackling System and Statistical Heterogeneity For Federated Learning With Adaptive Client Sampling

Uploaded by

Copyright:

Available Formats

Tackling System and Statistical Heterogeneity for

Federated Learning with Adaptive Client Sampling

Email: {luobing, lixiang, jianweihuang}@cuhk.edu.cn, shiqiang.wang@ieee.org, leandros.tassiulas@yale.edu

Abstract—Federated learning (FL) algorithms usually sample

In this section, we address the first challenge by deriving a

write the objective function of Problem P4 as g(q∗ (M ), M ).

full participation 68 full participation

Test Accuracy (%)

Test Accuracy (%)

statistical sampling uniform sampling statistical sampling

0.5 96.5 0.5

weighted sampling weighted sampling

You might also like