Tackling System and Statistical Heterogeneity For Federated Learning With Adaptive Client Sampling
Tackling System and Statistical Heterogeneity For Federated Learning With Adaptive Client Sampling
Tackling System and Statistical Heterogeneity For Federated Learning With Adaptive Client Sampling
a fraction of clients in each round (partial participation) when the A FL training round
number of participants is large and the server’s communication
Client 1 time
bandwidth is limited. Recent works on the convergence analysis
of FL have focused on unbiased client sampling, e.g., sampling Client 2
uniformly at random, which suffers from slow wall-clock time
for convergence due to high degrees of system heterogeneity and
Client 3
statistical heterogeneity. This paper aims to design an adaptive
.. ..
client sampling algorithm that tackles both system and statistical . . Server
heterogeneity to minimize the wall-clock convergence time. We
obtain a new tractable convergence bound for FL algorithms with Client K
arbitrary client sampling probabilities. Based on the bound, we
analytically establish the relationship between the total learning Sampling Probabilities : heterogeneous round time
time and sampling probabilities, which results in a non-convex
: heterogeneous local data
optimization problem for training time minimization. We design
an efficient algorithm for learning the unknown parameters in
the convergence bound and develop a low-complexity algorithm Fig. 1. An FL training round with system and statistical heterogeneity, with
to approximately solve the non-convex problem. Experimental re- K out of N clients sampled according to probability q = {q1 , . . . , qN }.
sults from both hardware prototype and simulation demonstrate
that our proposed sampling scheme significantly reduces the gence behavior.
convergence time compared to several baseline sampling schemes.
Notably, our scheme in hardware prototype spends 73% less time Due to limited communication bandwidth and across ge-
than the uniform sampling baseline for reaching the same target ographically dispersed devices, FL algorithms (e.g., the de
loss. facto FedAvg algorithm in [3]) usually perform multiple local
iterations on a fraction of randomly sampled clients (known as
I. I NTRODUCTION partial participation) and then aggregates their resulting local
Federated learning (FL) enables many clients1 to collab- model updates via the central server periodically [3]–[6]. Re-
oratively train a model under the coordination of a central cent works have provided theoretical convergence analysis that
server while keeping the training data decentralized and private demonstrates the effectiveness of FL with partial participation
(e.g., [1]–[3]). Compared to traditional distributed machine in various non-i.i.d. settings [15]–[19].
learning techniques, FL has two unique features (e.g., [4]–[9]), However, these prior works [15]–[19] have focused on
as shown in Fig. 1. First, clients are massively distributed and sampling schemes that select clients uniformly at random or
with diverse and low communication rates (known as system proportional to the clients’ data sizes, which often suffer from
heterogeneity), where stragglers can slow down the physical slow convergence with respect to wall-clock (physical) time3
training time.2 Second, the training data are distributed in a due to high degrees of the system and statistical heterogeneity.
non-i.i.d. and unbalanced fashion across the clients (known as This is because the total FL time depends on both the number
statistical heterogeneity), which negatively affects the conver- of training rounds for reaching the target precision and the
The work of Bing Luo Wenli Xiao, and Jianwei Huang were supported by physical time in each round [20]. Although uniform sampling
the Shenzhen Science and Technology Program (JCYJ20210324120011032), guarantees that the aggregated model update in each round
Shenzhen Institute of Artificial Intelligence and Robotics for Society, and
the Presidential Fund from the Chinese University of Hong Kong, Shenzhen.
is unbiased towards that with full client participation, the
The work of Leandros Tassiulas was supported by the AI Institute for Edge aggregated model may have a high variance due to data hetero-
Computing Leveraging Next Generation Networks (Athena) under Grant NSF geneity, thus, requiring more training rounds to converge to a
CNS-2112562 and Grant NRL N00173-21-1-G006. (Corresponding author:
Jianwei Huang.)
target precision. Moreover, considering clients’ heterogeneous
1 We use “device” and “client” interchangeably in this paper. communication delay, uniform sampling also suffers from the
2 As suggested [10]–[14], we consider mainstream synchronized FL in straggling effect, as the probability of sampling a straggler
this paper due to its composability with other techniques (such as secure
aggregation protocols and differential privacy). 3 We use wall-clock time to distinguish from the number of training rounds.
within the sampled subset in each round can be relatively pose a low-cost substitute sampling approach to learn the
high,4 thus yielding a long per-round time. convergence-related unknown parameters and develop an
One effective way of speeding up the convergence with efficient algorithm to approximately solve the non-convex
respect to the number of training rounds is to choose clients problem with low computational complexity. Our solution
according to some sampling distribution where “important” characterizes the impact of communication time (system
clients have high probabilities [21]–[24]. For example, recent heterogeneity) and data quantity and quality (statistical
works adopted importance sampling approaches based on heterogeneity) on the optimal client sampling design.
clients’ statistical property [25]–[28]. However, their sampling • Simulation and Prototype Experimentation: We evaluate
schemes did not account for the heterogeneous physical time the performance of our proposed algorithms through
in each round, especially under straggling circumstances. both a simulated environment and a hardware prototype.
Another line of works aims to minimize the learning time Experimental results demonstrate that for both convex
via optimizing client selection and scheduling based on their and non-convex learning models, our proposed sampling
heterogeneous system resources [20], [29]–[41]. However, scheme significantly reduces the convergence time com-
their optimization schemes did not consider how client selec- pared to several baseline sampling schemes. For example,
tion schemes influence the convergence behavior due to data with our hardware prototype and the EMNIST dataset,
heterogeneity and thus may negatively affect the total learning our sampling scheme spends 73% less time than baseline
time. uniform sampling for reaching the same target loss.
In a nutshell, the fundamental limitation of existing works
II. R ELATED W ORK
is the lack of joint consideration of the impact of the inherent
system heterogeneity and statistical heterogeneity on client Active client sampling and selection play a crucial role in
sampling. In other words, clients with valuable data may addressing the statistical and system heterogeneity challenges
have poor communication capabilities, whereas those who in cross-device FL. In the existing literature, the research
communicate fast may have low-quality data. This motivates efforts in speeding up the training process mainly focus
us to study the following key question. on two aspects: importance sampling and resource-aware
optimization-based approaches.
Key Question: How to design an optimal client sampling
define The goal of importance sampling is to reduce
scheme that tackles both system and statistical heterogeneity
the variance in traditional optimization algorithms based on
to achieve fast convergence with respect to wall-clock time?
stochastic gradient descent (SGD), where SGD draws data
The challenge of this question is threefold: (1) It is difficult samples uniformly at random during the learning process
to obtain an analytical FL convergence result for arbitrary (e.g., [21]–[24]). Recent works have adopted this idea in FL
client sampling probabilities. (2) The total learning time systems to improve communication efficiency via designing
minimization problem can be complex and non-convex due to client sampling strategy. Specifically, clients with “important”
the straggling effect. (3) The optimal client sampling solution data would have higher probabilities to be sampled in each
contains unknown parameters from the convergence result, round. For example, existing works use clients’ local gradi-
which we can only estimate during the learning process ent information (e.g., [25]–[27]) or local losses (e.g., [28])
(known as the chicken-and-egg problem). to measure the importance of clients’ data. However, these
In light of the above discussion, we state the main results schemes did not consider the speed of error convergence with
and key contributions of this paper as follows: respect to wall-clock time, especially the straggling effect due
• Optimal Client Sampling for Heterogeneous FL: We to heterogeneous transmission delays.
study how to design the optimal client sampling strat- Another line of works aims to minimize wall-clock time
egy to minimize FL wall-clock time with convergence via resource-aware optimization-based approaches, such as
guarantees. To the best of our knowledge, this is the first CPU frequency allocation (e.g., [29]), and communication
work that aims to optimize client sampling probabilities bandwidth allocation (e.g., [30], [31]), straggler-aware client
to address both system and statistical heterogeneity. scheduling (e.g., [20], [32]–[37]), parameters control (e.g.,
• Convergence Bound for Arbitrary Sampling: Using an [36]–[39]), and task offloading (e.g., [40], [41]). While these
adaptive client sampling and model aggregation design, papers provided some novel insights, their optimization ap-
we obtain a new tractable convergence upper bound for proaches did not consider how client sampling affects the total
FL algorithms with arbitrary client sampling probabilities. wall-clock time and thus are orthogonal to our work.
This enables us to establish the analytical relationship Unlike all the above-mentioned works, our work focuses
between the total learning time and client sampling on how to design the optimal client sampling strategy that
probabilities and formulate a non-convex training time tackles both system and statistical heterogeneity to minimize
minimization problem. the wall-clock time with convergence guarantees. In addition,
• Optimization Algorithm and Sampling Principle: We pro- most existing works on FL are based on computer simulations.
4 For
In contrast, we implement our algorithm in an actual hardware
example, suppose there are 100 clients with only 5 stragglers, then the
probability of sampling at least a straggler for uniformly sampling 10 clients prototype with resource-constrained devices, which allows us
in each round is more than 40%. to capture real system operations.
The organization of the rest of the paper is as follows. [20]. Thus, a careful client sampling design should tackle both
Section III introduces the system model and problem formu- system and statistical heterogeneity for fast convergence.
lation. Section IV presents our new error-convergence bound B. System Model of FL with Client Sampling q
with arbitrary client sampling. Section V gives the optimal We aim to sample clients according to a probability
PN distri-
client sampling algorithm and solution insights. Section VI bution q = {qi , ∀i ∈ N }, where 0 < qi < 1 and i=1 qi = 1.
provides the simulation and prototype experimental results. Through optimizing q, we want to address system and statis-
We conclude this paper in Section VII. tical heterogeneity so as to minimize the wall-clock time for
III. P RELIMINARIES AND S YSTEM M ODEL convergence. We describe the system model as follows.
1) Sampling Model: Following recent works [6], [15]–[19],
We start by summarizing the basics of FL and its de facto we assume that the server establishes the sampled client set
algorithm FedAvg with unbiased client sampling. Then, we r
K(q) by sampling K times with replacement from the total
introduce the proposed adaptive client sampling for statisti- r
N clients, where K(q) is a multiset in which a client may
cal and system heterogeneity based on FedAvg. Finally, we appear more than once. The aggregation weight of each client
present our formulated optimization problem. r
i is multiplied by the number of times it appears in K(q) .
A. Federated Learning (FL) 2) Statistical Heterogeneity Model: We consider the stan-
Consider a federated learning system involving a set of dard FL setting where the training data are distributed in an
N = 1, . . . , N clients, coordinated by a central server. Each unbalanced and non-i.i.d. fashion among clients.
client i has ni local training data samples (xi,1 , . . . , xi,ni ), 3) System Heterogeneity Model: Following the same setup
and the P total number of training data across N devices is of [9] and [20], we denote ti as the round time of client i,
N which includes both local model computation time and global
ntot := i=1 ni . Further, define f (·, ·) as a loss function
where f (w; xi,j ) indicates how the machine learning model communication time. For simplicity, we assume that ti remains
parameter w performs on the input data sample xi,j . Thus, the same across different rounds for each client i, while for
the local loss function of client i can be defined as different clients i and j, ti and tj can be different. The
1 Xni extension to time-varying ti is left for future work. Without
Fi (w) := f (w; xi,j ). (1) loss of generality, as illustrated in Fig. 1, we sort all N clients
ni j=1
in the ascending order {ti }, such that
Denote pi = nntoti as the weight of the i-th device such that
PN t1 ≤ t2 ≤ . . . ≤ ti ≤ . . . ≤ tN . (3)
i=1 pi = 1. Then, by denoting F (w) as the global loss
function, the goal of FL is to solve the following optimization 4) Total Wall-clock Time Model: We consider the main-
problem [1]: stream synchronized FL model where each sampled client
XN performs multiple (e.g., E) steps of local SGD before sending
min F (w) := pi Fi (w) . (2)
w i=1 back their model updates to the server (e.g., [3], [4], [15]–
The most popular and de facto optimization algorithm to [19]). For synchronous FL, the per-round time is limited by
solve (2) is FedAvg [3]. Here, denoting r as the index of an the slowest client (known as straggler). Thus, the per-round
FL round, we describe one round (e.g., the r-th) of the FedAvg time T (r) (q) of the entire FL process is
algorithm as follows: T (r) (q) := max {ti } . (4)
i∈K(q)(r)
1) The server uniformly at random samples a subset of K
clients (i.e., K := |Kr | with Kr ⊆ N ) and broadcasts the Therefore, the total learning time Ttot (q, R) after R rounds is
latest model wr to the selected clients. XR XR
2) Each sampled client i chooses wir,0 = wr , and runs E Ttot (q, R) = T (r) (q) = max r {ti } . (5)
r=1 r=1 i∈K(q)
steps5 of local SGD on (1) to compute an updated model
C. Problem Formulation
wir,E . Then, the sampled client lets wir+1 = wir,E and
send it back to the server. Our goal is to minimize the expected total learning time
3) The server aggregates (with weight pi ) the clients’ E[Ttot (q, R)], while ensuring that the expected global loss
updated model and computes a new global model wτ +1 . E[F wR (q) ] converges to the minimum value F ∗ with an
precision, with wR (q) being the aggregated global model after
The above process repeats for many rounds until the global
R rounds with client sampling probabilities q. This translates
loss converges.
into the following problem:
Recent works have demonstrated the effectiveness of Fe-
dAvg with theoretical convergence guarantees in various set- P1: minq,R E[Ttot (q, R)]
tings [15]–[19]. However, these works assume that the server s.t. E[F wR (q) ] − F ∗ ≤ ,
PN (6)
samples clients either uniformly at random or proportional i=1 qi = 1,
to data size, which may slow down the wall-clock time for qi > 0, ∀i ∈ N , R ∈ Z+ .
convergence due to the straggling effect and non-i.i.d. data
The expectation in E[Ttot (q, R)] and E[F wR (q) ] in (6) is
5 E is originally defined as epochs of SGD in [3]. In this paper, we denote due to the randomness in client sampling q and local SGD.
E as the number of local iterations for theoretical analysis. Solving Problem P1, however, is challenging in two aspects:
1) It is generally impossible to find out how q and R affect Algorithm 1: FL with Arbitrary Client Sampling
the final model wR (q) and the corresponding loss func- Input: Sampling probabilities q = {q1 , . . . , qN }, K, E,
tion E[F wR (q) ] before actually training the model. precision , initial model w0
Hence, we need to obtain an analytical expression with Output: Final model parameter wR
respect to q and R to predict how they affect wR (q) 1 for r ← 0, 1, 2, ..., R do
2 Server randomly samples a subset of clients K(q)r
and E[F wR (q) ]. according to q, and sends current global model wr to
2) The objective E[Ttot (q, R)] is complicated to optimize the selected clients; // Sampling
due to the straggling effect in (5), which can result in 3 Each sampled client i lets wir,0 ← wr , and performs
a non-convex optimization problem even for simplest wir,j+1 ← wir,j −η r ∇Fk wir,j , ξir,j , j = 0, 1, . . . , E−1,
cases as we will show later. and lets wir+1 ← wir,E ; // Computation
In Section IV and Section V, we address these two chal- 4 Each sampled client i sends back updated model wir+1
to the server ; // Communication
lenges, respectively, and propose approximate algorithms to Server computes aPnew global model parameter
find an approximate solution to Problem P1 efficiently.
5
pi r+1
as
wr+1 ← wr + i∈K(q)r Kq i
w i − w r
;
IV. C ONVERGENCE B OUND FOR A RBITRARY S AMPLING // Aggregation
Sampling scheme
proposed sampling statistical sampling weighted sampling uniform sampling full participation
Setup
Prototype Setup (EMNIST dataset) 733.2 s 2095.0 s (2.9×) 2221.7 s (3.0×) 2691.5 s (3.7×)† 2748.4 s (3.7×)
Simulation Setup 1 (Synthetic dataset) 445.5 s 952.4 s (2.1×) 940.2 s (2.1×) 933.8 s (2.1×) 1526.8 s (3.4×)
Simulation Setup 2 (MNIST dataset) 245.5 s 373.8 s (1.5×) 542.9 s (2.2×) NA 898.1 s (3.7×)
† “3.7×” represents the wall-clock time ratio of uniform sampling over proposed sampling for reaching the target loss, which is equivalent to proposed
sampling takes 73% less time than uniform sampling.
Global Loss
1.5 statistical sampling 1.5 statistical sampling
uniform sampling
proposed sampling 60 proposed sampling
weighted sampling
statistical sampling
proposed sampling
1.19
400 500
1.19 1.19
50
0 1000 2000 3000 0 1000 2000 3000 0 200 400 600
Time (s) Time (s) Round
(a) Loss with wall-clock time (b) Accuracy with wall-clock time (c) Loss with number of rounds
Fig. 3. Performance of Prototype Setup with logistic regression model, EMNIST dataset, uniform communication time, and target loss 1.19.
adopted the widely used MNIST dataset and EMNIST dataset 4) Training Parameters: For all experiments, we initialize
[6]. For the synthetic dataset, we follow a similar setup to our model with w0 = 0 and use an SGD batch size of b = 24.
that in [18], which generates 60-dimensional random vectors We use an initial learning rate of η0 = 0.1 with a decay rate
η0
as input data. We adopt both the convex multinomial logistic of 1+r , where r is the communication round index. We adopt
regression model and the non-convex convolutional neural the similar FedAvg settings as in [4], [18], [25], [32], where
network (CNN) model with LeNet-5 architecture [44]. we sample 10% of all clients in each round, i.e., K = 4 for
Prototype Setup and K = 10 for Simulation Setups, with each
3) Implementation: we consider three experimental setups.
client performing E = 50 local iterations.13
• Prototype Setup: We conduct the first experiment on 5) Heterogeneous System Parameters: For the Prototype
the prototype system using logistic regression and the Setup, to enable a heterogeneous communication time, we
EMNIST dataset. To generate heterogeneous data parti- control clients’ communication bandwidth and generate a uni-
tion, similar to [6], we randomly subsample 33, 036 lower form distribution ti ∼ U(0.187, 7.159) seconds, with a mean
case character samples from the EMNIST dataset and of 3.648 seconds and the standard deviation of 2.071 seconds.
distribute among N = 40 edge devices in an unbalanced For the simulation system, we generate the client transmission
(i.e., different devices have different numbers of data delays with an exponential distribution, i.e., ti ∼ exp (1)
samples, following a power-law distribution) and non- seconds, with both mean and standard deviation as 1 second.
i.i.d. fashion (i.e., each device has a randomly chosen B. Performance Results
number of classes, ranging from 1 to 10).12 We evaluate the wall-clock time performances of both the
• Simulation Setup 1: We conduct the second experiment global training loss and test accuracy on the aggregated model
in the simulated system using logistic regression and the in each round for all sampling schemes. We average each
Synthetic dataset. To simulate a heterogeneous setting, we experiment over 50 independent runs. For a fair comparison,
use the non-i.i.d. Synthetic (1, 1) setting. We generate we use the same random seed to compare sampling schemes
20, 509 data samples and distribute them among N = 100 in a single run and vary random seeds across different runs.
clients in an unbalanced power-law distribution. Fig. 3–5 show the results of Prototype Setup, Simulation
• Simulation Setup 2: We conduct the third experiment in Setup 1, and Simulation Setup 2, respectively. We summarize
the simulated system using CNN and the MNIST dataset, the key observations as follows.
where we randomly subsample 15, 129 data samples from 1) Loss with Wall-clock Time: As predicted by our theory,
MNIST and distribute them among N = 100 clients in an Figs. 3(a)–5(a) show that our proposed sampling scheme
unbalanced (following the power-law distribution) and achieves the same target loss with significantly less time,
non-i.i.d. (i.e., each device has 1–6 classes) fashion. 13 We also conduct experiments both on Prototype and Simulation Setups
12 The number of samples and the number of classes are randomly matched, with variant E and K, which show a similar performance as the experiments
such that clients with more data samples may not have more classes. in this paper, and due to page limitations, we do not illustrate them all.
2.5 2.5
75.3 full participation
full participation
uniform sampling uniform sampling
Global Loss
Global Loss
0.78 40 0.78
0 400 800 1200 1600 0 400 800 1200 1600 0 100 200 300 400
Time (s) Time (s) Round
(a) Loss with wall-clock time (b) Accuracy with wall-clock time (c) Loss with number of rounds
Fig. 4. Performance of Simulation Setup 1 with logistic regression model, Synthetic (1, 1) dataset, exponential communication time, and target loss 0.78.
Global Loss
statistical sampling statistical sampling
90
0.3 proposed sampling 0.3 proposed sampling
full participation
uniform sampling
85 weighted sampling
statistical sampling
proposed sampling
0.1 0.1
80
0 300 600 900 0 200 400 600 800 0 100 200 300
Time (s) Time (s) Round
(a) Loss with wall-clock time (b) Accuracy with wall-clock time (c) Loss with number of rounds
Fig. 5. Performance of Simulation Setup 2 with CNN model, MNIST dataset, exponential communication time, and target loss 0.1.
compared to the baseline sampling schemes. Specifically, for sampling and full participation schemes. This observation
Prototype Setup in Fig. 3(a), our proposed sampling scheme is expected since our proposed sampling scheme aims to
spends around 73% less time than full sampling and uniform minimize the wall-clock time instead of the number of rounds.
sampling and around 66% less time than weighted sampling Nevertheless, we notice that statistical sampling performs bet-
and statistical sampling for reaching the same target loss. ter than the other sampling schemes, which verifies Corollary 1
Fig. 5(a) highlights the fact that our proposed sampling works since the performance of loss with respect to the number of
well with the non-convex CNN model, under which the naive rounds is equivalent to that with respect to wall-clock time for
uniform sampling cannot reach the target loss within 900 homogeneous systems.
seconds, indicating the importance of a careful client sampling
VII. C ONCLUSION AND F UTURE W ORK
design. Table I summarizes the superior performances of our
proposed sampling scheme in wall-clock time for reaching In this work, we studied the optimal client sampling strategy
target loss in all three setups. that addresses the system and statistical heterogeneity in FL
2) Accuracy with Wall-clock Time: As shown in Fig. 3(b)– to minimize the wall-clock convergence time. We obtained
5(b), our proposed sampling scheme achieves the target test a new tractable convergence bound for FL algorithms with
accuracy14 much faster than the other benchmarks. Notably, arbitrary client sampling probabilities. Based on the bound, we
for Simulation Setup 1 with the target test accuracy of 75.3% formulated a non-convex wall-clock time minimization prob-
in Fig. 4(b), our proposed sampling scheme takes around 70% lem. We developed an efficient algorithm to learn the unknown
less time than full sampling and around 46% less time than parameters in the convergence bound and designed a low-
the other sampling schemes. We can also observe the superior complexity algorithm to approximately solve the non-convex
test accuracy performance of our proposed sampling schemes problem. Our solution characterizes the interplay between
in Prototype Setup and non-convex Simulation Setup 2 in clients’ communication delays (system heterogeneity) and data
Fig. 3(b) and Fig. 5(b), respectively. importance (statistical heterogeneity), and their impact on the
optimal client sampling design. Experimental results validated
3) Loss with Number of Rounds: Fig. 3(c)–5(c) show that
the superiority of our proposed scheme compared to several
our proposed sampling scheme requires more training rounds
baselines in speeding up wall-clock convergence time.
for reaching the target loss compared to baseline statistical
14 In Fig. 3(b), Fig. 4(b), and Fig. 5(b), the target test accuracy corresponds
to the test accuracy result when our proposed scheme reaches the target loss.
R EFERENCES [22] D. Needell, R. Ward, and N. Srebro, “Stochastic gradient descent,
weighted sampling, and the randomized kaczmarz algorithm,” in Ad-
[1] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. vances in neural information processing systems, 2014, pp. 1017–1025.
Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings et al., [23] G. Alain, A. Lamb, C. Sankar, A. Courville, and Y. Bengio, “Variance
“Advances and open problems in federated learning,” arXiv preprint reduction in SGD by distributed importance sampling,” arXiv preprint
arXiv:1912.04977, 2019. arXiv:1511.06481, 2015.
[2] Q. Yang, Y. Liu, T. Chen, and Y. Tong, “Federated machine learning: [24] S. Gopal, “Adaptive sampling for SGD by exploiting side information,”
Concept and applications,” ACM Transactions on Intelligent Systems and in Proceedings of the International Conference on Machine Learning
Technology, vol. 10, no. 2, pp. 1–19, 2019. (ICML). PMLR, 2016, pp. 364–372.
[3] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, [25] W. Chen, S. Horvath, and P. Richtarik, “Optimal client sampling for
“Communication-efficient learning of deep networks from decentralized federated learning,” arXiv preprint arXiv:2010.13723, 2020.
data,” in Proceedings of the 20th International Conference on Artificial [26] E. Rizk, S. Vlaski, and A. H. Sayed, “Federated learning under impor-
Intelligence and Statistics, 2017, pp. 1273–1282. tance sampling,” arXiv preprint arXiv:2012.07383, 2020.
[4] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, [27] H. T. Nguyen, V. Sehwag, S. Hosseinalipour, C. G. Brinton, M. Chiang,
V. Ivanov, C. Kiddon, J. Konečnỳ, S. Mazzocchi, H. B. McMahan et al., and H. V. Poor, “Fast-convergent federated learning,” IEEE Journal on
“Towards federated learning at scale: System design,” in Proceedings of Selected Areas in Communications, vol. 39, no. 1, pp. 201–218, 2021.
Machine Learning and Systems (MLSys), 2019. [28] Y. J. Cho, J. Wang, and G. Joshi, “Client selection in federated learning:
[5] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Convergence analysis and power-of-choice selection strategies,” arXiv
Challenges, methods, and future directions,” IEEE Signal Processing preprint arXiv:2010.01243, 2020.
Magazine, vol. 37, no. 3, pp. 50–60, 2020. [29] N. H. Tran, W. Bao, A. Zomaya, N. M. NH, and C. S. Hong,
[6] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated learning over wireless networks: Optimization model design
“Federated optimization in heterogeneous networks,” in Proceedings of and analysis,” in Proceedings of the IEEE Conference on Computer
Machine Learning and Systems (MLSys), 2020. Communications (INFOCOM), 2019, pp. 1387–1395.
[30] M. Chen, H. V. Poor, W. Saad, and S. Cui, “Convergence time optimiza-
[7] H. Yu, S. Yang, and S. Zhu, “Parallel restarted SGD for non-convex
tion for federated learning over wireless networks,” IEEE Transactions
optimization with faster convergence and less communication,” in Pro-
on Wireless Communications, vol. 20, no. 4, pp. 2457–2471, 2020.
ceedings of the AAAI Conference on Artificial Intelligence, 2019.
[31] W. Shi, S. Zhou, Z. Niu, M. Jiang, and L. Geng, “Joint device scheduling
[8] H. Yu, R. Jin, and S. Yang, “On the linear speedup analysis of
and resource allocation for latency constrained wireless federated learn-
communication efficient momentum SGD for distributed non-convex op-
ing,” IEEE Transactions on Wireless Communications, vol. 20, no. 1,
timization,” in Proceedings of the International Conference on Machine
pp. 453–467, 2021.
Learning (ICML). PMLR, 2019, pp. 7184–7193.
[32] T. Nishio and R. Yonetani, “Client selection for federated learning with
[9] J. Wang and G. Joshi, “Adaptive communication strategies to achieve heterogeneous resources in mobile edge,” in Proceedings of the IEEE
the best error-runtime trade-off in local-update SGD,” in Proceedings of International Conference on Communications (ICC), 2019, pp. 1–7.
Machine Learning and Systems (MLSys), 2019. [33] Z. Chai, A. Ali, S. Zawad, S. Truex, A. Anwar, N. Baracaldo, Y. Zhou,
[10] K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, H. Ludwig, F. Yan, and Y. Cheng, “Tifl: A tier-based federated learning
S. Patel, D. Ramage, A. Segal, and K. Seth, “Practical secure aggregation system,” in Proceedings of the International Symposium on High-
for federated learning on user-held data,” in NeurIPS Workshop on Performance Parallel and Distributed Computing, 2020, pp. 125–136.
Private Multi-Party Machine Learning, 2016. [34] H. H. Yang, Z. Liu, T. Q. Quek, and H. V. Poor, “Scheduling policies
[11] B. Avent, A. Korolova, D. Zeber, T. Hovden, and B. Livshits, for federated learning in wireless networks,” IEEE Transactions on
“BLENDER: Enabling local search with a hybrid differential privacy Communications, vol. 68, no. 1, pp. 317–333, 2019.
model,” in USENIX Security Symposium (USENIX Security), 2017, pp. [35] Y. Jin, L. Jiao, Z. Qian, S. Zhang, S. Lu, and X. Wang, “Resource-
747–764. efficient and convergence-preserving online participant selection in fed-
[12] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and erated learning,” in Proceedings of the IEEE International Conference
D. Bacon, “Federated learning: Strategies for improving communica- on Distributed Computing Systems (ICDCS), 2020.
tion efficiency,” in NeurIPS Workshop on Private Multi-Party Machine [36] B. Luo, X. Li, S. Wang, J. Huang, and L. Tassiulas, “Cost-effective
Learning, 2016. federated learning in mobile edge networks,” IEEE Journal on Selected
[13] M. Zhang, E. Wei, and R. Berry, “Faithful edge federated learning: Areas in Communications, vol. 39, no. 12, pp. 3606–3621, 2021.
Scalability and privacy,” IEEE Journal on Selected Areas in Communi- [37] H. Wang, Z. Kaplan, D. Niu, and B. Li, “Optimizing federated learning
cations, vol. 39, no. 12, pp. 3790–3804, 2021. on non-iid data with reinforcement learning,” in Proceedings of the
[14] P. Sun, H. Che, Z. Wang, Y. Wang, T. Wang, L. Wu, and H. Shao, “Pain- IEEE Conference on Computer Communications (INFOCOM), 2020,
fl: Personalized privacy-preserving incentive for federated learning,” pp. 1698–1707.
IEEE Journal on Selected Areas in Communications, vol. 39, no. 12, [38] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and
pp. 3805–3820, 2021. K. Chan, “Adaptive federated learning in resource constrained edge com-
[15] F. Haddadpour and M. Mahdavi, “On the convergence of local descent puting systems,” IEEE Journal on Selected Areas in Communications,
methods in federated learning,” arXiv preprint arXiv:1910.14425, 2019. vol. 37, no. 6, pp. 1205–1221, 2019.
[16] S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and [39] B. Luo, X. Li, S. Wang, J. Huang, and L. Tassiulas, “Cost-effective
A. T. Suresh, “Scaffold: Stochastic controlled averaging for on-device federated learning design,” in IEEE INFOCOM 2021-IEEE Conference
federated learning,” arXiv preprint arXiv:1910.06378, 2019. on Computer Communications. IEEE, 2021, pp. 1–10.
[17] H. Yang, M. Fang, and J. Liu, “Achieving linear speedup with par- [40] Y. Tu, Y. Ruan, S. Wagle, C. G. Brinton, and C. Joe-Wong, “Network-
tial worker participation in non-iid federated learning,” arXiv preprint aware optimization of distributed learning for fog computing,” in
arXiv:2101.11203, 2021. Proceedings of the IEEE Conference on Computer Communications
[18] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the conver- (INFOCOM), 2020.
gence of fedavg on non-iid data,” in Proceedings of the International [41] S. Wang, M. Lee, S. Hosseinalipour, R. Morabito, M. Chiang, and C. G.
Conference on Learning Representation (ICLR), 2019. Brinton, “Device sampling for heterogeneous federated learning: Theory,
[19] Z. Qu, K. Lin, J. Kalagnanam, Z. Li, J. Zhou, and Z. Zhou, “Fed- algorithms, and implementation,” in Proceedings of the IEEE Conference
erated learning’s blessing: Fedavg has linear speedup,” arXiv preprint on Computer Communications (INFOCOM), 2021.
arXiv:2007.05690, 2020. [42] S. U. Stich, “Local SGD converges fast and communicates little,” in
[20] R. Amirhossein, T. Isidoros, H. Hamed, M. Aryan, and P. Ramtin, Proceedings of the International Conference on Learning Representation
“Straggler-resilient federated learning: Leveraging the interplay be- (ICLR), 2018.
tween statistical accuracy and system heterogeneity,” arXiv preprint [43] S. Boyd, S. P. Boyd, and L. Vandenberghe, Convex optimization.
arXiv:2012.14453, 2020. Cambridge university press, 2004.
[21] P. Zhao and T. Zhang, “Stochastic optimization with importance sam- [44] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
pling for regularized loss minimization,” in Proceedings of the Interna- applied to document recognition,” Proceedings of the IEEE, vol. 86,
tional Conference on Machine Learning (ICML), 2015, pp. 1–9. no. 11, pp. 2278–2324, 1998.