FedAdp

1078 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 7, NO.
4, DECEMBER 2021
Fast-Convergent Federated Learning

With Adaptive Weighting
Hongda Wu , Student Member, IEEE, and Ping Wang , Senior Member, IEEE
Abstract—Federated learning (FL) enables resource- exceed the capacity of today’s Internet in the near future [5],
constrained edge nodes to collaboratively learn a global Mobile Edge Computing (MEC) has naturally been proposed
model under the orchestration of a central server while to incorporate the data processing outside the cloud [6], [7].
keeping privacy-sensitive data locally. The non-independent-
and-identically-distributed (non-IID) data samples across With computing and storage capability, MEC systems gen-
participating nodes slow model training and impose additional erally consist of end-edge-server architecture. Multiple edge
communication rounds for FL to converge. In this paper, we servers are capable of performing large-scale distributed tasks
propose Federated Adaptive Weighting (FedAdp) algorithm involving local processing and remote execution under the
that aims to accelerate model convergence under the presence of coordination of a remote cloud. MEC approaches compro-
nodes with non-IID dataset. We observe the implicit connection
between the node contribution to the global model aggregation mise training efficiency and communication cost by bring-
and data distribution on the local node through theoretical and ing model training towards where the data is generated.
empirical analysis. We then propose to assign different weights However, computation offloading task and data processing at
for updating the global model based on node contribution the edge server still involves the transmission of sensitive
adaptively through each training round. The contribution of data.
participating nodes is first measured by the angle between the
local gradient vector and the global gradient vector, and then, In either centralized cloud training or MEC approaches,
weight is quantified by a designed non-linear mapping function collecting data for model training is unrealistic from a pri-
subsequently. The simple yet effective strategy can reinforce vacy, security, regulatory, or necessity perspective. In order to
positive (suppress negative) node contribution dynamically, maintain privacy-sensitive data and to facilitate collaborative
resulting in communication round reduction drastically. Its machine learning (ML) among distributed nodes, Federated
superiority over the commonly adopted Federated Averaging
(FedAvg) is verified both theoretically and experimentally. With Learning (FL) has emerged as an attractive paradigm, where
extensive experiments performed in Pytorch and PySyft, we local nodes collaboratively train a task model under the orches-
show that FL training with FedAdp can reduce the number tration of a central server without accessing end-user data [8],
of communication rounds by up to 54.1% on MNIST dataset [9]. In FL, local nodes cooperatively train an ML model
and up to 45.4% on FashionMNIST dataset, as compared to required by the central server by utilizing their local data.
FedAvg algorithm.
Through transferring local model updates to the central server
Index Terms—Federated learning, communication efficiency, for model aggregation and acquiring a global model for local
mobile edge computing, Internet of Things. training rather than sending raw data, user data privacy is well
protected. As such, FL features from conventional approaches
I. I NTRODUCTION in data acquisition, storage, and training. FL has been deployed
HE RAPID advancement of edge devices (e.g., Internet of by major service providers and plays an important role in
T Things (IoT), mobile phones) is constantly generating an
unprecedented amount of data [1]. These devices are currently
supporting privacy-sensitive applications, including computer
vision, natural language processing, and medical database [10].
equipped with enhanced sensors, computing, and communi- Even though good convergence performance of FL approach
cation capability. Coupled with the rise of Deep Learning is shown, owing to limited connectivity of wireless networks,
(DL) [2], the edge devices unfold the countless opportuni- the availability of local nodes and straggler of participating
ties for various tasks of modern society, e.g., road congestion nodes, communication cost becomes a critical bottleneck in
prediction [3] and environmental monitoring [4]. FL context since generally several iterations are involved for
In the traditional cloud-centric approaches, data generated model converging [8]–[10]. Another fundamental challenge
and collected by edge devices is uploaded and processed in for FL is strongly non-independent-and-identically-distributed
a data center. It is predicted that the data generation rate will (non-IID) and highly skewed data across local nodes. The pres-
ence of non-IID data significantly degrades the performance
Manuscript received November 30, 2020; revised March 20, 2021; accepted of federated learning, which makes model training take more
May 20, 2021. Date of publication May 27, 2021; date of current version rounds to converge, and the variance caused by non-IID data
December 9, 2021. This work was supported by the Canada NSERC Discovery brings instability to the training process [11]–[13]. Since the
under Grant RGPIN-2019-06375. The associate editor coordinating the review
of this article and approving it for publication was M. Pan. (Corresponding completion time of federated learning is largely impacted
author: Ping Wang.) by the communication time, how to reduce the communi-
The authors are with the Department of Electrical Engineering and cation round for model convergence in FL, especially for
Computer Science, Lassonde School of Engineering, York University, Toronto,
ON M3J 1P3, Canada (e-mail: hwu1226@eecs.yorku.ca; pingw@yorku.ca). participating nodes with non-IID datasets, is urgent to be
Digital Object Identifier 10.1109/TCCN.2021.3084406 addressed.
2332-7731
c 2021 Crown Copyright.
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 15,2024 at 02:57:26 UTC from IEEE Xplore. Restrictions apply.
WU AND WANG: FAST-CONVERGENT FEDERATED LEARNING WITH ADAPTIVE WEIGHTING 1079
In this paper, to surmount the slow convergence of vanilla In particular, McMahan et al. [8] presented the vanilla
Federated Averaging (FedAvg) [8] under the presence of Federated Averaging (FedAvg) algorithm, which increases
non-IID dataset, we propose Federated Adaptive Weighting the number of local updates instead of updating the local
(FedAdp) algorithm that aims to improve the performance of model one time at each round. Li et al. [13] proposed to
federated learning through assigning distinct weight for partic- allow participating nodes to perform a variable number of local
ipating node to update the global model. We observe that nodes updates, rather than applying the same amount of workload for
with heterogeneous datasets make different contributions to the each node [8], to consequently overcome the heterogeneity of
global model aggregation. Therefore, our main intuition is to the system. Similar to [13], authors in [15] also posed local
measure the contribution of the participating node based on accuracy for participating nodes, based on limited computing
the gradient information from local nodes then assign differ- resources on nodes, as an index to steer the number of local
ent weights accordingly and adaptively at each communication updates performed. Different from [13], [15], the work in [14]
round for global model aggregation. According to node con- exposed an analytical model to dynamically adapt the number
tribution, the proposed adaptive weighting strategy is capable of local updates between two consecutive global aggrega-
of reducing the expected training loss of FL in each commu- tions in real-time to minimize the learning loss under a fixed
nication round under the presence of non-IID nodes, which resource budget of the edge computing system. Regarding the
accelerates the model convergence. Our main contributions in node selection, Nishio and Yonetani [16] proposed FedCS
this paper are as follows: algorithm to do node selection intentionally rather than ran-
• We identify the presence of nodes with non-independent- domly, based on the resource conditions of local nodes.
and-identically-distributed (non-IID) data distributions Authors in [17] utilized gradient information to do node selec-
slows the convergence speed of federated learning. In tion. The node whose inner product between its gradient vector
addition, we analyze the convergence bound of gradient- and the global gradient vector is negative will be excluded
descent based federated learning from a theoretical per- from FL training.
spective and derive the convergence bound that incor- To handle the non-IID data distribution, Zhao et al. [11]
porates the non-IID data distribution across participating quantified the weight divergence by earth mover’s distance
nodes and weighting strategy for model updating. between data distribution on nodes and population distribu-
• We observe the implicit connection between data distri- tion. However, the strategy of pushing a small set of uniformly
bution on a node and the contribution from that node distributed data to participating nodes in [11] violates the pri-
to the global model aggregation, measured at the central vacy concern of FL and imposes extra communication cost.
server-side by inferring gradient information of partici- It was proposed in [12] that communication rounds can be
pating nodes. The convergence bound is lowered, and the reduced effectively by selecting nodes based on their uploaded
convergence speed is accelerated by a carefully designed model weights, which profile the data distribution on those
weighting strategy, which is formalized as Federated nodes. In contrast, Wang et al. [18] proposed to identify the
Adaptive Weighting (FedAdp), that assigns different irrelevant update caused by different data distribution at the
weights to nodes for global model aggregation in each node side. The communication cost is accordingly reduced by
round of communication. precluding these nodes with irrelevant updates before updates
• We empirically evaluate the performance of the proposed transmission. However, local nodes are required to check the
weighting algorithm via extensive experiments using dif- relevance in each round using the global model kept in the
ferent real datasets with different learning objectives (i.e., previous round, which is in contravention of FL and brings
convex and non-convex loss function). Our experimental computational burdens to local nodes.
results have shown that FL training with FedAdp can Regarding the weighting strategy, authors in [19] proposed
drastically reduce the communication rounds compared to assign different weights for global model aggregation
with the commonly adopted FedAvg algorithm. adaptively by considering the time difference when the
The rest of this paper is organized as follows. Section II model update is done in a layerwise asynchronous manner.
discusses the related works. Section III provides the prelimi- Chai et al. [20] designed a tier-based FL system by dividing
naries of federated learning and the impact of non-IID data on the participating nodes into tiers according to their respond-
FL. In Section IV, the convergence analysis and the proposed ing time and devised to adaptively assign weights to different
weighting algorithm are presented. Experimental results are tiers for model aggregation since there exists different updat-
shown in Section V. Section VI presents the conclusion. ing frequency across tiers. Both methods in [19], [20] aim to
weigh the local update along with different communication
rounds.
II. R ELATED W ORK To enhance the convergence of FL with the presence of
Generally, the FL algorithm adopts synchronous aggrega- non-IID nodes, different from [11], [12] that measure model
tion and selects a subset of nodes randomly to participate in weight, we find out that nodes contribute differently to the
each round randomly to avoid long-tailed waiting time due to global model aggregation owing to their different data distri-
the network uncertainty and straggler. To boost convergence bution, and there exists an implicit connection between data
and reduce the communication rounds, tuning the number of distribution and gradient information. In this paper, we propose
local updates [8], [13], [14], [15], and selecting appropriate to measure the node contribution quantitatively by the angle
nodes for FL training [12], [16], [17] are the usually adopted between the local gradient of each participating node and the
approaches. global gradient across all participating nodes at the server-side.
1080 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 7, NO. 4, DECEMBER 2021
With the quantified contribution, the weight for aggregating the K N are selected and global model w(t − 1) in previous
global model can be devised discriminatively across the nodes iteration is sent to the selected nodes. Each of the participating
and adaptively in each round according to node contribu- nodes i performs stochastic gradient descent (SGD) training to
tion. The proposed adaptive weighting strategy can effectively optimize its local objective Fi (w):
speed up the convergence of FL in the presence of non-IID
wi (t) = w(t − 1) − η∇Fi (w(t − 1)), (3)
data. Different from [17], [18], our method does not impose
additional communication and computation burden to local where η is the learning rate and ∇Fi (·) is the gradient at
nodes. Besides, our adaptive weighting strategy is done in each node i. (3) gives a general principle of SGD optimization.
communication round, which is orthogonal with the methods wi (t) could be the result after one or several local updates
proposed in [19], [20]. of SGD (e.g., τ = 1 in FedSGD [8] or τ > 1 in
FedAvg [8], [14] with τ denoting the number of local updates
III. P RELIMINARIES between two consecutive global rounds). Hereinafter, SGD is
In this section, we briefly introduce key ingredients behind applied to mini-batch data samples with size B̄ . As such, local
the recent method for federated learning, FedAvg, and show model is updated by τ = D B̄
i
E times, where Di and E are the
how non-IID data impacts model convergence. number of training samples on node i and the number of local
training epochs, respectively.
The nodes then communicate their local model updates
A. Standard Federated Learning
Δi (t) = wi (t) − w(t − 1) to the central server,1 which
In general, federated learning methods [8], [10] are designed aggregates them and updates the global model accordingly,
to handle the consensus learning task in a decentralized man-
|St |
ner, where a central server coordinates the global learning
objective and multiple devices training the local model with Δ(t) = ψi Δi (t)
locally collected data. In particular, assume that we have N i=1
local nodes with dataset D1 , . . . , Di , . . . , DN and we define w(t) = w(t − 1) + Δ(t). (4)
Di |Di | as the number of data samples owned by each
node, where | · | denotes the Cardinality of sets. FL methods B. FedAvg for Non-IID Data
aim to minimize: The independent and identically distributed (IID) sampling
N
condition of training data is important that the stochastic gra-
min F (w) ψi Fi (w), (1) dient is an unbiased estimate of the full gradient [14]. FedAvg
w
i=1 is shown to be effective, given that the data distribution
across different nodes is the same as centrally collected data.
where w is global model weight, ψi = Di / N i =1 Di is the However, the data distribution determined by usage patterns
weight for aggregation in FL training, and global objective across local nodes is typically non-IID, i.e., p (i) is different
function F(w) is surrogated by using local objective function across participating nodes.
Fi (w), which is defined, as an example, in the context of Since local objective Fi (w) is closely related with data dis-
C-class classification problem thereinafter. In particular, C- tribution p (i) , a large number of local updates lead the model
class classification problem is defined over a feature space X towards optima of its local objective Fi (w) as opposed to the
and a label space Y = [C ], where [C ] = {1, . . . , C }. For global objective F(w). The inconsistency between local models
each labeled data sample {x, y}, predicted probability vec- wi and global model w is accumulated along with local train-
tor
y is achieved using mapping function f : X → Y,
byC
ing, leading to more communication rounds before training
where Y = { y| j =1 yj = 1, yj ≥ 0, ∀j ∈ [C ]}. As such, converges. As such, local training with multiple local updates
Fi (w) commonly measures the local empirical risk over pos- potentially hurts convergence and even leads to divergence
sibly different data distribution p (i) of node i, which is defined with the presence of non-IID data [8], [11].
by using cross entropy for C-class classification as follow, We conduct an experiment to demonstrate the impact of
⎡ ⎤
C non-IID data on model convergence. We train a two-layer

min Fi (w) E ⎣
(i) − 1y=j logfj (x, w)⎦ CNN model with the same neural network architecture in [8]
w x,y p
j =1 using Pytorch on the MNIST dataset (containing 60,000 sam-
C
ples with 10 classes) until the model achieves 95% test
accuracy. 10 nodes are selected, each with 600 samples that are
=− p (i) (y = j )Ex|y=j logfj (x, w) , (2)
j =1
selected based on their label criteria. If a node is at IID setting,
600 samples are randomly selected over the whole training set.
where fj (x, w) denotes the probability that the data sample x If a node is at x-class non-IID setting, 600 samples are ran-
is classified as the j-th class given model w, and p (i) (y = j ) domly selected over a subset, which is composed of x class
denotes the data distribution on node i over class j ∈ [C ]. data samples. Each class of the x-class is selected at random
In general federated learning setting (e.g., FedAvg), the
1 Typically there are two ways for nodes to upload their local model to the
participating nodes perform local training with the same
server, either by uploading model parameters w(t) or by uploading the model
training configuration (e.g., optimizer, learning rate, etc). At difference Δi (t). Although the same amount of data are to be sent in both
each communication round t, a subset of the nodes St , |St | = ways, conveying Δi (t) is proven to be more amenable for compression [9].
i ∈ St , i.e., ∇Fi (w) − ∇Fi (w ) ≤ βw − w for any

two parameter vectors w, w .
Based on Assumption 1, the definition of F(w), and triangle
inequality, we can easily get the following lemma.
Lemma 1: F(w) is β-Lipschitz smoothness.
Assumption 2 (Bounded Local Dissimilarity):2 For any par-
ticipating node i, the dissimilarity between local objective
and global objective at w is bounded by A and B, i.e.,
A∇F (w) ≤ ∇Fi (w) ≤ B ∇F (w).
Here ∇F (w) is the gradient of the global objective that is
|S| |S|
defined as ∇F (w) = i=1 (Di / i =1 Di )∇Fi (w) in FL
context. The local dissimilarity in assumption 2 can be seen
as a metric that reveals the data heterogeneity when the same
training configuration (e.g., learning rate, batch size, training
Fig. 1. Test accuracy over communication rounds of FedAvg with hetero-
geneous data distribution over participating nodes. X IID + Y non-IID (1)
epoch, etc.) across participating nodes is held. As a sanity
(or (2)) represents X nodes are at IID setting and Y nodes are at 1-class (or check, when all the local data samples are the same, we have
2-class) non-IID setting. A = B = 1.
Theorem 1: With loss function Fi (w) satisfying
and can be overlapped. The skewness of datasets is measured Assumptions 1-2 and supposing w(t) is not a stationary
and reflected by the value of x. solution, the expected decrease in the global loss function
We use the same notations for FedAvg algorithm as [8]: B̄ , between two consecutive rounds satisfies,
the local minibatch size, and E, the number of local training
epochs. In this experiment, B̄ = 32, E = 1, η = 0.01 and F (w(t + 1)) ≤ F (w(t))
learning rate decay of 0.995 per communication round. We ∇F (w(t)), ∇Fi (w(t)) B βη
− ηEi|t −
can conclude from Fig. 1: ∇F (w(t))∇Fi (w(t)) 2
• Model convergence highly depends on IID nodes. The
A2
presence of non-IID nodes imposes variance to model · ∇F (w(t))2 , (5)
B
training, which slows the convergence of FL (e.g., 5 IID
case converges faster than 5 IID + 5 non-IID (1) case). where the expectation Ei|t refers to the weighting strategy of
• The skewness of data affects model convergence. With the participating node i ∈ St for global model aggregation.
the participation of the non-IID node, the model con- · is the inner product operation and · denotes the 2 norm
verges much slower when the skewness of the dataset of a vector.
increases (e.g., 3 IID + 7 non-IID (2) case converges The proof of Theorem 1 is presented in Appendix-A.
much faster than 3 IID + 7 non-IID (1) case). Theorem 1 provides a bound on how rapid the decrease of
the global FL loss can be expected. Based on Theorem 1, we
IV. F EDERATED A DAPTIVE W EIGHTING have the following corollary and remarks.
Corollary 1: The convergence upper bound of FL after T
In this section, we develop our methodology for improv-
global rounds is given by,
ing the convergence rate of federated learning. We first analyze
the convergence property of federated learning (Section IV-A). F (w(T )) ≤ F (w(0))
The theoretical analysis on the expected decrease of FL loss T −1
∇F (w(t)), ∇Fi (w(t)) B βη
in each round of training reveals that gradient information and −η Ei|t −
data distribution impact the convergence. The experimental ∇F (w(t))∇Fi (w(t)) 2
t=0
result shows the diversity of node contribution in reducing
A2
the FL loss in each round (Section IV-B), measured by the · ∇F (w(t))2 . (6)
B
local gradient of each node and the global gradient from par-
ticipating nodes. This motivates us to assign weight adaptively Remark 1: The decrease of FL loss between two consecu-
according to node contribution for global model aggregation. tive global rounds shows a dependency on learning rate η,
Finally, we theoretically prove that assigning weight based the bounded local dissimilarity of participating nodes, the
on node contribution adaptively leads to accelerating model correlation between the local gradient and the global gradi-
∇F (w(t)),∇F (w(t))
convergence and formally present the methodology of the ent ∇F (w(t))∇Fi (w(t)) , and the weight strategy Ei|t that
i
proposed FedAdp algorithm (Section IV-C). weighs participating nodes for the global model aggregation
in each global round.
A. Convergence Analysis
2 Similar assumption has made in FL context, for example in [13], [14], [17].
For theoretical analysis of federated learning algorithms, we In [13], [17], the dissimilarity across local gradients is imposed by an upper
employ the following typical assumptions in our analysis (see, bound to capture the impact of data heterogeneity on FL convergence, and an
e.g., [11], [13], [14], [17]). analogous definition named gradient divergence is also presented in [14]. By
tracking the divergence of gradients on each participating node, we observe
Assumption 1 (β-Lipschitz Smoothness): Fi (w) is β- that the dissimilarity can be further bounded by a lower bound as shown in
Lipschitz smoothness for each of the participating nodes Assumption 2.
Remark 2: The local gradient, which is correlated with

minimizing the local objective, may not align with the direc-
tion of approaching the optimal of the global objective. The
∇F (w(t)),∇F (w(t))
correlation ∇F (w(t))∇Fi (w(t)) between the local gradient
i
and the global gradient is a metric to measure their alignment
level. From Theorem 1, we can see this metric also indicates
how much each node contributes to reducing FL loss in each
round.
Remark 3: The FL loss F(w(t + 1)) is negatively associ-
ated with the bound gap in Assumption 2, meaning that as
bound gap [A, B] grows larger, the bound weakens, and the
convergence exacerbates. Intuitively, the root cause of dissim-
ilarity is the divergence of local gradients across participating
nodes with heterogeneous datasets, which can be intentionally
regularized by a properly designed weighting strategy. Fig. 2. The smoothed angle θi of participating node at different train-
ing round, where star and pentagon sign denote the angle at communication
An immediate suggestion from Theorem 1 is that to improve round 1 and communication round 15, respectively. Nodes with different data
the convergence of FL, one can reduce the FL loss by distribution are marked with different colors.
increasing Ei|t [·] in each global round. This motivates us to
measure node contribution quantitatively through the corre-
∇F (w(t)),∇F (w(t))
lation ∇F (w(t))∇Fi (w(t)) between the local gradient and By using smoothed angle θi (t), the angle difference across
i
the global gradient and assign larger weights to the nodes with nodes uniquely depends on the data distribution. Intuitively,
higher contribution to enlarge the expected decrease of FL loss the angle θi (t) will be larger as the dissimilarity between data
in each global round. distribution on node i and population distribution grows. Also,
参与节点对全局聚合的贡献是不同的 the smoothed angle is capable of quantifying the degree of data
B. Measurement of Node Contribution dissimilarity among the local nodes.
We conduct an experiment to illustrate how data distribution
In FL, the direction of minimizing local objective Fi (w) can be reflected by angle. Under the same training model in
might not align with the direction of minimizing F(w). In Section III-B, we randomly assign i) 3 nodes with 1-class non-
particular, it can be deduced from (3) that the gradient on IID setting (i.e., node “A”, “B”, “C”), ii) 2 nodes with 2-class
different nodes may be tremendously diverse, especially for non-IID setting (i.e., nodes “D” and “E”), and iii) the rest of
participating nodes with heterogeneous datasets. As such, the 5 nodes with IID setting.
contribution from participating nodes for global aggregation As shown in Fig. 2, the smoothed angle between the local
is different. Empirically, we note that if the data distribution gradient and the global gradient is full of randomness at the
on a node is highly skewed, the gradient may highly deviate beginning of FL training. Along with the training, smoothed
from or even in the opposite direction to the global gradient, angle θi shows diversity across the participating nodes due to
causing a negative effect on the global aggregation. the impact of data heterogeneity on local training. To be more
Instead of assigning weight for participating nodes based specific, for those nodes with 1-class non-IID setting, the data
on the size of datasets as in FedAvg [8], we measure the samples from which are highly skewed since the label space Y
contribution of participating nodes based on the correlation is extremely limited. Due to the limited richness of data sam-
between local gradient and global gradient. Particularly, we ples on node i, the direction for minimizing its local objective
quantify the contribution of each node at each global round Fi (w), which is reflected by ∇Fi (w), will be far away from
based on angle θi (t), that is defined as: the direction for minimizing the overall objective F (w), which
|S| |S|
∇F (w(t)), ∇Fi (w(t)) is reflected by ∇F (w) =
θi (t) = arccos . (7) i=1 (Di / i =1 Di )∇Fi (w),
∇F (w(t))∇Fi (w(t)) resulting a greater θi as defined by (7). As shown in Fig. 2,
From (7), we can see that when the angle θi (t) is small, the gradient from the node with extremely skewed data (e.g.,
it means the local gradient ∇Fi (w(t)) has a similar direction node “A”, “B”, “C”) is nearly orthogonal with the global gra-
to the global gradient, thereby positively contributing to the dient after 15 communication rounds, which barely brings a
global aggregation. In contrast, when θi (t) is large, e.g., larger contribution to the global model. If we ignore the discrepancy
than π/2, the local gradient ∇Fi (w(t)) has an opposite direc- of node contribution and average local update according to the
tion to the global gradient, thereby negatively contributing to size of datasets, as in FedAvg, it slows model convergence.
the global aggregation.
To restrain the instability caused by randomness presented C. Federated Adaptive Weighting (FedAdp)
in instantaneous angle θi (t) at each round, we use so-called Provided the diverse node contribution from partici-
smoothed angle θi (t) as a substitution, which is the averaged pating nodes, the weighting strategy affects Theorem 1
angle over previous training rounds and is defined as: ∇F (w(t)),∇F (w(t))
through the expectation Ei|t [ ∇F (w(t))∇Fi (w(t)) ] con-
i
θi (t) t =1 sequently. To accelerate the convergence rate, we seek
θi (t) = t−1 1 θ (t) t > 1. (8)
t θ i (t − 1) + t i to lower the upper bound of the expected loss in each
communication round, which reveals to assign different Algorithm 1 Federated Adaptive Weighting (FedAdp)
weights ψi to different nodes for the global model procedure F EDERATED O PTIMIZATION
aggregation. As such, the corresponding objective is for- Input: node set S, E , B , T , η,
∇F (w(t)),∇F (w(t)) 1: Server initializes global model w(0), global update Δ(0),
mally stated as enlarging Ei|t [ ∇F (w(t))∇Fi (w(t)) ] =
|St | ∇F (w(t)),∇Fi (w(t)) i
i (t) via designing ψi under smoothed angle θi (0), i ∈ S
i ∇F (w(t))∇Fi (w(t))
· ψ 2: for t = 1, . . . , T − 1 do
|S |
the inherent constrain i t ψi (t) = 1, ψi (t) ≥ 0 ∀i , t. 3: for node i ∈ St in parallel do
Considering the node contribution is measured by (7), a 4: Δi (t) ← L OCAL U PDATE (i , wi (t − 1))
natural weighting design aiming to enlarge the expectation 5: w(t) ← G LOBAL U PDATE
should follow the criterion that nodes with higher contribution (Δ1 (t) Δ2 (t), · · · , Δ|St | (t))
deserve higher weights for aggregation in each global round. procedure L OCAL U PDATE
We characterize the contribution-regulated weighting strategy Input: node index i , model wi (t − 1)
for the global aggregation in each global round adaptively as 6: Calculate local updates for τ = Di E times of SGD with
B̄
Federated Adaptive Weighting (FedAdp). step-size η on Fi (w) and obtain wi (t) using (3)
Assigning adaptive weight for updating the global model in 7: Calculate the model difference Δi (t) = wi (t) − w(t − 1)
the proposed FedAdp algorithm includes two steps: 8: return Δi (t)
1) Non-Linear Mapping Function: We design a non-linear procedure G LOBAL U PDATE
mapping function to first quantify the contribution of each Input: local update Δ1 (t), Δ2 (t), · · · , Δ|St | (t)
node based on angle information. Inspired by the sigmoid 9: Calculate the global gradient
|St | |St |
function, we use a variant of Gompertz function [21], which ∇F (w(t)) = i=1 (Di / i =1 Di )∇Fi (w(t)), where
is a non-linear decreasing function defined as ∇Fi (w(t)) = −Δi (t)/η
(t)−1)
−α(θ 10: Calculate instantaneous angle θi (t) by (7)
f θi (t) = α 1 − e −e
i
, (9) 11: Update smoothed angle θ i (t) by (8)
12: Calculate weight for model aggregation by (9), (10)
where θi (t) is the smoothed angle in radian, e denotes the 13: Update global model Ei,t ψ i (t)wi (t − 1)
exponential constant and α is a constant as explained in the 14: return w(t)
following.
The designed mapping function has several properties that
are important for the subsequent weight calculation:
• lim
θi (t)→π/2
f (θi (t)) = , where ∝ α1 is constant;
quantified by e f (θi (t)) . From the 2nd line of (10), FedAdp
• lim
0→θi (t)→υ
f (θi (t)) = α, where υ ∝ α is a constant;
will assign weight based on both the contribution and the data
α controls the decreasing rate of f (θi (t)) from α to as size.

θi (t) increases from υ to π/2. For example, a small α ∈ Z+ Remark 4: Different from FedAvg, where the weight for
indicates a lower decreasing rate of f (θi (t)) that decreases aggregation is solely proportional to the size of local datasets
|St |
from α to ∝ α1 as θi (t) increases from υ ∝ α to π/2. (e.g., ψi = Di / i =1 Di ), FedAdp takes both the data size
As α increases, the gap between small angle and large angle and the node contribution into consideration when assigning
is amplified (e.g., f (θi (t)) changes within a relatively large weights for model aggregation.
range [α, ] as θi (t) increases within range [α, π/2]), so is the The reason for adopting the Softmax function is twofold:
difference of contribution from those nodes. However, keeping i) The output of the Softmax function is a normalized
increasing α is not consistently effective to distinguish the value with a larger angle corresponding to a smaller
difference of contributions from nodes. Since υ is proportional weight. ii) Using the Softmax function, each node’s con-
to α, a large α narrows the boundary [υ, π2 ] where the node tribution can be reinforced or suppressed, depending on
contribution should be considered, making the contribution of the smoothed angle between its gradient and the global
nodes whose angle lays between [0, υ] indistinguishable. The gradient.
choice of α is empirically verified in Section V-B. The complete procedures of the proposed FedAdp algo-
2) Weighting: After getting the contribution mapped using rithm are presented in Algorithm 1 and FedAdp with adaptive
the smoothed angle from each node, we use Softmax function weighting strategy leads to the following theorem.
to finally calculate the weight of participating nodes for global Theorem 2: FedAdp with weight design ψi achieves a
model aggregation as follows: tighter bound on FL loss decrease in Theorem 1 than FedAvg
⎧ with weight ψi .
⎪ e f (θi (t)) Dm = Dn , ∀m, n ∈ St
⎪
⎨|St | f (θi (t)) The proof of Theorem 2 is presented in Appendix-B.
=1 e
ψi (t) = i
(10) Compared to FedAvg, FedAdp adopts a simple yet effec-
⎪
⎪ D e f (θi (t))
D = D , ∃m, n ∈ S .
⎩|St | i m n t tive strategy that measures the node contribution by quantify-
D e ( i )
(t)
f θ
i =1 i ing the correlation between the local gradient and the global
From the first line of (10), if all the participating nodes gradient. Weight for the global model updates can be adap-
have the same size of data samples, the proposed FedAdp tively assigned based on node contribution rather than evenly
algorithm will assign weight solely based on their contribution averaging, which results in greater FL loss reduction in each
Fig. 3. Test accuracy over communication rounds of FedAdp and FedAvg with heterogeneous data distribution over participating nodes using MLR model.
Upper and lower subplots correspond to training performance on MNIST and FashionMNIST datasets, respectively.
global round and accelerates model convergence consequently, for CNN, E = 1, T = 300, η = 0.01, decay rate = 0.995, the
as confirmed by our experimental results. constant in non-linear mapping function α = 5. The skewness
of the dataset is measured by x-class non-IID. The dataset for
V. E VALUATION AND A NALYSIS nodes is generated in the same way as in Section III-B.
To evaluate the performance of our proposed adaptive
weighting algorithm, we implemented FedAdp with PyTorch A. Data Heterogeneity
framework and PySyft library, and studied the image classifi- We investigate the different number of non-IID nodes with
cation task. We evaluated FedAdp by training typical convex different skewness levels of non-IID data to testify the effi-
and non-convex learning models on two datasets: MNIST and ciency of FedAdp. For non-IID data, two skewness cases that
FashionMNIST. Similar to the experiment in Section III-B, x = 1, 2 are considered. We plot the test accuracy vs. the com-
when the different degree of skewness of non-IID dataset munication rounds of federated learning in Fig. 3 and Fig. 4
is presented, we first investigated how FedAdp outperforms when MLR and CNN models are adopted, respectively.
FedAvg [8] by assigning adaptive weight for model aggrega- 1) MLR Model: Given the learning capability of MLR is
tion. Note that our proposed algorithm is not limited by the limited, instead of setting a target accuracy, we simply train
presence of the IID dataset and can be applied to a general a model over 50 global rounds. We plot the test accuracy vs.
scenario with data heterogeneity as verified in Section IV-A. the communication rounds of federated learning algorithms in
Then, the choice of α for non-linear mapping in FedAdp is Fig. 3. From Fig. 3, we can tell FedAdp always outperforms
discussed in Section IV-B. Finally, by tracking the divergence FedAvg when the nodes with non-IID dataset are present.
of gradients on participating nodes, we showed FedAdp alle- In addition, FedAdp converges very fast in the early train-
viates the impact brought by the data heterogeneity, compared ing stage, and the superiority of FedAdp is more prominent
to FedAvg, which is beneficial to reducing the FL loss in each when the proportion of nodes with non-IID datasets is larger.
round and accelerating FL model convergence as discussed in It is noted that the gap between FedAdp and FedAvg over
Section IV-C. We briefly describe our experiment settings as 50 global rounds is not conspicuous because of the simplic-
follows. ity of the MLR model. Different weighting strategies will not
We consider Multinomial Logistic Regression3 (MLR) make much difference when the model is reaching its learning
model and CNN model4 to represent convex and non-convex capability. In contrast, the weighting strategy will consistently
learning objective, respectively. we use the number of com- impact the FL training process when a more complex neu-
munication rounds for the FL model to reach a target testing ral network model is applied, as shown in the following
accuracy as a performance metric. Unless otherwise specified, experiment.
the target accuracy is set to 95% for training on MNIST, and 2) CNN Model: We plot the test accuracy vs. the commu-
80% for training on FashionMNIST. The number of participat- nication rounds of federated learning in Fig. 4. From Fig. 4,
ing nodes |St | = 10, Di = 600, B̄ = 50 for MLR and B̄ = 32 we can tell FedAdp always outperforms FedAvg when the
nodes with non-IID dataset are present. In particular, FedAdp
3 For MLR model, the input is a flattened 784-dimensiona (28 × 28) image,
and the output is a class label between 0 and 9. Note that MLR model can
converges very fast in the early training stage since the gra-
be extended to strongly-convex setting by adding regularlization term [22]. dient divergence is more obvious in the initial rounds, which
4 The CNN has 7 layers with the following structure: 5 × 5 × 32 makes the effect of assigning adaptive weight for updating the
Convolutional → 2 × 2 MaxPool → 5 × 5 × 64 Convolutional → 2 × 2 global model even more significant.
MaxPool → 1024 × 512 Fully connected → 512 × 10 Fully connected →
Softmax (1,663,370 total parameters). All Convolutional and Fully connected To measure the effectiveness of FedAdp, we count the
layers are mapped by ReLu activation. The configuration is similar to [8]. number of communication rounds needed to reach a target
Fig. 4. Test accuracy over communication rounds of FedAdp and FedAvg with heterogeneous data distribution over participating nodes using CNN model.
Upper and lower subplots correspond to training performance on MNIST and FashionMNIST datasets, respectively.
TABLE I
N UMBER OF C OMMUNICATION ROUNDS TO R EACH A TARGET ACCURACY
FOR F E D A D P , V ERSUS F E D A V G [8], W ITHIN 300 ROUNDS . N/A R EFERS
T HAT A LGORITHMS C AN N OT R EACH TARGET ACCURACY B EFORE
T ERMINATION W HERE THE H IGHEST T EST ACCURACY I S S HOWN
Fig. 5. FL training performance over communication rounds when FedAdp

is adopted considering general heterogeneous data distribution over partici-
pating nodes. The top row and bottom row represent the test accuracy and
training loss over the communication round, respectively.
accuracy when FedAdp is adopted. Each entry in Table I • Case 1: The number of classes of data samples owned by
shows the number of communication rounds necessary to node i, denoted by xi , is randomly selected from the set
achieve a test accuracy of 95% for CNN on MNIST and {1, 2, . . . , 10} without overlapping. Whereafter, the data
80% for FashionMNIST. The bold number indicates the bet- samples on each node are randomly selected from the
ter result achieved by FedAdp, as compared to FedAvg. xi -subset of the training dataset.
FedAdp decreases the number of communication rounds • Case 2: For half of the nodes, their xi (i.e., the number of
by up to 54.1% and 43.2% for the MNIST task when classes of data samples) is selected following the uniform
non-IID nodes are at 1-class and 2-class non-IID setting, distribution U (1, 5), whereas for the other half, xi follows
respectively. For the FashionMNIST task, the correspond- the uniform distribution U (6, 10). The data samples on
ing decreases are up to 43.7% and 45.4%, respectively. In each node are randomly selected from the xi -subset of
the cases when the target accuracy is not reachable before the training dataset.
300 rounds, FedAdp always terminates with higher testing From Fig. 5, we can see FedAdp outperforms FedAvg
accuracy. in both cases. In both cases, the convergence performance is
Previously, two extremely skewness cases that x = 1, 2 are worse than the result in Fig. 4 because the number of IID nodes
considered, while the superiority of the proposed weighting is small and the local dissimilarity is greater in these two cases.
strategy is not limited to extreme cases. To verify the proposed However, it is clear by measuring node contribution, FedAdp
weighting strategy in a more general data heterogeneity case, is more rapid in reducing FL loss in each global round thus
we consider the CNN model for the MNIST dataset in the accelerating model convergence, even without the participation
following two cases. of IID nodes.
Fig. 6. Effect of setting α on federated learning performance. Data

heterogeneity setting is 5 IID + 5 non-IID (1) and CNN model is adopted.
Fig. 7. The connection between the model test loss and the divergence across
local gradients. The proposed weighting strategy FedAdp gives an impact on
alleviating the divergence brought by nodes with skewed datasets. (1) Top
B. Choosing α row: the training loss on the MNIST dataset under one data heterogeneity
One natural question is how to determine α for non- setting (5 IID + 5 non-IID (1)). (2) Bottom row: the corresponding divergence
measurement.
linear function. A large α may increase the convergence by
emphasizing the difference of contribution from participating
nodes, which hastens model convergence in the initial train-
make a smaller (or even negative) contribution to the global
ing stage. Meanwhile, since υ is proportional to α, a large α
model aggregation than the nodes with IID data. We have
also narrows the boundary [υ, π2 ] where the node contribution
proposed to measure the node contribution based on the angle
should be considered, making the contribution of nodes whose
between local gradient and global gradient and designed a
angle lays between [0, υ] indistinguishable.
non-linear mapping function to quantify node contribution.
We heuristically choose α ∈ Z+ in the ascending order.
We have designed an adaptive weighting strategy that assigns
From Fig. 6, increasing α leads to faster convergence since
weight proportional to node contribution instead of according
the gap between small angle and large angle is amplified, so
to the size of local datasets. The simple yet effective strategy
is the difference of contribution from those nodes. However,
is able to reinforce positive (suppress negative) node contri-
a larger α is not always effective, especially after the initial
bution dynamically, leading to a significant communication
training stage. Empirically, the best α is 5 for our experimental
round reduction. Its performance superiority over FedAvg
setting.
is verified both theoretically and experimentally. We have
shown that FL training with FedAdp has reduced the com-
C. Divergence Measurement munication rounds by up to 54.1% on the MNIST dataset
Finally, in Fig. 7, we take one experimental case as an and up to 45.4% on the FashionMNIST dataset compared to
example to demonstrate the divergence of local gradients, FedAvg.
which captures the overall data heterogeneity of participat-
ing nodes. In particular, we track the divergence of gra-
dients A PPENDIX A
St 1over all participating nodes, which is measured by P ROOF OF T HEOREM 1
i |St | ∇F (w) − ∇Fi (w). Empirically, we observe that
our proposed weighting strategy leads to smaller divergence From the β-Lipschitz smoothness of F(w) in Lemma 1 and
among participating nodes, and the smaller the divergence, the Taylor expansion, we have
smaller the FL loss. As w(t) is not a stationary solution along
F (w(t + 1)) ≤ F (w(t)) + ∇F (w(t)), w(t + 1) − w(t)
with the training, aggregation by FedAdp is seen as a reg-
β
ularization process that restrains the local weight wi (t + 1) + w(t + 1) − w(t)2 . (A1)
trained by skewed datasets from being deviatory, which low- 2
ers the model divergence and consequently accelerates the The last two terms on the right hand side of the above
convergence. inequality are bounded respectively as:
• Bounding w(t + 1) − w(t)2 : By the definition of the
VI. C ONCLUSION global aggregation for w(t + 1), we have
In this paper, we have presented our design of FedAdp
w(t + 1) − w(t) = Ei|t [wi (t + 1) − w(t)]. (A2)
algorithm that assigns nodes with different weights for updat-
ing the global model in each round adaptively to reduce By following SGD optimization, for each term within the
the communication rounds of FL training in the presence of expectation in the right hand side of A2, we have
non-IID data. We argue that non-IID data exacerbates the
model divergence and observe the nodes with non-IID data wi (t + 1) = w(t) − η∇Fi (w(t)). (A3)
Therefore, ψi,j (t) denotes the weight for virtual node (i , j ). The weight
i
2 of node i is ψi (t) = D
j =1 ψi,j (t) = Di ψi,j (t).
w(t + 1) − w(t)2 = Ei|t [wi (t + 1) − w(t)] From (7), θi,j = θi monotonically decreases with
2 ∇F (w(t)),∇Fi (w(t))
. From (9), f (·) is a decreasing function
= η 2 Ei|t [∇Fi (w(t))] ∇F (w(t))∇F (w(t))i

of θ. Thus, by that ψi,j (t) = e f (θi,j (t)) , we can see
1 |St |
≤ η 2 Ei|t ∇Fi (w(t))2 , (A4) i =1 i
D e f (θi (t))
∇F (w(t)),∇F (w(t))
ψi,j (t) monotonically increases with ∇F (w(t))∇Fi (w(t)) .
where inequality 1 holds by Cauchy-Schwarz inequality. i
• Bounding ∇F (w(t)), w(t + 1) − w(t) : Again, by the Therefore, generic ψi,j (t) satisfies the following criterion
definition of the global aggregation for w(t + 1) and A3 we ∇F (w(t)), ∇Fi (w(t))
have ψi,j (t) ∝
∇F (w(t))∇Fi (w(t))
∇F (w(t)), w(t + 1) − w(t) ψi,j (t) ≥ 0 ∀i , j , t
= −ηEi|t [ ∇F (w(t)), ∇Fi (w(t)) ]. |St | Di |St |
(A5)
ψi,j (t) = ψi (t) = 1, (B1)
The expectation term in A5 can be further rewritten as i=1 j =1 i=1
Ei|t [ ∇F (w(t)), ∇Fi (w(t)) ] with the corresponding bound of the expected loss being
∇F (w(t)), ∇Fi (w(t)) F (w(t + 1)) ≤ F (w(t))
= Ei|t
∇F (w(t))∇Fi (w(t)) |St |
∇F (w(t)), ∇Fi (w(t)) B βη
−η ψi (t) −
·∇F (w(t))∇Fi (w(t)) ∇F (w(t))∇Fi (w(t)) 2
i

2 ∇F (w(t)), ∇Fi (w(t)) Fi (w(t))2 A2
≥ Ei|t · , · ∇F (w)2 . (B2)
∇F (w(t))∇Fi (w(t)) B B
(A6) where ψi (t) is defined as in (10).
In order to compare the expected loss achieved by FedAdp
where inequality 2 comes from Assumptions 2 that local and FedAvg, one can simply measure the expectation term
dissimilarity is upper bounded by B . in (5). We use ui,j to denote the contribution from virtual node
Plugging A6 into A5, then the last two terms on the right j of participating node i for model aggregation. In each global
hand side of A1 are expressed as round, we sort the contribution from all the virtual nodes that is
∇F (w(t)),∇F (w(t))
β measured by the correlation ∇F (w(t))∇Fi (w(t)) between
∇F (w(t)), w(t + 1) − w(t) + w(t + 1) − w(t)2 i
the local gradient and the global gradient in descending order,
2
∇F (w(t)), ∇Fi (w(t)) Fi (w(t))2 that is u1,1 = u1,2 = · · · = u1,D1 ≥ u2,1 = u2,2 =
≤ −ηEi|t · · · · = u2,D2 ≥ · · · ≥ u|St |,1 = u|St |,2 = · · · = u|St |,D|S | .
∇F (w(t))∇Fi (w(t)) B t
Apparently, the weight assigned to virtual node in FedAdp

βη 2 should follow the same order ψ1,1 = ψ1,2 = · · · = ψ1,D1 ≥
+ E ∇Fi (w(t))2
2 i|t ψ2,1 = ψ2,2 = · · · = ψ2,D2 ≥ · · · ≥ ψ|St |,1 = ψ|St |,2 =
∇F (w(t)), ∇Fi (w(t)) B βη
· · · = ψ|St |,D|S | , with
3
≤ −ηEi|t − i j ψi,j = 1. As such, by
∇F (w(t))∇Fi (w(t)) 2 t
Chebyshev’s inequality [23], we have the following hold for

A 2
any um,j , un,j ,
· ∇F (w(t))2 , (A7)
B
ψm,j ψn,j
where inequality 3 holds because of Assumptions 2 that local ψ̄ um,j − un,j − ≥0
ψ̄m,j ψ̄n,j
dissimilarity is lower bounded by A.
Finally, Theorem 1 is proved by substituting A7 into A1. ψ̄ um,j ψm,j ψ̄n,j + un,j ψn,j ψ̄m,j

A PPENDIX B ≥ ψ̄ um,j ψn,j ψ̄m,j + un,j ψm,j ψ̄n,j , (B3)
P ROOF OF T HEOREM 2
where ψ̄ = ψ̄m,j = ψ̄n,j = D1 denotes the weight of FedAvg
We consider the general case that participating nodes have |S |
for all virtual nodes with D = i t Di .
a different number of data samples. For node i with data size
Adding all the D 2 inequalities, we have
Di , we create Di virtual nodes, each with a unit sample size. ⎡ ⎤
|St | Dm |St | Dn
Hereinafter, we use index (i , j ), j ∈ {1, . . . , Di } to denote the
j-th virtual node split from the participating node i , i ∈ St , ψ̄ ⎣ um,j ψm,j ψ̄n,j + un,j ψn,j ψ̄m,j ⎦
where the gradient information is kept on virtual nodes as on m=1 j =1 n=1 j =1
⎡ ⎤
the participating node (e.g., ∇Fi,j (w(t) = ∇Fi (w(t)), θi,j = |St | Dm |St | Dn

θi ). As such, all virtual nodes split by node i share the same ≥ ψ̄ ⎣ um,j ψn,j ψ̄m,j + un,j ψm,j ψ̄n,j ⎦
weight (i.e., ψi,j (t) = ψi,k (t), ∀j , k ∈ {1, . . . , Di }), where m=1 j =1 n=1 j =1
|St | Dm |St | Dn |St | Dn

[10] W. Y. B. Lim et al., “Federated learning in mobile edge networks: A
um,j ψm,j ψ̄n,j + un,j ψn,j comprehensive survey,” IEEE Commun. Surveys Tuts., vol. 22, no. 3,
m=1 j =1 n=1 j =1 n=1 j =1 pp. 2031–2063, 3rd Quart., 2020.
! [11] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra,
=1 “Federated learning with non-IID data,” 2018. [Online]. Available:
|St | Dm arXiv:1806.00582.
[12] H. Wang, Z. Kaplan, D. Niu, and B. Li, “Optimizing federated learn-
× ψ̄m,j ing on non-IID data with reinforcement learning,” in Proc. IEEE Conf.
m=1 j =1 Comput. Commun. (INFOCOM), 2020, pp. 1698–1707.
! [13] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith,
=1 “Federated optimization in heterogeneous networks,” 2018. [Online].
|St | Dm |St | Dn |St | Dn Available: arXiv:1812.06127.
[14] S. Wang et al., “Adaptive federated learning in resource constrained
≥ um,j ψ̄m,j ψn,j + un,j ψ̄n,j edge computing systems,” IEEE J. Sel. Areas Commun., vol. 37, no. 6,
m=1 j =1 n=1 j =1 n=1 j =1 pp. 1205–1221, Jun. 2019.
! [15] N. H. Tran, W. Bao, A. Zomaya, M. N. H. Nguyen, and C. S. Hong,
=1 “Federated learning over wireless networks: Optimization model design
|St | Dm and analysis,” in Proc. IEEE Conf. Comput. Commun. (INFOCOM),
2019, pp. 1387–1395.
× ψm,j [16] T. Nishio and R. Yonetani, “Client selection for federated learning with
m=1 j =1 heterogeneous resources in mobile edge,” in Proc. IEEE Int. Conf.
! Commun. (ICC), Shanghai, China, 2019, pp. 1–7.
=1 [17] H. T. Nguyen, V. Sehwag, S. Hosseinalipour, C. G. Brinton, M. Chiang,
|St | Dm |St | Dm and H. V. Poor, “Fast-convergent federated learning,” IEEE J. Sel. Areas
Commun., vol. 39, no. 1, pp. 201–218, Jan. 2021.
2· um,j ψm,j ≥ 2 · um,j ψ̄m,j [18] L. Wang, W. Wang, and B. Li, “CMFL: Mitigating communication over-
m=1 j =1 m=1 j =1 head for federated learning,” in Proc. IEEE 39th Int. Conf. Distrib.
4 Comput. Syst. (ICDCS), Dallas, TX, USA, 2019, pp. 954–964.
um ψm ≥ u m ψm . (B4) [19] Y. Chen, X. Sun, and Y. Jin, “Communication-efficient federated deep
learning with layerwise asynchronous model update and temporally
m m
! ! weighted aggregation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 31,
no. 10, pp. 4229–4238, Oct. 2020.
FedAdp FedAvg
[20] Z. Chai, Y. Chen, L. Zhao, Y. Cheng, and H. Rangwala, “FedAT: A
where um = um,1 = · · · = um,Dm . Inequality 4 holds communication-efficient federated learning method with asynchronous
tiers under non-IID data,” 2020. [Online]. Available: arXiv:2010.05958.
because ψm = ψm,j · Dm and ψm = ψ̄m,j · Dm with [21] M. N. Gibbs and D. J. C. MacKay, “Variational Gaussian process
ψm and ψm denoting the weight for model aggregation in classifiers,” IEEE Trans. Neural Netw., vol. 11, no. 6, pp. 1458–1464,
Nov. 2000.
FedAdp and FedAvg, respectively. The equality 4 holds [22] C. T. Dinh et al., “Federated learning over wireless networks:
when ui = uj , ∀i , j ∈ St . Convergence analysis and resource allocation,” 2019. [Online].
Due to the greater expectation term in (5). FedAdp results Available: arXiv:1910.13067.
[23] A. W. Marshall and I. Olkin, “Multivariate Chebyshev inequalities,” Ann.
in greater decrease of FL loss in each global round, as Math. Stat., vol. 31, pp. 1001–1014, Dec. 1960.
compared to FedAvg. This completes the proof.
R EFERENCES
Hongda Wu (Student Member, IEEE) received the
[1] K. L. Lueth. (Aug. 2019). State of the IoT 2018: Number of M.A.Sc. degree in electrical engineering from the
IoT Devices Now at 7B-Market Accelerating. [Online]. Available: Communication University of China in 2019. He
https://iot-analytics.com/state-of-the-iot-update-q1-q2-2018-number-of- is currently pursuing the Ph.D. degree with the
iot-devices-now-7b/ Department of Electrical Engineering and Computer
[2] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, Science, York University, Canada. His research
no. 7553, pp. 436–444, 2015. interests include federated learning, reinforcement
[3] O.-E.-K. Aktouf, T. Zhang, J. Gao, and T. Uehara, “Testing location- learning, wireless network, and the Internet of
based function services for mobile applications,” in Proc. IEEE Symp. Things.
Serv. Orient. Syst. Eng. (SOSE), 2015, pp. 308–314.
[4] R. K. Ganti, F. Ye, and H. Lei, “Mobile crowdsensing: Current state and
future challenges,” IEEE Commun. Mag., vol. 49, no. 11, pp. 32–39,
Nov. 2011.
[5] M. Chiang and T. Zhang, “Fog and IoT: An overview of research Ping Wang (Senior Member, IEEE) received the
opportunities,” IEEE Internet Things J., vol. 3, no. 6, pp. 854–864, Bachelor and Master degrees in electrical and com-
Dec. 2016. puter engineering from the Huazhong University of
[6] Z. Xiong, Y. Zhang, D. Niyato, P. Wang, and Z. Han, “When mobile Science and Technology, in 1994 and 1997, respec-
blockchain meets edge computing,” IEEE Commun. Mag., vol. 56, no. 8, tively, and the Ph.D. degree in electrical and com-
pp. 33–39, Aug. 2018. puter engineering from the University of Waterloo,
[7] X. Wang, Y. Han, V. C. M. Leung, D. Niyato, X. Yan, and X. Chen, Canada, in 2008. She joined York University as an
“Convergence of edge computing and deep learning: A comprehensive Associate Professor in August 2018. Prior to that,
survey,” IEEE Commun. Surveys Tuts., vol. 22, no. 2, pp. 869–904, 2nd she worked with Nanyang Technological University,
Quart., 2020. Singapore, from 2008 to July 2018. Her research
[8] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Y. Arcas, interests are mainly in wireless communication
“Communication-efficient learning of deep networks from decentral- networks, cloud computing, and the Internet of Things. Her scholarly works
ized data,” in Proc. Artif. Intell. Statist. Conf. (AISTATS), 2017, have been widely disseminated through top-ranked IEEE journals/conferences
pp. 1273–1282. and received the Best Paper Awards from IEEE Wireless Communications
[9] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and and Networking Conference in 2012 and 2020, from IEEE Communication
D. Bacon, “Federated learning: Strategies for improving communication Society: Green Communications and Computing Technical Committee in
efficiency,” 2016. [Online]. Available: arXiv:1610.05492. 2018, and from IEEE International Conference on Communications in 2007.

FedAdp

Uploaded by

Copyright:

Available Formats

FedAdp

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FedAdp

Uploaded by

Copyright:

Available Formats

1078 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 7, NO.

Fast-Convergent Federated Learning

i ∈ St , i.e., ∇Fi (w) − ∇Fi (w ) ≤ βw − w for any

Remark 2: The local gradient, which is correlated with

Fig. 5. FL training performance over communication rounds when FedAdp

Fig. 6. Effect of setting α on federated learning performance. Data

|St | Dm |St | Dn |St | Dn

You might also like

FedAdp

Uploaded by

Copyright:

Available Formats

FedAdp

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FedAdp

Uploaded by

Copyright:

Available Formats

1078 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 7, NO.

Fast-Convergent Federated Learning

i ∈ St , i.e., ∇Fi (w) − ∇Fi (w ) ≤ βw − w  for any

Remark 2: The local gradient, which is correlated with

Fig. 5. FL training performance over communication rounds when FedAdp

Fig. 6. Effect of setting α on federated learning performance. Data

|St | Dm |St | Dn |St | Dn

You might also like

i ∈ St , i.e., ∇Fi (w) − ∇Fi (w ) ≤ βw − w for any