FedAdp
FedAdp
FedAdp
4, DECEMBER 2021
Abstract—Federated learning (FL) enables resource- exceed the capacity of today’s Internet in the near future [5],
constrained edge nodes to collaboratively learn a global Mobile Edge Computing (MEC) has naturally been proposed
model under the orchestration of a central server while to incorporate the data processing outside the cloud [6], [7].
keeping privacy-sensitive data locally. The non-independent-
and-identically-distributed (non-IID) data samples across With computing and storage capability, MEC systems gen-
participating nodes slow model training and impose additional erally consist of end-edge-server architecture. Multiple edge
communication rounds for FL to converge. In this paper, we servers are capable of performing large-scale distributed tasks
propose Federated Adaptive Weighting (FedAdp) algorithm involving local processing and remote execution under the
that aims to accelerate model convergence under the presence of coordination of a remote cloud. MEC approaches compro-
nodes with non-IID dataset. We observe the implicit connection
between the node contribution to the global model aggregation mise training efficiency and communication cost by bring-
and data distribution on the local node through theoretical and ing model training towards where the data is generated.
empirical analysis. We then propose to assign different weights However, computation offloading task and data processing at
for updating the global model based on node contribution the edge server still involves the transmission of sensitive
adaptively through each training round. The contribution of data.
participating nodes is first measured by the angle between the
local gradient vector and the global gradient vector, and then, In either centralized cloud training or MEC approaches,
weight is quantified by a designed non-linear mapping function collecting data for model training is unrealistic from a pri-
subsequently. The simple yet effective strategy can reinforce vacy, security, regulatory, or necessity perspective. In order to
positive (suppress negative) node contribution dynamically, maintain privacy-sensitive data and to facilitate collaborative
resulting in communication round reduction drastically. Its machine learning (ML) among distributed nodes, Federated
superiority over the commonly adopted Federated Averaging
(FedAvg) is verified both theoretically and experimentally. With Learning (FL) has emerged as an attractive paradigm, where
extensive experiments performed in Pytorch and PySyft, we local nodes collaboratively train a task model under the orches-
show that FL training with FedAdp can reduce the number tration of a central server without accessing end-user data [8],
of communication rounds by up to 54.1% on MNIST dataset [9]. In FL, local nodes cooperatively train an ML model
and up to 45.4% on FashionMNIST dataset, as compared to required by the central server by utilizing their local data.
FedAvg algorithm.
Through transferring local model updates to the central server
Index Terms—Federated learning, communication efficiency, for model aggregation and acquiring a global model for local
mobile edge computing, Internet of Things. training rather than sending raw data, user data privacy is well
protected. As such, FL features from conventional approaches
I. I NTRODUCTION in data acquisition, storage, and training. FL has been deployed
HE RAPID advancement of edge devices (e.g., Internet of by major service providers and plays an important role in
T Things (IoT), mobile phones) is constantly generating an
unprecedented amount of data [1]. These devices are currently
supporting privacy-sensitive applications, including computer
vision, natural language processing, and medical database [10].
equipped with enhanced sensors, computing, and communi- Even though good convergence performance of FL approach
cation capability. Coupled with the rise of Deep Learning is shown, owing to limited connectivity of wireless networks,
(DL) [2], the edge devices unfold the countless opportuni- the availability of local nodes and straggler of participating
ties for various tasks of modern society, e.g., road congestion nodes, communication cost becomes a critical bottleneck in
prediction [3] and environmental monitoring [4]. FL context since generally several iterations are involved for
In the traditional cloud-centric approaches, data generated model converging [8]–[10]. Another fundamental challenge
and collected by edge devices is uploaded and processed in for FL is strongly non-independent-and-identically-distributed
a data center. It is predicted that the data generation rate will (non-IID) and highly skewed data across local nodes. The pres-
ence of non-IID data significantly degrades the performance
Manuscript received November 30, 2020; revised March 20, 2021; accepted of federated learning, which makes model training take more
May 20, 2021. Date of publication May 27, 2021; date of current version rounds to converge, and the variance caused by non-IID data
December 9, 2021. This work was supported by the Canada NSERC Discovery brings instability to the training process [11]–[13]. Since the
under Grant RGPIN-2019-06375. The associate editor coordinating the review
of this article and approving it for publication was M. Pan. (Corresponding completion time of federated learning is largely impacted
author: Ping Wang.) by the communication time, how to reduce the communi-
The authors are with the Department of Electrical Engineering and cation round for model convergence in FL, especially for
Computer Science, Lassonde School of Engineering, York University, Toronto,
ON M3J 1P3, Canada (e-mail: hwu1226@eecs.yorku.ca; pingw@yorku.ca). participating nodes with non-IID datasets, is urgent to be
Digital Object Identifier 10.1109/TCCN.2021.3084406 addressed.
2332-7731
c 2021 Crown Copyright.
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 15,2024 at 02:57:26 UTC from IEEE Xplore. Restrictions apply.
WU AND WANG: FAST-CONVERGENT FEDERATED LEARNING WITH ADAPTIVE WEIGHTING 1079
In this paper, to surmount the slow convergence of vanilla In particular, McMahan et al. [8] presented the vanilla
Federated Averaging (FedAvg) [8] under the presence of Federated Averaging (FedAvg) algorithm, which increases
non-IID dataset, we propose Federated Adaptive Weighting the number of local updates instead of updating the local
(FedAdp) algorithm that aims to improve the performance of model one time at each round. Li et al. [13] proposed to
federated learning through assigning distinct weight for partic- allow participating nodes to perform a variable number of local
ipating node to update the global model. We observe that nodes updates, rather than applying the same amount of workload for
with heterogeneous datasets make different contributions to the each node [8], to consequently overcome the heterogeneity of
global model aggregation. Therefore, our main intuition is to the system. Similar to [13], authors in [15] also posed local
measure the contribution of the participating node based on accuracy for participating nodes, based on limited computing
the gradient information from local nodes then assign differ- resources on nodes, as an index to steer the number of local
ent weights accordingly and adaptively at each communication updates performed. Different from [13], [15], the work in [14]
round for global model aggregation. According to node con- exposed an analytical model to dynamically adapt the number
tribution, the proposed adaptive weighting strategy is capable of local updates between two consecutive global aggrega-
of reducing the expected training loss of FL in each commu- tions in real-time to minimize the learning loss under a fixed
nication round under the presence of non-IID nodes, which resource budget of the edge computing system. Regarding the
accelerates the model convergence. Our main contributions in node selection, Nishio and Yonetani [16] proposed FedCS
this paper are as follows: algorithm to do node selection intentionally rather than ran-
• We identify the presence of nodes with non-independent- domly, based on the resource conditions of local nodes.
and-identically-distributed (non-IID) data distributions Authors in [17] utilized gradient information to do node selec-
slows the convergence speed of federated learning. In tion. The node whose inner product between its gradient vector
addition, we analyze the convergence bound of gradient- and the global gradient vector is negative will be excluded
descent based federated learning from a theoretical per- from FL training.
spective and derive the convergence bound that incor- To handle the non-IID data distribution, Zhao et al. [11]
porates the non-IID data distribution across participating quantified the weight divergence by earth mover’s distance
nodes and weighting strategy for model updating. between data distribution on nodes and population distribu-
• We observe the implicit connection between data distri- tion. However, the strategy of pushing a small set of uniformly
bution on a node and the contribution from that node distributed data to participating nodes in [11] violates the pri-
to the global model aggregation, measured at the central vacy concern of FL and imposes extra communication cost.
server-side by inferring gradient information of partici- It was proposed in [12] that communication rounds can be
pating nodes. The convergence bound is lowered, and the reduced effectively by selecting nodes based on their uploaded
convergence speed is accelerated by a carefully designed model weights, which profile the data distribution on those
weighting strategy, which is formalized as Federated nodes. In contrast, Wang et al. [18] proposed to identify the
Adaptive Weighting (FedAdp), that assigns different irrelevant update caused by different data distribution at the
weights to nodes for global model aggregation in each node side. The communication cost is accordingly reduced by
round of communication. precluding these nodes with irrelevant updates before updates
• We empirically evaluate the performance of the proposed transmission. However, local nodes are required to check the
weighting algorithm via extensive experiments using dif- relevance in each round using the global model kept in the
ferent real datasets with different learning objectives (i.e., previous round, which is in contravention of FL and brings
convex and non-convex loss function). Our experimental computational burdens to local nodes.
results have shown that FL training with FedAdp can Regarding the weighting strategy, authors in [19] proposed
drastically reduce the communication rounds compared to assign different weights for global model aggregation
with the commonly adopted FedAvg algorithm. adaptively by considering the time difference when the
The rest of this paper is organized as follows. Section II model update is done in a layerwise asynchronous manner.
discusses the related works. Section III provides the prelimi- Chai et al. [20] designed a tier-based FL system by dividing
naries of federated learning and the impact of non-IID data on the participating nodes into tiers according to their respond-
FL. In Section IV, the convergence analysis and the proposed ing time and devised to adaptively assign weights to different
weighting algorithm are presented. Experimental results are tiers for model aggregation since there exists different updat-
shown in Section V. Section VI presents the conclusion. ing frequency across tiers. Both methods in [19], [20] aim to
weigh the local update along with different communication
rounds.
II. R ELATED W ORK To enhance the convergence of FL with the presence of
Generally, the FL algorithm adopts synchronous aggrega- non-IID nodes, different from [11], [12] that measure model
tion and selects a subset of nodes randomly to participate in weight, we find out that nodes contribute differently to the
each round randomly to avoid long-tailed waiting time due to global model aggregation owing to their different data distri-
the network uncertainty and straggler. To boost convergence bution, and there exists an implicit connection between data
and reduce the communication rounds, tuning the number of distribution and gradient information. In this paper, we propose
local updates [8], [13], [14], [15], and selecting appropriate to measure the node contribution quantitatively by the angle
nodes for FL training [12], [16], [17] are the usually adopted between the local gradient of each participating node and the
approaches. global gradient across all participating nodes at the server-side.
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 15,2024 at 02:57:26 UTC from IEEE Xplore. Restrictions apply.
1080 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 7, NO. 4, DECEMBER 2021
With the quantified contribution, the weight for aggregating the K N are selected and global model w(t − 1) in previous
global model can be devised discriminatively across the nodes iteration is sent to the selected nodes. Each of the participating
and adaptively in each round according to node contribu- nodes i performs stochastic gradient descent (SGD) training to
tion. The proposed adaptive weighting strategy can effectively optimize its local objective Fi (w):
speed up the convergence of FL in the presence of non-IID
wi (t) = w(t − 1) − η∇Fi (w(t − 1)), (3)
data. Different from [17], [18], our method does not impose
additional communication and computation burden to local where η is the learning rate and ∇Fi (·) is the gradient at
nodes. Besides, our adaptive weighting strategy is done in each node i. (3) gives a general principle of SGD optimization.
communication round, which is orthogonal with the methods wi (t) could be the result after one or several local updates
proposed in [19], [20]. of SGD (e.g., τ = 1 in FedSGD [8] or τ > 1 in
FedAvg [8], [14] with τ denoting the number of local updates
III. P RELIMINARIES between two consecutive global rounds). Hereinafter, SGD is
In this section, we briefly introduce key ingredients behind applied to mini-batch data samples with size B̄ . As such, local
the recent method for federated learning, FedAvg, and show model is updated by τ = D B̄
i
E times, where Di and E are the
how non-IID data impacts model convergence. number of training samples on node i and the number of local
training epochs, respectively.
The nodes then communicate their local model updates
A. Standard Federated Learning
Δi (t) = wi (t) − w(t − 1) to the central server,1 which
In general, federated learning methods [8], [10] are designed aggregates them and updates the global model accordingly,
to handle the consensus learning task in a decentralized man-
|St |
ner, where a central server coordinates the global learning
objective and multiple devices training the local model with Δ(t) = ψi Δi (t)
locally collected data. In particular, assume that we have N i=1
local nodes with dataset D1 , . . . , Di , . . . , DN and we define w(t) = w(t − 1) + Δ(t). (4)
Di |Di | as the number of data samples owned by each
node, where | · | denotes the Cardinality of sets. FL methods B. FedAvg for Non-IID Data
aim to minimize: The independent and identically distributed (IID) sampling
N
condition of training data is important that the stochastic gra-
min F (w) ψi Fi (w), (1) dient is an unbiased estimate of the full gradient [14]. FedAvg
w
i=1 is shown to be effective, given that the data distribution
across different nodes is the same as centrally collected data.
where w is global model weight, ψi = Di / N i =1 Di is the However, the data distribution determined by usage patterns
weight for aggregation in FL training, and global objective across local nodes is typically non-IID, i.e., p (i) is different
function F(w) is surrogated by using local objective function across participating nodes.
Fi (w), which is defined, as an example, in the context of Since local objective Fi (w) is closely related with data dis-
C-class classification problem thereinafter. In particular, C- tribution p (i) , a large number of local updates lead the model
class classification problem is defined over a feature space X towards optima of its local objective Fi (w) as opposed to the
and a label space Y = [C ], where [C ] = {1, . . . , C }. For global objective F(w). The inconsistency between local models
each labeled data sample {x, y}, predicted probability vec- wi and global model w is accumulated along with local train-
tor
y is achieved using mapping function f : X → Y,
byC
ing, leading to more communication rounds before training
where Y = { y| j =1 yj = 1, yj ≥ 0, ∀j ∈ [C ]}. As such, converges. As such, local training with multiple local updates
Fi (w) commonly measures the local empirical risk over pos- potentially hurts convergence and even leads to divergence
sibly different data distribution p (i) of node i, which is defined with the presence of non-IID data [8], [11].
by using cross entropy for C-class classification as follow, We conduct an experiment to demonstrate the impact of
⎡ ⎤
C non-IID data on model convergence. We train a two-layer
min Fi (w) E ⎣
(i) − 1y=j logfj (x, w)⎦ CNN model with the same neural network architecture in [8]
w x,y p
j =1 using Pytorch on the MNIST dataset (containing 60,000 sam-
C
ples with 10 classes) until the model achieves 95% test
accuracy. 10 nodes are selected, each with 600 samples that are
=− p (i) (y = j )Ex|y=j logfj (x, w) , (2)
j =1
selected based on their label criteria. If a node is at IID setting,
600 samples are randomly selected over the whole training set.
where fj (x, w) denotes the probability that the data sample x If a node is at x-class non-IID setting, 600 samples are ran-
is classified as the j-th class given model w, and p (i) (y = j ) domly selected over a subset, which is composed of x class
denotes the data distribution on node i over class j ∈ [C ]. data samples. Each class of the x-class is selected at random
In general federated learning setting (e.g., FedAvg), the
1 Typically there are two ways for nodes to upload their local model to the
participating nodes perform local training with the same
server, either by uploading model parameters w(t) or by uploading the model
training configuration (e.g., optimizer, learning rate, etc). At difference Δi (t). Although the same amount of data are to be sent in both
each communication round t, a subset of the nodes St , |St | = ways, conveying Δi (t) is proven to be more amenable for compression [9].
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 15,2024 at 02:57:26 UTC from IEEE Xplore. Restrictions apply.
WU AND WANG: FAST-CONVERGENT FEDERATED LEARNING WITH ADAPTIVE WEIGHTING 1081
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 15,2024 at 02:57:26 UTC from IEEE Xplore. Restrictions apply.
WU AND WANG: FAST-CONVERGENT FEDERATED LEARNING WITH ADAPTIVE WEIGHTING 1083
communication round, which reveals to assign different Algorithm 1 Federated Adaptive Weighting (FedAdp)
weights ψi to different nodes for the global model procedure F EDERATED O PTIMIZATION
aggregation. As such, the corresponding objective is for- Input: node set S, E , B , T , η,
∇F (w(t)),∇F (w(t)) 1: Server initializes global model w(0), global update Δ(0),
mally stated as enlarging Ei|t [ ∇F (w(t))∇Fi (w(t)) ] =
|St | ∇F (w(t)),∇Fi (w(t)) i
i (t) via designing ψi under smoothed angle θi (0), i ∈ S
i ∇F (w(t))∇Fi (w(t))
· ψ 2: for t = 1, . . . , T − 1 do
|S |
the inherent constrain i t ψi (t) = 1, ψi (t) ≥ 0 ∀i , t. 3: for node i ∈ St in parallel do
Considering the node contribution is measured by (7), a 4: Δi (t) ← L OCAL U PDATE (i , wi (t − 1))
natural weighting design aiming to enlarge the expectation 5: w(t) ← G LOBAL U PDATE
should follow the criterion that nodes with higher contribution (Δ1 (t) Δ2 (t), · · · , Δ|St | (t))
deserve higher weights for aggregation in each global round. procedure L OCAL U PDATE
We characterize the contribution-regulated weighting strategy Input: node index i , model wi (t − 1)
for the global aggregation in each global round adaptively as 6: Calculate local updates for τ = Di E times of SGD with
B̄
Federated Adaptive Weighting (FedAdp). step-size η on Fi (w) and obtain wi (t) using (3)
Assigning adaptive weight for updating the global model in 7: Calculate the model difference Δi (t) = wi (t) − w(t − 1)
the proposed FedAdp algorithm includes two steps: 8: return Δi (t)
1) Non-Linear Mapping Function: We design a non-linear procedure G LOBAL U PDATE
mapping function to first quantify the contribution of each Input: local update Δ1 (t), Δ2 (t), · · · , Δ|St | (t)
node based on angle information. Inspired by the sigmoid 9: Calculate the global gradient
|St | |St |
function, we use a variant of Gompertz function [21], which ∇F (w(t)) = i=1 (Di / i =1 Di )∇Fi (w(t)), where
is a non-linear decreasing function defined as ∇Fi (w(t)) = −Δi (t)/η
(t)−1)
−α(θ 10: Calculate instantaneous angle θi (t) by (7)
f θi (t) = α 1 − e −e
i
, (9) 11: Update smoothed angle θ i (t) by (8)
12: Calculate weight for model aggregation by (9), (10)
where θi (t) is the smoothed angle in radian, e denotes the 13: Update global model Ei,t ψ i (t)wi (t − 1)
exponential constant and α is a constant as explained in the 14: return w(t)
following.
The designed mapping function has several properties that
are important for the subsequent weight calculation:
• lim
θi (t)→π/2
f (θi (t)) = , where ∝ α1 is constant;
quantified by e f (θi (t)) . From the 2nd line of (10), FedAdp
• lim
0→θi (t)→υ
f (θi (t)) = α, where υ ∝ α is a constant;
will assign weight based on both the contribution and the data
α controls the decreasing rate of f (θi (t)) from α to as size.
θi (t) increases from υ to π/2. For example, a small α ∈ Z+ Remark 4: Different from FedAvg, where the weight for
indicates a lower decreasing rate of f (θi (t)) that decreases aggregation is solely proportional to the size of local datasets
|St |
from α to ∝ α1 as θi (t) increases from υ ∝ α to π/2. (e.g., ψi = Di / i =1 Di ), FedAdp takes both the data size
As α increases, the gap between small angle and large angle and the node contribution into consideration when assigning
is amplified (e.g., f (θi (t)) changes within a relatively large weights for model aggregation.
range [α, ] as θi (t) increases within range [α, π/2]), so is the The reason for adopting the Softmax function is twofold:
difference of contribution from those nodes. However, keeping i) The output of the Softmax function is a normalized
increasing α is not consistently effective to distinguish the value with a larger angle corresponding to a smaller
difference of contributions from nodes. Since υ is proportional weight. ii) Using the Softmax function, each node’s con-
to α, a large α narrows the boundary [υ, π2 ] where the node tribution can be reinforced or suppressed, depending on
contribution should be considered, making the contribution of the smoothed angle between its gradient and the global
nodes whose angle lays between [0, υ] indistinguishable. The gradient.
choice of α is empirically verified in Section V-B. The complete procedures of the proposed FedAdp algo-
2) Weighting: After getting the contribution mapped using rithm are presented in Algorithm 1 and FedAdp with adaptive
the smoothed angle from each node, we use Softmax function weighting strategy leads to the following theorem.
to finally calculate the weight of participating nodes for global Theorem 2: FedAdp with weight design ψi achieves a
model aggregation as follows: tighter bound on FL loss decrease in Theorem 1 than FedAvg
⎧ with weight ψi .
⎪ e f (θi (t)) Dm = Dn , ∀m, n ∈ St
⎪
⎨|St | f (θi (t)) The proof of Theorem 2 is presented in Appendix-B.
=1 e
ψi (t) = i
(10) Compared to FedAvg, FedAdp adopts a simple yet effec-
⎪
⎪ D e f (θi (t))
D = D , ∃m, n ∈ S .
⎩|St | i m n t tive strategy that measures the node contribution by quantify-
D e ( i )
(t)
f θ
i =1 i ing the correlation between the local gradient and the global
From the first line of (10), if all the participating nodes gradient. Weight for the global model updates can be adap-
have the same size of data samples, the proposed FedAdp tively assigned based on node contribution rather than evenly
algorithm will assign weight solely based on their contribution averaging, which results in greater FL loss reduction in each
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 15,2024 at 02:57:26 UTC from IEEE Xplore. Restrictions apply.
1084 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 7, NO. 4, DECEMBER 2021
Fig. 3. Test accuracy over communication rounds of FedAdp and FedAvg with heterogeneous data distribution over participating nodes using MLR model.
Upper and lower subplots correspond to training performance on MNIST and FashionMNIST datasets, respectively.
global round and accelerates model convergence consequently, for CNN, E = 1, T = 300, η = 0.01, decay rate = 0.995, the
as confirmed by our experimental results. constant in non-linear mapping function α = 5. The skewness
of the dataset is measured by x-class non-IID. The dataset for
V. E VALUATION AND A NALYSIS nodes is generated in the same way as in Section III-B.
To evaluate the performance of our proposed adaptive
weighting algorithm, we implemented FedAdp with PyTorch A. Data Heterogeneity
framework and PySyft library, and studied the image classifi- We investigate the different number of non-IID nodes with
cation task. We evaluated FedAdp by training typical convex different skewness levels of non-IID data to testify the effi-
and non-convex learning models on two datasets: MNIST and ciency of FedAdp. For non-IID data, two skewness cases that
FashionMNIST. Similar to the experiment in Section III-B, x = 1, 2 are considered. We plot the test accuracy vs. the com-
when the different degree of skewness of non-IID dataset munication rounds of federated learning in Fig. 3 and Fig. 4
is presented, we first investigated how FedAdp outperforms when MLR and CNN models are adopted, respectively.
FedAvg [8] by assigning adaptive weight for model aggrega- 1) MLR Model: Given the learning capability of MLR is
tion. Note that our proposed algorithm is not limited by the limited, instead of setting a target accuracy, we simply train
presence of the IID dataset and can be applied to a general a model over 50 global rounds. We plot the test accuracy vs.
scenario with data heterogeneity as verified in Section IV-A. the communication rounds of federated learning algorithms in
Then, the choice of α for non-linear mapping in FedAdp is Fig. 3. From Fig. 3, we can tell FedAdp always outperforms
discussed in Section IV-B. Finally, by tracking the divergence FedAvg when the nodes with non-IID dataset are present.
of gradients on participating nodes, we showed FedAdp alle- In addition, FedAdp converges very fast in the early train-
viates the impact brought by the data heterogeneity, compared ing stage, and the superiority of FedAdp is more prominent
to FedAvg, which is beneficial to reducing the FL loss in each when the proportion of nodes with non-IID datasets is larger.
round and accelerating FL model convergence as discussed in It is noted that the gap between FedAdp and FedAvg over
Section IV-C. We briefly describe our experiment settings as 50 global rounds is not conspicuous because of the simplic-
follows. ity of the MLR model. Different weighting strategies will not
We consider Multinomial Logistic Regression3 (MLR) make much difference when the model is reaching its learning
model and CNN model4 to represent convex and non-convex capability. In contrast, the weighting strategy will consistently
learning objective, respectively. we use the number of com- impact the FL training process when a more complex neu-
munication rounds for the FL model to reach a target testing ral network model is applied, as shown in the following
accuracy as a performance metric. Unless otherwise specified, experiment.
the target accuracy is set to 95% for training on MNIST, and 2) CNN Model: We plot the test accuracy vs. the commu-
80% for training on FashionMNIST. The number of participat- nication rounds of federated learning in Fig. 4. From Fig. 4,
ing nodes |St | = 10, Di = 600, B̄ = 50 for MLR and B̄ = 32 we can tell FedAdp always outperforms FedAvg when the
nodes with non-IID dataset are present. In particular, FedAdp
3 For MLR model, the input is a flattened 784-dimensiona (28 × 28) image,
and the output is a class label between 0 and 9. Note that MLR model can
converges very fast in the early training stage since the gra-
be extended to strongly-convex setting by adding regularlization term [22]. dient divergence is more obvious in the initial rounds, which
4 The CNN has 7 layers with the following structure: 5 × 5 × 32 makes the effect of assigning adaptive weight for updating the
Convolutional → 2 × 2 MaxPool → 5 × 5 × 64 Convolutional → 2 × 2 global model even more significant.
MaxPool → 1024 × 512 Fully connected → 512 × 10 Fully connected →
Softmax (1,663,370 total parameters). All Convolutional and Fully connected To measure the effectiveness of FedAdp, we count the
layers are mapped by ReLu activation. The configuration is similar to [8]. number of communication rounds needed to reach a target
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 15,2024 at 02:57:26 UTC from IEEE Xplore. Restrictions apply.
WU AND WANG: FAST-CONVERGENT FEDERATED LEARNING WITH ADAPTIVE WEIGHTING 1085
Fig. 4. Test accuracy over communication rounds of FedAdp and FedAvg with heterogeneous data distribution over participating nodes using CNN model.
Upper and lower subplots correspond to training performance on MNIST and FashionMNIST datasets, respectively.
TABLE I
N UMBER OF C OMMUNICATION ROUNDS TO R EACH A TARGET ACCURACY
FOR F E D A D P , V ERSUS F E D A V G [8], W ITHIN 300 ROUNDS . N/A R EFERS
T HAT A LGORITHMS C AN N OT R EACH TARGET ACCURACY B EFORE
T ERMINATION W HERE THE H IGHEST T EST ACCURACY I S S HOWN
accuracy when FedAdp is adopted. Each entry in Table I • Case 1: The number of classes of data samples owned by
shows the number of communication rounds necessary to node i, denoted by xi , is randomly selected from the set
achieve a test accuracy of 95% for CNN on MNIST and {1, 2, . . . , 10} without overlapping. Whereafter, the data
80% for FashionMNIST. The bold number indicates the bet- samples on each node are randomly selected from the
ter result achieved by FedAdp, as compared to FedAvg. xi -subset of the training dataset.
FedAdp decreases the number of communication rounds • Case 2: For half of the nodes, their xi (i.e., the number of
by up to 54.1% and 43.2% for the MNIST task when classes of data samples) is selected following the uniform
non-IID nodes are at 1-class and 2-class non-IID setting, distribution U (1, 5), whereas for the other half, xi follows
respectively. For the FashionMNIST task, the correspond- the uniform distribution U (6, 10). The data samples on
ing decreases are up to 43.7% and 45.4%, respectively. In each node are randomly selected from the xi -subset of
the cases when the target accuracy is not reachable before the training dataset.
300 rounds, FedAdp always terminates with higher testing From Fig. 5, we can see FedAdp outperforms FedAvg
accuracy. in both cases. In both cases, the convergence performance is
Previously, two extremely skewness cases that x = 1, 2 are worse than the result in Fig. 4 because the number of IID nodes
considered, while the superiority of the proposed weighting is small and the local dissimilarity is greater in these two cases.
strategy is not limited to extreme cases. To verify the proposed However, it is clear by measuring node contribution, FedAdp
weighting strategy in a more general data heterogeneity case, is more rapid in reducing FL loss in each global round thus
we consider the CNN model for the MNIST dataset in the accelerating model convergence, even without the participation
following two cases. of IID nodes.
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 15,2024 at 02:57:26 UTC from IEEE Xplore. Restrictions apply.
1086 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 7, NO. 4, DECEMBER 2021
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 15,2024 at 02:57:26 UTC from IEEE Xplore. Restrictions apply.
WU AND WANG: FAST-CONVERGENT FEDERATED LEARNING WITH ADAPTIVE WEIGHTING 1087
Therefore, ψi,j (t) denotes the weight for virtual node (i , j ). The weight
i
2 of node i is ψi (t) = D
j =1 ψi,j (t) = Di ψi,j (t).
w(t + 1) − w(t)2 = Ei|t [wi (t + 1) − w(t)] From (7), θi,j = θi monotonically decreases with
2 ∇F (w(t)),∇Fi (w(t))
. From (9), f (·) is a decreasing function
= η 2 Ei|t [∇Fi (w(t))] ∇F (w(t))∇F (w(t))i
of θ. Thus, by that ψi,j (t) = e f (θi,j (t)) , we can see
1 |St |
≤ η 2 Ei|t ∇Fi (w(t))2 , (A4) i =1 i
D e f (θi (t))
∇F (w(t)),∇F (w(t))
ψi,j (t) monotonically increases with ∇F (w(t))∇Fi (w(t)) .
where inequality 1 holds by Cauchy-Schwarz inequality. i
• Bounding ∇F (w(t)), w(t + 1) − w(t) : Again, by the Therefore, generic ψi,j (t) satisfies the following criterion
definition of the global aggregation for w(t + 1) and A3 we ∇F (w(t)), ∇Fi (w(t))
have ψi,j (t) ∝
∇F (w(t))∇Fi (w(t))
∇F (w(t)), w(t + 1) − w(t) ψi,j (t) ≥ 0 ∀i , j , t
= −ηEi|t [ ∇F (w(t)), ∇Fi (w(t)) ]. |St | Di |St |
(A5)
ψi,j (t) = ψi (t) = 1, (B1)
The expectation term in A5 can be further rewritten as i=1 j =1 i=1
Ei|t [ ∇F (w(t)), ∇Fi (w(t)) ] with the corresponding bound of the expected loss being
∇F (w(t)), ∇Fi (w(t)) F (w(t + 1)) ≤ F (w(t))
= Ei|t
∇F (w(t))∇Fi (w(t)) |St |
∇F (w(t)), ∇Fi (w(t)) B βη
−η ψi (t) −
·∇F (w(t))∇Fi (w(t)) ∇F (w(t))∇Fi (w(t)) 2
i
2 ∇F (w(t)), ∇Fi (w(t)) Fi (w(t))2 A2
≥ Ei|t · , · ∇F (w)2 . (B2)
∇F (w(t))∇Fi (w(t)) B B
(A6) where ψi (t) is defined as in (10).
In order to compare the expected loss achieved by FedAdp
where inequality 2 comes from Assumptions 2 that local and FedAvg, one can simply measure the expectation term
dissimilarity is upper bounded by B . in (5). We use ui,j to denote the contribution from virtual node
Plugging A6 into A5, then the last two terms on the right j of participating node i for model aggregation. In each global
hand side of A1 are expressed as round, we sort the contribution from all the virtual nodes that is
∇F (w(t)),∇F (w(t))
β measured by the correlation ∇F (w(t))∇Fi (w(t)) between
∇F (w(t)), w(t + 1) − w(t) + w(t + 1) − w(t)2 i
the local gradient and the global gradient in descending order,
2
∇F (w(t)), ∇Fi (w(t)) Fi (w(t))2 that is u1,1 = u1,2 = · · · = u1,D1 ≥ u2,1 = u2,2 =
≤ −ηEi|t · · · · = u2,D2 ≥ · · · ≥ u|St |,1 = u|St |,2 = · · · = u|St |,D|S | .
∇F (w(t))∇Fi (w(t)) B t
Apparently, the weight assigned to virtual node in FedAdp
βη 2 should follow the same order ψ1,1 = ψ1,2 = · · · = ψ1,D1 ≥
+ E ∇Fi (w(t))2
2 i|t ψ2,1 = ψ2,2 = · · · = ψ2,D2 ≥ · · · ≥ ψ|St |,1 = ψ|St |,2 =
∇F (w(t)), ∇Fi (w(t)) B βη
· · · = ψ|St |,D|S | , with
3
≤ −ηEi|t − i j ψi,j = 1. As such, by
∇F (w(t))∇Fi (w(t)) 2 t
Chebyshev’s inequality [23], we have the following hold for
A 2
any um,j , un,j ,
· ∇F (w(t))2 , (A7)
B
ψm,j ψn,j
where inequality 3 holds because of Assumptions 2 that local ψ̄ um,j − un,j − ≥0
ψ̄m,j ψ̄n,j
dissimilarity is lower bounded by A.
Finally, Theorem 1 is proved by substituting A7 into A1. ψ̄ um,j ψm,j ψ̄n,j + un,j ψn,j ψ̄m,j
A PPENDIX B ≥ ψ̄ um,j ψn,j ψ̄m,j + un,j ψm,j ψ̄n,j , (B3)
P ROOF OF T HEOREM 2
where ψ̄ = ψ̄m,j = ψ̄n,j = D1 denotes the weight of FedAvg
We consider the general case that participating nodes have |S |
for all virtual nodes with D = i t Di .
a different number of data samples. For node i with data size
Adding all the D 2 inequalities, we have
Di , we create Di virtual nodes, each with a unit sample size. ⎡ ⎤
|St | Dm |St | Dn
Hereinafter, we use index (i , j ), j ∈ {1, . . . , Di } to denote the
j-th virtual node split from the participating node i , i ∈ St , ψ̄ ⎣ um,j ψm,j ψ̄n,j + un,j ψn,j ψ̄m,j ⎦
where the gradient information is kept on virtual nodes as on m=1 j =1 n=1 j =1
⎡ ⎤
the participating node (e.g., ∇Fi,j (w(t) = ∇Fi (w(t)), θi,j = |St | Dm |St | Dn
θi ). As such, all virtual nodes split by node i share the same ≥ ψ̄ ⎣ um,j ψn,j ψ̄m,j + un,j ψm,j ψ̄n,j ⎦
weight (i.e., ψi,j (t) = ψi,k (t), ∀j , k ∈ {1, . . . , Di }), where m=1 j =1 n=1 j =1
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 15,2024 at 02:57:26 UTC from IEEE Xplore. Restrictions apply.
1088 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 7, NO. 4, DECEMBER 2021
R EFERENCES
Hongda Wu (Student Member, IEEE) received the
[1] K. L. Lueth. (Aug. 2019). State of the IoT 2018: Number of M.A.Sc. degree in electrical engineering from the
IoT Devices Now at 7B-Market Accelerating. [Online]. Available: Communication University of China in 2019. He
https://iot-analytics.com/state-of-the-iot-update-q1-q2-2018-number-of- is currently pursuing the Ph.D. degree with the
iot-devices-now-7b/ Department of Electrical Engineering and Computer
[2] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, Science, York University, Canada. His research
no. 7553, pp. 436–444, 2015. interests include federated learning, reinforcement
[3] O.-E.-K. Aktouf, T. Zhang, J. Gao, and T. Uehara, “Testing location- learning, wireless network, and the Internet of
based function services for mobile applications,” in Proc. IEEE Symp. Things.
Serv. Orient. Syst. Eng. (SOSE), 2015, pp. 308–314.
[4] R. K. Ganti, F. Ye, and H. Lei, “Mobile crowdsensing: Current state and
future challenges,” IEEE Commun. Mag., vol. 49, no. 11, pp. 32–39,
Nov. 2011.
[5] M. Chiang and T. Zhang, “Fog and IoT: An overview of research Ping Wang (Senior Member, IEEE) received the
opportunities,” IEEE Internet Things J., vol. 3, no. 6, pp. 854–864, Bachelor and Master degrees in electrical and com-
Dec. 2016. puter engineering from the Huazhong University of
[6] Z. Xiong, Y. Zhang, D. Niyato, P. Wang, and Z. Han, “When mobile Science and Technology, in 1994 and 1997, respec-
blockchain meets edge computing,” IEEE Commun. Mag., vol. 56, no. 8, tively, and the Ph.D. degree in electrical and com-
pp. 33–39, Aug. 2018. puter engineering from the University of Waterloo,
[7] X. Wang, Y. Han, V. C. M. Leung, D. Niyato, X. Yan, and X. Chen, Canada, in 2008. She joined York University as an
“Convergence of edge computing and deep learning: A comprehensive Associate Professor in August 2018. Prior to that,
survey,” IEEE Commun. Surveys Tuts., vol. 22, no. 2, pp. 869–904, 2nd she worked with Nanyang Technological University,
Quart., 2020. Singapore, from 2008 to July 2018. Her research
[8] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. Y. Arcas, interests are mainly in wireless communication
“Communication-efficient learning of deep networks from decentral- networks, cloud computing, and the Internet of Things. Her scholarly works
ized data,” in Proc. Artif. Intell. Statist. Conf. (AISTATS), 2017, have been widely disseminated through top-ranked IEEE journals/conferences
pp. 1273–1282. and received the Best Paper Awards from IEEE Wireless Communications
[9] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and and Networking Conference in 2012 and 2020, from IEEE Communication
D. Bacon, “Federated learning: Strategies for improving communication Society: Green Communications and Computing Technical Committee in
efficiency,” 2016. [Online]. Available: arXiv:1610.05492. 2018, and from IEEE International Conference on Communications in 2007.
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 15,2024 at 02:57:26 UTC from IEEE Xplore. Restrictions apply.