Communication-Efficient Federated Learning For Wireless Edge Intelligence in Iot

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

5986 IEEE INTERNET OF THINGS JOURNAL, VOL. 7, NO.

7, JULY 2020

Communication-Efficient Federated Learning for


Wireless Edge Intelligence in IoT
Jed Mills , Jia Hu , and Geyong Min

Abstract—The rapidly expanding number of Internet of Things and challenges for ML. There is a potential to use the
(IoT) devices is generating huge quantities of data, but public con- large number of IoT devices to perform distributed ML,
cern over data privacy means users are apprehensive to send data rather than in a central server, distributing expensive train-
to a central server for machine learning (ML) purposes. The eas-
ily changed behaviors of edge infrastructure that software-defined ing and inference while keeping private data on user devices.
networking (SDN) provides makes it possible to collate IoT data However, the distributed ML schemes are typically designed
at edge servers and gateways, where federated learning (FL) can for the datacenter, assuming high-performance computing and
be performed: building a central model without uploading data networking hardware, contrasting the highly heterogeneous
to the server. FedAvg is an FL algorithm which has been the computing and communication capabilities of low-powered
subject of much study, however, it suffers from a large number
of rounds to convergence with non-independent identically dis- IoT devices. Also, the IoT devices collect data from dif-
tributed (non-IID) client data sets and high communication costs ferent users, so the distribution over devices can be highly
per round. We propose adapting FedAvg to use a distributed form non-independent identically distributed (non-IID), making it
of Adam optimization, greatly reducing the number of rounds difficult to create the ML models with good performance.
to convergence, along with the novel compression techniques, to Software-defined networking (SDN) has the potential to
produce communication-efficient FedAvg (CE-FedAvg). We per-
form extensive experiments with the MNIST/CIFAR-10 data sets, help alleviate some of the distributed ML problems. AN edge
IID/non-IID client data, varying numbers of clients, client partic- network architecture typically involves IoT devices connected
ipation rates, and compression rates. These show that CE-FedAvg to IoT gateways. Gateways and similar devices often possess
can converge to a target accuracy in up to 6× less rounds than much more computing and storage capacity than IoT devices
similarly compressed FedAvg, while uploading up to 3× less data, and are located at the network edge, closer to user devices,
and is more robust to aggressive compression. Experiments on
an edge-computing-like testbed using Raspberry Pi clients also meaning data do not need to be centrally collected. SDN could
show that CE-FedAvg is able to reach a target accuracy in up be used to easily alter the behavior of IoT gateways: collecting
to 1.7× less real time than FedAvg. data from a group of local IoT devices and performing dis-
Index Terms—Compression, distributed computing, edge com- tributed ML, or delivering these data securely to nearby edge
puting, federated learning (FL), Internet of Things (IoT). servers.
To address the issue of distributed ML, the concept of
federated learning (FL) [2] collaboratively training a model
I. I NTRODUCTION across devices without sharing their data was introduced.
HERE are over 8 billion Internet of Things (IoT) devices McMahan et al. [3] proposed an implementation of FL with
T worldwide, as of 2019 [1]. These are typically low-
powered, embedded devices used for data collection. The
their FedAvg algorithm, designed for user devices such as
smartphones. In FedAvg, clients independently train deep neu-
expansion of the number of IoT devices has therefore led to ral networks (DNNs) on their local data and periodically
an explosion in the quantity of data available for organizations average them. FedAvg has been shown to work in real-world
to use for machine learning (ML) purposes. These organiza- settings by Google with their GBoard [4] next-word-prediction
tions can use this data for insights, consumer products, and and emjoi-prediction [5] software. FedAvg can be used in the
scientific research. However, there is growing public concern IoT/SDN framework described above to provide distributed
over the distribution of private data. Many IoT devices, such as ML at the network edge.
security cameras, fitness and health trackers, and smartphones Modern DNNs contain a huge number of weights, in the
gather sensitive data that users would not wish to share with order of tens to hundreds of millions. As FedAvg requires
a central server. uploading and downloading these models between the clients
The combination of increasing IoT devices and data, along and the central server, several works have been published on
with user privacy concerns presents a series of opportunities reducing the amount of communication performed by FedAvg
to reduce bottlenecks and help ease network use.
Manuscript received September 23, 2019; revised November 15, 2019; A standard method of training DNNs is minibatch-stochastic
accepted November 21, 2019. Date of publication November 28, 2019; date
of current version July 10, 2020. This work was supported by EPSRC DTP gradient descent (mb-SGD): the model weights are updated
Studentship. (Corresponding authors: Jia Hu; Geyong Min.) with their gradients multiplied by a fixed learning rate. Most
The authors are with the College of Engineering, Maths and works on FL use mb-SGD for their DNNs, however, there
Physical Science, University of Exeter, Exeter EX4 4QJ, U.K. (e-mail:
jm729@exeter.ac.uk; j.hu@exeter.ac.uk; g.min@exeter.ac.uk). are more sophisticated optimization techniques based on mb-
Digital Object Identifier 10.1109/JIOT.2019.2956615 SGD, such as AdaGrad [6] and RMSProp [7]. Adam [8] is one
2327-4662 
c 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: KUMOH NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 04,2021 at 10:54:52 UTC from IEEE Xplore. Restrictions apply.
MILLS et al.: COMMUNICATION-EFFICIENT FEDERATED LEARNING FOR WIRELESS EDGE INTELLIGENCE IN IoT 5987

popular technique that features both per-parameter learning Wang et al. [10] addressed this with a control algorithm
rates and momentum. It has been shown to be a very efficient that dynamically changes the number of local SGD updates
optimizer for many tasks and has the potential to improve the that clients perform before uploading their models, based
convergence rate of FedAvg. on a tradeoff between the computing power and networking
FedAvg, therefore, has the following problems when con- bandwidth available to the client.
sidered with the IoT edge-computing scenario. First is the high Several papers have been written on the task of reduc-
communication cost per round due to the large models being ing the amount of data communicated during FedAvg.
sent between the clients and the server, and the second is the Konečný et al. [11] introduced the concepts of structured
high number of rounds to convergence, especially in non- and sketched model updates, and were able to significantly
IID settings. This article proposes communication-efficient reduce the quantity of uploaded data during training. Similarly,
FedAvg (CE-FedAvg), which reduces the number of rounds Sattler et al. [12] demonstrated their sparse binary compres-
to convergence, and the total data uploaded per round over sion system against uncompressed FedAvg, and were able to
FedAvg. We, therefore, make the following contributions. achieve better accuracy while communicating much less data.
1) We propose the CE-FedAvg algorithm, which is com- Lin et al. proposed deep gradient compression [13], a system
posed of two parts: a) distributed Adam optimization and applicable to FedAvg, where gradient updates are accumu-
b) compression of uploaded models. CE-FedAvg reduces lated at clients until their magnitude is greater than a threshold.
the number of communication rounds taken to reach a Sattler et al. [14] proposed sparse ternary compression to com-
target accuracy, and the total data uploaded per round press weight updates during FedAvg, including using Golomb
compared to uncompressed FedAvg. encoding [15] to compress the indeces of weights in sparse
2) We develop two novel schemes for the quantization of matrices produced by sparsification.
communicated weights and moment values of Adam As well as compression of weights, other techniques to
(uniform and exponential quantization). These two reduce communication during FedAvg have been proposed.
schemes are used alongside sparsification and Golomb Chen et al. [16] aggregated shallow layers more frequently
encoding (GE) for the compression in CE-FedAvg. than deep layers (they assert that shallow layers learn gen-
3) We perform extensive simulated experiments of CE- eral features and deep layers learn more specific features).
FedAvg using the MNIST and CIFAR-10 data sets, three Their scheme was able to converge to a target accuracy with
DNN architectures, and varying numbers of clients and up to 7× less communication than FedAvg. Liu et al. [17]
client participation rates. These show that CE-FedAvg is proposed having clients aggregate with intermediate servers
able to reach a target accuracy in far fewer rounds than which then aggregate with a main server with two given
FedAvg in non-IID scenarios, and that it is more robust frequencies. Their experiments showed that increasing aggre-
to aggressive communication reduction. gation frequency causes the global model to converge faster (as
4) We perform further experimentation on an edge- would be expected), and that the global aggregation frequency
computing-like testbed using Raspberry Pi (Rpi) clients has the biggest impact on the convergence rate.
shows CE-FedAvg reduces the real time to convergence Leroy et al. [18] proposed adapting Adam optimizer to
over FedAvg. the FedAvg algorithm. In their system, clients perform SGD
The remainder of this article is structured as follows. We using all their local data and send their weight updates to the
outline the related work in the fields of FL and ML for the server. The server holds the first and second moment values for
IoT. We then describe our proposed algorithm, CE-FedAvg, each weight, which uses to compute the central model weight
in detail. After that, we outline the simulation and testbed updates using the Adam update rule. However, the way we
experiments we ran comparing CE-FEdAvg to FedAvg. We use Adam is different in our work from [18]: our method uses
finally give our conclusions of this article in the last section. Adam optimization at clients, and the Adam values are also
aggregated at the server, as opposed to using Adam solely
at the server. We also propose compression schemes for our
II. R ELATED W ORK
version, which Leroy et al. do not.
A. Federated Learning
McMahan et al. [3] proposed the original FedAvg algo-
rithm to train an aggregate model without uploading client data B. Internet of Things and Machine Learning
to a server. FedAvg drastically reduces total communicated Some works have been published by combining the deep
data compared to datacentre-style distributed SGD (where each learning and the IoT. One seminal work by Liu et al. [19]
worker performs a single step of SGD before aggregation). created a system for identifying foods, in which image prepro-
FL considers clients with non-IID data distributions, which cessing and segmentation was performed on edge devices, and
is an obstacle to training a good central model. Zhao et al. [9] classification was done on a central server, reducing the latency
proposed sharing a small amount of data between non-IID at inference time. Li et al. [20] put forward a similar idea
clients to reduce the difference between clients distributions, where an ML model was trained in the cloud, then initial lay-
increasing the maximum accuracy their FedAvg models were ers of the model were distributed to edge servers. At inference
able to attain. time, the edge servers compute the first layer before sending
Devices participating in FL are assumed to have the result of these to the cloud for the final inference, resulting
highly heterogeneous computing and networking resources. in less data sent to the cloud over sending the raw input.

Authorized licensed use limited to: KUMOH NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 04,2021 at 10:54:52 UTC from IEEE Xplore. Restrictions apply.
5988 IEEE INTERNET OF THINGS JOURNAL, VOL. 7, NO. 7, JULY 2020

For training near the network edge, Lyu et al. [21] created
fog privacy-preserving deep learning (FPPDL), which com-
bines privacy-preserving techniques with intermediate-layer
aggregation of IoT data at “Fog” nodes before aggregation
with a central server.
ML has been applied to a large variety of edge-like devices.
Reina et al. [22] treated unmanned aerial vehicles (UAVs)
as edge devices, and solved the multiobjective optimization
problem for their area coverage using a multisubpopulation
genetic algorithm.
Pathinarupothi et al. [23] developed a unique IoT-based
system to provide health alerts to doctors for patients. Patients
were fitted with multiple physiological sensors, which could
be connected to the Internet using the patient’s smartphone as
a gateway. The gateways had software to determine if doctors
need to be alerted of patients. This system shows the ability
for smartphones to be used as intelligent, programmable gate-
ways for nearby IoT devices, which could be applicable to the
scheme proposed in this article.
Other authors have investigated the resource consumption
of DNNs on IoT and edge devices. Chandakkar et al. [24]
developed a system for retraining and pruning networks at the Fig. 1. One communication round of CE-FedAvg. 1. Clients download the
global model weights, first and second moments (ω, m, v). 2. Clients perform
edge as nodes receive data over time. Similarly, DeepIoT [25] training on their IoT-derived data sets. 3. Clients compress their models. 4.
was created to reduce the size of DNNs for inference by Clients upload their compressed parameters and indeces (ω , m , v , i). 5. The
using a second network to determine the best weights to drop server decompresses the model weights and moments. 6. The server aggregates
all client models before starting a new round of training.
from the original network. Guo et al. [26] proposed a novel
approach that involved training a DNN and using automata
to gradually prune the network weights over time. They were Alongside, in FedAvg, the entire DNN model is sent from
able to slightly improve the MNIST performance of dense clients to server each round. Modern DNNs have a huge num-
networks with this method while pruning > 40% of weights. ber of weights, and the upload bandwidth of edge clients is
Most current work exploring ML and the IoT focuses on typically far lower than the download bandwidth. Therefore,
centralized training, and using IoT/edge computing for infer- uploading models to the server is a significant potential bot-
ence, or on decreasing the size of DNNs for these devices. tleneck of the system. Previous works have shown that these
This article, however, investigates decentralized training using uploaded weight tensors can be compressed without signif-
full DNNs. icantly harming the performance of the model [11]–[14].
However, these schemes typically increase the number of
III. CE-F EDAVG rounds FedAvg takes to converge, albeit with lower total data
The FedAvg algorithm has a single master model that is uploaded by clients.
an aggregate of the client models. For each round of com- To address the above problems, we propose CE-FedAvg: a
munication, the server selects a subset of clients and pushes scheme that both reduces the number of rounds taken to con-
the master model to these clients. Each client then performs verge to a given accuracy and decreases the total data uploaded
a predetermined number of rounds of gradient descent using during training over FedAvg. Fig. 1 shows the operation of
the clients’s local data, pushes their model weight deltas to the one round of CE-FedAvg, and the interaction between IoT
server, and the server then averages these updates to become devices and edge servers. Algorithm 1 shows the details of
the new master model. CE-FedAvg. The UniQ, UniDQ, ExpQ, and ExpDQ. functions
FedAvg features a fixed learning rate across devices as per are shown in Algorithms 2 and 3.
normal mb-SGD. As shown below, this can result in a very In CE-FedAvg, the server first initializes the global model
large number of rounds required to converge to a given accu- with random weights and 0 values for the first and sec-
racy. mb-SGD can result in low convergence rates in later ond Adam moments (line 3). Each communication round, the
rounds of training because some weights need finer “tuning” server selects a subset of clients (line 5) and sends the weights
than others. Adam optimization is a popular alternative to stan- and moments to them. In our algorithm, clients are selected at
dard mb-SGD. It stores two values for each model weight: random, but in reality, clients would be selected based on their
m (the first moment) and v (the second moment), which are power/communication properties at the time as in [4]. These
used along with gradients computed by backpropagation and clients then perform their local Adam SGD (lines 7 and 8) and
global decaying learning rates to update the model weights for upload their data to the server. The server dehumanizes and
each minibatch. Adam reduces both the problems of weights reconstructs the sparse updates from the clients (lines 10–17),
requiring different degrees of tuning (having adaptive rates for averages the updates (lines 20–22), and starts the next
each weight) and local minima (via momentum). round. Each client updates by replacing their current model

Authorized licensed use limited to: KUMOH NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 04,2021 at 10:54:52 UTC from IEEE Xplore. Restrictions apply.
MILLS et al.: COMMUNICATION-EFFICIENT FEDERATED LEARNING FOR WIRELESS EDGE INTELLIGENCE IN IoT 5989

Algorithm 1 CE-FedAvg Algorithm 2 Uniform Quantization (UniQ) and


1: Server Executes: Dequantization (UniDQ)
2: // Global weights, 1st and 2nd moments 1: function U NI Q(a)
3: initialize ω; m ← 0; v ← 0 2: lzmin ← min(a− )
4: while termination condition not met do 3: lzmax ← max(a− )
5: S ← random set of max(C · K, 1) clients 4: gzmin ← min(a+ )
6: for each k ∈ S in parallel do 5: gzmax ← max(a+ )
7: (ωq , mq , vq , g, b∗ , lzmin , lzmax , gzmin , 6: q ← 0|a| // initialize q, type 8-bit int
8: gzmax , bm , bv ) ← ClientUpdatek (ω, m, v) 7: qa<0 ←  lzmax127
−lzmin (aa<0 − lzmin )
9: // Decompress deltas and indexes 8: qa>0 ← 128 +  gzmax127 −gzmin (aa>0 − gzmin )
10: ωk ← 0 9: return q, lzmin , lzmax , gzmin , gzmax
11: mk ← 0 10: end function
12: vk ← 0 11:
13: idxs ← GDecode(g, b∗ ) 12: function U NI DQ(q, lzmin , lzmax , gzmin , gzmax )
14: ωk,idxs ← UniDQ(ωq , lzmin , 13: d ← 0|q| // initialize d, type 32-bit float
15: lzmax , gzmin , gzmax ) 14:
−lzmin
dq<128 ← ( lzmax127 qq<128 ) + lzmin
16: mk,idxs ← ExpDQ(mk , bm ) 15: dq≥128 ← ( gzmax −gzmin
(qq≥128 − 128)) + gzmin
127
17: vk,idxs ← ExpDQ(vk , bv ) 16: return d
18: end for 17: end function
19: // Update global
 weights
20: ω←ω+ K nk
n ωk
k=1
K Algorithm 3 Exponential Quantization (ExpQ) and
21: m ← m + k=1 nnk mk
 Dequantization (ExpDQ)
22: v←v+ K nk
k=1 n vk
1: function E XP Q(a)
23: end while 1
24: 2: b ← min(abs(a)) 127
25: function C LIENT U PDATEk (ω, m, v) 3: q ← 0|a| // initialize q, type 8-bit int
26: // Client weights, 1st and 2nd moments 4: p ←  log(b)
1
log(abs(a))
27: ωk ← ω; mk ← m; vk ← v 5: qa<0 ← abs(pa<0 )
28: for epoch ← 1 : E do 6: qa>0 ← 128 + abs(pa>0 )
29: batches ← (data Pk in batches of size B) 7: return q, b
30: for each b ∈ batches do 8: end function
31: // Perform Adam gradient descent 9:
32: ωk , mk , vk ← AdamSGD(ωk , mk , vk ) 10: function E XP DQ(q, b)
33: end for 11: d ← 0|q| // initialize d, type 32-bit float
34: end for 12: d0<q<128 ← −b−q0<q<128
35: // Compress ωk , mk , vk deltas and indexes 13: dq≥128 ← b128−qq≥128
36: ωs , idxs ← Sparsify(ωk − ω) 14: return d
37: g, b∗ ← GEncode(idxs) 15: end function
38: ωq , lzmin , lzmax , gzmin , gzmax ← UniQ(ωs )
39: mq , bm ← ExpQ(mk − m)
40: vq , bv ← ExpQ(vk − v) CE-FedAdam works well with the default Adam parameters.
41: Return (ωq , mq , vq , g, b∗ , lzmin , lzmax , FedAvg, on the other hand, requires finding learning rates for
42: gzmin , gzmax , bm , bv ) to server each specific data set and scenario [3]. This is not only time
43: end function consuming and costly but in the FL setting, the server does not
have access to the client data sets, so it is unclear how this
would be done in reality. Also, as CE-FedAvg reduces the
number of rounds of communication to reach a target accu-
with the downloaded global weights and moments (line 27), racy, the total amount of computation required of clients is also
performing E epochs of Adam SGD (lines 28–34), sparsify- reduced: the extra cost of performing one round of CE-FedAvg
ing and then quantizing the model deltas and sparse indexes over FedAvg is outweighed by reduced rounds of learning. We
(lines 36–40), and uploading the model to the server. Barring are not aware of any work on FL that reduces the amount of
compression, CE-FedAvg would communicate 3× the data as computation at clients in this way.
FedAvg (the shapes of m and v are the same as ω). However,
as shown later, ω, m, and v can be aggressively compressed
in CE-FedAvg while still taking less rounds to converge than IV. C OMPRESSION S TRATEGIES
FedAvg. Previous works compressing the models uploaded by clients
CE-FedAvg provides other practical benefits over FedAvg. typically rely on sparsification of weights and/or quantization
Due to the adaptive learning rates inherited from Adam, of weights from typical 32-bit floats.

Authorized licensed use limited to: KUMOH NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 04,2021 at 10:54:52 UTC from IEEE Xplore. Restrictions apply.
5990 IEEE INTERNET OF THINGS JOURNAL, VOL. 7, NO. 7, JULY 2020

Our proposed technique comprises of sparsification fol-


lowed by quantization. To sparsify gradients, for each tensor,
the top (s − 1)% of deltas with the largest absolute value
are chosen, and they are extracted along with their (flattened)
indexes. In CE-FedAvg, the corresponding m and v deltas
are also sent along with the weight deltas, using the same
indexes. We found that if noncorresponding m and v values
are sent (i.e., the m and v tensors are sparsified independently,
in the same manner as the weight tensors) this quickly pro-
duces exploding gradients at the clients, likely due to stale m
and v values producing problems in the Adam optimization
algorithm.
After sparsification, the weight deltas, m and v, are quan-
tized from 32-bit floats to 8-bit unsigned ints. The weight
deltas contain positive and negative values with the greatest
(s − 1)% of magnitudes and are quantized as per the uniform
quantization scheme shown in Algorithm 2. To quantize using
uniform quantization, the values with the greatest positive and
greatest negative, and lowest positive and lowest negative are
chosen (lines 2–5). These are used to map all values lower than
zero to the integers 0–127 (line 7) and values greater than zero
to 128–255 (line 9). To dequantize, the reverse functions are
applied (lines 14 and 15) to return a vector of floats.
Analysis of m and v values produced by Adam shows
that there is a large range in the scale of these values: the
authors experiencing values with exponents ranging from 10−1
to 10−35 . Attempts to quantize m and v deltas show that they
are very sensitive to errors in their exponent, and quantizing
them using uniform quantization or the schemes from other Fig. 2. Top: expected number of bits per value, gs , using Golomb encoding
works results in exploding gradients within a few rounds. versus sparsity. Bottom: compression ratio of CE-FedAvg and FedAvg, and
CE-FedAvg:FedAvg uploaded data ratio versus sparsity.
Therefore, we propose a different method of quantization,
dubbed exponential quantization, for these deltas.
In exponential quantization, negative values are again quantization (32-bit floats). The total number of uploaded bits
mapped to the range 0–127, and positive to 128–255. after compression, per client per round, is therefore

Algorithm 3 shows the procedure. To quantize a tensor, the FedAvg: bitsup = 160|W| + (1 − s)(8 + gs ) |ω| (1)
smallest absolute value, δ, is found, and b = δ −1/127 is ω∈
computed (line 2). This provides the base for quantization: 
the minimum base with an exponent of 127 able to represent CE-FedAvg: bitsup = 224|W| + (1 − s)(24 + gs ) |ω| (2)
ω∈
the value with the smallest magnitude in the tensor. Using the
minimum possible base provides the highest resolution for the where s is the sparsity,  is the set of weight tensors compris-
quantized values. The logarithm with base b for all values in ing the network, and gs is the expected number of bits needed
the tensor is then found and rounded to the nearest integer to Golomb encode one index value for a given sparsity.
(line 4), with 128 added to positive values to map them to The implementation of GE [15] is the same as used in [14].
128–255 (line 6). To dequantize, b is simply raised to the The expected number of bits per value for a given sparsity, gs ,
power of q for each value, times by −1 for q < 128 (lines is given as
12 and 13). 1
The last item to be compressed are the indexes of the values gs = b∗ + ∗ (3)
1 − s
2b
in the sparse arrays sent from the clients to server. For these 
∗ φ−1
values, we use the lossless Golomb encoding [15] technique as b = 1 + log2 (4)
s
used by Sattler et al. [14] in their sparse ternary compression √
scheme. where φ = ([ 5 + 1]/2) ≈ 1.62 is the golden ratio. The value
Using these techniques, compressed FedAvg, therefore, of b∗ is sent from client to server to convert the GE bit string
uploads for each weight tensor: the weight deltas (8-bit inte- back to integer values.
gers); the four min/max values from uniform quantization Plotting gs [Fig. 2(top)], the second terms of the above equa-
(4 × 32 bits); their indexes (Golomb encoded); and b∗ (a tions, and the nominal (without |W| terms) ratio of CE-FedAvg
32-bit float used for Golomb encoding). CE-Fedvgm must to FedAvg uploaded data with different s [Fig. 2(bottom)]
communicate all of these, plus the m and v deltas (both 8-bit shows that CE-Fedvg compresses more than FedAvg using
integers) and the base, b for both of these from exponential this scheme. This means the total amount of data uploaded

Authorized licensed use limited to: KUMOH NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 04,2021 at 10:54:52 UTC from IEEE Xplore. Restrictions apply.
MILLS et al.: COMMUNICATION-EFFICIENT FEDERATED LEARNING FOR WIRELESS EDGE INTELLIGENCE IN IoT 5991

TABLE I
by a CE-FedAvg client in a single round is between 2.6 and ROUNDS R EQUIRED TO R EACH TARGET T EST ACCURACIES FOR F EDAVG
2× that of FedAvg for 0.5 ≤ s < 1.0, as opposed to 3× if (G RAY ) AND CE-F EDAVG (W HITE ), W ITH U PLOAD S PARSITY s = 0.6
there was no compression. However, despite communicating
2 − 2.6× more data than FedAvg per client per round, due to
the reduction in rounds achieved by CE-FedAvg, clients still
upload less total data than FedAvg in many cases.

V. E XPERIMENTS
A. Simulation Setup
A series of experiments were conducted to evaluate CE-
FedAvg against FedAvg. The experiments were image classi-
fication tasks using the MNIST [27] and CIFAR10 [28] data
sets. Models were implemented using Tensorflow [29].
MNIST: 28 × 28 gray-scale images of handwritten digits in
10 classes. Two models were trained on this data set. The first
(MNIST-2NN) had two fully connected layers of 200 neurons
with ReLU activation, and a softmax output layer. The second
(MNIST-CNN) was a convolutional network consisting of: two
5 × 5 convolutional layers with 32 and 64 output neurons,
respectively, each followed by 2 × 2 max pooling and ReLU
activation; a fully connected layer with 512 neurons and ReLU of rounds to reach a given target accuracy for the best
activation; and a softmax output layer. parameter setup in each case. “NC” is used where no set
CIFAR10: 32 × 32 color images of objects in ten classes. of parameters could be found to get the algorithm to con-
One network (CIFAR-CNN) was trained on this data set with: verge to the target. Sparsity rates of s = {0.6, 0.9, 0.95}
two 3 × 3 convolutional layers with 32 output neurons, L2 provide approximately {7, 25, 53}× compression for FedAvg,
regularization and each followed by batch normalization; a and approximately {9, 33, 68}× compression for CE-FedAvg,
dropout layer with d = 0.2, two 3 × 3 convolutional layers resulting in a FedAvg:CE-FedAvg uploaded data ratio per
with 64 output neurons, L2 regularization, and batch normal- round of ≈ 1:2.3.
ization; a second dropout layer with d = 0.3; two final 3 × 3 Table I shows that more rounds are required to converge
convolutional layers with L2 regularization and batch normal- in non-IID scenarios than IID, as is consistent with other FL
ization; a final d = 0.4 dropout layer; and a softmax output works. As clients’ distributions are very different in the non-
layer. IID scenarios, their models diverge more between aggregations
The networks were trained on the data sets using differing and harming the global model. Table I also shows that with
numbers of clients, classes per client, client participation rate, moderate compression, CE-FedAvg was able to reach the tar-
compression rates, and either FedAvg or CE-FedAvg until a get in significantly fewer rounds in the non-IID scenarios. This
given target accuracy was achieved. The models had the fol- may be because of CE-FedAvg’s adaptive learning rates: Adam
lowing target accuracies: MNIST-2NN, 97%; MNIST-CNN, was able to make more fine-tuned changes to the worker mod-
99%; and CIFAR-CNN, 60%. For each setting, values of E els between aggregations so these models did not diverge as
were tested to find the best for that setting, and when using much. CE-FedAvg did comparatively better with more work-
FedAvg, different learning rates were also tested. ers/decreasing data set size per worker. CE-FedAvg reduced
The entire data set was split across all clients in each case. the rounds taken by ≥ 2.3× in ten cases, including all the
Therefore, increasing number of clients resulted in less sam- MNIST-CNN non-IID cases. In the MNIST-CNN, W = 40,
ples per worker. To produce IID data, the data sets were C = 0.5, Y = 2 case, CE-FedAvg reduced the number of
shuffled and each client given an equal portion of the data. rounds by 5.3× over FedAvg.
For non-IID data, the data sets were sorted by class, divided Tables II and III show that higher compression rates increase
into slices, and each client was either given two slices. This the number of rounds to reach the target, especially, in the non-
resulted in most clients having data from only two classes. IID cases. Again, in all non-IID cases, CE-FedAvg is able to
The test data for each data set was taken from the official test reach the target far faster than FedAvg. For s = 0.9, CE-
sets. FedAvg reached the target in ≥ 2.3× less rounds in 7 cases.
For s = 0.95, CE-FedAvg took ≥ 2.3× less rounds in 5 cases.
B. Simulation Results Taking the same MNIST-CNN case, with s = 0.9, CE-FedAvg
We ran experiments using FedAvg and CE-FedAvg to reach reaches the target in 6× fewer rounds, and 4.3× fewer for
a given target accuracy as described above. The networks s = 0.95.
were run with IID (Y = 10), and non-IID (Y = 2) data For all non-IID CIFAR-10 s = 0.9, 0.95 cases, compressed
partitioning, where Y is the number of classes per worker. FedAvg could not converge within 1000 rounds. The CIFAR-
For FedAvg, multiple global learning rates were also tri- 10 problem is much more complex than MNIST. It is likely
aled for each scenario. Tables I–III show the average number that FedAvg could not converge due to its fixed learning rate.

Authorized licensed use limited to: KUMOH NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 04,2021 at 10:54:52 UTC from IEEE Xplore. Restrictions apply.
5992 IEEE INTERNET OF THINGS JOURNAL, VOL. 7, NO. 7, JULY 2020

TABLE II
ROUNDS R EQUIRED TO R EACH TARGET T EST ACCURACIES FOR F EDAVG
(G RAY ) AND CE-F EDAVG (W HITE ), W ITH U PLOAD S PARSITY s = 0.9.
NC D ENOTES C ASES U NABLE TO C ONVERGE

TABLE III
ROUNDS R EQUIRED TO R EACH TARGET T EST ACCURACIES FOR F EDAVG
(G RAY ) AND CE-F EDAVG (W HITE ), W ITH U PLOAD S PARSITY s = 0.95.
NC D ENOTES C ASES U NABLE TO C ONVERGE

Fig. 3. Top: Compressed uploaded data per client per round for FedAvg and
CE-FedAvg with different sparsities, for the MNIST-CNN W = 40, C = 1.0,
and Y = 2 scenario. Bottom: Total uploaded data during training for the same
scenario.

optimization in every case. This resulted in much faster exper-


iment set times for CE-FedAvg, as multiple learning rates did
not have to be trialed. Tuning the Adam parameters may have
achieved even better results than those listed above. FL consid-
ers ML where a central server does not have access to training
data due to client privacy. Therefore, this presents a major
advantage over FedAvg: without a central test/validation set
of data, it would be infeasible to test multiple learning rates
for FedAvg before conducting the actual FL, whereas CE-
FedAdam, for all of the above experiments, worked out of the
box with no parameter tuning.
Fig. 3 shows the compressed size of the MNIST-CNN
CE-FedAvg, on the other hand, could reliably converge even updates for FedAvg and CE-FedAvg, including an uncom-
in these extreme settings. Not considering these cases, CE- pressed case (s = 0). It is interesting to see that while
FedAvg was less likely to diverge during training than FedAvg. CE-FedAvg uploads more data per client per round (due to the
Over the total 962 FedAvg and 1034 CE-FedAvg experiments extra variables from Adam) in all cases, the total data uploaded
conducted (after finding suitable E values, and learning rates by CE-FedAvg is far lower than FedAvg in all cases. This is
for FedAvg, and not including the NC FedAvg cases), FedAvg due to the large decrease in rounds to convergence CE-FedAvg
diverged before reaching the target in 2% of the experiments, gives.
whereas CE-FedAvg diverged in 0.1% of cases (a single case).
This reliability may be due to the adaptive gradients of Adam:
discrepancies between the model weights downloaded from the C. Testbed Setup
server and what is suitable for the specific client’s learning To test the real-time convergence of CE-FedAvg over
problem are more easily overcome with the adaptive learning FedAvg, we used a RPi testbed to simulate a heterogeneous
rates. low-powered edge-computing scenario. The testbed consisted
While also being more reliable than FedAvg, CE-FedAvg of 5 RPi 2Bs and 5 RPi 3Bs. A desktop acted as the server over
was able to achieve this using the default parameters for Adam a wireless network to emulate lower bandwidth networking.

Authorized licensed use limited to: KUMOH NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 04,2021 at 10:54:52 UTC from IEEE Xplore. Restrictions apply.
MILLS et al.: COMMUNICATION-EFFICIENT FEDERATED LEARNING FOR WIRELESS EDGE INTELLIGENCE IN IoT 5993

data sets showed CE-FedAvg was generally able to reach a tar-


get accuracy in far fewer communication rounds than FedAvg
in non-IID settings (up to 6× fewer). These experiments
showed CE-FedAvg was also far more robust to aggressive
compression of uploaded data, and able to converge with up
to 3× less total uploaded data per client. Further experiments
using an RPi testbed showed CE-FedAdam could converge in
up to 1.7× less real time. CE-FedAvg, therefore, presented
the benefits of being able to train a model in less com-
munication rounds (reducing the overall data and computing
cost of training), less real time, and with less uploaded data
than uncompressed FedAvg, a unique result considering most
schemes using compression reduce uploaded data at the cost
of more rounds to convergence. Future work in this area
could investigate other SGD-type algorithms applied to FL,
Fig. 4. Estimated time on RPi testbed of different FL scenarios. and in compressing the models downloaded by clients from
the server.

The work of the server in these experiments was small: receiv-


ing and decompressing, aggregating and resending models to R EFERENCES
clients. Therefore, the server had a small impact on the time [1] K. L. Lueth. (Aug. 2018). State of the IoT 2018: Number of IoT
experiments took to run, and the vast majority of time taken Devices Now at 7B—Market Accelerating. [Online]. Available: iot-
analytics.com/state-of-the-iot-update-q1–q2-2018-number-of-iot-devi
in the FedAvg/CE-FedAvg algorithm was on the RPi clients. ces-now-7b/
The RPi clients all had an install of Raspbian OS. Software [2] H. B. McMahan and D. Ramage, Federated Learning: Collaborative
was written with Python using Tensorflow 1.12. Machine Learning Without Centralized Training Data, Google,
Mountain View, CA, USA, Apr. 2017. [Online]. Available:
Experiments with ten workers, the MNIST-2NN and https://ai.googleblog.com/2017/04/federated-learning-collaborative.html
MNIST-CNN models, and sparsity rate s = 0.6 were per- [3] H. B. McMahan, E. Moore, D. Ramage, and B. A. Y. Arcas, “Federated
formed to evaluate the runtime of CE-FedAvg and FedAvg. learning of deep networks using model averaging,” arXiv preprint
arXiv:1602.05629, 2016.
Round times were taken and averaged, and then used with the
[4] A. Hard et al., “Federated learning for mobile keyboard prediction,”
total rounds from the relevant parts of Table I to determine the arXiv preprint arXiv:1811.03604, 2018.
total time that the experiment would take on the testbed. Time [5] S. Ramaswamy, R. Mathews, K. Rao, and F. Beaufays, “Federated
taken to complete rounds was very consistent on the testbed, learning for emoji prediction in a mobile keyboard,” arXiv preprint
arXiv:1906.04329, 2019.
making this a reliable estimator of total time. [6] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods
for online learning and stochastic optimization,” J. Mach. Learn. Res.,
vol. 12, pp. 2121–2159, Jul. 2011.
D. Testbed Results [7] T. Tieleman and G. Hinton, “Coursera: Neural networks for machine
learning,” Univ. Toronto, Toronto, ON, Canada, Rep., 2012.
We ran experiments to get the estimated real time on a [8] D. P. Kingma and L. J. Ba, “Adam: A method for stochastic
RPi testbed for a set of 4 experiments to reach given target optimization,” in Proc. Int. Conf. Learn. Represent., 2015, pp. 1–13.
accuracies. The results of this are shown in Fig. 4. [9] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated
learning with non-IID data,” arXiv preprint arXiv:1806.00582, 2018.
The times in Fig. 4 show that CE-FedAvg is able to con-
[10] S. Wang et al., “Adaptive federated learning in resource constrained
verge to a given target accuracy in less real time than FedAvg edge computing systems,” IEEE J. Sel. Areas Commun., vol. 37, no. 6,
with similar compression. Although the time taken per round is pp. 1205–1221, Jun. 2019.
greater for CE-FedAvg than FedAvg (due in small extra com- [11] J. Konečný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and
D. Bacon, “Federated learning: Strategies for improving communication
putation required from Adam, but mostly due to the increased efficiency,” in Proc. NIPS Workshop Private Multi Party Mach. Learn.,
communication per round compared to similarly compressed 2016.
FedAvg), the number of rounds taken to converge, as per [12] F. Sattler, S. Wiedemann, K.-R. Müller, and W. Samek, “Sparse
binary compression: Towards distributed deep learning with minimal
Table I, was lower in all the given experiments. Fig. 4 shows communication,” in Proc. Int. Joint Conf. Neural Netw., 2019, pp. 1–8.
CE-FedAvg was able to converge 1.2 − 1.7× faster than [13] Y. Lin, S. Han, H. Mao, Y. Wang, and B. Dally, “Deep gradient compres-
FedAvg. sion: Reducing the communication bandwidth for distributed training,”
in Proc. Int. Conf. Learn. Represent., 2018.
[14] F. Sattler, S. Wiedemann, K.-R. Müller, and W. Samek, “Robust and
communication-efficient federated learning from non-IID data,” arXiv
VI. C ONCLUSION preprint arXiv:1903.02891, 2019.
[15] S. Golomb, “Run-length encodings (corresp.),” IEEE Trans. Inf. Theory,
FL can allow distributed ML to be performed on the network vol. 12, no. 3, pp. 399–401, Sep. 2006.
edge using data generated by the IoT devices. We adapted [16] Y. Chen, X. Sun, and Y. Jin, “Communication-efficient federated deep
FedAvg (a popular FL algorithm) with Adam optimization learning with asynchronous model update and temporally weighted
and compression to produce CE-FedAvg, which reduces total aggregation,” arXiv preprint arXiv:1903.07424, 2019.
[17] L. Liu, J. Zhang, S. H. Song, and K. B. Letaief, “Edge-assisted
uploaded data and rounds compared to similarly compressed hierarchical federated learning with non-IID data,” arXiv preprint
FedAvg. Extensive experiments on the MNIST and CIFAR-10 arXiv:1905.06641, 2019.

Authorized licensed use limited to: KUMOH NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 04,2021 at 10:54:52 UTC from IEEE Xplore. Restrictions apply.
5994 IEEE INTERNET OF THINGS JOURNAL, VOL. 7, NO. 7, JULY 2020

[18] D. Leroy, A. Coucke, T. Lavril, T. Gisselbrecht, and J. Dureau, Jed Mills received the B.Sc. degree in natural sci-
“Federated learning for keyword spotting,” in Proc. IEEE Int. Conf. ence from the University of Exeter, Exeter, U.K.,
Acoust. Speech Signal Process. (ICASSP), 2019, pp. 6341–6345. in 2018, where he is currently pursuing the Ph.D.
[19] C. Liu et al., “A new deep learning-based food recognition system for degree in computer science with the College of
dietary assessment on an edge computing service infrastructure,” IEEE Engineering, Maths and Physical Science.
Trans. Services Comput., vol. 11, no. 2, pp. 249–261, Mar./Apr. 2018. His research interests are in machine learn-
[20] H. Li, K. Ota, and M. Dong, “Learning IoT in edge: Deep learning for ing, distributed machine learning, and mobile edge
the Internet of Things with edge computing,” IEEE Netw., vol. 32, no. 1, computing.
pp. 96–101, Jan./Feb. 2018.
[21] L. Lyu, J. C. Bezdek, X. He, and J. Jin, “Fog-embedded deep learning
for the Internet of Things,” IEEE Trans. Ind. Informat., vol. 15, no. 7,
pp. 4206–4215, Jul. 2019. Jia Hu received the B.Eng. and M.Eng. degrees
[22] D. G. Reina, H. Tawfik, and S. L. Toral, “Multi-subpopulation evolu- in electronic engineering from the Huazhong
tionary algorithms for coverage deployment of UAV-networks,” Ad Hoc University of Science and Technology, Wuhan,
Netw., vol. 68, pp. 16–32, Jan. 2018. China, in 2006 and 2004, respectively, and the Ph.D.
[23] R. K. Pathinarupothi, P. Durga, and E. S. Rangan, “Iot-based smart edge degree in computer science from the University of
for global health: Remote monitoring with severity detection and alerts Bradford, Bradford, U.K., in 2010.
transmission,” IEEE Internet Things J., vol. 6, no. 2, pp. 2449–2462, He is a Lecturer of computer science with the
Apr. 2019. University of Exeter, Exeter, U.K. His research
[24] P. S. Chandakkar, Y. Li, P. L. K. Ding, and B. Li, “Strategies for re- interests include cloud and edge computing, resource
training a pruned neural network in an edge computing paradigm,” optimization, applied machine learning, and network
in Proc. IEEE Int. Conf. Edge Comput., Honolulu, HI, USA, 2017, security.
pp. 244–247.
[25] S. Yao, Y. Zhao, A. Zhang, L. Su, and T. Abdelzaher, “DeepIoT: Geyong Min received the B.Sc. degree in com-
Compressing deep neural network structures for sensing systems with puter science from the Huazhong University
a compressor-critic framework,” in Proc. 15th ACM Conf. Embedded of Science and Technology, Wuhan, China, in
Netw. Sensor Syst., 2017, pp. 1–14. 1995, and the Ph.D. degree in computing sci-
[26] H. Guo, S. Li, B. Li, Y. Ma, and X. Ren, “A new learning automata- ence from the University of Glasgow, U.K., in
based pruning method to train deep neural networks,” IEEE Internet 2003.
Things J., vol. 5, no. 5, pp. 3263–3269, Oct. 2018. He is a Professor of high performance computing
[27] Y. LeCun and C. Cortes. (2010). MNIST Handwritten Digit Database. and networking with the Department of Computer
[Online]. Available: http://yann.lecun.com/exdb/mnist/ Science, College of Engineering, Mathematics and
[28] A. Krizhevsky, “Learning multiple layers of features from tiny Physical Sciences, University of Exeter, Exeter, U.K.
images,” Univ. Toronto, Toronto, ON, Canada, Rep. TR-2009, His research interests include future Internet, com-
2009. puter networks, wireless communications, multimedia systems, information
[29] M. Abadi et al.. (2015). TensorFlow: Large-Scale Machine Learning on security, high-performance computing, ubiquitous computing, modeling, and
Heterogeneous Systems. [Online]. Available: tensorflow.org performance engineering.

Authorized licensed use limited to: KUMOH NATIONAL INSTITUTE OF TECHNOLOGY. Downloaded on February 04,2021 at 10:54:52 UTC from IEEE Xplore. Restrictions apply.

You might also like