Energy Efficient Federated Learning Over Wireless

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

1

Energy Efficient Federated Learning Over Wireless


Communication Networks
Zhaohui Yang, Mingzhe Chen, Walid Saad, Fellow, IEEE, Choong Seon Hong, Senior Member, IEEE, and
Mohammad Shikh-Bahaei, Senior Member, IEEE

Abstract—In this paper, the problem of energy efficient trans- federated learning (FL) framework that will be adopted in
mission and computation resource allocation for federated learn- future Internet of Things (IoT) systems [17]–[26]. In FL,
ing (FL) over wireless communication networks is investigated. wireless devices can cooperatively execute a learning task by
arXiv:1911.02417v2 [cs.IT] 18 Nov 2020

In the considered model, each user exploits limited local compu-


tational resources to train a local FL model with its collected data only uploading local learning model to the base station (BS)
and, then, sends the trained FL model to a base station (BS) which instead of sharing the entirety of their training data [27]. Using
aggregates the local FL model and broadcasts it back to all of the gradient sparsification, a digital transmission scheme based on
users. Since FL involves an exchange of a learning model between gradient quantization was investigated in [28]. To implement
users and the BS, both computation and communication latencies FL over wireless networks, the wireless devices must transmit
are determined by the learning accuracy level. Meanwhile, due
to the limited energy budget of the wireless users, both local their local training results over wireless links [29], which can
computation energy and transmission energy must be considered affect the performance of FL due to limited wireless resources
during the FL process. This joint learning and communication (such as time and bandwidth). In addition, the limited energy
problem is formulated as an optimization problem whose goal is of wireless devices is a key challenge for deploying FL.
to minimize the total energy consumption of the system under a Indeed, because of these resource constraints, it is necessary
latency constraint. To solve this problem, an iterative algorithm
is proposed where, at every step, closed-form solutions for time to optimize the energy efficiency for FL implementation.
allocation, bandwidth allocation, power control, computation Some of the challenges of FL over wireless networks have
frequency, and learning accuracy are derived. Since the iterative been studied in [7], [30]–[35]. To minimize latency, a broad-
algorithm requires an initial feasible solution, we construct the band analog aggregation multi-access scheme was designed in
completion time minimization problem and a bisection-based [30] for FL by exploiting the waveform-superposition property
algorithm is proposed to obtain the optimal solution, which is
a feasible solution to the original energy minimization problem. of a multi-access channel. An FL training minimization prob-
Numerical results show that the proposed algorithms can reduce lem was investigated in [31] for cell-free massive multiple-
up to 59.5% energy consumption compared to the conventional input multiple-output (MIMO) systems. For FL with redundant
FL method. data, an energy-aware user scheduling policy was proposed
Index Terms—Federated learning, resource allocation, energy in [32] to maximize the average number of scheduled users.
efficiency. To improve the statistical learning performance for on-device
distributed training, the authors in [33] developed a novel
I. I NTRODUCTION sparse and low-rank modeling approach. The work in [34] in-
In future wireless systems, due to privacy constraints and troduced an energy-efficient strategy for bandwidth allocation
limited communication resources for data transmission, it is under learning performance constraints. However, the works
impractical for all wireless devices to transmit all of their in [30]–[34] focused on the delay/energy for wireless trans-
collected data to a data center that can use the collected mission without considering the delay/energy tradeoff between
data to implement centralized machine learning algorithms for learning and transmission. Recently, the works in [7] and [35]
data analysis and inference [2]–[7]. To this end, distributed considered both local learning and wireless transmission en-
learning frameworks are needed, to enable the wireless devices ergy. In [7], we investigated the FL loss function minimization
to collaboratively build a shared learning model with training problem with taking into account packet errors over wireless
their collected data locally [6], [8]–[16]. One of the most links. However, this prior work ignored the computation delay
promising distributed learning algorithms is the emerging of local FL model. The authors in [35] considered the sum
computation and transmission energy minimization problem
This work was supported in part by the EPSRC through the Scalable Full for FL. However, the solution in [35] requires all users to
Duplex Dense Wireless Networks (SENSE) grant EP/P003486/1, and by the
U.S. National Science Foundation under Grant CNS-1814477. A preliminary upload their learning model synchronously. Meanwhile, the
version of this work was by IEEE ICML 2020 [1]. work in [35] did not provide any convergence analysis for FL.
Z. Yang and M. Shikh-Bahaei are with the Centre for Telecommunications The main contribution of this paper is a novel energy
Research, Department of Engineering, King’s College London, WC2R 2LS,
UK, Emails: yang.zhaohui@kcl.ac.uk, m.sbahaei@kcl.ac.uk. efficient computation and transmission resource allocation
M. Chen is with the Electrical Engineering Department of Princeton scheme for FL over wireless communication networks. Our
University, NJ, 08544, USA, and also with the Chinese University of Hong key contributions include:
Kong, Shenzhen, 518172, China, Email: mingzhec@princeton.edu.
W. Saad is with Wireless@VT, Bradley Department of Electrical and • We study the performance of FL algorithm over wireless
Computer Engineering, Virginia Tech, Blacksburg, VA, 24060, USA, Email: communication networks for a scenario in which each
walids@vt.edu. user locally computes its FL model under a given learning
C. Hong is with the Department of Computer Science and Engineering,
Kyung Hee University, Yongin-si, Gyeonggi-do 17104, Rep. of Korea, Email: accuracy and the BS broadcasts the aggregated FL model
cshong@khu.ac.kr. to all users. For the considered FL algorithm, we first
2

Local computation
BS at user 1
Ă Local computation Ă Local computation
at user k at user K
Local accuracy η Local accuracy η Local accuracy η
Ă

Ă
Ă

Ă
Ă

Ă
Ă
Ă
Wireless transmission Global accuracy ε0
ĂĂ ĂĂ BS
User k User K aggregation Information broadcast
User 1
xk1,yk1 xk2,yk2 ĂĂ xkDk,ykDk

Local dataset Dk
Fig. 1. Illustration of the considered model for FL over wireless communi- Fig. 2. The FL procedure between users and the BS.
cation networks.
(using several local iterations), local FL model transmission
derive the convergence rate. for each user, and result aggregation and broadcast at the BS.
• We formulate a joint computation and transmission op-
The local computation step is essentially the phase during
timization problem aiming to minimize the total energy which each user calculates its local FL model by using its
consumption for local computation and wireless trans- local data set and the received global FL model.
mission. To solve this problem, an iterative algorithm
1) Local Computation: Let fk be the computation capacity
is proposed with low complexity. At each step of this
of user k, which is measured by the number of CPU cycles
algorithm, we derive new closed-form solutions for the
per second. The computation time at user k needed for data
time allocation, bandwidth allocation, power control,
processing is:
computation frequency, and learning accuracy.
• To obtain a feasible solution for the total energy mini- Ik Ck Dk
τk = , ∀k ∈ K, (1)
mization problem, we construct the FL completion time fk
minimization problem. We theoretically show that the where Ck (cycles/sample) is the number of CPU cycles
completion time is a convex function of the learning required for computing one sample data at user k and Ik is the
accuracy. Based on this theoretical finding, we propose a number of local iterations at user k. According to Lemma 1
bisection-based algorithm to obtain the optimal solution in [36], the energy consumption for computing a total number
for the FL completion time minimization. of Ck Dk CPU cycles at user k is:
• Simulation results show that the proposed scheme that
jointly considers computation and transmission optimiza-
C
Ek1 = κCk Dk fk2 , (2)
tion can achieve up to 59.5% energy reduction compared where κ is the effective switched capacitance that depends on
to the conventional FL method. the chip architecture. To compute the local FL model, user k
The rest of this paper is organized as follows. The system needs to compute Ck Dk CPU cycles with Ik local iterations,
model is described in Section II. Section III provides prob- which means that the total computation energy at user k is:
lem formulation and the resource allocation for total energy
EkC = Ik Ek1
C
= κIk Ck Dk fk2 . (3)
minimization. The algorithm to find a feasible solution of the
original energy minimization problem is given in Section IV.
Simulation results are analyzed in Section V. Conclusions are 2) Wireless Transmission: After local computation, all
drawn in Section VI. users upload their local FL model to the BS via frequency
domain multiple access (FDMA). The achievable rate of user
II. S YSTEM M ODEL k can be given by:
Consider a cellular network that consists of one BS serving  
K users, as shown in Fig. 1. Each user k has a local dataset Dk g k pk
rk = bk log2 1 + , ∀k ∈ K, (4)
with Dk data samples. For each dataset Dk = {xkl , ykl }D k N0 bk
l=1 ,
d
xkl ∈ R is an input vector of user k and ykl is its corre- where bk is the bandwidth allocated to user k, pk is the average
sponding output1. The BS and all users cooperatively perform transmit power of user k, gk is the channel gain between user
an FL algorithm over wireless networks for data analysis and k and the BS, and N0 is the power spectral density of the
inference. Hereinafter, the FL model that is trained by each Gaussian noise. Note that the Shannon capacity (4) serves
user’s dataset is called the local FL model, while the FL model as an upper bound of the transmission PK rate. Due to limited
that is generated by the BS using local FL model inputs from bandwidth of the system, we have: k=1 bk ≤ B, where B
all users is called the global FL model. is the total bandwidth.
In this step, user k needs to upload the local FL model to
A. Computation and Transmission Model the BS. Since the dimensions of the local FL model are fixed
The FL procedure between the users and their serving BS is for all users, the data size that each user needs to upload is
shown in Fig. 2. From this figure, the FL procedure contains
constant, and can be denoted by s. To upload data of size
three steps at each iteration: local computation at each user s within transmission time tk , we must have: tk rk ≥ s. To
1 For simplicity, we consider an FL algorithm with a single output. In future transmit data of size s within a time duration tk , the wireless
work, our approach will be extended to the case with multiple outputs. transmit energy of user k will be: EkT = tk pk .
3

Algorithm 1 FL Algorithm
Computation Time Transmission Time
Frequency 1: Initialize global model w 0 and global iteration number

b1 τ1 t1 User 1 n = 0.
2: repeat
b2 t2 User 2
τ2 3: Each user k computes ∇Fk (w(n) ) and sends it to the
B
BS.

ĂĂ
1 PK
4: The BS computes ∇F (w(n) ) = K k=1 ∇Fk (w
(n)
),
bK τK tK User K which is broadcast to all users.
Time 5: parallel for user k ∈ K = {1, · · · , K}
Computing and Transmitting Downloading 6: Initialize the local iteration number i = 0 and set
(n),(0)
hk = 0.
Fig. 3. An implementation for the FL algorithm via FDMA.
7: repeat
(n),(i+1) (n),(i)
8: Update hk = hk −
3) Information Broadcast: In this step, the BS aggregates (n),(i)
the global FL model. The BS broadcasts the global FL model δ∇Gk (w(n) , hk ) and i = i + 1.
to all users in the downlink. Due to the high transmit power 9: until the accuracy η of local optimization problem
at the BS and the high bandwidth that can be used for data (11) is obtained.
(n) (n),(i) (n)
broadcasting, the downlink time is neglected compared to the 10: Denote hk = hk and each user sends hk to
uplink data transmission time. It can be observed that the local the BS.
data Dk is not accessed by the BS, so as to protect the privacy 11: end for
of users, as is required by FL. 12: The BS computes
K
According to the above FL model, the energy consumption (n+1) (n) 1 X (n)
w =w + hk , (10)
of each user includes both local computation energy EkC and K
k=1
wireless transmission energy EkT . Denote the number of global and broadcasts the value to all users.
iterations by I0 , and the total energy consumption of all users 13: Set n = n + 1.
that participate in FL will be: 14: until the accuracy ǫ0 of problem (9) is obtained.
XK
E = I0 (EkC + EkT ). (5)
k=1
Fk (w, xk1 , yk1 , · · · , xkDk , ykDk ) is denoted by Fk (w) for
Hereinafter, the total time needed for completing the ex- simplicity of notation.
ecution of the FL algorithm is called completion time. The In order to deploy an FL algorithm, it is necessary to train
completion time of each user includes the local computation the underlying model. Training is done in order to generate a
time and transmission time, as shown in Fig. 3. Based on (1), unified FL model for all users without sharing any datasets.
the completion time Tk of user k will be: The FL training problem can be formulated as [2], [20], [27]:
 
Ik Ck Dk
Tk = I0 (τk + tk ) = I0 + tk . (6) K K Dk
fk X Dk 1 XX
min F (w) , Fk (w) = f (w, xkl , ykl ),
Let T be the maximum completion time for training the entire w D D
k=1 k=1 l=1
FL algorithm and we have: (9)
where D = K
P
k=1 Dk is the total data samples of all users.
Tk ≤ T, ∀k ∈ K. (7) To solve problem (9), we adopt the distributed approximate
Newton (DANE) algorithm in [27], which is summarized in
B. FL Model Algorithm 1. Note that the gradient descent (GD) is used at
We define a vector w to capture the parameters related each user in Algorithm 1. One can use stochastic gradient
to the global FL model. We introduce the loss function descent (SGD) to decrease the computation complexity for
f (w, xkl , ykl ), that captures the FL performance over input cases in which a relatively low accuracy can be tolerated.
vector xkl and output ykl . For different learning tasks, the However, if high accuracy was needed, gradient descent (GD)
loss function will be different. For example, f (w, xkl , ykl ) = would be preferred [27]. Moreover, the number of global
1 T 2
2 (xkl w − ykl ) for linear regression and f (w, xkl , ykl ) = iterations is higher in SGD than in GD. Due to limited wireless
− log(1 + exp(−ykl xTkl w)) for logistic regression. Since the resources, SGD may not be efficient since SGD requires
dataset of user k is Dk , the total loss function of user k will more iterations for wireless transmissions compared to GD.
be: The DANE algorithm is designed to solve a general local
Dk optimization problem, before averaging the solutions of all
1 X
Fk (w, xk1 , yk1 , · · · , xkDk , ykDk ) = f (w, xkl , ykl ). users. The DANE algorithm relies on the similarity of the
Dk
l=1 Hessians of local objectives, representing their iterations as
(8)
Note that function f (w, xkl , ykl ) is the loss function an average of inexact Newton steps. In Algorithm 1, we
of user k with one data sample and function can see that, at every FL iteration, each user downloads the
Fk (w, xk1 , yk1 , · · · , xkDk , ykDk ) is the total loss function global FL model from the BS for local computation, while
of user k with the whole local dataset. In the following, the BS periodically gathers the local FL model from all
4

users and sends the updated global FL model back to all In Algorithm 1, the iterative method involves a number of
users. According to Algorithm 1, since the local optimization global iterations (i.e., the value of n in Algorithm 1) to achieve
problem is solved with several local iterations at each global a global accuracy ǫ0 for the global FL model. In other words,
iteration, the number of full gradient computations will be the solution w(n) of problem (9) with accuracy ǫ0 is a point
smaller than the number of local updates. such that
We define w(n) as the global FL model at a given iteration
n. In practice, each user solves the local optimization problem: F (w(n) ) − F (w∗ ) ≤ ǫ0 (F (w(0) ) − F (w∗ )), (16)

where w∗ is the actual optimal solution of problem (9).


min Gk (w(n) , hk ) , Fk (w(n) + hk )
hk ∈Rd To analyze the convergence rate of Algorithm 1, we make
− (∇Fk (w(n) ) − ξ∇F (w(n) ))T hk . (11) the following two assumptions on the loss function.
- A1: Function Fk (w) is L-Lipschitz, i.e., ∇2 Fk (w) 
In problem (11), ξ is a constant value. Any solution hk of
LI.
problem (11) represents the difference between the global FL
- A2: Function Fk (w) is γ-strongly convex, i.e.,
model and local FL model for user k, i.e., w(n) + hk is the
∇2 Fk (w)  γI.
local FL model of user k at iteration n. The local optimization
problem in (11) depends only on the local data and the gradient The values of γ and L are determined by the loss function.
of the global loss function. The local optimization problem is These assumptions can be easily satisfied by widely used
then solved, and the updates from individual users are averaged FL loss functions such as linear or logistic loss functions [19].
to form a new iteration. This approach allows any algorithm Under assumptions A1 and A2, we provide the following
to be used to solve the local optimization problem. Note that theorem about convergence rate of Algorithm 1 (i.e., the
the number of iterations needed to upload the local FL model number of iterations needed for Algorithm 1 to converge),
(i.e., the number of global iterations) in Algorithm 1 is always where each user solves its local optimization problem with a
smaller than in the conventional FL algorithms that do not given accuracy.
γ
consider a local optimization problem (11) [27]. To solve Theorem 1: If we run Algorithm 1 with 0 < ξ ≤ L for
the local optimization problem in (11), we use the gradient a
method: n≥ (17)
(n),(i+1) (n),(i) (n),(i) 1−η
hk = hk − δ∇Gk (w(n) , hk ), (12)
2
where δ is the step size, hk
(n),(i)
is the value of hk at the i-th lo- iterations with a = 2L 1
γ 2 ξ ln ǫ0 , we have F (w
(n)
) − F (w∗ ) ≤
(n),(i)
cal iteration with given vector w(n) , and ∇Gk (w(n) , hk ) ǫ0 (F (w(0) ) − F (w∗ )).
is the gradient of function Gk (w(n) , hk ) at point hk = Proof: See Appendix B. 
(n),(i)
hk . For small step size δ, based on equation (12), we From Theorem 1, we observe that the number of global
(n),(0) (n),(1) (n),(2) iterations n increases with the local accuracy η at the rate
can obtain a set of solutions hk , hk , hk , ···,
hk
(n),(i)
, which satisfy, of 1/(1 − η). From Theorem 1, we can also see that the FL
(n),(0) (n),(i) performance depends on parameters L, γ, ξ, ǫ0 and η. Note
Gk (w(n) , hk ) ≥ · · · ≥ Gk (w(n) , hk ). (13)
that the prior work in [37, Eq. (9)] only studied the number of
To provide the convergence condition for the gradient method, iterations needed for FL convergence under the special case in
we introduce the definition of local accuracy, i.e., the solution which η = 0. Theorem 1 provides a general convergence rate
(n),(i)
hk of problem (11) with accuracy η means that: for FL with an arbitrary η. Since the FL algorithm involves
(n),(i) (n)∗
Gk (w(n) , hk ) − Gk (w(n) , hk ) the accuracy of local computation and the result aggregation,
≤ η(Gk (w(n) , hk
(n),(0)
) − Gk (w(n) , hk
(n)∗
)), (14) it is hard to calculate the exact number of iterations needed
a
for convergence. In the following, we use 1−η to approximate
(n)∗
where hk is the optimal solution of problem (11). Each the number I0 of global iterations.
user is assumed to solve the local optimization problem (11) For parameter setup, one way is to choose a very small
γ
with a target accuracy η. Next, in Lemma 1, we derive a lower value of ξ , i.e., ξ ≈ 0 satisfying 0 < ξ ≤ L , for an arbitrary
bound on the number of local iterations needed to achieve a learning task and loss function. However, based on Theorem
local accuracy η in (14). 1, the iteration time is pretty large for a very small value of
2
Lemma 1: Let v = (2−Lδ)δγ . If we set step δ < L2 and run ξ. As a result, we first choose a large value of ξ and decrease
the gradient method the value of ξ to ξ/2 when the loss does not decrease over
i ≥ v log2 (1/η) (15) time (i.e., the value of ξ is large). Note that in the simulations,
iterations at each user, we can solve local optimization prob- the value of ξ always changes at least three times. Thm 1 is
lem (11) with an accuracy η. suitable for the situation when the value of ξ is fixed, i.e., after
Proof: See Appendix A.  at most three iterations of Algorithm 1.
The lower bounded derived in (15) reflects the growing trend
for the number of local iterations with respect to accuracy η.
C. Extension to Nonconvex Loss Function
In the following, we use this lower bound to approximate the
number of iterations Ik needed for local computations by each We replace convex assumption A2 with the following con-
user. dition:
5

- B2: Function Fk (w) is of γ-bounded nonconvexity (or capacity and average transmit power limits of all users. The
γ-nonconvex), i.e., all the eigenvalues of ∇2 Fk (w) lie in local accuracy constraint is given by (19f).
[−γ, L], for some γ ∈ (0, L].
Due to the non-convexity of function Fk (w), we respectively B. Iterative Algorithm
replace Fk (w) and F (w) with their regularized versions [38] The proposed iterative algorithm mainly contains two steps
K at each iteration. To optimize (t, b, f , p, η) in problem (19),
(n) (n) 2 (n)
X Dk (n)
F̃k (w) = Fk (w)+γkw−w k , F̃ (w) = F̃ (w). first optimize (t, η) with fixed (b, f , p), then (b, f , p) is
we
D k updated based on the obtained (t, η) in the previous step. The
k=1
(18) advantage of this iterative algorithm lies in that we can obtain
(n)
Based on assuption A2 and (18), both functions Fk (w) the optimal solution of (t, η) or (b, f , p) in each step.
(n)
and F (n) (w) are γ-strongly convex, i.e., ∇2 F̃k (w)  γI In the first step, given (b, f , p), problem (19) becomes:
2 (n)
and ∇ F̃ (w)  γI. Moreover, it can be proved that both K
(n)
functions Fk (w) and F (n) (w) are (L + 2γ)-Lipschitz. The a X
κAk log2 (1/η)fk2 + tk pk ,

min (20)
convergence analysis in Section II-B can be direcltly applied. t,η 1−η
k=1
 
a Ak log2 (1/η)
III. R ESOURCE A LLOCATION FOR E NERGY M INIMIZATION s.t. + tk ≤ T, ∀k ∈ K, (20a)
1−η fk
In this section, we formulate the energy minimization prob-
lem for FL. Since it is challenging to obtain the globally opti- tk ≥ tmin
k , ∀k ∈ K, (20b)
mal solution due to nonconvexity, an iterative algorithm with 0 ≤ η ≤ 1, (20c)
low complexity is proposed to solve the energy minimization
problem. where s
tmin
k =   , ∀k ∈ K. (21)
gk pk
A. Problem Formulation b k log 2 1 + N0 bk
Our goal is to minimize the total energy consumption of
all users under a latency constraint. This energy efficient The optimal solution of (20) can be derived using the
optimization problem can be posed as follows: following theorem.
min E, (19) Theorem 2: The optimal solution (t∗ , η ∗ ) of problem (20)
t,b,f ,p,η
  satisfies:
a Ak log2 (1/η)
s.t. + tk ≤ T, ∀k ∈ K, t∗k = tmin
k , ∀k ∈ K, (22)
1−η fk
(19a)
  and η ∗ is the optimal solution to:
g k pk α1 log2 (1/η) + α2
tk bk log2 1 + ≥ s, ∀k ∈ K, (19b) min (23)
N0 bk η 1−η
K
X s.t. η min ≤ η ≤ η max , (23a)
bk ≤ B, (19c)
k=1
where
0 ≤ fk ≤ fkmax , ∀k ∈ K, (19d) η min = max ηkmin , η max = min ηkmax , (24)
k∈K k∈K
0 ≤ pk ≤ pmax
k , ∀k ∈ K, (19e)
0 ≤ η ≤ 1, (19f) βk (ηkmin ) = βk (ηkmax ) = tmin min
k , tk ≤ tmax
k , α1 , α2 and βk (η)
are defined in (C.2).
tk ≥ 0, bk ≥ 0, ∀k ∈ K, (19g)
Proof: See Appendix C. 
T T
where t = [t1 , · · · , tK ] , b = [b1 , · · · , bK ] , f = Theorem 2 shows that it is optimal to transmit with the
[f1 , · · · , fK ]T , p = [p1 , · · · , pK ]T , Ak = vCk Dk is a minimum time for each user. Based on this finding, problem
constant, fkmax and pmax k are respectively the maximum local (20) is equivalent to the problem (23) with only one variable.
computation capacity and maximum value of the average Obviously, the objective function (23) has a fractional form,
transmit power of user k. Constraint (19a) is obtained by which is generally hard to solve. By using the parametric
a
substituting Ik = v log2 (1/η) from (15) and I0 = 1−η from approach in [39], [40], we consider the following problem,
(17) into (6). Constraints (19b) is derived according to (4) and
tk rk ≥ s. Constraint (19a) indicates that the execution time H(ζ) = min α1 log2 (1/η) + α2 − ζ(1 − η). (25)
η min ≤η≤η max
of the local tasks and transmission time for all users should
not exceed the maximum completion time for the whole FL It has been proved [39] that solving (23) is equivalent to
algorithm. Since the total number of iterations for each user finding the root of the nonlinear function H(ζ). Since (25)
is the same, the constraint in (19a) captures a maximum time with fixed ζ is convex, the optimal solution η ∗ can be obtained
constraint for all users at each iteration if we divide the total by setting the first-order derivative to zero, yielding the optimal
number of iterations on both sides of the constraint in (19a). solution: η ∗ = (lnα2)ζ
1
. Thus, problem (23) can be solved by
The data transmission constraint is given by (19b), while the using the Dinkelbach method in [39] (shown as Algorithm 2).
bandwidth constraint is given by (19c). Constraints (19d) and In the second step, given (t, η) calculated in the first step,
(19e) respectively represent the maximum local computation problem (19) can be simplified as:
6

Algorithm 2 The Dinkelbach Method Algorithm 3 : Iterative Algorithm


(0)
1: Initialize ζ = ζ > 0, iteration number n = 0, and set 1: Initialize a feasible solution (t(0) , b(0) , f (0) , p(0) , η (0) ) of
the accuracy ǫ3 . problem (19) and set l = 0.
2: repeat 2: repeat
α1
3: Calculate the optimal η ∗ = (ln 2)ζ (n) of problem (25). 3: With given (b(l) , f (l) , p(l) ), obtain the optimal
4: Update ζ (n+1)
= α1 log2 (1/η ∗ )+α2 (t(l+1) , η (l+1) ) of problem (20).
1−η ∗
5: Set n = n + 1. 4: With given (t(l+1) , η (l+1) ), obtain the optimal (
6: until |H(ζ (n+1) )|/|H(ζ (n) )| < ǫ3 . b(l) , f (l) , p(l) ) of problem (26).
5: Set l = l + 1.
6: until objective value (19a) converges
K
a X
κAk log2 (1/η)fk2 + tk pk ,

min (26)
b,f ,p 1 − η
k=1
  bk (µ) is the solution to
a Ak log2 (1/η)
s.t. + tk ≤ T, ∀k ∈ K, (26a)
 
1−η fk N0 (ln 2)s (ln 2)s t(lnb 2)s
e tk bk (µ)
−1− e kk (µ)
+ µ = 0, (33)

g k pk
 g k tk tk bk (µ)
tk bk log2 1 + ≥ s, ∀k ∈ K, (26b) and µ satisfies
N0 bk
K
X K
X
bk ≤ B, (26c) max{bk (µ), bmin
k } = B. (34)
k=1 k=1
0 ≤ pk ≤ pmax
k , ∀k ∈ K, (26d) Proof: See Appendix D. 
max
0 ≤ fk ≤ fk , ∀k ∈ K. (26e) By iteratively solving problem (20) and problem (26), the
algorithm that solves problem (19) is given in Algorithm 3.
Since both objective function and constraints can be decou-
Since the optimal solution of problem (20) or (26) is obtained
pled, problem (26) can be decoupled into two subproblems:
in each step, the objective value of problem (19) is nonincreas-
ing in each step. Moreover, the objective value of problem (19)
K
aκ log2 (1/η) X is lower bounded by zero. Thus, Algorithm 3 always converges
min Ak fk2 , (27)
f 1−η to a local optimal solution.
k=1
 
a Ak log2 (1/η)
s.t. + tk ≤ T, ∀k ∈ K, (27a) C. Complexity Analysis
1−η fk
0 ≤ fk ≤ fkmax , ∀k ∈ K, (27b) To solve the energy minimization problem (19) by using
Algorithm 3, the major complexity in each step lies in solving
and K problem (20) and problem (26). To solve problem (20), the
a X
min tk p k , (28) major complexity lies in obtaining the optimal η ∗ according
b,p 1−η
k=1
  to Theorem 2, which involves complexity O(K log 2(1/ǫ1))
g k pk with accuracy ǫ1 by using the Dinkelbach method. To solve
s.t. tk bk log2 1 + ≥ s, ∀k ∈ K, (28a)
N0 bk problem (26), two subproblems (27) and (28) need to be
XK optimized. For subproblem (27), the complexity is O(K)
bk ≤ B, (28b) according to (29). For subproblem (28), the complexity is
k=1 O(K log2 (1/ǫ2 ) log2 (1/ǫ3 )), where ǫ2 and ǫ3 are respec-
0 ≤ pk ≤ pmax
k , ∀k ∈ K. (28c) tively the accuracy of solving (32) and (33) [41]. As a
result, the total complexity of the proposed Algorithm 3 is
According to (27), it is always efficient to utilize the
O(Lit SK), where Lit is the number of iterations for iteratively
minimum computation capacity fk . To minimize (27), the
optimizing (t, η) and (T, b, f , p), and S = log 2(1/ǫ1 ) +
optimal fk∗ can be obtained from (27a), which gives:
aAk log2 (1/η) log 2(1/ǫ2 ) log 2(1/ǫ3 ).
fk∗ = , ∀k ∈ K. (29) The conventional successive convex approximation (SCA)
T (1 − η) − atk
We solve problem (28) using the following theorem. method can be used to solve problem (19). The complexity of
Theorem 3: The optimal solution (b∗ , p∗ ) of problem (28) SCA method is O(Lsca K 3 ), where Lsca is the total number of
satisfies: iterations for SCA method. Compared to SCA, the proposed
b∗k = max{bk (µ), bmin k }, (30) Algorithm 3 grows linearly with the number of users K.
and It should be noted that Algorithm 3 is done at the BS side
N0 b∗k  tksb∗ 
p∗k = 2 k −1 , (31) before executing the FL scheme in Algorithm 1. To implement
gk Algorithm 3, the BS needs to gather the information of gk ,
where pmax
k , fkmax , Ck , Dk , and s, which can be uploaded by all
(ln 2)s users before the FL process. Due to small data size, the
bmin
k =−  (ln 2)N0 s
 , (32) transmission delay of these information can be neglected. The
2)N0 s − gk pmax tk
tk W − g(ln kp
max tk e k + (ln 2)N0 s
gk pmax
BS broadcasts the obtained solution to all users. Since the BS
k k
7

has high computation capacity, the latency of implementing Algorithm 4 Completion Time Minimization
Algorithm 3 at the BS will not affect the latency of the FL 1: Initialize Tmin , Tmax , and the tolerance ǫ5 .
process. 2: repeat
3: Set T = Tmin +T2
max
.
IV. A LGORITHM F IND A F EASIBLE S OLUTION OF
TO 4: Check the feasibility condition (40).
P ROBLEM (19) 5: If set (36a)-(36e) has a feasible solution, set Tmax = T .
Since the iterative algorithm to solve problem (19) requires Otherwise, set T = Tmin .
an initial feasible solution, we provide an efficient algorithm 6: until (Tmax − Tmin )/Tmax ≤ ǫ5 .
to find a feasible solution of problem (19) in this section. To
obtain a feasible solution of (19), we construct the completion
time minimization problem: (ln 2)η
uk (η) = − , (38)
min T, (35)
 (ln 2)N0 η

2)N0 η −
T,t,b,f ,p,η W − (ln
gk pmax e gk pmax
k + (ln 2)N0 η
gk pmax
k k
s.t. (19a) − (19g). (35a)
and s
We define (T ∗ , t∗ , b∗ , f ∗ , p∗ , η ∗ ) as the optimal solution of vk (η) = (1−η)T Ak log2 η
. (39)
a + fkmax
problem (35). Then, (t∗ , b∗ , f ∗ , p∗ , η ∗ ) is a feasible solution
of problem (19) if T ∗ ≤ T , where T is the maximum delay Proof: See Appendix F. 
in constraint (19a). Otherwise, problem (19) is infeasible. To effectively solve (37) in Lemma 3, we provide the
Although the completion time minimization problem (35) is following lemma.
still nonconvex due to constraints (19a)-(19b), we show that Lemma 4: In (38), uk (vk (η)) is a convex function.
the globally optimal solution can be obtained by using the Proof: See Appendix G. 
bisection method. Lemma 4 implies that the optimization problem in (37) is a
convex problem, which can be effectively solved. By finding
A. Optimal Resource Allocation the optimal solution of (37), the sufficient and necessary
Lemma 2: Problem (35) with T < T ∗ does not have a condition for the feasibility of set (36a)-(36e) can be simplified
feasible solution (i.e., it is infeasible), while problem (35) with using the following theorem.
T > T ∗ always has a feasible solution (i.e., it is feasible). Theorem 4: Set (36a)-(36e) is nonempty if and only if
Proof: See Appendix E.  XK
According to Lemma 2, we can use the bisection method B≥ uk (vk (η ∗ )), (40)
to obtain the optimal solution of problem (35). k=1
PK
With a fixed T , we still need to check whether there where η ∗ is the unique solution to k=1 u′k (vk (η ∗ ))vk′ (η ∗ ) =
exists a feasible solution satisfying constraints (19a)-(19g). 0.
From constraints (19a) and (19c), we can see that it is Theorem 4 directly follows from Lemmas 3 and 4. Due to
PK
always efficient to utilize the maximum computation capac- the convexity of function uk (vk (η)), k=1 u′k (vk (η ∗ ))vk′ (η ∗ )
ity, i.e., fk∗ = fkmax , ∀k ∈ K. From (19b) and (19d), we is an increasing function of η ∗ . As a result, the unique
PK
can see that minimizing the completion time occurs when ∗
solution of η to ′ ∗ ′ ∗
k=1 uk (vk (η ))vk (η ) = 0 can be ef-
p∗k = pmax
k , ∀k ∈ K. Substituting the maximum computation fectively solved via the bisection method. Based on Theorem
capacity and maximum transmission power into (35), the 4, the algorithm for obtaining the minimum completion time
completion time minimization problem becomes: is summarized in Algorithm 4. Theorem 4 shows that the
min T (36) optimal FL accuracy level η ∗ meets the first-order condition
T,t,b,η PK ′ ∗ ′ ∗ ∗
k=1 uk (vk (η ))vk (η ) = 0, i.e., the optimal η should not
(1 − η)T Ak log2 η be too small or too large for FL. This is because, for small
s.t. tk ≤ + , ∀k ∈ K, (36a)
a fkmax η, the local computation time (number of iterations) becomes
gk pmax
 
s k high as shown in Lemma 1. For large η, the transmission time
≤ bk log2 1 + , ∀k ∈ K, (36b)
tk N0 bk is long due to the fact that a large number of global iterations
XK is required as shown in Theorem 1.
bk ≤ B, (36c)
k=1 B. Complexity Analysis
0 ≤ η ≤ 1, (36d) The major complexity of Algorithm 4 at each iteration
tk ≥ 0, bk ≥ 0, ∀k ∈ K. (36e) lies in checking the feasibility condition (40). To check the
inequality in (40), the optimal η ∗ needs to be obtained by
Next, we provide the sufficient and necessary condition for using the bisection method, which involves the complexity
the feasibility of set (36a)-(36e). of O(K log2 (1/ǫ4 )) with accuracy ǫ4 . As a result, the total
Lemma 3: With a fixed T , set (36a)-(36e) is nonempty if complexity of Algorithm 4 is O(K log2 (1/ǫ4 ) log2 (1/ǫ5 )),
an only if K where ǫ5 is the accuracy of the bisection method used in
X
B ≥ min uk (vk (η)), (37) the outer layer. The complexity of Algorithm 4 is low since
0≤η≤1 O(K log2 (1/ǫ4 ) log2 (1/ǫ5 )) grows linearly with the total
k=1
where number of users.
8

106 65
IID data with serveral local computaions Upper Bound
Non-IID data with serval local computaions 60 Exact value
5

Required number of iterations


10 IID data with one local computaion
Value of the loss function

Non-IID data with one local computaion 55


4
10
50

103 45

40
102
35
1
10 30

0 25
10 0 0.002 0.004 0.006 0.008 0.01
0 100 200 300 400 500
Number of computations Accuracy 0
Fig. 4. Value of the loss function as the number of total computations varies Fig. 5. Comparison of exact number of required iterations with the upper
for FL algorithms with IID and non-IID data. bound derived in Theorem 1.

110
Similar to Algorithm 3 in Section III, Algorithm 4 is done Proposed FL
at the BS side before executing the FL scheme in Algorithm EB-FDMA
100 FE-FDMA
1, which will not affect the latency of the FL process. TDMA

Completion time (s)


90
V. N UMERICAL R ESULTS
For our simulations, we deploy K = 50 users uniformly in 80
a square area of size 500 m × 500 m with the BS located
at its center. The path loss model is 128.1 + 37.6 log10 d (d 70
is in km) and the standard deviation of shadow fading is 8
dB. In addition, the noise power spectral density is N0 = 60
−174 dBm/Hz. We use the real open blog feedback dataset in
[42]. This dataset with a total number of 60,000 data samples 50
2 4 6 8 10 12 14 16 18 20
originates from blog posts and the dimensional of each data Maximum average transmit power (dBm)
Fig. 6. Completion time versus maximum average transmit power of each
sample is 281. The prediction task associated with the data is user.
the prediction of the number of comments in the upcoming
24 hours. Parameter Ck is uniformly distributed in [1, 3] × In Fig. 5, we show the comparison of exact number of
104 cycles/sample. The effective switched capacitance in local required iterations with the upper bound derived in Theorem 1.
computation is κ = 10−28 [36]. In Algorithm 1, we set ξ = According to this figure, it is found that the gap of the exact
1/10, δ = 1/10, and ǫ0 = 10−3 . Unless specified otherwise, number of required iterations with the upper bound derived
we choose an equal maximum average transmit power pmax 1 = in Theorem 1 decreases as the accuracy ξ0 decreases, which
· · · = pmax
K = pmax = 10 dB, an equal maximum computation indicates that the upper bound is tight for small value of ξ0 .
capacity f1max = · · · = fKmax
= f max = 2 GHz, a transmit
data size s = 28.1 kbits, and a bandwidth B = 20 MHz. Each B. Completion Time Minimization
We compare the proposed FL scheme with the FL FDMA
user has Dk = 500 data samples, which are randomly selected
scheme with equal bandwidth b1 = · · · = bK (labelled as
from the dataset with equal probability. All statistical results
‘EB-FDMA’), the FL FDMA scheme with fixed local accuracy
are averaged over 1000 independent runs.
η = 1/2 (labelled as ‘FE-FDMA’), and the FL TDMA scheme
A. Convergence Behavior in [35] (labelled as ‘TDMA’). Fig. 6 shows how the completion
Fig. 4 shows the value of the loss function as the number time changes as the maximum average transmit power of each
of total computations varies with IID and non-IID data. For user varies. We can see that the completion time of all schemes
IID data case, all data samples are first shuffled and then decreases with the maximum average transmit power of each
partitioned into 60, 000/500 = 120 equal parts, and each user. This is because a large maximum average transmit
device is assigned with one particular part. For non-IID data power can decrease the transmission time between users and
case [30], all data samples are first sorted by digit label and the BS. We can clearly see that the proposed FL scheme
then divided into 240 shards of size 250, and each device is achieves the best performance among all schemes. This is
assigned with two shards. Note that the latter is a pathological because the proposed approach jointly optimizes bandwidth
non-IID partition way since most devices only obtain two and local accuracy η, while the bandwidth is fixed in EB-
kinds of digits. From this figure, it is observed that the FDMA and η is not optimized in FE-FDMA. Compared to
FL algorithm with non-IID data shows similar convergence TDMA, the proposed approach can reduce the completion
behavior with the FL algorithms with IID data. Besides, we time by up to 27.3% due to the following two reasons. First,
find that the FL algorithm with multiple local updates requires each user can directly transmit result data to the BS after
more total computations than the FL algorithm with only one local computation in FDMA, while the wireless transmission
local update. should be performed after the local computation for all users
9

90 380
GD, batch-size 500
SGD, batch-size 250 360
80
SGD, batch-size 125
340
70 Proposed FL
Completion time (s)

Total energy (J)


320 EB-FDMA
FE-FDMA
60 300 EXH-FDMA
TDMA
50 280

260
40
240
30
220

20 200
2 4 6 8 10 12 14 16 18 20 10 11 12 13 14 15 16 17 18 19
Maximum average transmit power (dB) Maximum average transmit power (dB)
Fig. 7. Comparison of completion time with different batch sizes. Fig. 9. Total energy versus maximum average transmit power of each user
with T = 100 s.
70
350
60 Proposed FDMA, Communication
Proposed FDMA, Computation
300
TDMA, Communication
50
Completion time (s)

TDMA, Computation
250
Proposed FDMA, Communication

Total energy (J)


40 Proposed FDMA, Computation
TDMA, Communication 200
TDMA, Computation
30
150
20
100
10
50
0
2 4 6 8 10 12 14 16 18 20 0
Maximum average transmit power (dB) 10 11 12 13 14 15 16 17 18
Fig. 8. Communication and computation time versus maximum average
Maximum average transmit power (dB)
transmit power of each user. Fig. 10. Communication and computation energy versus maximum average
transmit power of each user.
in TDMA, which needs a longer time compared to FDMA.
Second, the noise power for users in FDMA is lower than in the solution with the best objective value is treated as the near
TDMA since each user is allocated to part of the bandwidth optimal solution. From this figure, we can observe that the total
and each user occupies the whole bandwidth in TDMA, which energy decreases with the maximum average transmit power
indicates the transmission time in TDMA is longer than in of each user. Fig. 9 also shows that the proposed FL scheme
FDMA. outperforms the EB-FDMA, FE-FDMA, and TDMA schemes.
Fig. 7 shows the completion time versus the maximum Moreover, the EXH-FDMA scheme achieves almost the same
average transmit power with different batch sizes [43], [44]. performance as the proposed FL scheme, which shows that
The number of local iterations are keep fixed for these three the proposed approach achieves the optimum solution.
schemes in Fig. 7. From this figure, it is found that the Fig. 10 shows the communication and computation energy
SGD method with smaller batch size (125 or 250) always as functions of the maximum average transmit power of
outperforms the GD scheme, which indicates that it is efficient each user. In this figure, it is found that the communication
to use small batch size with SGD scheme. energy increases with the maximum average transmit power
Fig. 8 shows how the communication and computation time of each user, while the computation energy decreases with
change as the maximum average transmit power of each user the maximum average transmit power of each user. For low
varies. It can be seen from this figure that both communication maximum average transmit power of each user (less than 15
and computation time decrease with the maximum average dB), FDMA outperforms TDMA in terms of both computation
transmit power of each user. We can also see that the compu- and communication energy consumption. For high maximum
tation time is always larger than communication time and the average transmit power of each user (higher than 17 dB),
decreasing speed of communication time is faster than that of FDMA outperforms TDMA in terms of communication energy
computation time. consumption, while TDMA outperforms FDMA in terms of
C. Total Energy Minimization computation energy consumption.
Fig. 9 shows the total energy as function of the maximum Fig. 11 shows the tradeoff between total energy consump-
average transmit power of each user. In this figure, the EXH- tion and completion time. In this figure, we compare the
FDMA scheme is an exhaustive search method that can find proposed scheme with the random selection (RS) scheme,
a near optimal solution of problem (19), which refers to the where 25 users are randomly selected out from K = 50
proposed iterative algorithm with 1000 initial starting points. users at each iteration. We can see that FDMA outperforms
There are 1000 solutions obtained by using EXH-FDMA, and RS in terms of total energy consumption especially for low
10

(n),(i+1) (n)∗
Gk (w(n) , hk ) − Gk (w(n) , hk )
600
(n) (n),(i) (n) (n)∗
Proposed FL
EB-FDMA
≤ Gk (w , hk )
− Gk (w , hk )
500 FE-FDMA (2 − Lδ)δγ (n),(i) (n)∗
RS − (Gk (w(n) , hk ) − Gk (w(n) , hk ))
2
400
Total energy (J)

 i+1
(2 − Lδ)δγ (n)∗
≤ 1− (Gk (w(n) , 0) − Gk (w(n) , hk ))
300 2
 
(2 − Lδ)δγ (n)∗
200 ≤ exp −(i + 1) (Gk (w(n) , 0) − Gk (w(n) , hk )),
2
(A.3)
100
where the last inequality follows from the fact that 1 −
(n),(i)
0
100 150 200 250 300
x ≤ exp(−x). To ensure that Gk (w(n) , hk ) −
(n) (n)∗ (n) (n) (n)∗
Fig. 11. Total energy versus completion time.
Completion time (s) Gk (w , hk ) ≤ η(Gk (w , 0) − Gk (w , hk )), we
have (15).
completion time. This is because, in FDMA all users can
transmit data to the BS in each global iteration, while only A PPENDIX B
half number of users can transmit data to the BS in each P ROOF OF T HEOREM 1
global iteration. As a result, the average transmission time for
each user in FDMA is larger than that in RS, which can lead Before proving Theorem 1, the following lemma is pro-
to lower transmission energy for the proposed algorithm. In vided.
particular, with given the same completion time, the proposed Lemma 5: Under the assumptions A1 and A2, the following
FL can reduce energy of up to 12.8%, 53.9%, and 59.5% conditions hold:
1
compared to EB-FDMA, FE-FDMA, and RS, respectively. k∇Fk (w(n) + h(n) ) − ∇Fk (w(n) )k2
L
VI. C ONCLUSIONS ≤ (∇Fk (w(n) + h(n) ) − ∇Fk (w(n) ))T h(n)
In this paper, we have investigated the problem of energy 1
≤ k∇Fk (w(n) + h(n) ) − ∇Fk (w(n) )k2 , (B.1)
efficient computation and transmission resource allocation of γ
FL over wireless communication networks. We have derived and
the time and energy consumption models for FL based on k∇F (w)k2 ≥ γ(F (w) − F (w∗ )). (B.2)
the convergence rate. With these models, we have formulated Proof: According to the Lagrange median theorem, there
a joint learning and communication problem so as to min- always exists a w such that
imize the total computation and transmission energy of the (∇Fk (w(n) + h(n) ) − ∇Fk (w(n) )) = ∇2 Fk (w)h(n) . (B.3)
network. To solve this problem, we have proposed an iterative Combining assumptions A1, A2, and (B.3) yields (B.1).
algorithm with low complexity, for which, at each iteration, For the optimal solution w∗ of F (w∗ ), we always have
we have derived closed-form solutions for computation and ∇F (w∗ ) = 0. Combining (9) and (B.1), we also have γI 
transmission resources. Numerical results have shown that the ∇2 F (w), which indicates that
proposed scheme outperforms conventional schemes in terms
of total energy consumption, especially for small maximum k∇F (w) − ∇F (w∗ )k ≥ γkw − w∗ k, (B.4)
average transmit power. and
γ
F (w∗ ) ≥ F (w)+ ∇F (w)T (w∗ − w)+ kw∗ − wk2 . (B.5)
A PPENDIX A 2
P ROOF OF L EMMA 1 As a result, we have:
Based on (12), we have: k∇F (w)k2 = k∇F (w) − ∇F (w∗ )k2
(n),(i+1) A1 (n),(i)
Gk (w(n) , hk ) ≤ Gk (w(n) , hk ) (B.4)
2
≥ γk∇F (w) − ∇F (w∗ )kkw − w∗ k
(n),(i)
2 Lδ (n),(i)
2
− δ ∇Gk (w(n) , hk ) + ∇Gk (w(n) , hk ) ≥ γ(∇F (w) − ∇F (w∗ ))T (w − w∗ )
2
(n) (n),(i) (2 − Lδ)δ (n),(i)
2 = γ∇F (w)T (w − w∗ )
= Gk (w , hk )− ∇Gk (w(n) , hk ) .
2 (B.5)
(A.1) ≥ γ(F (w) − F (w∗ )), (B.6)

Similar to (B.2) in the following Appendix B, we can prove which proves (B.2). 
that: For the optimal solution of problem (11), the first-order
k∇Gk (w(n) , hk )k2 ≥ γ(Gk (w(n) , hk ) − Gk (w(n) , hk
(n)∗
)), derivative condition always holds, i.e.,
(A.2) (n)∗
∇Fk (w(n) + hk ) − ∇Fk (w(n) ) + ξ∇F (w(n) ) = 0. (B.7)
(n)∗
where hk is the optimal solution of problem (11). Based We are new ready to prove Theorem 1. With the above
on (A.1) and (A.2), we have: inequalities and equalities, we have:
11

F (w(n+1) ) Applying (B.11) to (B.10), we can obtain:


2 K 
K K 1 X γ − Lξ (n) 2
(10),A1 1 X (n) L X (n) F (w(n+1) ) ≤ F (w(n) ) − khk k
≤ F (w(n) ) + ∇F (w(n) )T hk + hk Kξ 2
K 2K 2 k=1
k=1 k=1 
K  (1 − η)γ (n)∗ 2
(11) 1 X (n) (n)
+ khk k . (B.12)
(n)
= F (w ) + Gk (w(n) , hk ) + ∇Fk (w(n) )hk 2

k=1 From (B.1), the following relationship is obtained:
K 2 1
(n)∗ (n)∗
khk k2 ≥ 2 k∇Fk (w(n) + hk )) − ∇Fk (w(n) )k2 .

(n) (n) L X (n)
− Fk (w + hk ) + hk L
2K 2 (B.13)
k=1 For the constant parameter ξ, we choose
K 
A2 1 X (n) γ − Lξ > 0. (B.14)
≤ F (w(n) ) + Gk (w(n) , hk ) − Fk (w(n) )
Kξ According to (B.14) and (B.12), we can obtain:
k=1
K
K 2 (1 − η)γ X (n)∗ 2
F (w(n+1) )≤F (w(n) ) − khk k

γ (n) 2 L X (n)
− khk k + hk . (B.8) 2Kξ
2 2K 2 k=1
k=1 K 
(B.13) (1 − η)γ X
According to the triangle inequality and mean inequality, ≤ F (w(n) ) −
2KL2ξ
we have k=1
K 2 K
!2 K

1 X (n) 1 X (n) 1 X (n) 2 (n) (n)∗ (n) 2
hk ≤ hk ≤ hk . k∇Fk (w + hk ) − ∇Fk (w )k
K K K
k=1 k=1 k=1 (1 − η)γξ
(B.7)
(B.9) = F (w(n) ) − k∇F (w(n) )k2
2L2
Combining (B.8) and (B.9) yields: (B.2) (1 − η)γ 2 ξ
≤ F (w(n) ) − (F (w(n) ) − F (w∗ )).
F (w(n+1) ) 2L2
K 
(B.15)
(n) 1 X (n)
≤ F (w ) + Gk (w(n) , hk ) − Fk (w(n) ) Based on (B.15), we get:

k=1
 F (w(n+1) ) − F (w∗ )
γ − Lξ (n) 2 
(1 − η)γ 2 ξ

− khk k ≤ 1− (F (w(n) ) − F (w∗ ))
2 2L2
K  n+1
(11) 1 X (n) (n)∗

(1 − η)γ 2 ξ
= F (w(n) ) + Gk (w(n) , hk ) − Gk (w(n) , hk ) ≤ 1− (F (w(0) ) − F (w∗ ))
Kξ 2L2
k=1
(1 − η)γ 2 ξ
  
(n)∗ γ − Lξ (n) 2
(n) (n)
− (Gk (w , 0) − Gk (w , hk )) − khk k ≤ exp −(n + 1) (F (w(0) ) − F (w∗ )), (B.16)
2 2L2
K 
(14) 1 X γ − Lξ (n) 2 where the last inequality follows from the fact that 1 − x ≤
≤ F (w(n) ) − khk k exp(−x). To ensure that F (w(n+1) )−F (w∗ ) ≤ ǫ0 (F (w(0) )−
Kξ 2
k=1
 F (w∗ )), we have (17).
(n)∗
+ (1 − η)(Gk (w(n) , 0) − Gk (w(n) , hk )) A PPENDIX C
P ROOF OF T HEOREM 2
K  According to (20), transmitting with minimal time is always
(11) (n) 1 X γ − Lξ (n) 2
= F (w )− khk k energy efficient, i.e., the optimal time allocation is t∗k = tmin
Kξ 2 k .
k=1 Substituting t∗k = tmin
k into problem (20) yields:
(n)∗
+ (1 − η)(Fk (w(n) ) − Fk (w(n) + hk ) α1 log2 (1/η) + α2
 min (C.1)
(n) (n) T (n)∗ η 1−η
+ (∇Fk (w ) − ξ∇F (w )) hk )
s.t. tmin
k ≤ βk (η), ∀k ∈ K, (C.1a)
K 
(B.7) 1 X γ − Lξ (n) 2 0 ≤ η ≤ 1, (C.1b)
= F (w(n) ) − khk k
Kξ 2
k=1 where
(n) (n)∗ K K
+ (1 − η)(Fk (w ) − Fk (w(n) + hk ) X X
 α1 = a κAk fk2 , α2 =a tmin
k pk ,
(n)∗ (n)∗
+ ∇Fk (w(n) + hk )T hk ) . (B.10) k=1 k=1
(1 − η)T Ak log2 η
βk (η) = + . (C.2)
Based on assumption A2, we can obtain: a fkmax
(n)∗
Fk (w(n) ) ≥ Fk (w(n) + hk ) From (C.2), it can be verified that βk (η) is a concave
(n) (n)∗ (n)∗ γ (n)∗ function. Due to the concavity of βk (η), constraints (C.1a)
+ ∇Fk (w + hk )T (−hk ) + khk k2 . (B.11)
2 can be equivalently transformed to:
12

ηkmin ≤ η ≤ ηkmax , (C.3) A PPENDIX E


P ROOF OF L EMMA 2
where βk (ηkmin )
= βk (ηkmax )
= tmin
k and tmin
k ≤ tmax
k . Since Assume that problem (35) with T = T̄ < T ∗ is feasi-
min
βk (1) = 0, limη→0+ βk (η) = −∞ and tk > 0, we always ble, and the feasible solution is (T̄ , t̄, b̄, f¯, p̄, η̄). Then, the
have 0 < tmin
k ≤ tmax
k < 1. With the help of (C.3), problem solution (T̄ , t̄, b̄, f¯, p̄, η̄) is feasible with lower value of the
(C.1) can be simplified as (23). objective function than solution (T ∗ , t∗ , b∗ , f ∗ , p∗ , η ∗ ), which
contradicts the fact that (T ∗ , t∗ , b∗ , f ∗ , p∗ , η ∗ ) is the optimal
A PPENDIX D solution.
P ROOF OF T HEOREM 3 For problem (35) with T = T̄ > T ∗ , we can always
PK
To minimize k=1 tk pk , transmit power pk needs to be construct a feasible solution (T̄ , t∗ , b∗ , f ∗ , p∗ , η ∗ ) to problem
minimized. To minimize pk from (28a), we have: (35) by checking all constraints.
N0 bk  t sb 
p∗k = 2 k k −1 . (D.1) A PPENDIX F
gk
P ROOF OF L EMMA 3
The first order and second order derivatives of p∗k can be To prove this, we first  define function
respectively given by

1
y = x ln 1 + , x > 0. (F.1)
∂p∗k x
 
N0 (ln 2)s (ln 2)s (ln 2)s
= e tk bk
−1− e tk bk
, (D.2) Then, we have
∂bk gk tk b k  
1 1 1
and y ′ = ln 1 + − , y ′′ = − < 0. (F.2)
∂ 2 p∗k N0 (ln 2)2 s2 (ln 2)s x x+1 x(x + 1)2
= e tk bk ≥ 0. (D.3)
∂bk2 2 3
g k tk b k According to (F.2), y ′ is a decreasing function. Since
∂p∗ limti →+∞ y ′ = 0, we can obtain that y ′ > 0 for all
From (D.3), we can see that ∂bkk is an increasing function 0 < x < +∞. As a result, y is an increasing function, i.e., the
∂p∗ ∂p∗
of bk . Since limbk →0+ ∂bkk = 0, we have ∂bkk < 0 for 0 < right hand side of (36b) is an increasing function of bandwidth
bk < ∞, i.e., p∗k in (D.1) is a decreasing function of bk . Thus, bk .
maximum transmit power constraint p∗k ≤ pmax k is equivalent To ensure that the maximum bandwidth constraint (36c) can
to: be satisfied, the left hand side of (36b) should be as small as
(ln 2)s possible, i.e., tk should be as long as possible. Based on (36a),
bk ≥ bmin
k ,− ,
the optimal time allocation should be:
 (ln 2)N0 s

(ln 2)N0 s − gk pmax tk (ln 2)N0 s
tk W − gk pmax tk e k + gk pmax (1 − η)T Ak log2 η
k k
t∗k = + , ∀k ∈ K. (F.3)
(D.4) a fkmax
where bmin
k is defined in (32).
Substituting (F.3) into (36b), we can construct the following
Substituting (D.1) into problem (28), we can obtain: problem:
K K
X N0 tk bk  b st  X
min 2 k k −1 , (D.5) min bk (F.4)
b gk b,η
k=1 k=1
K gk pmax
 
k
vk (η) ≤ bk log2 1 + , ∀k ∈ K, (F.4a)
X
s.t. bk ≤ B, (D.5a) s.t.
N0 bk
k=1
0 ≤ η ≤ 1, (F.4b)
bk ≥ bmin
k , ∀k ∈ K, (D.5b)
bk ≥ 0, ∀k ∈ K, (F.4c)
According to (D.3), problem (D.5) is a convex function, where vk (η) is defined in (39). We can observe that set (36a)-
which can be effectively solved by using the Karush-Kuhn- (36e) is nonempty if an only if the optimal objective value of
Tucker conditions. The Lagrange function of (D.5) is: (F.4) is less than B. Since the right hand side of (36b) is an
K K
!
X N0 tk bk  b st  X increasing function, (36b) should hold with equality for the
L(b, µ) = 2 k k −1 +µ bk − B , optimal solution of problem (F.4). Setting (36b) with equality,
gk
k=1 k=1 problem (F.4) reduces to (37).
(D.6)
where µ is the Lagrange multiplier associated with constraint A PPENDIX G
(D.5a). The first order derivative of L(b, µ) with respect to bk P ROOF OF L EMMA 4
is: We first prove that vk (η) is a convex function. To show this,
 
∂L(b, µ) N0 (ln 2)s (ln 2)s (ln 2)s we define: s
= e tk bk
−1− e tk bk
+ µ. (D.7) φ(η) = , 0 ≤ η ≤ 1, (G.1)
∂bk g k tk tk b k η
and
We define bk (µ) as the unique solution to ∂L(b,µ)
∂bk = 0. Given (1 − η)T Ak log2 η
constraint (D.5b), the optimal b∗k can be founded from (30). ϕk (η) = + , 0 ≤ η ≤ 1. (G.2)
a fkmax
Since the objective function (D.5) is a decreasing function of According to (39), we have: vk (η) = φ(ϕk (η)). Then, the
bk , constrain (D.5a) always holds with equality for the optimal second-order derivative of vk (η) is:
solution, which shows that the optimal Lagrange multiplier is
vk′′ (η) = φ′′ (ϕk (η))(ϕ′k (η))2 + φ′ (ϕk (η))ϕ′′k (η). (G.3)
obtained by solving (34).
13

According to (G.1) and (G.2), we have: [8] W. Saad, M. Bennis, and M. Chen, “A vision of 6G wireless systems:
s 2s Applications, trends, technologies, and open research problems,” IEEE
φ′ (η) = − 2 ≤ 0, φ′′ (η) = 3 ≥ 0, (G.4) Network, to appear, 2019.
η η [9] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network
Ak intelligence at the edge,” Proceedings of the IEEE, vol. 107, no. 11, pp.
ϕ′′k (η) = − ≤ 0. (G.5)
(ln 2)fkmax η 2 2204–2239, Nov. 2019.
[10] M. Chen, O. Semiari, W. Saad, X. Liu, and C. Yin, “Federated echo state
Combining (G.3)-(G.5), we can find that vk′′ (η) ≥ 0, i.e., vk (η) learning for minimizing breaks in presence in wireless virtual reality
is a convex function. networks,” IEEE Trans. Wireless Commun., to appear, 2019.
Then, we can show that uk (η) is an increasing and convex [11] H. Tembine, Distributed strategic learning for wireless engineers. CRC
Press, 2018.
function. According to Appendix B, uk (η) is the inverse [12] S. Samarakoon, M. Bennis, W. Saad, and M. Debbah, “Distributed fed-
function of the right hand side of (36b). If we further define erated learning for ultra-reliable low-latency vehicular communications,”
function: arXiv preprint arXiv:1807.08127, 2018.

gk pmax
 [13] D. Gündüz, P. de Kerret, N. D. Sidiropoulos, D. Gesbert, C. R. Murthy,
k
zk (η) = η log2 1 + , η ≥ 0, (G.6) and M. van der Schaar, “Machine learning in the air,” IEEE J. Sel. Areas
N0 η Commun., vol. 37, no. 10, pp. 2184–2199, Oct. 2019.
uk (η) is the inverse function of zk (η), which gives [14] M. S. H. Abad, E. Ozfatura, D. Gündüz, and O. Ercetin, “Hierarchi-
cal federated learning across heterogeneous cellular networks,” arXiv
uk (zk (η)) = η. preprint arXiv:1909.02362, 2019.
According to (F.1) and(F.2) in Appendix B, function zk (η) [15] T. Nishio and R. Yonetani, “Client selection for federated learning with
is an increasing and concave function, i.e., zk′ (η) ≥ 0 and heterogeneous resources in mobile edge,” in Proc. IEEE Int. Conf.
Commun. Shanghai, China: IEEE, May 2019, pp. 1–7.
zk′′ (η) ≤ 0. Since zk (η) is an increasing function, its inverse [16] S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and
function uk (η) is also an increasing function. A. T. Suresh, “SCAFFOLD: Stochastic controlled averaging for on-
Based on the definition of concave function, for any η1 ≥ 0, device federated learning,” arXiv preprint arXiv:1910.06378, 2019.
[17] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas,
η2 ≥ 0 and 0 ≤ θ ≤ 1, we have: “Communication-efficient learning of deep networks from decentralized
zk (θη1 + (1 − θ)η2 ) ≥ θzk (η1 ) + (1 − θ)zk (η2 ). (G.7) data,” arXiv preprint arXiv:1602.05629, 2016.
[18] V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated
Applying the increasing function uk (η) on both sides of (G.7) multi-task learning,” in Advances Neural Information Process. Syst.
yields: Curran Associates, Inc., 2017, pp. 4424–4434.
[19] H. H. Yang, Z. Liu, T. Q. S. Quek, and H. V. Poor, “Scheduling policies
θη1 + (1 − θ)η2 ≥ uk (θzk (η1 ) + (1 − θ)zk (η2 )). (G.8) for federated learning in wireless networks,” IEEE Trans. Commun., to
Denote η̄1 = zk (η1 ) and η̄2 = zk (η2 ), i.e., we have η1 = appear, 2019.
[20] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and
uk (η̄1 ) and η2 = uk (η̄2 ). Thus, (G.8) can be rewritten as: K. Chan, “Adaptive federated learning in resource constrained edge
θuk (η̄1 ) + (1 − θ)uk (η̄1 ) ≥ uk (θη̄1 + (1 − θ)η̄2 ), (G.9) computing systems,” IEEE J. Sel. Areas Commun., vol. 37, no. 6, pp.
1205–1221, June 2019.
which indicates that uk (η) is a convex function. As a result, we [21] E. Jeong, S. Oh, J. Park, H. Kim, M. Bennis, and S. Kim, “Multi-hop
have proven that uk (η) is an increasing and convex function, federated private data augmentation with sample compression,” arXiv
which shows: preprint arXiv:1907.06426, 2019.
[22] J.-H. Ahn, O. Simeone, and J. Kang, “Wireless federated distillation
u′k (η) ≥ 0, u′′k (η) ≥ 0. (G.10) for distributed edge learning with heterogeneous data,” arXiv preprint
arXiv:1907.02745, 2019.
To show the convexity of uk (vk (η)), we have: [23] M. Chen, M. Mozaffari, W. Saad, C. Yin, M. Debbah, and C. S. Hong,
u′′k (vk (η)) = u′′k (vk (η))(vk′ (η))2 + u′k (vk (η))vk′′ (η) ≥ 0, “Caching in the sky: Proactive deployment of cache-enabled unmanned
(G.11) aerial vehicles for optimized quality-of-experience,” IEEE J. Sel. Areas
Commun., vol. 35, no. 5, pp. 1046–1061, 2017.
according to vk′′ (η) ≥ 0 and (G.10). As a result, uk (vk (η)) is [24] M. Mozaffari, W. Saad, M. Bennis, and M. Debbah, “Mobile unmanned
R EFERENCES aerial vehicles (uavs) for energy-efficient internet of things communica-
a convex function. tions,” IEEE Trans. Wireless Commun., vol. 16, no. 11, pp. 7574–7589,
[1] Z. Yang, M. Chen, W. Saad, C. S. Hong, and M. Shikh-Bahaei, Nov. 2017.
“Delay minimization for federated learning over wireless communication [25] Z. Yang, C. Pan, M. Shikh-Bahaei, W. Xu, M. Chen, M. Elkashlan,
networks,” Proc. Int. Conf. Machine Learning Workshop, July 2020. and A. Nallanathan, “Joint altitude, beamwidth, location, and bandwidth
[2] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and optimization for UAV-enabled communications,” IEEE Commun. Lett.,
K. Chan, “When edge meets learning: Adaptive control for resource- vol. 22, no. 8, pp. 1716–1719, 2018.
constrained distributed machine learning,” in Proc. IEEE Conf. Com- [26] H. Long, M. Chen, Z. Yang, B. Wang, Z. Li, X. Yun, and M. Shikh-
puter Commun., Honolulu, HI, USA, Apr. 2018, pp. 63–71. Bahaei, “Reflections in the sky: Joint trajectory and passive beamforming
[3] Y. Sun, M. Peng, Y. Zhou, Y. Huang, and S. Mao, “Application of design for secure uav networks with reconfigurable intelligent surface,”
machine learning in wireless networks: Key techniques and open issues,” arXiv preprint arXiv:2005.10559, 2020.
IEEE Commun. Surveys & Tut., vol. 21, no. 4, pp. 3072–3108, June [27] J. Konečnỳ, H. B. McMahan, D. Ramage, and P. Richtárik, “Federated
2019. optimization: Distributed machine learning for on-device intelligence,”
[4] Y. Liu, S. Bi, Z. Shi, and L. Hanzo, “When machine learning arXiv preprint arXiv:1610.02527, 2016.
meets big data: A wireless communication perspective,” arXiv preprint [28] M. M. Amiri and D. Gündüz, “Machine learning at the wireless edge:
arXiv:1901.08329, 2019. Distributed stochastic gradient descent over-the-air,” in Proc. IEEE Int.
[5] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, Symp. Information Theory. Paris, France: IEEE, Jan. 2019, pp. 1432–
V. Ivanov, C. Kiddon, J. Konecny, S. Mazzocchi, H. B. McMahan et al., 1436.
“Towards federated learning at scale: System design,” arXiv preprint [29] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Towards
arXiv:1902.01046, 2019. an intelligent edge: Wireless communication meets machine learning,”
[6] M. Chen, U. Challita, W. Saad, C. Yin, and M. Debbah, “Artificial neural arXiv preprint arXiv:1809.00343, 2018.
networks-based machine learning for wireless networks: A tutorial,” [30] G. Zhu, Y. Wang, and K. Huang, “Low-latency broadband analog ag-
IEEE Communications Surveys Tutorials, vol. 21, no. 4, pp. 3039–3071, gregation for federated edge learning,” arXiv preprint arXiv:1812.11494,
Fourthquarter 2019. 2018.
[7] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint [31] T. T. Vu, D. T. Ngo, N. H. Tran, H. Q. Ngo, M. N. Dao, and R. H.
learning and communications framework for federated learning over Middleton, “Cell-free massive MIMO for wireless federated learning,”
wireless networks,” arXiv preprint arXiv:1909.07972, 2019. arXiv preprint arXiv:1909.12567, 2019.
14

[32] Y. Sun, S. Zhou, and D. Gündüz, “Energy-aware analog aggre-


gation for federated learning with redundant data,” arXiv preprint
arXiv:1911.00188, 2019.
[33] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-
the-air computation,” arXiv preprint arXiv:1812.11750, 2018.
[34] Q. Zeng, Y. Du, K. K. Leung, and K. Huang, “Energy-efficient ra-
dio resource allocation for federated edge learning,” arXiv preprint
arXiv:1907.06040, 2019.
[35] N. H. Tran, W. Bao, A. Zomaya, and C. S. Hong, “Federated learning
over wireless networks: Optimization model design and analysis,” in
Proc. IEEE Conf. Computer Commun., Paris, France, June 2019, pp.
1387–1395.
[36] Y. Mao, J. Zhang, and K. B. Letaief, “Dynamic computation offloading
for mobile-edge computing with energy harvesting devices,” IEEE J.
Sel. Areas Commun., vol. 34, no. 12, pp. 3590–3605, Dec. 2016.
[37] O. Shamir, N. Srebro, and T. Zhang, “Communication-efficient dis-
tributed optimization using an approximate newton-type method,” in
Proc. Int. Conf. Machine Learning, Beijing, China, June 2014, pp. 1000–
1008.
[38] Z. Allen-Zhu, “Natasha 2: Faster non-convex optimization than SGD,”
in Advances in neural information processing systems, 2018, pp. 2675–
2686.
[39] W. Dinkelbach, “On nonlinear fractional programming,” Management
Science, vol. 13, no. 7, pp. 492–498, 1967.
[40] Z. Yang, M. Chen, W. Saad, W. Xu, M. Shikh-Bahaei, H. V. Poor,
and S. Cui, “Energy-efficient wireless communications with distributed
reconfigurable intelligent surfaces,” 2020.
[41] Z. Yang, W. Xu, C. Huang, J. Shi, and M. Shikh-Bahaei, “Beamforming
design for multiuser transmission through reconfigurable intelligent
surface,” IEEE Trans. Commun., 2020 (To appear).
[42] K. Buza, “Feedback prediction for blogs,” in Data analysis, machine
learning and knowledge discovery. Springer, 2014, pp. 145–152.
[43] C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and
G. E. Dahl, “Measuring the effects of data parallelism on neural network
training,” arXiv preprint arXiv:1811.03600, 2018.
[44] T. Lin, S. U. Stich, K. K. Patel, and M. Jaggi, “Don’t use large mini-
batches, use local SGD,” arXiv preprint arXiv:1808.07217, 2018.

You might also like