Energy Efficient Federated Learning Over Wireless
Energy Efficient Federated Learning Over Wireless
Energy Efficient Federated Learning Over Wireless
Abstract—In this paper, the problem of energy efficient trans- federated learning (FL) framework that will be adopted in
mission and computation resource allocation for federated learn- future Internet of Things (IoT) systems [17]–[26]. In FL,
ing (FL) over wireless communication networks is investigated. wireless devices can cooperatively execute a learning task by
arXiv:1911.02417v2 [cs.IT] 18 Nov 2020
Local computation
BS at user 1
Ă Local computation Ă Local computation
at user k at user K
Local accuracy η Local accuracy η Local accuracy η
Ă
Ă
Ă
Ă
Ă
Ă
Ă
Ă
Wireless transmission Global accuracy ε0
ĂĂ ĂĂ BS
User k User K aggregation Information broadcast
User 1
xk1,yk1 xk2,yk2 ĂĂ xkDk,ykDk
Local dataset Dk
Fig. 1. Illustration of the considered model for FL over wireless communi- Fig. 2. The FL procedure between users and the BS.
cation networks.
(using several local iterations), local FL model transmission
derive the convergence rate. for each user, and result aggregation and broadcast at the BS.
• We formulate a joint computation and transmission op-
The local computation step is essentially the phase during
timization problem aiming to minimize the total energy which each user calculates its local FL model by using its
consumption for local computation and wireless trans- local data set and the received global FL model.
mission. To solve this problem, an iterative algorithm
1) Local Computation: Let fk be the computation capacity
is proposed with low complexity. At each step of this
of user k, which is measured by the number of CPU cycles
algorithm, we derive new closed-form solutions for the
per second. The computation time at user k needed for data
time allocation, bandwidth allocation, power control,
processing is:
computation frequency, and learning accuracy.
• To obtain a feasible solution for the total energy mini- Ik Ck Dk
τk = , ∀k ∈ K, (1)
mization problem, we construct the FL completion time fk
minimization problem. We theoretically show that the where Ck (cycles/sample) is the number of CPU cycles
completion time is a convex function of the learning required for computing one sample data at user k and Ik is the
accuracy. Based on this theoretical finding, we propose a number of local iterations at user k. According to Lemma 1
bisection-based algorithm to obtain the optimal solution in [36], the energy consumption for computing a total number
for the FL completion time minimization. of Ck Dk CPU cycles at user k is:
• Simulation results show that the proposed scheme that
jointly considers computation and transmission optimiza-
C
Ek1 = κCk Dk fk2 , (2)
tion can achieve up to 59.5% energy reduction compared where κ is the effective switched capacitance that depends on
to the conventional FL method. the chip architecture. To compute the local FL model, user k
The rest of this paper is organized as follows. The system needs to compute Ck Dk CPU cycles with Ik local iterations,
model is described in Section II. Section III provides prob- which means that the total computation energy at user k is:
lem formulation and the resource allocation for total energy
EkC = Ik Ek1
C
= κIk Ck Dk fk2 . (3)
minimization. The algorithm to find a feasible solution of the
original energy minimization problem is given in Section IV.
Simulation results are analyzed in Section V. Conclusions are 2) Wireless Transmission: After local computation, all
drawn in Section VI. users upload their local FL model to the BS via frequency
domain multiple access (FDMA). The achievable rate of user
II. S YSTEM M ODEL k can be given by:
Consider a cellular network that consists of one BS serving
K users, as shown in Fig. 1. Each user k has a local dataset Dk g k pk
rk = bk log2 1 + , ∀k ∈ K, (4)
with Dk data samples. For each dataset Dk = {xkl , ykl }D k N0 bk
l=1 ,
d
xkl ∈ R is an input vector of user k and ykl is its corre- where bk is the bandwidth allocated to user k, pk is the average
sponding output1. The BS and all users cooperatively perform transmit power of user k, gk is the channel gain between user
an FL algorithm over wireless networks for data analysis and k and the BS, and N0 is the power spectral density of the
inference. Hereinafter, the FL model that is trained by each Gaussian noise. Note that the Shannon capacity (4) serves
user’s dataset is called the local FL model, while the FL model as an upper bound of the transmission PK rate. Due to limited
that is generated by the BS using local FL model inputs from bandwidth of the system, we have: k=1 bk ≤ B, where B
all users is called the global FL model. is the total bandwidth.
In this step, user k needs to upload the local FL model to
A. Computation and Transmission Model the BS. Since the dimensions of the local FL model are fixed
The FL procedure between the users and their serving BS is for all users, the data size that each user needs to upload is
shown in Fig. 2. From this figure, the FL procedure contains
constant, and can be denoted by s. To upload data of size
three steps at each iteration: local computation at each user s within transmission time tk , we must have: tk rk ≥ s. To
1 For simplicity, we consider an FL algorithm with a single output. In future transmit data of size s within a time duration tk , the wireless
work, our approach will be extended to the case with multiple outputs. transmit energy of user k will be: EkT = tk pk .
3
Algorithm 1 FL Algorithm
Computation Time Transmission Time
Frequency 1: Initialize global model w 0 and global iteration number
b1 τ1 t1 User 1 n = 0.
2: repeat
b2 t2 User 2
τ2 3: Each user k computes ∇Fk (w(n) ) and sends it to the
B
BS.
ĂĂ
1 PK
4: The BS computes ∇F (w(n) ) = K k=1 ∇Fk (w
(n)
),
bK τK tK User K which is broadcast to all users.
Time 5: parallel for user k ∈ K = {1, · · · , K}
Computing and Transmitting Downloading 6: Initialize the local iteration number i = 0 and set
(n),(0)
hk = 0.
Fig. 3. An implementation for the FL algorithm via FDMA.
7: repeat
(n),(i+1) (n),(i)
8: Update hk = hk −
3) Information Broadcast: In this step, the BS aggregates (n),(i)
the global FL model. The BS broadcasts the global FL model δ∇Gk (w(n) , hk ) and i = i + 1.
to all users in the downlink. Due to the high transmit power 9: until the accuracy η of local optimization problem
at the BS and the high bandwidth that can be used for data (11) is obtained.
(n) (n),(i) (n)
broadcasting, the downlink time is neglected compared to the 10: Denote hk = hk and each user sends hk to
uplink data transmission time. It can be observed that the local the BS.
data Dk is not accessed by the BS, so as to protect the privacy 11: end for
of users, as is required by FL. 12: The BS computes
K
According to the above FL model, the energy consumption (n+1) (n) 1 X (n)
w =w + hk , (10)
of each user includes both local computation energy EkC and K
k=1
wireless transmission energy EkT . Denote the number of global and broadcasts the value to all users.
iterations by I0 , and the total energy consumption of all users 13: Set n = n + 1.
that participate in FL will be: 14: until the accuracy ǫ0 of problem (9) is obtained.
XK
E = I0 (EkC + EkT ). (5)
k=1
Fk (w, xk1 , yk1 , · · · , xkDk , ykDk ) is denoted by Fk (w) for
Hereinafter, the total time needed for completing the ex- simplicity of notation.
ecution of the FL algorithm is called completion time. The In order to deploy an FL algorithm, it is necessary to train
completion time of each user includes the local computation the underlying model. Training is done in order to generate a
time and transmission time, as shown in Fig. 3. Based on (1), unified FL model for all users without sharing any datasets.
the completion time Tk of user k will be: The FL training problem can be formulated as [2], [20], [27]:
Ik Ck Dk
Tk = I0 (τk + tk ) = I0 + tk . (6) K K Dk
fk X Dk 1 XX
min F (w) , Fk (w) = f (w, xkl , ykl ),
Let T be the maximum completion time for training the entire w D D
k=1 k=1 l=1
FL algorithm and we have: (9)
where D = K
P
k=1 Dk is the total data samples of all users.
Tk ≤ T, ∀k ∈ K. (7) To solve problem (9), we adopt the distributed approximate
Newton (DANE) algorithm in [27], which is summarized in
B. FL Model Algorithm 1. Note that the gradient descent (GD) is used at
We define a vector w to capture the parameters related each user in Algorithm 1. One can use stochastic gradient
to the global FL model. We introduce the loss function descent (SGD) to decrease the computation complexity for
f (w, xkl , ykl ), that captures the FL performance over input cases in which a relatively low accuracy can be tolerated.
vector xkl and output ykl . For different learning tasks, the However, if high accuracy was needed, gradient descent (GD)
loss function will be different. For example, f (w, xkl , ykl ) = would be preferred [27]. Moreover, the number of global
1 T 2
2 (xkl w − ykl ) for linear regression and f (w, xkl , ykl ) = iterations is higher in SGD than in GD. Due to limited wireless
− log(1 + exp(−ykl xTkl w)) for logistic regression. Since the resources, SGD may not be efficient since SGD requires
dataset of user k is Dk , the total loss function of user k will more iterations for wireless transmissions compared to GD.
be: The DANE algorithm is designed to solve a general local
Dk optimization problem, before averaging the solutions of all
1 X
Fk (w, xk1 , yk1 , · · · , xkDk , ykDk ) = f (w, xkl , ykl ). users. The DANE algorithm relies on the similarity of the
Dk
l=1 Hessians of local objectives, representing their iterations as
(8)
Note that function f (w, xkl , ykl ) is the loss function an average of inexact Newton steps. In Algorithm 1, we
of user k with one data sample and function can see that, at every FL iteration, each user downloads the
Fk (w, xk1 , yk1 , · · · , xkDk , ykDk ) is the total loss function global FL model from the BS for local computation, while
of user k with the whole local dataset. In the following, the BS periodically gathers the local FL model from all
4
users and sends the updated global FL model back to all In Algorithm 1, the iterative method involves a number of
users. According to Algorithm 1, since the local optimization global iterations (i.e., the value of n in Algorithm 1) to achieve
problem is solved with several local iterations at each global a global accuracy ǫ0 for the global FL model. In other words,
iteration, the number of full gradient computations will be the solution w(n) of problem (9) with accuracy ǫ0 is a point
smaller than the number of local updates. such that
We define w(n) as the global FL model at a given iteration
n. In practice, each user solves the local optimization problem: F (w(n) ) − F (w∗ ) ≤ ǫ0 (F (w(0) ) − F (w∗ )), (16)
- B2: Function Fk (w) is of γ-bounded nonconvexity (or capacity and average transmit power limits of all users. The
γ-nonconvex), i.e., all the eigenvalues of ∇2 Fk (w) lie in local accuracy constraint is given by (19f).
[−γ, L], for some γ ∈ (0, L].
Due to the non-convexity of function Fk (w), we respectively B. Iterative Algorithm
replace Fk (w) and F (w) with their regularized versions [38] The proposed iterative algorithm mainly contains two steps
K at each iteration. To optimize (t, b, f , p, η) in problem (19),
(n) (n) 2 (n)
X Dk (n)
F̃k (w) = Fk (w)+γkw−w k , F̃ (w) = F̃ (w). first optimize (t, η) with fixed (b, f , p), then (b, f , p) is
we
D k updated based on the obtained (t, η) in the previous step. The
k=1
(18) advantage of this iterative algorithm lies in that we can obtain
(n)
Based on assuption A2 and (18), both functions Fk (w) the optimal solution of (t, η) or (b, f , p) in each step.
(n)
and F (n) (w) are γ-strongly convex, i.e., ∇2 F̃k (w) γI In the first step, given (b, f , p), problem (19) becomes:
2 (n)
and ∇ F̃ (w) γI. Moreover, it can be proved that both K
(n)
functions Fk (w) and F (n) (w) are (L + 2γ)-Lipschitz. The a X
κAk log2 (1/η)fk2 + tk pk ,
min (20)
convergence analysis in Section II-B can be direcltly applied. t,η 1−η
k=1
a Ak log2 (1/η)
III. R ESOURCE A LLOCATION FOR E NERGY M INIMIZATION s.t. + tk ≤ T, ∀k ∈ K, (20a)
1−η fk
In this section, we formulate the energy minimization prob-
lem for FL. Since it is challenging to obtain the globally opti- tk ≥ tmin
k , ∀k ∈ K, (20b)
mal solution due to nonconvexity, an iterative algorithm with 0 ≤ η ≤ 1, (20c)
low complexity is proposed to solve the energy minimization
problem. where s
tmin
k = , ∀k ∈ K. (21)
gk pk
A. Problem Formulation b k log 2 1 + N0 bk
Our goal is to minimize the total energy consumption of
all users under a latency constraint. This energy efficient The optimal solution of (20) can be derived using the
optimization problem can be posed as follows: following theorem.
min E, (19) Theorem 2: The optimal solution (t∗ , η ∗ ) of problem (20)
t,b,f ,p,η
satisfies:
a Ak log2 (1/η)
s.t. + tk ≤ T, ∀k ∈ K, t∗k = tmin
k , ∀k ∈ K, (22)
1−η fk
(19a)
and η ∗ is the optimal solution to:
g k pk α1 log2 (1/η) + α2
tk bk log2 1 + ≥ s, ∀k ∈ K, (19b) min (23)
N0 bk η 1−η
K
X s.t. η min ≤ η ≤ η max , (23a)
bk ≤ B, (19c)
k=1
where
0 ≤ fk ≤ fkmax , ∀k ∈ K, (19d) η min = max ηkmin , η max = min ηkmax , (24)
k∈K k∈K
0 ≤ pk ≤ pmax
k , ∀k ∈ K, (19e)
0 ≤ η ≤ 1, (19f) βk (ηkmin ) = βk (ηkmax ) = tmin min
k , tk ≤ tmax
k , α1 , α2 and βk (η)
are defined in (C.2).
tk ≥ 0, bk ≥ 0, ∀k ∈ K, (19g)
Proof: See Appendix C.
T T
where t = [t1 , · · · , tK ] , b = [b1 , · · · , bK ] , f = Theorem 2 shows that it is optimal to transmit with the
[f1 , · · · , fK ]T , p = [p1 , · · · , pK ]T , Ak = vCk Dk is a minimum time for each user. Based on this finding, problem
constant, fkmax and pmax k are respectively the maximum local (20) is equivalent to the problem (23) with only one variable.
computation capacity and maximum value of the average Obviously, the objective function (23) has a fractional form,
transmit power of user k. Constraint (19a) is obtained by which is generally hard to solve. By using the parametric
a
substituting Ik = v log2 (1/η) from (15) and I0 = 1−η from approach in [39], [40], we consider the following problem,
(17) into (6). Constraints (19b) is derived according to (4) and
tk rk ≥ s. Constraint (19a) indicates that the execution time H(ζ) = min α1 log2 (1/η) + α2 − ζ(1 − η). (25)
η min ≤η≤η max
of the local tasks and transmission time for all users should
not exceed the maximum completion time for the whole FL It has been proved [39] that solving (23) is equivalent to
algorithm. Since the total number of iterations for each user finding the root of the nonlinear function H(ζ). Since (25)
is the same, the constraint in (19a) captures a maximum time with fixed ζ is convex, the optimal solution η ∗ can be obtained
constraint for all users at each iteration if we divide the total by setting the first-order derivative to zero, yielding the optimal
number of iterations on both sides of the constraint in (19a). solution: η ∗ = (lnα2)ζ
1
. Thus, problem (23) can be solved by
The data transmission constraint is given by (19b), while the using the Dinkelbach method in [39] (shown as Algorithm 2).
bandwidth constraint is given by (19c). Constraints (19d) and In the second step, given (t, η) calculated in the first step,
(19e) respectively represent the maximum local computation problem (19) can be simplified as:
6
has high computation capacity, the latency of implementing Algorithm 4 Completion Time Minimization
Algorithm 3 at the BS will not affect the latency of the FL 1: Initialize Tmin , Tmax , and the tolerance ǫ5 .
process. 2: repeat
3: Set T = Tmin +T2
max
.
IV. A LGORITHM F IND A F EASIBLE S OLUTION OF
TO 4: Check the feasibility condition (40).
P ROBLEM (19) 5: If set (36a)-(36e) has a feasible solution, set Tmax = T .
Since the iterative algorithm to solve problem (19) requires Otherwise, set T = Tmin .
an initial feasible solution, we provide an efficient algorithm 6: until (Tmax − Tmin )/Tmax ≤ ǫ5 .
to find a feasible solution of problem (19) in this section. To
obtain a feasible solution of (19), we construct the completion
time minimization problem: (ln 2)η
uk (η) = − , (38)
min T, (35)
(ln 2)N0 η
2)N0 η −
T,t,b,f ,p,η W − (ln
gk pmax e gk pmax
k + (ln 2)N0 η
gk pmax
k k
s.t. (19a) − (19g). (35a)
and s
We define (T ∗ , t∗ , b∗ , f ∗ , p∗ , η ∗ ) as the optimal solution of vk (η) = (1−η)T Ak log2 η
. (39)
a + fkmax
problem (35). Then, (t∗ , b∗ , f ∗ , p∗ , η ∗ ) is a feasible solution
of problem (19) if T ∗ ≤ T , where T is the maximum delay Proof: See Appendix F.
in constraint (19a). Otherwise, problem (19) is infeasible. To effectively solve (37) in Lemma 3, we provide the
Although the completion time minimization problem (35) is following lemma.
still nonconvex due to constraints (19a)-(19b), we show that Lemma 4: In (38), uk (vk (η)) is a convex function.
the globally optimal solution can be obtained by using the Proof: See Appendix G.
bisection method. Lemma 4 implies that the optimization problem in (37) is a
convex problem, which can be effectively solved. By finding
A. Optimal Resource Allocation the optimal solution of (37), the sufficient and necessary
Lemma 2: Problem (35) with T < T ∗ does not have a condition for the feasibility of set (36a)-(36e) can be simplified
feasible solution (i.e., it is infeasible), while problem (35) with using the following theorem.
T > T ∗ always has a feasible solution (i.e., it is feasible). Theorem 4: Set (36a)-(36e) is nonempty if and only if
Proof: See Appendix E. XK
According to Lemma 2, we can use the bisection method B≥ uk (vk (η ∗ )), (40)
to obtain the optimal solution of problem (35). k=1
PK
With a fixed T , we still need to check whether there where η ∗ is the unique solution to k=1 u′k (vk (η ∗ ))vk′ (η ∗ ) =
exists a feasible solution satisfying constraints (19a)-(19g). 0.
From constraints (19a) and (19c), we can see that it is Theorem 4 directly follows from Lemmas 3 and 4. Due to
PK
always efficient to utilize the maximum computation capac- the convexity of function uk (vk (η)), k=1 u′k (vk (η ∗ ))vk′ (η ∗ )
ity, i.e., fk∗ = fkmax , ∀k ∈ K. From (19b) and (19d), we is an increasing function of η ∗ . As a result, the unique
PK
can see that minimizing the completion time occurs when ∗
solution of η to ′ ∗ ′ ∗
k=1 uk (vk (η ))vk (η ) = 0 can be ef-
p∗k = pmax
k , ∀k ∈ K. Substituting the maximum computation fectively solved via the bisection method. Based on Theorem
capacity and maximum transmission power into (35), the 4, the algorithm for obtaining the minimum completion time
completion time minimization problem becomes: is summarized in Algorithm 4. Theorem 4 shows that the
min T (36) optimal FL accuracy level η ∗ meets the first-order condition
T,t,b,η PK ′ ∗ ′ ∗ ∗
k=1 uk (vk (η ))vk (η ) = 0, i.e., the optimal η should not
(1 − η)T Ak log2 η be too small or too large for FL. This is because, for small
s.t. tk ≤ + , ∀k ∈ K, (36a)
a fkmax η, the local computation time (number of iterations) becomes
gk pmax
s k high as shown in Lemma 1. For large η, the transmission time
≤ bk log2 1 + , ∀k ∈ K, (36b)
tk N0 bk is long due to the fact that a large number of global iterations
XK is required as shown in Theorem 1.
bk ≤ B, (36c)
k=1 B. Complexity Analysis
0 ≤ η ≤ 1, (36d) The major complexity of Algorithm 4 at each iteration
tk ≥ 0, bk ≥ 0, ∀k ∈ K. (36e) lies in checking the feasibility condition (40). To check the
inequality in (40), the optimal η ∗ needs to be obtained by
Next, we provide the sufficient and necessary condition for using the bisection method, which involves the complexity
the feasibility of set (36a)-(36e). of O(K log2 (1/ǫ4 )) with accuracy ǫ4 . As a result, the total
Lemma 3: With a fixed T , set (36a)-(36e) is nonempty if complexity of Algorithm 4 is O(K log2 (1/ǫ4 ) log2 (1/ǫ5 )),
an only if K where ǫ5 is the accuracy of the bisection method used in
X
B ≥ min uk (vk (η)), (37) the outer layer. The complexity of Algorithm 4 is low since
0≤η≤1 O(K log2 (1/ǫ4 ) log2 (1/ǫ5 )) grows linearly with the total
k=1
where number of users.
8
106 65
IID data with serveral local computaions Upper Bound
Non-IID data with serval local computaions 60 Exact value
5
103 45
40
102
35
1
10 30
0 25
10 0 0.002 0.004 0.006 0.008 0.01
0 100 200 300 400 500
Number of computations Accuracy 0
Fig. 4. Value of the loss function as the number of total computations varies Fig. 5. Comparison of exact number of required iterations with the upper
for FL algorithms with IID and non-IID data. bound derived in Theorem 1.
110
Similar to Algorithm 3 in Section III, Algorithm 4 is done Proposed FL
at the BS side before executing the FL scheme in Algorithm EB-FDMA
100 FE-FDMA
1, which will not affect the latency of the FL process. TDMA
90 380
GD, batch-size 500
SGD, batch-size 250 360
80
SGD, batch-size 125
340
70 Proposed FL
Completion time (s)
260
40
240
30
220
20 200
2 4 6 8 10 12 14 16 18 20 10 11 12 13 14 15 16 17 18 19
Maximum average transmit power (dB) Maximum average transmit power (dB)
Fig. 7. Comparison of completion time with different batch sizes. Fig. 9. Total energy versus maximum average transmit power of each user
with T = 100 s.
70
350
60 Proposed FDMA, Communication
Proposed FDMA, Computation
300
TDMA, Communication
50
Completion time (s)
TDMA, Computation
250
Proposed FDMA, Communication
(n),(i+1) (n)∗
Gk (w(n) , hk ) − Gk (w(n) , hk )
600
(n) (n),(i) (n) (n)∗
Proposed FL
EB-FDMA
≤ Gk (w , hk )
− Gk (w , hk )
500 FE-FDMA (2 − Lδ)δγ (n),(i) (n)∗
RS − (Gk (w(n) , hk ) − Gk (w(n) , hk ))
2
400
Total energy (J)
i+1
(2 − Lδ)δγ (n)∗
≤ 1− (Gk (w(n) , 0) − Gk (w(n) , hk ))
300 2
(2 − Lδ)δγ (n)∗
200 ≤ exp −(i + 1) (Gk (w(n) , 0) − Gk (w(n) , hk )),
2
(A.3)
100
where the last inequality follows from the fact that 1 −
(n),(i)
0
100 150 200 250 300
x ≤ exp(−x). To ensure that Gk (w(n) , hk ) −
(n) (n)∗ (n) (n) (n)∗
Fig. 11. Total energy versus completion time.
Completion time (s) Gk (w , hk ) ≤ η(Gk (w , 0) − Gk (w , hk )), we
have (15).
completion time. This is because, in FDMA all users can
transmit data to the BS in each global iteration, while only A PPENDIX B
half number of users can transmit data to the BS in each P ROOF OF T HEOREM 1
global iteration. As a result, the average transmission time for
each user in FDMA is larger than that in RS, which can lead Before proving Theorem 1, the following lemma is pro-
to lower transmission energy for the proposed algorithm. In vided.
particular, with given the same completion time, the proposed Lemma 5: Under the assumptions A1 and A2, the following
FL can reduce energy of up to 12.8%, 53.9%, and 59.5% conditions hold:
1
compared to EB-FDMA, FE-FDMA, and RS, respectively. k∇Fk (w(n) + h(n) ) − ∇Fk (w(n) )k2
L
VI. C ONCLUSIONS ≤ (∇Fk (w(n) + h(n) ) − ∇Fk (w(n) ))T h(n)
In this paper, we have investigated the problem of energy 1
≤ k∇Fk (w(n) + h(n) ) − ∇Fk (w(n) )k2 , (B.1)
efficient computation and transmission resource allocation of γ
FL over wireless communication networks. We have derived and
the time and energy consumption models for FL based on k∇F (w)k2 ≥ γ(F (w) − F (w∗ )). (B.2)
the convergence rate. With these models, we have formulated Proof: According to the Lagrange median theorem, there
a joint learning and communication problem so as to min- always exists a w such that
imize the total computation and transmission energy of the (∇Fk (w(n) + h(n) ) − ∇Fk (w(n) )) = ∇2 Fk (w)h(n) . (B.3)
network. To solve this problem, we have proposed an iterative Combining assumptions A1, A2, and (B.3) yields (B.1).
algorithm with low complexity, for which, at each iteration, For the optimal solution w∗ of F (w∗ ), we always have
we have derived closed-form solutions for computation and ∇F (w∗ ) = 0. Combining (9) and (B.1), we also have γI
transmission resources. Numerical results have shown that the ∇2 F (w), which indicates that
proposed scheme outperforms conventional schemes in terms
of total energy consumption, especially for small maximum k∇F (w) − ∇F (w∗ )k ≥ γkw − w∗ k, (B.4)
average transmit power. and
γ
F (w∗ ) ≥ F (w)+ ∇F (w)T (w∗ − w)+ kw∗ − wk2 . (B.5)
A PPENDIX A 2
P ROOF OF L EMMA 1 As a result, we have:
Based on (12), we have: k∇F (w)k2 = k∇F (w) − ∇F (w∗ )k2
(n),(i+1) A1 (n),(i)
Gk (w(n) , hk ) ≤ Gk (w(n) , hk ) (B.4)
2
≥ γk∇F (w) − ∇F (w∗ )kkw − w∗ k
(n),(i)
2 Lδ (n),(i)
2
− δ ∇Gk (w(n) , hk ) + ∇Gk (w(n) , hk ) ≥ γ(∇F (w) − ∇F (w∗ ))T (w − w∗ )
2
(n) (n),(i) (2 − Lδ)δ (n),(i)
2 = γ∇F (w)T (w − w∗ )
= Gk (w , hk )− ∇Gk (w(n) , hk ) .
2 (B.5)
(A.1) ≥ γ(F (w) − F (w∗ )), (B.6)
Similar to (B.2) in the following Appendix B, we can prove which proves (B.2).
that: For the optimal solution of problem (11), the first-order
k∇Gk (w(n) , hk )k2 ≥ γ(Gk (w(n) , hk ) − Gk (w(n) , hk
(n)∗
)), derivative condition always holds, i.e.,
(A.2) (n)∗
∇Fk (w(n) + hk ) − ∇Fk (w(n) ) + ξ∇F (w(n) ) = 0. (B.7)
(n)∗
where hk is the optimal solution of problem (11). Based We are new ready to prove Theorem 1. With the above
on (A.1) and (A.2), we have: inequalities and equalities, we have:
11
According to (G.1) and (G.2), we have: [8] W. Saad, M. Bennis, and M. Chen, “A vision of 6G wireless systems:
s 2s Applications, trends, technologies, and open research problems,” IEEE
φ′ (η) = − 2 ≤ 0, φ′′ (η) = 3 ≥ 0, (G.4) Network, to appear, 2019.
η η [9] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network
Ak intelligence at the edge,” Proceedings of the IEEE, vol. 107, no. 11, pp.
ϕ′′k (η) = − ≤ 0. (G.5)
(ln 2)fkmax η 2 2204–2239, Nov. 2019.
[10] M. Chen, O. Semiari, W. Saad, X. Liu, and C. Yin, “Federated echo state
Combining (G.3)-(G.5), we can find that vk′′ (η) ≥ 0, i.e., vk (η) learning for minimizing breaks in presence in wireless virtual reality
is a convex function. networks,” IEEE Trans. Wireless Commun., to appear, 2019.
Then, we can show that uk (η) is an increasing and convex [11] H. Tembine, Distributed strategic learning for wireless engineers. CRC
Press, 2018.
function. According to Appendix B, uk (η) is the inverse [12] S. Samarakoon, M. Bennis, W. Saad, and M. Debbah, “Distributed fed-
function of the right hand side of (36b). If we further define erated learning for ultra-reliable low-latency vehicular communications,”
function: arXiv preprint arXiv:1807.08127, 2018.
gk pmax
[13] D. Gündüz, P. de Kerret, N. D. Sidiropoulos, D. Gesbert, C. R. Murthy,
k
zk (η) = η log2 1 + , η ≥ 0, (G.6) and M. van der Schaar, “Machine learning in the air,” IEEE J. Sel. Areas
N0 η Commun., vol. 37, no. 10, pp. 2184–2199, Oct. 2019.
uk (η) is the inverse function of zk (η), which gives [14] M. S. H. Abad, E. Ozfatura, D. Gündüz, and O. Ercetin, “Hierarchi-
cal federated learning across heterogeneous cellular networks,” arXiv
uk (zk (η)) = η. preprint arXiv:1909.02362, 2019.
According to (F.1) and(F.2) in Appendix B, function zk (η) [15] T. Nishio and R. Yonetani, “Client selection for federated learning with
is an increasing and concave function, i.e., zk′ (η) ≥ 0 and heterogeneous resources in mobile edge,” in Proc. IEEE Int. Conf.
Commun. Shanghai, China: IEEE, May 2019, pp. 1–7.
zk′′ (η) ≤ 0. Since zk (η) is an increasing function, its inverse [16] S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and
function uk (η) is also an increasing function. A. T. Suresh, “SCAFFOLD: Stochastic controlled averaging for on-
Based on the definition of concave function, for any η1 ≥ 0, device federated learning,” arXiv preprint arXiv:1910.06378, 2019.
[17] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas,
η2 ≥ 0 and 0 ≤ θ ≤ 1, we have: “Communication-efficient learning of deep networks from decentralized
zk (θη1 + (1 − θ)η2 ) ≥ θzk (η1 ) + (1 − θ)zk (η2 ). (G.7) data,” arXiv preprint arXiv:1602.05629, 2016.
[18] V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated
Applying the increasing function uk (η) on both sides of (G.7) multi-task learning,” in Advances Neural Information Process. Syst.
yields: Curran Associates, Inc., 2017, pp. 4424–4434.
[19] H. H. Yang, Z. Liu, T. Q. S. Quek, and H. V. Poor, “Scheduling policies
θη1 + (1 − θ)η2 ≥ uk (θzk (η1 ) + (1 − θ)zk (η2 )). (G.8) for federated learning in wireless networks,” IEEE Trans. Commun., to
Denote η̄1 = zk (η1 ) and η̄2 = zk (η2 ), i.e., we have η1 = appear, 2019.
[20] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and
uk (η̄1 ) and η2 = uk (η̄2 ). Thus, (G.8) can be rewritten as: K. Chan, “Adaptive federated learning in resource constrained edge
θuk (η̄1 ) + (1 − θ)uk (η̄1 ) ≥ uk (θη̄1 + (1 − θ)η̄2 ), (G.9) computing systems,” IEEE J. Sel. Areas Commun., vol. 37, no. 6, pp.
1205–1221, June 2019.
which indicates that uk (η) is a convex function. As a result, we [21] E. Jeong, S. Oh, J. Park, H. Kim, M. Bennis, and S. Kim, “Multi-hop
have proven that uk (η) is an increasing and convex function, federated private data augmentation with sample compression,” arXiv
which shows: preprint arXiv:1907.06426, 2019.
[22] J.-H. Ahn, O. Simeone, and J. Kang, “Wireless federated distillation
u′k (η) ≥ 0, u′′k (η) ≥ 0. (G.10) for distributed edge learning with heterogeneous data,” arXiv preprint
arXiv:1907.02745, 2019.
To show the convexity of uk (vk (η)), we have: [23] M. Chen, M. Mozaffari, W. Saad, C. Yin, M. Debbah, and C. S. Hong,
u′′k (vk (η)) = u′′k (vk (η))(vk′ (η))2 + u′k (vk (η))vk′′ (η) ≥ 0, “Caching in the sky: Proactive deployment of cache-enabled unmanned
(G.11) aerial vehicles for optimized quality-of-experience,” IEEE J. Sel. Areas
Commun., vol. 35, no. 5, pp. 1046–1061, 2017.
according to vk′′ (η) ≥ 0 and (G.10). As a result, uk (vk (η)) is [24] M. Mozaffari, W. Saad, M. Bennis, and M. Debbah, “Mobile unmanned
R EFERENCES aerial vehicles (uavs) for energy-efficient internet of things communica-
a convex function. tions,” IEEE Trans. Wireless Commun., vol. 16, no. 11, pp. 7574–7589,
[1] Z. Yang, M. Chen, W. Saad, C. S. Hong, and M. Shikh-Bahaei, Nov. 2017.
“Delay minimization for federated learning over wireless communication [25] Z. Yang, C. Pan, M. Shikh-Bahaei, W. Xu, M. Chen, M. Elkashlan,
networks,” Proc. Int. Conf. Machine Learning Workshop, July 2020. and A. Nallanathan, “Joint altitude, beamwidth, location, and bandwidth
[2] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and optimization for UAV-enabled communications,” IEEE Commun. Lett.,
K. Chan, “When edge meets learning: Adaptive control for resource- vol. 22, no. 8, pp. 1716–1719, 2018.
constrained distributed machine learning,” in Proc. IEEE Conf. Com- [26] H. Long, M. Chen, Z. Yang, B. Wang, Z. Li, X. Yun, and M. Shikh-
puter Commun., Honolulu, HI, USA, Apr. 2018, pp. 63–71. Bahaei, “Reflections in the sky: Joint trajectory and passive beamforming
[3] Y. Sun, M. Peng, Y. Zhou, Y. Huang, and S. Mao, “Application of design for secure uav networks with reconfigurable intelligent surface,”
machine learning in wireless networks: Key techniques and open issues,” arXiv preprint arXiv:2005.10559, 2020.
IEEE Commun. Surveys & Tut., vol. 21, no. 4, pp. 3072–3108, June [27] J. Konečnỳ, H. B. McMahan, D. Ramage, and P. Richtárik, “Federated
2019. optimization: Distributed machine learning for on-device intelligence,”
[4] Y. Liu, S. Bi, Z. Shi, and L. Hanzo, “When machine learning arXiv preprint arXiv:1610.02527, 2016.
meets big data: A wireless communication perspective,” arXiv preprint [28] M. M. Amiri and D. Gündüz, “Machine learning at the wireless edge:
arXiv:1901.08329, 2019. Distributed stochastic gradient descent over-the-air,” in Proc. IEEE Int.
[5] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, Symp. Information Theory. Paris, France: IEEE, Jan. 2019, pp. 1432–
V. Ivanov, C. Kiddon, J. Konecny, S. Mazzocchi, H. B. McMahan et al., 1436.
“Towards federated learning at scale: System design,” arXiv preprint [29] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Towards
arXiv:1902.01046, 2019. an intelligent edge: Wireless communication meets machine learning,”
[6] M. Chen, U. Challita, W. Saad, C. Yin, and M. Debbah, “Artificial neural arXiv preprint arXiv:1809.00343, 2018.
networks-based machine learning for wireless networks: A tutorial,” [30] G. Zhu, Y. Wang, and K. Huang, “Low-latency broadband analog ag-
IEEE Communications Surveys Tutorials, vol. 21, no. 4, pp. 3039–3071, gregation for federated edge learning,” arXiv preprint arXiv:1812.11494,
Fourthquarter 2019. 2018.
[7] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint [31] T. T. Vu, D. T. Ngo, N. H. Tran, H. Q. Ngo, M. N. Dao, and R. H.
learning and communications framework for federated learning over Middleton, “Cell-free massive MIMO for wireless federated learning,”
wireless networks,” arXiv preprint arXiv:1909.07972, 2019. arXiv preprint arXiv:1909.12567, 2019.
14