刊出 IoT
刊出 IoT
6, JUNE 2020
Blockchain-Enabled Software-Defined
Industrial Internet of Things With Deep
Reinforcement Learning
Jia Luo , Qianbin Chen , Senior Member, IEEE, F. Richard Yu , Fellow, IEEE, and Lun Tang
Abstract—Recently, software-defined Industrial Internet of to meet the requirement of general public in the daily life
Things (SDIIoT), the integration of software-defined networking (e.g., smart home), IIoT is mainly used in the fields of manu-
(SDN) and Industrial Internet of Things (IIoT), has emerged. facture, transportation, logistics, and energy industry, making
It is perceived as an effective way to manage IIoT dynami-
cally. Aiming to improve the scalability and flexibility of SDIIoT, it require a higher level in security, robustness, flexibility, and
multi-SDN has been applied to form a physically distributed scalability. In order to facilitate the scalability and flexibility in
control plane to handle a large amount of data generated by resource management, a new paradigm called software-defined
industrial devices. However, as the core of multi-SDN, reach- IIoT (SDIIoT) has been proposed to integrate software-defined
ing consensus among multiple SDN controllers is a thorny issue. networking (SDN) into IIoT [2]. The SDN technology sepa-
To meet the required design principle, this article proposes a
blockchain-enabled distributed SDIIoT to synchronize local views rates the control plane and the data plane that are coupled
between distinct SDN controllers and finally reach the consensus together in traditional networks. As a result, the information of
of the global view. On the other hand, both the cryptographic certain area is collected by the corresponding SDN controller
operations of blockchain and the noncryptographic tasks have which belongs to the control plane, thus the SDN controller
access to the same computational resource pool of mobile edge can manage this area in a centralized way, facilitating the flex-
cloud (MEC). In order to optimize the system energy efficiency,
we adaptively allocate computational resources and the batch ibility and scalability for the IIoT. The original SDIIoT works
size of the block by jointly considering the trust features of under the premise that one central SDN controller has access
SDN controllers and the resource requirements of noncrypto- to the global information of the whole network. However,
graphic operations. To implement the truly distributed manner with the increasing network size, multiple SDN controllers are
of blockchain, we describe our problem as a partially observable required to manage the huge and widely distributed industrial
Markov decision process (POMDP) and propose a novel deep
reinforcement learning (DRL) approach to solve it. In the simu- devices. Hence, this trend transforms the classical centralized
lation results, we compare three different protocols of blockchain SDIIoT into a distributed paradigm where reaching consensus
and show the effectiveness of our scheme in each of them. among multiple SDN controllers is the core issue.
Index Terms—Blockchain, deep reinforcement learning (DRL), Consensus algorithms such as RAFT [3] defined in the
Industrial Internet of Things (IIoT), software-defined networking literature for distributed systems can be used to handle the con-
(SDN). sensus problem, however, these algorithms cannot guarantee
the consensus among multiple SDN controllers if malicious
nodes exist which is not acceptable for the SDIIoT requir-
I. I NTRODUCTION ing high security. As a promising technology, blockchain is
I N RECENT years, the explosive growth of smart devices considered as a viable solution to guarantee the system con-
and the demand for interconnection have fostered the rise of sensus even under malicious attacks. The blockchain, whose
the Internet of Things (IoT). As the implementation of IoT for concept originates from the cryptocurrency called Bitcoin [4],
industry, Industrial IoT (IIoT), motivated by the development has gained wide attention in recent years [5]. In general,
of wireless communications, sensor networks, and embedded blockchain is a distributed digital ledger, which is a series of
systems, has attracted great interests from both academia and data blocks generated by cryptography. Each block consists of
industry [1]. Compared to the Consumer IoT (CIoT) designed various transactions recording not just financial information
but virtually everything of value [6]. On this basis, all the
Manuscript received December 10, 2019; revised February 23, 2020; blockchain nodes have copies of the existing authenticated
accepted March 2, 2020. Date of publication March 5, 2020; date ledger distributed among them, thus to serve the purpose of
of current version June 12, 2020. This work was supported in part
by the National Natural Science Foundation of China under Grant providing a consensus about the transaction at any given time
61571073, and in part by the Science and Technology Research Program without a central authority. Any modification to a single trans-
of Chongqing Municipal Education Commission under Grant KJZD- action record requires the alternation of all subsequent records
M201800601. (Corresponding author: Qianbin Chen.)
Jia Luo, Qianbin Chen, and Lun Tang are with the Key Laboratory and the collusion of the whole network, making data in the
of Mobile Communication Technology, Chongqing University of Posts and blockchain tamper resistant. For the distributed SDIIoT, SDN
Telecommunications, Chongqing 400065, China (e-mail: cqb@cqupt.edu.cn). controllers can be integrated with the function of blockchain
F. Richard Yu is with the Department of Systems and Computer
Engineering, Carleton University, Ottawa, ON K1S 5B6, Canada. to synchronize local views of distinct SDN controllers and
Digital Object Identifier 10.1109/JIOT.2020.2978516 finally get the global view of the whole network through the
c 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
2327-4662 ⃝
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on June 14,2020 at 16:05:49 UTC from IEEE Xplore. Restrictions apply.
LUO et al.: BLOCKCHAIN-ENABLED SDIIoT WITH DEEP REINFORCEMENT LEARNING 5467
process of reaching consensus. In addition to reaching consen- provided and discussed in Section VI. Finally, we conclude
sus, blockchain makes the information (stored in the blocks) this article in Section VII.
of SDIIoT tamper resistant and traceable thus to enhance the
security of the network. For instance, in the SDIIoT-based II. R ELATED W ORKS
logistics network, transportation information is shared and
stored by utilizing the blockchain technology. Due to the chain In this section, we briefly discuss some related works in
structure of blockchain, historical transportation information four areas: 1) SDIIoT; 2) distributed SDN; 3) blockchain; and
can be traced to check possible problems regarding trans- 4) DRL.
portation and blockchain can guarantee that this historical
A. Software-Defined Industrial Internet of Things
information cannot be tampered with.
As a further matter, computational resources are necessary In order to facilitate the scalability, ubiquitous accessibility,
for the corresponding operations in the consensus protocol and flexibility in resource management, a slew of literature has
of blockchain. To fulfill this requirement, SDIIoT can be integrated SDN into IIoT. For instance, Wan et al. [12] imple-
equipped with mobile edge cloud (MEC) which has been mented a software-defined industrial network to achieve the
proposed and studied in recent years [7]–[9]. Compared to dynamic management of manufacturing resources. Meanwhile,
the traditional cloud, MEC provides computational resources Moness et al. [13] proposed a hybrid software-defined
in close proximity to devices, thus to offer faster response time approach in IIoT. Chaudhary et al. [14] designed an SDN-
for services [10]. According to [11], the energy consumption enabled secure communication mechanism for the smart grid
of MEC is not negligible. Therefore, in this article, we propose in the IIoT environment. Besides, Bedhief et al. [15] offered
a distributed blockchain-enabled SDIIoT system and optimize the integration of SDN and fog computing into IIoT to provide
its energy efficiency to use the energy of MEC efficiently. Our a flexible solution granting low delays for IIoT applications.
contributions are listed as follows. However, the increase in the number of industrial devices
1) We integrate the permissioned blockchain into dis- has resulted in numerous data and flows in SDIIoT, requiring
tributed SDIIoT to reach the consensus of the global multiple SDN controllers for management. One of the essential
view and propose an optimization framework to issues in SDIIoT with multiple SDN controllers (also known
optimize the system energy efficiency. Specifically, we as distributed SDIIoT) is reaching consensus among multiple
adaptively allocate computational resources and the SDN controllers. Therefore, we will present some related
batch size of block via jointly considering the trust works in reaching consensus for distributed SDN architectures.
features of SDN controllers, as well as the resource
requirements of noncryptographic tasks in MEC. B. Distributed SDN Architectures
2) We design a novel method to add the state information
As a matter of fact, a bunch of works have been engaged
of the control plane to the local view which will be
in the design for distributed SDN architectures [16].
synchronized among multiple SDN controllers via the
Tootoonchian and Ganjali [17] proposed HyperFlow which
consensus protocol. Specifically, each SDN controller
is the first distributed SDN architecture for OpenFlow. SDN
will propose its local view to the consensus proce-
controllers keep synchronized in a consistent global view via
dure at every time slot to assist the distributed learning
a publish/subscribe messaging paradigm. Koponen et al. [18]
algorithm.
considered an Onix architecture, where the network
3) To the best of our knowledge, we are the first to use a
information base (NIB) is used to aggregate and share
partially observable Markov decision process (POMDP)
a global view. Moreover, Berde et al. [19] designed
to model such framework which optimizes the system
the ONOS, an experimental distributed SDN framework
energy efficiency in blockchain-enabled SDIIoT, thus
maintaining a global view via the consensus protocol in
providing the truly distributed implementation to fulfill
ZooKeeper [20]. On the other hand, the work in [21], [22],
the decentralization feature of blockchain.
and [23] proposed a two-level hierarchical control plane
4) We adopt the deep reinforcement learning (DRL) algo-
for improving scalability. Basically, they utilized a logically
rithm to solve the corresponding POMDP problem.
centralized root controller at the upper layer of the control
To be specific, we combine deep recurrent Q-network
hierarchy to handle the consensus issue at the lower layer.
(DRQN) with normalized advantage functions (NAFs)
Nevertheless, the above works have neglected the potential
to handle the partial observability and continuous action
influence inflicted by malicious attacks. To provide depend-
space in the problem. In the simulation results, we com-
able services to distributed architectures, Qiu et al. [24]
pare two different protocols of permissioned blockchain
and Liu et al. [25] integrated blockchain into SDIIoT and
and show the effectiveness of our scheme in either of
IIoT, then they deployed DRL to optimize the transactional
them.
throughput of blockchain.
The remainder of this article is organized as follows. We
review some related works in Section II. Then, we present the
network architecture, along with the analysis of the consensus C. Blockchain
protocol and the energy efficiency model in Section III, fol- Typically, the mainstream blockchain is classified as per-
lowed by the problem formulation in Section IV. Section V missionless blockchain and permissioned blockchain. To be
describes the DRL-based algorithm. The simulation results are specific, permissionless blockchain allows arbitrary entities
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on June 14,2020 at 16:05:49 UTC from IEEE Xplore. Restrictions apply.
5468 IEEE INTERNET OF THINGS JOURNAL, VOL. 7, NO. 6, JUNE 2020
TABLE I
to join the network, participate in the process of reach- N OTATIONS
ing consensus. The most known examples are Bitcoin [4]
and Ethereum [26] networks where the underlying consensus
protocol is Proof of Work (PoW). However, permissionless
blockchain suffers from low throughput due to its consensus
protocol. On the contrary, permissioned blockchain restricts
the entities which can contribute to the consensus and adopts
a set of consensus protocols, such as practical Byzantine fault
tolerance (PBFT) [27], Aardvark [28], redundant Byzantine
fault tolerance (RBFT) [29], and Zyzzyva [30] whose origin
is the concept of Byzantine fault tolerance (BFT). Compared
to PoW, these consensus protocols result in higher throughput.
E. Discussions
Although DRL has been utilized to handle the optimization
problem in blockchain-enabled SDIIoT, a major issue in exist-
ing works is that they assume an agent can collect the states
of multiple entities in a centralized way thus to learn the pol-
icy. However, this is actually opposed to the decentralization
principle of blockchain. As a workaround, we use POMDP Fig. 1. Network architecture of blockchain-enabled SDIIoT.
to design a novel mechanism to optimize the system in a
decentralized method with modified DRL. We utilize permis- plane and forward them to the control plane. According to
sioned blockchain to reach the consensus in SDIIoT due to these collected data, the control plane thus can manage and
its advantage in performance and the nonpublic property of optimize the network. Specifically, the control plane is com-
SDIIoT. posed of single or multiple SDN controllers. We consider a
III. S YSTEM M ODEL distributed SDIIoT with N physical machines equipped with
SDN controller modules, denoted by N = {1, 2, . . . , N}. Each
In this section, we introduce the system model adopted in of them manages a certain domain of the network, and it
this article. We first present the network architecture, followed is equipped with a local MEC which has a computational
by the overview of blockchain, the description of the con- resource pool with ξtot CPU cycles per second.
sensus protocol, and the energy efficiency model. The main SDN controller can get the information of its domain which
notations used in the remainder of this article are summarized we define as the local view. Specifically, the local view mainly
in Table I. consists of the following two kinds of information.
1) Noncontrol Plane Information: Noncontrol plane
A. Network Architecture information is composed of the infrastructure plane
As shown in Fig. 1, due to the introduction of SDN, the con- information (e.g., transportation information in the logistics
trol plane and data plane are separated into two independent network and temperature information in the environmental
planes. The data plane mainly incorporates multiple switch monitoring network) and the data plane information (e.g.,
nodes which collect the devices’ data from the infrastructure topology information of switch nodes). The former needs to
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on June 14,2020 at 16:05:49 UTC from IEEE Xplore. Restrictions apply.
LUO et al.: BLOCKCHAIN-ENABLED SDIIoT WITH DEEP REINFORCEMENT LEARNING 5469
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on June 14,2020 at 16:05:49 UTC from IEEE Xplore. Restrictions apply.
5470 IEEE INTERNET OF THINGS JOURNAL, VOL. 7, NO. 6, JUNE 2020
Fig. 2. Relationship between blockchain and SDIIoT. Fig. 3. Protocol communication pattern of Aardvark.
MAC authenticator is the MAC computed with the key via the Request phase and Reply phase. Replica communi-
shared by the sender (node i) and the intended recipient cation is the communication within the blockchain system.
corresponding to the element. We take the communication between a specific SDN con-
We consider an uncivil execution during which a frac- troller and the blockchain system as an example, as shown
tion gn (t) of transactions sent by SDN controller n are correct in Fig. 3, the whole consensus process mainly comprises five
and F blockchain nodes are faulty. gn (t) is the trust feature of phases: 1) Request; 2) Pre-prepare; 3) Prepare; 4) Commit;
SDN controller n. Moreover, at time slot t, SDN controller n and 5) Reply. The detailed operations are as follows.
sends bn (t) transactions to be verified. Therefore, the batch a) Request: At the beginning of time slot t, SDN
size of block (i.e., the total number of valid transactions) at controller n sends REQUEST message to the replica p it
time slot t is caculated as believes to be the primary. The message is in the form of
N
! ⟨REQUEST, ⟨TXNy ⟩σn , n⟩µn,p , where TXNy and n denote the
btot = bn (t)gn (t). (1) transaction with ID y and the SDN controller ID, respec-
n=1 tively. At time slot t, SDN controller n sends bn (t) REQUEST
In general, the BFT protocol can be classified into the messages to the primary. The local view of SDN controller
fast BFT protocol and the robust BFT protocol. Protocols, n is incorporated in transactions to be synchronized. Each
such as PBFT and Zyzzyva, are fast BFT protocols that are transaction is signed by the private key of SDN controller
designed to provide the best possible performance in the n and the REQUEST message is authenticated with a MAC
absence of faults. When malicious attacks occur, the fast BFT appropriate for verification by primary p. Upon receiving a
protocol can eventually recover from attacks, however, the REQUEST message, primary p verifies the MAC. If valid, it
transactional throughput may drop to zero during a possi- proceeds to verify the signature of the transaction. Any failure
bly long time interval corresponding to the duration of the in verification will lead to the discarding of the corresponding
attack. On the other hand, the robust BFT protocol achieves message or transaction. This implementation ensures that the
good performance when malicious attacks occur. Therefore, to primary verifies only the signature of the REQUEST message
improve the system’s robustness, we use RBFT and Aardvark, whose MAC checks out thus to filter the possible expensive
two typical robust BFT protocols, to implement the consensus verifications for malicious REQUEST messages.
process. The corresponding analysis is as follows. In this article, the control plane information of all domains
1) Aardvark: Aardvark is a specific case of the BFT is shared every time slot to assist the distributed algorithm. As
network ability. BFT is the ability of a network to unmis- a consequence, each SDN controller proposes its local view
takably reach a consensus despite malicious nodes’ attempts in the REQUEST message to the primary at every time slot.
to corrupt the process. Blockchain nodes running Aardvark Therefore, SDN controller n needs to generate bn (t) MACs
are termed as replicas where only one chosen node is also and signatures, and the primary needs to verify MACs of the
called primary. In order to improve the system’s robustness, REQUEST messages delivered by N SDN controllers and sig-
Aardvark uses a hybrid MAC-signature construct to authenti- natures of the corresponding transactions. The required CPU
cate messages. Specifically, in this article, the digital signature cycles for signing one transaction/block, verifying one signa-
is used to sign transaction and block since they will be added ture, generating one MAC, and verifying one MAC are denoted
to the system digital ledger for the possible inspection in the by δ, θ , α, and α, respectively. Hence, in this phase, the cost
future. In this way, the system is enabled to provide the prop- (unit: CPU cycle) at the primary is
erty of nonrepudiation—the ability to prove that a message is N
!
authentic to a third party. It is noteworthy that the verification bn′ (t)α + btot θ. (2)
of the digital signature is more costly than that of MAC, thus n′ =1
Aardvark also uses MAC to guard against denial-of-service
The cost at SDN controller n is
attacks where the system receives a large number of requests
with signatures that need to be verified. bn (t)(α + δ) (3)
For Aardvark, the relationship between blockchain and
SDIIoT is shown in Fig. 2. SDIIoT interacts with blockchain and there is no cost at nonprimary replica.
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on June 14,2020 at 16:05:49 UTC from IEEE Xplore. Restrictions apply.
LUO et al.: BLOCKCHAIN-ENABLED SDIIoT WITH DEEP REINFORCEMENT LEARNING 5471
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on June 14,2020 at 16:05:49 UTC from IEEE Xplore. Restrictions apply.
5472 IEEE INTERNET OF THINGS JOURNAL, VOL. 7, NO. 6, JUNE 2020
F +1 matching PROPAGATE messages, each blockchain node by the tuple (dn,lnormal (t), T max (t)), where d normal (t) specifies
n,l n,l
forwards all the transactions to replicas of the F + 1 protocol the required number of CPU cycles to complete the task, and
instances running locally and enters the next phases which are max (t) is the maximum delay (unit: s). Let ξ normal (t) denote
Tn,l n,l
the same as that of Aardvark. RBFT adopts the same cryp- the volume of computational resources allocated to the lth
tographic technique as Aardvark. Similarly, we can calculate noncryptographic task in domain n at time slot t, thus the
the cost for one transaction per protocol instance at primary n execution time of task (dn,l normal (t), T max (t)) can be caculated as
n,l
as follows:
normal (t)
dn,l
(btot + 2N + 4F − 2)α (N − 2)α + θ Tn,l (t) = (20)
ζn (t) = + normal (t)
ξn,l
b (t) F+1
"# tot $
N
n′ =1 bn′ (t) + N α where Tn,l (t) should satisfy the following constraint:
+ . (15)
btot (t)(F + 1) max
Tn,l (t) ≤ Tn,l (t). (21)
The cost for one transaction per protocol instance at non-
primary replica n is Assume one CPU cycle is comprised of ρ clock cycle,
normal (t) is the frequency corresponding to the com-
thus ρξn,l
(btot + 2N + 4F)α (N − 2)α + θ putational resources. The power consumption of CPU is
ψn (t) = + normal (t), where e is the operating volt-
"#
btot (t)
$
F+1 proportional to e2 ρξn,l
N age. Furthermore, the operating voltage is proportional to the
n′ =1 bn′ (t) + N α
+ . (16) frequency of CPU, hence we have the following power model
btot (t)(F + 1) for the lth noncryptographic task in domain n at time slot t:
The cost for one transaction at SDN controller n is " $3
bn (t)δ + (bn (t)N + F + 1)α q + ηρ 3 ξn,l normal
(t) (22)
κn (t) = . (17)
btot (t) where q is a fixed value representing the power consump-
Therefore, the cost for one transaction at physical tion of other components (except CPU) which is independent
machine n is of the frequency. η is a proportional factor. Thus, the energy
F+1 F+1 normal (t), T (t)) is
consumption for executing the task (dn,l
! ! % & n,l
dncrypto (t) = κn (t) + ωf ,n (t)ζn (t) + 1 − ωf ,n (t) ψn (t) '
normal (t)
dn,l " $3 (
f =1 f =1
Ei = normal (t)
q + ρ 3 η ξn,l
normal
(t) . (23)
(18) ξn,l
where ωf ,n (t) = 1 means the primary replica of protocol It is worth noting that the noncryptographic task may be
instance f runs on blockchain node n, otherwise, ωf ,n (t) = 0. the task offloaded by the industrial device from the infras-
Suppose the blockchain system is saturated by REQUEST tructure plane. However, this article mainly considers the
messages. According to the above analysis, the transactional computational resource allocation regarding MEC to optimize
throughput (unit: transactions per second, TPS) of blockchain the energy efficiency in the proposed scenario, therefore, the
is caculated as energy consumption for the corresponding data transmission
crypto
ξn (t) is not taken into account in the system model. Consequently,
T PS(t) = min crypto (19) we can calculate the total energy consumption of N MECs as
n∈N dn (t) − κn (t) follows:
where ξn
crypto
(t) represents the computational resources for N
) crypto
! dn (t)btot (t) " % &3 $
cryptographic operations in the MEC containing blockchain E(t) = crypto q + ρ 3 η ξncrypto (t)
node n at time slot t. Besides, since SDN controller is not n=1 ξn (t)
normal (t) '
*
the part of blockchain, κn (t) is deducted in the denomi- !L
dn,l " $3 (
crypto 3 normal
nator. The calculation of dn (t) depends on the specific + q + ρ η ξn,l (t) .
implementation of the consensus protocol. ξ normal (t)
l=1 n,l
(24)
D. Energy-Efficiency Model
Aming to utilize energy efficiently, we consider the system
Aiming to improve the throughput of blockchain, each SDN energy efficiency calculated as
controller utilize its local MEC to execute the computational
tasks pertaining to the blockchain system. In the meantime, it T PS(t)
G(t) = . (25)
is noteworthy that the cryptographic operation is not the only E(t)
kind of computational task to use computational resources in
MEC. Some other noncryptographic operations occupy part of IV. P ROBLEM F ORMULATION
the computational resources as well. Among all the noncrypto- According to (1), the trust feature of the SDN controller has
graphic tasks in a domain, assume that L of them are selected an influence on the batch size of block thus to affect the energy
to be executed at one time slot. The lth (l ∈ {1, 2, . . . , L}) non- efficiency, therefore, we consider the trust feature as one of the
cryptographic task in domain n at time slot t is characterized state in our problem. We utilize a Markov chain to describe
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on June 14,2020 at 16:05:49 UTC from IEEE Xplore. Restrictions apply.
LUO et al.: BLOCKCHAIN-ENABLED SDIIoT WITH DEEP REINFORCEMENT LEARNING 5473
the variation of trust feature gn (t). Hence, the correspond- to the trust feature of the SDN controller. As a consequence,
ing transition probability is denoted by Pr(gn (t + 1)|gn (t)). these transactions will be discarded and the corresponding
In addition, we assume that the variation of dn,l normal (t) also information will not be shared through the consensus process.
normal
has the Markov property, and ξn,l (t) can affect the value In this case, earlier state information can be used as an alter-
normal (t). Furthermore, computational resources of each
of dn,l native. Furthermore, the SDN controller can be considered as
MEC are limited by ξtot , thus we have the following constraint: the central authority of its own domain, thus it should not be
as fragile as a general smart device. On this basis, we assume
L
! that the SDN controller is well designed in safety that only
ξncrypto (t) + normal
ξn,l (t) ≤ ξtot ∀n. (26) a portion of transactions is incorrect in the worst case (i.e.,
l=1
g(t) ∈ (0, 1]) and the correctness of transactions pertaining to
Therefore, the corresponding transition probability is the state information can be guaranteed.
denoted by Pr(dn,lnormal (t+1)|d normal (t), ξ normal (t), ξ crypto (t)). In
n,l n,l n
order to improve the system energy efficiency, we consider the B. Action
trust features of SDN controllers and the resource requirements
of noncryptographic tasks as the states of our problem and The agent needs to decide the batch size of block btot (t),
make sequential decisions according to their Markov property. computational resource allocation ξ crypto (t) for cryptographic
To do this, the component called agent is utilized to interact operations, and computational resource allocation ξ normal (t)
with the environment. Moreover, the agent is equipped with for noncryptographic operations. Therefore, the action at for
the DRL algorithm to handle the problem. Therefore, at each agent n at time slot t is denoted by
time slot t, an agent observes state st and chooses action at ⎡ ⎤
b1 (t) b2 (t) ··· bN (t)
which determines the immediate reward rt and next state st+1 . ⎣ ξ crypto (t) crypto
ξ2 (t) · · · ξN
crypto
(t) ⎦ (29)
1
In our case, the agent is located within MEC, thus each MEC ξ1normal normal
(t) ξ 2 (t) · · · ξ Nnormal
(t)
contains an agent. However, it is noteworthy that agent n can
only collect the up-to-date states regarding physical machine where ξnnormal (t) is a vector with L elements representing the
n at the beginning of each time slot. In order to make deci- computational resources allocated to noncryptographic tasks
sions that need the whole states of the system, we model our in domain n at time slot t.
problem as a POMDP.
C. Transition Probability
A. Observation According to the above discussion, the transition proba-
The state st at time slot t is bility that action at in state st at time slot t leads to state
+ , st+1 at time slot t + 1 can be denoted by Pr(st+1 |st , at ). The
g1 (t) g2 (t) ··· gN (t)
(27) goal of the agent is to choose actions at each time slot that
d1normal (t) d2normal (t) ··· normal (t)
dN
maximize its long-term expected reward determined by the
where dnnormal (t) is a vector with L elements representing the corresponding actions and states. Furthermore, the variation of
resource requirements of noncryptographic tasks in domain n states depends on Pr(st+1 |st , at ), therefore, for MDP solutions
at time slot t. with fully observable states, the variation of states is utilized
For a specific agent n at time slot t, it can get the cur- to implicitly estimate Pr(st+1 |st , at ) thus to get the optimal
rent states of its own domain, whereas dm̸ normal (t) and g actions. However, for the POMDP issue in this article, the
=n m̸=n (t)
are beyond the reach. Hence, the system state is partially states are partially observable and the next state is determined
observable which is the case of POMDP. We denote the by Pr(st+1 |st , at )Pr(st |ot ). Hence, using the observation alone
obervations of domain n as the corresponding states of the will not work correctly. We will propose our solution to handle
current time slot. In addition, agent n can get previous states this issue in Section V.
of other domains through the consensus process as the state
information is the control plane information included in the D. Reward
local view, thus observations of other domains are represented
In order to improve the long-term expected system energy
by the corresponding states of the last time slot. Although
efficiency, G(t) should be included in the immediate reward.
these observations are not the up-to-date states which we really
Additionally, constraints (21) and (26) should be satisfied. We
need, with the increase of the training time, the agent will get
utilize the ReLu function f (x) = max(x, 0) to calculate the
more information about the model and work more and more
following variables with regard to the constraints:
similar to the condition where all current states are known.
) *
Therefore, the observation ot for agent n of time slot t is L
!
crypto normal
+ , ℵn (t) = max ξn (t)+ ξn,l (t) − ξtot , 0 ∀n (30)
g1 (t − 1) ··· gn (t) ··· gN (t − 1) l=1
d1normal (t − 1) · · · dnnormal (t) · · · dN normal (t − 1) . % &
max
βn,l (t) = max Tn,l (t) − Tn,l (t), 0 ∀n, l. (31)
(28)
Then, we form the penalty function P(t) as follows:
It is worth noting that transactions regarding the above state
information may not pass the corresponding verification due P(t) = ∥ℵ(t)∥1 + ∥β(t)∥1 (32)
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on June 14,2020 at 16:05:49 UTC from IEEE Xplore. Restrictions apply.
5474 IEEE INTERNET OF THINGS JOURNAL, VOL. 7, NO. 6, JUNE 2020
where ℵ(t) and β(t) are the corresponding vectors of ℵn (t) B. Deep Q Network
and βn,l (t). ∥ · ∥1 represents the operation of the L1 norm. For the problems feature either high-dimensional state space
Therefore, we calculate the immediate reward as follows: S or high-dimensional action space A which is often the case
rt = G(t) − λP(t) (33) in reality, it is impractical to maintain a separate estimate for
each S × A. As an alternative, a model (also known as approx-
where λ (λ > 0) is the penalty coefficient. P(t) is a measure of imator) is used to approximate Q-values, usually a nonlinear
violation of the constraints. It is nonzero when the constraints approximator. In the case of DQN, the approximator is NN
are violated and is zero in the region where constraints are sat- parameterized by weights and biases collectively denoted by ν.
isfied. By introducing P(t) to the immediate reward, the agent Consequently, Q-values are expressed as Q(s, a|ν). Due to the
will gradually learn to satisfy the constraints while optimizing adoption of NN, DQN has the ability to handle large-scale
the system energy efficiency. problems which largely broadens the application of RL [31].
It is noteworthy that each agent makes decisions about the Instead of updating individual Q-values, updates are now made
system, but only the actions for its own domain are actually to the parameters of NN to minimize a differentiable loss
executed. In other words, each agent actually makes deci- function
sions for its own domain based on the global states. Moreover, ' (2
% &
different domains’ contributions to the immediate reward are L(νt ) = rt + γ max Q st+1 , a′ |νt − Q(st , at |νt ) (36)
shared through the consensus process, thus all agents can get a′
the same immediate reward which pushes them to the same νt+1 = νt − α∇ν L(νt ). (37)
direction of optimization. As a consequence, the distributed In (36), it is notable that Q(st , at |νt ) represents the current
agents will work cooperatively like a single entity when the estimate for the following cumulative reward from state st after
DRL algorithm converges. executing action at and following the optimal policy thereafter:
∞
!
V. DRL-BASED O PTIMIZATION F RAMEWORK Rt = rt + γ τ rt+τ . (38)
Several methods can approximate the optimal POMDP τ =1
policy on condition that the model of the environment is DQN assumes that the action corresponding to the maxi-
known, which may not be an appropriate assumption in real- mal Q-value will be executed at the next state st+1 , thus the
ity [36]. To be more practical, we utilize DRL to solve the following expression can be considered as an alternative to
POMDP problem without knowing the explicit model of the represent the above cumulative reward:
environment. % &
R′t = rt + γ max Q st+1 , a′ |νt . (39)
a′
A. Q-Learning
Due to the use of current Q network, R′t is also the esti-
Reinforcement learning (RL) is a branch of machine learn- mate for Rt , however, compared to Q(st , at |νt ), R′t is closer to
ing which predominantly seeks to solve the MDP where its the target Rt since it includes the actual immediate reward rt ,
system dynamics Pr(st+1 |st , at ) are unknown. The goal of the therefore, DQN utilizes the mean-square error (MSE) between
agent is to select action which maximizes long-term rewards. R′t and Q(st , at |νt ) depicted in (36) as the loss function and
Q-learning is one of the traditional RL algorithms and adopts (37) to update the Q network.
a model-free learning method for estimating the long-term Since the same NN generates the next state target Q-values
expected reward known as Q-value. In addition, Q-value is Q(st+1 , a′ |νt ) used in updating the current Q-values, the
also called the action–state value function. A specific Q-value learning process with such update can be unstable or even
Qπ (st , at ) corresponding to the policy π is defined as the long- diverge [37]. DQN ingeniously utilizes two mechanisms to
term expected reward from state st after executing action at restore learning stability. First, transitions like (st , at , rt , st+1 )
and following the policy π thereafter: are recorded in a experience replay memory and then sam-
1∞ 2
! pled uniformly at the training time. Second, a separate target
Qπ (st , at ) = Eπ γ τ rt+τ |st , at (34) network Q̂ generates target Q-values for update, decoupling
τ =0 the feedback resulting from the network generating its own
where γ (0 ≤ γ ≤ 1) is the discount factor representing the targets. Q̂ is identical to the evaluation network Q except its
difference on importance between future rewards and imme- parameters ν − are updated less frequently than the evaluation
diate reward. Q-values are learned iteratively by updating the network to match ν. As a result, the loss function in (36) are
current Q-value estimate as follows: changed into
Q(st , at ) ← Q(st , at ) L(νt ) = (yt − Q(st , at |νt ))2 (40)
' (
% &
′
+ α rt + γ max Q st+1 , a − Q(st , at ) (35) where yt = rt + γ maxa′ Q̂(st+1 , a′ |νt− )
is the stable update
a′ target given by the target network Q̂. Learning processes with
where α ≥ 0 is the learning rate. Generally, Q-learning uses such update have been empirically shown to be tractable and
Q-table to store Q(s, a). At a certain time slot, Q-value regard- stable [31].
ing the state and action of that time slot is updated according Despite the wide application in many fields, DQN can-
to (35). not be directly used to handle our problem due to its partial
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on June 14,2020 at 16:05:49 UTC from IEEE Xplore. Restrictions apply.
LUO et al.: BLOCKCHAIN-ENABLED SDIIoT WITH DEEP REINFORCEMENT LEARNING 5475
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on June 14,2020 at 16:05:49 UTC from IEEE Xplore. Restrictions apply.
5476 IEEE INTERNET OF THINGS JOURNAL, VOL. 7, NO. 6, JUNE 2020
TABLE IV
S IMULATION PARAMETERS
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on June 14,2020 at 16:05:49 UTC from IEEE Xplore. Restrictions apply.
LUO et al.: BLOCKCHAIN-ENABLED SDIIoT WITH DEEP REINFORCEMENT LEARNING 5477
⎛ ⎞
0.2 0.14 0.11 0.1 0.1 0.04 0.06 0.02 0.08 0.15
⎜ 0.16 0.18 0.14 0.13 0.08 0.03 0.04 0.03 0.1 0.11 ⎟
⎜ ⎟
⎜ 0.11 0.16 0.16 0.14 0.13 0.1 0.06 0.039 0.031 0.07 ⎟
⎜ ⎟
⎜ 0.1 0.11 0.16 0.2 0.11 0.12 0.05 0.08 0.026 0.044⎟
⎜ ⎟
⎜ 0.06 0.07 0.095 0.145 0.24 0.14 0.09 0.07 0.04 0.05 ⎟
⎜ ⎟ (46)
⎜ 0.08 0.032 0.062 0.14 0.12 0.19 0.156 0.1 0.1 0.02 ⎟
⎜ ⎟
⎜ 0.02 0.05 0.06 0.1 0.13 0.14 0.16 0.16 0.11 0.07 ⎟
⎜ ⎟
⎜0.018 0.03 0.044 0.049 0.095 0.146 0.29 0.137 0.12 0.071⎟
⎜ ⎟
⎝0.027 0.03 0.035 0.037 0.101 0.137 0.31 0.148 0.114 0.061⎠
0.025 0.08 0.031 0.026 0.112 0.143 0.262 0.146 0.121 0.054
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on June 14,2020 at 16:05:49 UTC from IEEE Xplore. Restrictions apply.
5478 IEEE INTERNET OF THINGS JOURNAL, VOL. 7, NO. 6, JUNE 2020
Fig. 13. Energy efficiency of PBFT, Aardvark, and RBFT versus the total
Fig. 11. Total computational resources of MEC in Aardvark versus time slot. number of transactions sent by N SDN controllers.
Fig. 14. Energy efficiency of PBFT, Aardvark, and RBFT versus computa-
Fig. 12. Delay of the noncryptographic task in Aardvark versus time slot. tional resources for cryptographic operations in each MEC.
As another important hyperparameter, mini-batch size controls As we can see, with the convergence of the learning algo-
the accuracy of the estimate of the loss function gradient when rithm, the actual delay of this kind of task is controlled to
performing DRL. Higher mini-batch size can bring about more be lower than the maximum day. Moreover, the violation of
accurate gradient descent thus to stabilize and accelerate the constraint (21) results in a relatively wider area of the explor-
learning process. Specifically, for PBFT, Aardvark, and RBFT ing space of delay, hence the agent takes more time slots to
in this article, the best mini-batch sizes are 64, 64, and 128, satisfy the delay constraint. As for other MECs and protocols,
respectively. Higher values cause nothing in the improvement the performance curves are similar to Figs. 11, 12. Due to
for convergence speed but the need for more memory space. the space limitation, we only give the above two figures as a
To show the influence the penalty function has on the typical example.
computational resources, we take Aardvark with ten physi- Aiming to show the effect of computational resources for
cal machines (i.e., ten MECs) as the example and depict the cryptographic operations and the number of transactions sent
relationship between total computational resources of a spe- by SDN controllers, we set N = 10, then we fix b =
crypto crypto crypto
cific MEC and time slots in Fig. 11. In the beginning, the [b1 , b2 , . . . , bN ] and ξ crypto = [ξ1 , ξ2 , . . . , ξN ],
value of total computational resources varies in a wide range respectively, and compare the corresponding performance for
to explore the possibility of action space. The agent learns our proposed DRL in Figs. 13 and 14. As is shown, different
from the penalty [nonzero P(t)] it receives from the reward areas of b and ξ crypto result in distinct varying trends of energy
and tries to avoid the area of action space leading to the vio- efficiency. Hence, one-size-fits-all parameter setting does not
lation of constraint (26). Besides, the agent learns to achieve exist. Besides, due to tracking the variation of the environment,
higher G(t) as well. As the learning algorithm progresses, the our proposed DRL achieves higher energy efficiency than the
value of total computational resources varies in a small range two fixed schemes, and the fixed ξ crypto scheme gets severer
to achieve the best reward, resulting in the convergence of the performance degradation.
algorithm and the satisfaction of the corresponding constraint. As the general cases of performance display for Fig. 8,
In addition, Fig. 12 illustrates the delay performance pertaining Figs. 15–17 illustrate the relationship between energy
to the noncryptographic task with a maximum 10 ms delay. efficiency, and the number of blockchain nodes for PBFT,
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on June 14,2020 at 16:05:49 UTC from IEEE Xplore. Restrictions apply.
LUO et al.: BLOCKCHAIN-ENABLED SDIIoT WITH DEEP REINFORCEMENT LEARNING 5479
VII. C ONCLUSION
In this article, we integrated permissioned blockchain into
distributed SDIIoT to reach the consensus of the global view
and proposed an optimization framework to optimize the
system energy efficiency. In order to fulfill the decentralization
feature of blockchain, we described our problem as POMDP
which can be implemented in a truly distributed manner. Then,
Fig. 15. Energy efficiency of PBFT versus the number of blockchain nodes. we proposed a novel DRL approach to solve the POMDP.
The simulation results showed that the proposed algorithm
could achieve the goal of improving the energy efficiency of
Aardvark, RBFT and PBFT with limited performance reduc-
tion compared to the ideal situation. For future work, we will
consider more consensus protocols and apply similar schemes
in specific edge computing cases such as video transcoding.
R EFERENCES
[1] J. Li, F. R. Yu, G. Deng, C. Luo, Z. Ming, and Q. Yan,
“Industrial Internet: A survey on the enabling technologies, applica-
tions, and challenges,” IEEE Commun. Surveys Tuts., vol. 19, no. 3,
pp. 1504–1526, 3rd Quart., 2017.
[2] X. Li, D. Li, J. Wan, C. Liu, and M. Imran, “Adaptive transmission
optimization in SDN-based industrial Internet of Things with edge
computing,” IEEE Internet Things J., vol. 5, no. 3, pp. 1351–1360,
Jun. 2018.
Fig. 16. Energy efficiency of Aardvark versus the number of blockchain [3] D. Ongaro and J. Ousterhout, “In search of an understandable consensus
nodes. algorithm,” in Proc. USENIX Annu. Tech. Conf., Philadelphia, PA, USA,
2014, pp. 305–320.
[4] S. Nakamoto, “Bitcoin: A peer-to-peer electronic cash system,” Rep.,
2008. [Online]. Available: https://bitcoin.org/bitcoin.pdf
[5] Z. Zheng, S. Xie, H. Dai, X. Chen, and H. Wang, “An overview of
blockchain technology: Architecture, consensus, and future trends,” in
Proc. IEEE Int. Congr. Big Data (BigData Congress), Honolulu, HI,
USA, Jun. 2017, pp. 557–564.
[6] F. R. Yu, J. Liu, Y. He, P. Si, and Y. Zhang, “Virtualization for distributed
ledger technology (vDLT),” IEEE Access, vol. 6, pp. 25019–25028,
2018.
[7] C. Wang, Y. He, F. R. Yu, Q. Chen, and L. Tang, “Integration of
networking, caching, and computing in wireless systems: A survey, some
research issues, and challenges,” IEEE Commun. Surveys Tuts., vol. 20,
no. 1, pp. 7–38, 1st Quart., 2018.
[8] S. Song and J. Chung, “Sliced NFV service chaining in mobile edge
clouds,” in Proc. 19th Asia–Pac. Netw. Oper. Manag. Symp. (APNOMS),
Seoul, South Korea, Sep. 2017, pp. 292–294.
[9] H. Truong and M. Karan, “Analytics of performance and data quality for
mobile edge cloud applications,” in Proc. IEEE 11th Int. Conf. Cloud
Comput. (CLOUD), San Francisco, CA, USA, Jul. 2018, pp. 660–667.
Fig. 17. Energy efficiency of RBFT versus the number of blockchain nodes. [10] H. Liu, F. Eldarrat, H. Alqahtani, A. Reznik, X. de Foy, and Y. Zhang,
“Mobile edge cloud system: Architectures, challenges, and approaches,”
IEEE Syst. J., vol. 12, no. 3, pp. 2495–2508, Sep. 2018.
Aardvark, and RBFT, respectively. According to the corre- [11] J. Luo, F. R. Yu, Q. Chen, and L. Tan, “Adaptive video stream-
sponding consensus procedures, more blockchain nodes will ing with edge caching and video transcoding over software-defined
beget more message exchanges, thus more computational mobile networks: A deep reinforcement learning approach,” IEEE Trans.
Wireless Commun., early access, doi: 10.1109/TWC.2019.2955129.
resources will be used for the same amount of transactions.
[12] J. Wan et al., “Toward dynamic resources management for IoT-based
Therefore, the energy efficiency decreases with the increase in manufacturing,” IEEE Commun. Mag., vol. 56, no. 2, pp. 52–59,
the number of blockchain nodes. As shown in the above fig- Feb. 2018.
ures, our proposed DRL achieves similar performance to the [13] M. Moness, A. M. Moustafa, A. H. Muhammad, and A. A. Younis,
“Hybrid controller for a software-defined architecture of industrial
ideal situation (i.e., centralized DRL) with the variation of the Internet lab-scale process,” in Proc. 12th Int. Conf. Comput. Eng. Syst.
number of blockchain nodes. In addition to the proposed DRL (ICCES), Cairo, Egypt, Dec. 2017, pp. 266–271.
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on June 14,2020 at 16:05:49 UTC from IEEE Xplore. Restrictions apply.
5480 IEEE INTERNET OF THINGS JOURNAL, VOL. 7, NO. 6, JUNE 2020
[14] R. Chaudhary, G. S. Aujla, S. Garg, N. Kumar, and J. J. P. C. Rodrigues, [38] D. Bernstein, “The Poly1305-AES message-authentication code,” in
“SDN-enabled multi-attribute-based secure communication for smart Proc. 12th Int. Conf. Fast Softw. Encryption, Paris, France, Feb. 2005,
grid in IIoT environment,” IEEE Trans. Ind. Informat., vol. 14, no. 6, pp. 32–49.
pp. 2629–2640, Jun. 2018. [39] D. J. Bernstein, N. Duif, T. Lange, P. Schwabe, and B.-Y. Yang, “High-
[15] I. Bedhief, L. Foschini, P. Bellavista, M. Kassar, and T. Aguili, “Toward speed high-security signatures,” J. Cryptograph. Eng., vol. 2, no. 2,
self-adaptive software defined fog networking architecture for IIoT and pp. 77–89, Sep. 2012.
industry 4.0,” in Proc. IEEE 24th Int. Workshop Comput. Aided Model.
Design Commun. Links Netw. (CAMAD), Limassol, Cyprus, Sep. 2019,
pp. 1–5. Jia Luo received the M.S. degree from Chongqing
[16] F. Bannour, S. Souihi, and A. Mellouk, “Distributed SDN control: University of Posts and Telecommunications,
Survey, taxonomy, and challenges,” IEEE Commun. Surveys Tuts., Chongqing, China, in 2014, where he is currently
vol. 20, no. 1, pp. 333–354, 1st Quart., 2018. pursuing the Ph.D. degree with the School of
[17] A. Tootoonchian and Y. Ganjali, “HyperFlow: A distributed control Communication and Information Engineering.
plane for OpenFlow,” in Proc. Internet Netw. Manag. Conf. Res. From April 2018 to April 2019, he was a visiting
Enterprise Netw., Apr. 2010, pp. 1–6. Ph.D. student with Carleton University, Ottawa,
[18] Koponen et al., “Onix: A distributed control platform for large-scale ON, Canada. His current research interests include
production networks,” in Proc. USENIX Symp. Operating Syst. Design blockchain, SDN, mobile edge computing, and
Implement., Vancouver, BC, Canada, Oct. 2010, pp. 351–364. deep reinforcement learning.
[19] P. Berde et al., “ONOS: Towards an open, distributed SDN OS,” in Proc.
ACM SIGCOMM Workshop Hot Topics Softw. Defined Netw., Aug. 2014,
pp. 1–6. Qianbin Chen (Senior Member, IEEE) received
[20] P. Hunt et al., “Zookeeper: Wait-free coordination for Internet-scale the Ph.D. degree in communication and information
systems,” in Proc. USENIX Annu. Tech. Conf., 2010, p. 11. system from the University of Electronic Science
[21] S. Yeganeh and Y. Ganjali, “Kandoo: A framework for efficient and Technology of China, Chengdu, China, in
and scalable offloading of control applications,” in Proc. 1st ACM 2002.
Int. Workshop Hot Topics Softw. Defined Netw., Helsinki, Finland, He is currently a Professor with the
Aug. 2012, pp. 19–24. School of Communication and Information
[22] S. Jain et al., “B4: Experience with a globally-deployed software defined Engineering, Chongqing University of Posts and
WAN,” in Proc. ACM SIGCOMM Conf., Hong Kong, China, Aug. 2013, Telecommunications, Chongqing, China, where he
pp. 3–14. is the Director of the Chongqing Key Laboratory
[23] A. Kumar et al., “BwE: Flexible, hierarchical bandwidth allocation for of Mobile Communication Technology. He has
WAN distributed computing,” in Proc. ACM SIGCOMM Special Interest authored or coauthored over 100 papers in journals and peer-reviewed con-
Group Data Commun., London, U.K., Aug. 2015, pp. 1–14. ference proceedings, and has coauthored seven books. He holds 47 granted
[24] C. Qiu, F. R. Yu, H. Yao, C. Jiang, F. Xu, and C. Zhao, national patents.
“Blockchain-based software-defined industrial Internet of Things: A
dueling deep Q-learning approach,” IEEE Internet Things J., vol. 6,
no. 3, pp. 4627–4639, Jun. 2019. F. Richard Yu (Fellow, IEEE) received the Ph.D.
[25] M. Liu, F. R. Yu, Y. Teng, V. Leung, and M. Song, “Performance degree in electrical engineering from the University
optimization for blockchain-enabled Industrial Internet of Things (IIoT) of British Columbia, Vancouver, BC, Canada, in
systems: A deep reinforcement learning approach,” IEEE Trans. Ind. 2003.
Informat., vol. 15, no. 6, pp. 3559–3570, Jun. 2019. From 2002 to 2006, he was with Ericsson, Lund,
[26] G. Wood, “Ethereum: A secure decentralised generalised transaction Sweden, and a startup in California, USA. He joined
ledger,” Ethereum Project Yellow Paper, vol. 151, pp. 1–32, Jan. 2014. Carleton University, Ottawa, ON, Canada, in 2007,
[27] M. Castro and B. Liskov, “Practical byzantine fault tolerance and proac- where he is currently a Professor. His research
tive recovery,” ACM Trans. Comput. Syst., vol. 20, no. 4, pp. 398–461, interests include wireless cyber–physical systems,
Nov. 2002. connected/autonomous vehicles, security, distributed
[28] A. Clement, E. L. Wong, L. Alvisi, M. Dahlin, and M. Marchetti, ledger technology, and deep learning.
“Making byzantine fault tolerant systems tolerate byzantine faults,” in Prof. Yu received the IEEE Outstanding Service Award in 2016; the IEEE
Proc. 6th USENIX Symp. Netw. Syst. Design Implement., Berkeley, CA, Outstanding Leadership Award in 2013; the Carleton Research Achievement
USA, 2009, pp. 153–168. Award in 2012; the Ontario Early Researcher Award (formerly, Premiers
[29] P. Aublin, S. B. Mokhtar, and V. Quéma, “RBFT: Redundant byzantine Research Excellence Award) in 2011; the Excellent Contribution Award at
fault tolerance,” in Proc. IEEE 33rd Int. Conf. Distrib. Comput. Syst., IEEE/IFIP TrustCom 2010; the Leadership Opportunity Fund Award from
Philadelphia, PA, USA, Jul. 2013, pp. 297–306. Canada Foundation of Innovation in 2009; and the Best Paper Awards at
[30] R. Kotla, L. Alvisi, M. Dahlin, A. Clement, and E. Wong, “Zyzzyva: IEEE ICNC 2018, VTC 2017 Spring, ICC 2014, Globecom 2012, IEEE/IFIP
Speculative byzantine fault tolerance,” SIGOPS Oper. Syst. Rev., vol. 41, TrustCom 2009, and International Conference on Networking 2005. He serves
no. 6, pp. 45–58, Oct. 2007. on the editorial boards of several journals, including as the Co-Editor-in-
[31] V. Mnih et al., “Human-level control through deep reinforcement Chief for Ad Hoc and Sensor Wireless Networks and the Lead Series Editor
learning,” Nature, vol. 518, pp. 529–533, Feb. 2015. for the IEEE T RANSACTIONS ON V EHICULAR T ECHNOLOGY, the IEEE
[32] M. Hausknecht and P. Stone, “Deep recurrent Q-learning for partially T RANSACTIONS ON G REEN C OMMUNICATIONS AND N ETWORKING, and
observable MDPs,” Jan. 2017. [Online]. Available: arXiv:1507.06527. IEEE C OMMUNICATIONS S URVEYS & T UTORIALS. He has served as the
[33] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep Technical Program Committee Co-Chair of numerous conferences. He is a
Q-learning with model-based acceleration,” Mar. 2016. [Online]. Registered Professional Engineer in the province of Ontario, Canada, and a
Available: arXiv:1603.00748. Fellow of the Institution of Engineering and Technology. He is a Distinguished
[34] J. Wang, L. Zhao, J. Liu, and N. Kato, “Smart resource allo- Lecturer, the Vice President (Membership), and an Elected Member of the
cation for mobile edge computing: A deep reinforcement learn- Board of Governors of the IEEE Vehicular Technology Society.
ing approach,” IEEE Trans. Emerg. Topics Comput., early access,
doi: 10.1109/TETC.2019.2902661.
Lun Tang received the Ph.D. degree in commu-
[35] Z. Md. Fadlullah, F. Tang, B. Mao, J. Liu, and N. Kato, “On intelligent
nication and information system from Chongqing
traffic control for large-scale heterogeneous networks: A value matrix-
University, Chongqing, China.
based deep learning approach,” IEEE Commun. Lett., vol. 22, no. 12,
He is currently a Professor with the
pp. 2479–2482, Dec. 2018.
School of Communication and Information
[36] N. A. Vien, T. P. Le, and T. Chung, “Deep hierarchical reinforcement
Engineering, Chongqing University of Posts and
learning algorithm in partially observable Markov decision processes,”
Telecommunications, Chongqing. His current
IEEE Access, vol. 6, pp. 49089–49102, 2018.
research interests include 5G cellular networks,
[37] J. N. Tsitsiklis and B. V. Roy, “An analysis of temporal-difference learn-
network slicing, and machine learning.
ing with function approximation,” IEEE Trans. Autom. Control, vol. 42,
no. 5, pp. 674–690, May 1997.
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on June 14,2020 at 16:05:49 UTC from IEEE Xplore. Restrictions apply.