Deep Reinforcement Learning-Based Collaborative Vi

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Hindawi

Wireless Communications and Mobile Computing


Volume 2020, Article ID 6684293, 16 pages
https://doi.org/10.1155/2020/6684293

Research Article
Deep Reinforcement Learning-Based Collaborative Video Caching
and Transcoding in Clustered and Intelligent Edge B5G Networks

1 1,2
Zheng Wan and Yan Li
1
School of Information Management, Jiangxi University of Finance and Economics, No. 665, West Yuping Road, Nanchang,
Jiangxi 330032, China
2
School of Information Engineering, Nanchang Institute of Technology, No. 289, Tianxiang Road, Nanchang, Jiangxi 330099, China

Correspondence should be addressed to Yan Li; yli@nit.edu.cn

Received 14 October 2020; Revised 7 November 2020; Accepted 25 November 2020; Published 12 December 2020

Academic Editor: Lisheng Fan

Copyright © 2020 Zheng Wan and Yan Li. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.

In the next-generation wireless communications system of Beyond 5G networks, video streaming services have held a surprising
proportion of the whole network traffic. Furthermore, the user preference and demand towards a specific video might be
different because of the heterogeneity of users’ processing capabilities and the variation of network condition. Thus, it is a
complicated decision problem with high-dimensional state spaces to choose appropriate quality videos according to users’ actual
network condition. To address this issue, in this paper, a Content Distribution Network and Cluster-based Mobile Edge
Computing framework has been proposed to enhance the ability of caching and computing and promote the collaboration
among edge severs. Then, we develop a novel deep reinforcement learning-based framework to automatically obtain the
intracluster collaborative caching and transcoding decisions, which are executed based on video popularity, user requirement
prediction, and abilities of edge servers. Simulation results demonstrate that the quality of video streaming service can be
significantly improved by using the designed deep reinforcement learning-based algorithm with less backhaul consumption and
processing costs.

1. Introduction In recent years, the number of smart devices has been


explosively grown, which led to unprecedented increase in
Beyond fifth-generation (B5G) networks is the next- the demand on video streaming service. In video streaming
generation wireless communications systems. They are service, it generally requires higher data rates and bigger sys-
desired to provide rather reliable services with super high tem capacity. The overall mobile data traffic has experienced
transmission rate, ultralow latency, very little energy loss, 17-fold growth from 2012-2017 as summarized in Cisco
excellent quality of experience (QoE), and much enhanced Visual Networking Index [4]. Mobile videos account for
security [1]. Due to providing mobile edge computing and more than half of this data traffic and are predicted to further
edge caching capabilities together with machine learning, grow by 2022, accounting for 79% of the total data traffic.
edge intelligence is emerging as a new concept and has Due to the immense demands of mobile videos, mobile net-
extremely high potential in addressing the new challenges work operators can not be enough to satisfy the users’
in B5G networks [2, 3]. In wireless communication net- demands on high-quality video streaming services.
works, video streaming services have hold a surprising pro- To address this issue, firstly, edge video caching has been
portion of the whole network traffic. In particular, because recognized as a promising solution to reduce the data traffic,
of the impact of the epidemic at 2019-nCoV in this year, because edge video caching can bring videos closer to the
it has greater dependence and demand on online video users, which will reduce data traffic going through the back-
streaming services, such as online meeting, online teaching, haul links and the time required for video delivery [5]. Moti-
and online shopping. vated by serving the users better, different edge caching
2 Wireless Communications and Mobile Computing

strategies have been studied recently. Secondly, a good seeking more rewards. The goal of DRL is to enable an agent
video QoE is very important to users. In a full range of user to take the best action in the current state to maximize long-
mobile devices, the source video streams are needed to be term gains in the environment [25, 26]. And the interaction
transcoded into multiple representations. But the video between the agent’s action and state is learned by leveraging
transcoding is also an extremely computation intensive the deep neural network (DNN). Due to these characteristics,
and time-consuming work [6]. DRL becomes a powerful tool in robotics, wireless communi-
Recently, mobile edge computing (MEC) has been intro- cation, etc. [27–29]. Since the advent of deep Q network
duced as an emerging paradigm in the edge of the cellular (DQN) [30–32] in 2013, a large number of algorithms and
Radio Access Network (C-RAN) [7–12]. The MEC servers papers to solve practical application problems have appeared
are implemented particularly at the BSs in the mobile edge in the field of deep reinforcement learning. The basic idea
computing platforms, enabling video streaming services in behind many reinforcement learning algorithms is to esti-
close-proximity to the mobile users. Due to this position, mate the Q value function by using the Bellman equation as
MEC presents a unique opportunity to not only perform an iterative update. Such value iteration algorithms converge
edge caching but also implement edge processing. to the optimal Q value function.
Due to the heterogeneity of users’ processing capabilities This paper intends to propose a video transmission
and the variation of network condition, the user preference model combining MEC and Content Distribution Network
and demand towards a specific video might be different. (CDN) technology, which interconnects the CDN network
For example, users with better network condition usually with the MEC network through the CDN tips. Also, we focus
prefer high-resolution videos while users with poor network on exploiting MEC storage and processing capabilities to
condition may desire for appropriate quality videos accord- improve the performance of high-quality streaming services.
ing to their actual network condition. Based on this phenom- We aim to solve the collaborative caching and transcoding
enon, adaptive bitrate (ABR) streaming [13, 16] has been for multi-MEC servers by using the DRL algorithm in mobile
widely used to improve the quality of delivered video. In edge computing system. Specifically, the main contributions
ABR streaming, the bitrate of the streaming video will be of this paper are as follows:
chosen according to the users’ specific request and actual net-
work condition. A video content is encoded into multiple (i) A CDN and Cluster-based Mobile Edge Computing
layers with different bitrates, satisfying different users’ (2C-MEC) system model has been proposed, which
requirement. Then, each video layer will be further seg- promotes cooperation among MEC servers and
mented into many small video chunks, which contains reduces unnecessary backhaul consumption and
several seconds of the video content. Thus, users can dynam- processing costs. We design a MEC-enabled collabo-
ically adjust video layer for different video chunks, depending rative caching and transcoding for multi-MEC
on their actual network conditions. So, it is a complicated servers in the 2C-MEC system by leveraging video
decision problem with high-dimensional state spaces to caching and transcoding in the vicinity of RAN at
choose appropriate quality videos according to users’ actual multi-MEC servers
network condition. There are obvious advantages in deploy-
(ii) The optimization problem of collaborative caching
ing ABR streaming locally at multi-MEC servers in RAN,
and transcoding for multi-MEC servers can be for-
such as avoiding the long latency and reducing the prestorage
mulated as a stochastic Markov decision process to
pressure at RAN [14–18]. Then, the required video layer of
maximize the time-averaged Deep Q-Network
mobile users can be transcoded in an on-demand fashion,
(DQN) reward. The reward is defined as the
which can improve ABR streaming performance over mobile
weighted sum of the cache hit rate, user perceived
edge computing networks when it is directly served from a
QoE, the cost of performing transcoding, and trans-
local MEC server.
mission at multi-MEC servers. Then, we develop a
Deep learning has a strong perception ability. It is mainly
DRL-based algorithm to automatically obtain the
used to solve classification and regression problems by cap-
intracluster collaborative caching and transcoding
turing and analyzing data features [19–22], but it does not
decisions, which are executed based on video popu-
have the ability to make decisions. Reinforcement learning
larity, user requirement prediction, and abilities of
[23] has the ability to make decisions, but it is helpless to per-
MEC servers
ceive problems and cannot handle high-dimensional data.
Reinforcement learning is actually an agent that learns the (iii) Simulation results demonstrate that video streaming
best decision sequence during the interaction with the envi- service can be significantly improved by using the
ronment. In order to deal with the complicated control and proposed DRL-based algorithm compared with the
decision problems with high-dimensional state spaces, a scheme that video transcoding is not implemented
promising solution has been given in recent development of at the MEC servers, with less backhaul consumption
deep reinforcement learning (DRL) [24]. DRL consists of and processing costs
two modules: deep learning and reinforcement learning. It
uses deep learning to extract features from complex high- The remainder of this paper is organized as follows.
dimensional data and transform it into a low-dimensional Section 2 presents a related work. Section 3 describes the
feature space. Then the low-dimensional feature state space framework design of system and Section 4 formulates the opti-
inputs into reinforcement learning to make decisions for mization problem. The DRL-based algorithm is presented in
Wireless Communications and Mobile Computing 3

Section 5. Section 6 presents the simulation results and analy- the wireless networks. To achieve accurate QoE, Liu et al.
sis, followed by conclusions in Section 7. [53] and Zhang et al. [54] presented deep learning-based
QoE prediction called DeepQoE. Then in [53], the authors
2. Related Work designed a content-aware bitrate adaptation policy with the
objective to prefetch a higher resolution version for video
The research on the application of DRL to wireless network clips that is in line with viewers’ interests. Zhang et al. [54]
transmission optimization in the MEC environment is exten- also developed a DeepQoE-based ABR system to verify that
sive studied recently. It can be seen that the research in this their framework can be easily applied to multimedia commu-
area mainly began in 2018, increasing quickly year by year nication service. To address the challenge of how to allocate
after 2018. Furthermore, the application of DRL in video bitrate budgets for different parts of the video with different
transmission optimization under MEC environment is less. users’ interest, Gao et al. [55] proposed a content-of-inter-
The current research in this area includes the following cate- est-based rate adaptation scheme for ABR. They designed a
gories: DRL-based caching strategy, DRL-based real-time deep learning approach for recognizing the interestingness
transcoding scheduling decision, DRL-based wireless net- of the video content and a DQN approach for rate adaptation
work communication resource allocation [33–37], and according to incorporating video interestingness informa-
DRL-based offloading and service migration of computing tion. Considering joint computation and communication
tasks [38–43]. In this paper, we mainly focus on the first for ABR streaming, Guo et al. [56] presented a joint video
two topics, trying to satisfy the requests of quality for user’s transcoding and quality adaptation framework for ABR
streaming service. streaming. Inspired by recent advances of blockchain tech-
nology, Liu et al. [57] proposed a novel DRL-based transco-
2.1. DRL-Based Caching Strategy. For edge video caching at der selection framework for blockchain-enabled D2D
MEC servers, video caching policy is driven by video popu- transcoding systems where video transcoding has been
larity. Therefore, knowing the video popularity is key to solve widely adopted in live streaming services, to bridge the reso-
the video caching problem. To avoid such drawbacks, com- lution and format gap between content producers and
bining DRL methods are introduced to implement video consumers. To accommodate personalized QoE with mini-
cache strategies, which is an important research direction mized system cost, Wang et al. [58] proposed DeepCast,
[44–47]. In order to reduce the traffic load of backhaul and which is an edge-assisted crowdcast framework. It makes
transmission latency, Wei et al. [48] proposed the Q- intelligent decisions at edges based on the massive amount
Learning-based collaborative cache algorithm to solve the of real-time information from the network and viewers. In
intelligent baseband unit pool cache problem. Yang et al. [59], using DRL to train a neural network model for resource
[49] considered the task offloading decision, cache allocation, provisioning, Pang et al. designed a joint resource provision-
and computation allocation problems in single MEC sever; a ing and task scheduling approach for transcoding live
DRL algorithm was proposed to solve this optimization streams in the cloud.
problem with low complexity. Zhong et al. [50, 51] presented The application of DRL in transcoding scheduling deci-
a DRL-based framework with Wolpertinger architecture for sions mainly focuses on making intelligent real-time trans-
content caching at the single MEC. They proposed deep coding decisions at the network edge based on a large
actor-critic reinforcement learning-based policies for both amount of real-time information from the network and cus-
centralized and decentralized content caching, aiming at tomers. In order to meet the high-quality video service expe-
maximizing the cache hit rate in centralized edge caching rience of requirements of different users, DRL-based
and the cache hit rate and transmission delay as performance transcoding scheduling strategy will aim at achieving person-
metrics in decentralized edge caching. Gursoy et al. [52] alized QoE with minimized system cost.
designed a deep actor-critic RL-based multiagent framework
for the edge caching problem in both a multicell network and 2.3. Our Vision and Motivation. Inspired by the success of
a single-cell network with D2D communication. DRL in solving complicated control problems, DRL-based
Applying DRL to cache technology mainly solves the methods are commonly used in caching and transcoding
problem of cache content location decision, cache update strategy for MEC system. But there are still some issues
strategy, and cache content delivery. It implements resource which are needed to be resolved. (i) At present, there are
allocation and cache scheduling by using deep learning to many systems mainly studying single-MEC server. However,
analyze and learn network information. Then corresponding single-MEC server does not have enough storage and com-
video content and bitrate versions are cached to improve puting ability to satisfy the needs of different users. (ii) There
cache hit radio and utilization of cache resources. However, are few researches on the cooperation mode and efficiency of
the lack of transcoding on the network edge will reduce the multi-MEC servers. The completion of intensive tasks
video cache hit rate. requires efficient collaboration among multi-MEC servers.
(iii) In multi-MEC servers’ system, the load balance among
2.2. DRL-Based Transcoding Scheduling Strategy. The user’s MEC servers and the resource utilization of the MEC
demand towards a specific video might be different because server are basically not considered. (iv) According to users’
of the heterogeneity of their actual network condition. To network conditions, adaptively collaborative caching and
address this issue, transcoding in network edge has been transcoding methods in ABR streaming are needed further
widely used to improve the quality of delivered video on explored.
4 Wireless Communications and Mobile Computing

Edge area

Edge area

Internet CDN
Cluster

Edge area

Backhaul link
CDN tip RAN link
Cluster head/element D2D link

Mobile terminal
Backbone link

Figure 1: A CDN and Cluster-based Mobile Edge Computing (2C-MEC) system.

To address these issues, in this paper, a CDN and Cluster- area consists of the CDN tips (that is, the “edge node” in the
based Mobile Edge Computing (2C-MEC) system model has CDN, in order to distinguish it from the edge computing
been presented, which promotes cooperation among MEC node, called “CDN tip” in this paper) and many edge com-
servers and reduces unnecessary backhaul consumption puting nodes in the local area (may be deployed at small base
and processing costs. Then, aiming to exploit MEC storage stations, macro base stations, and locations higher than the
and processing capabilities to improve performance of macro base stations). Thereby, the computing, storage, and
high-quality streaming services, we focus on solving the col- communication capabilities of edge computing nodes are
laborative caching and transcoding for multi-MEC servers by used to assist in the deployment of sparse CDN tips to opti-
using the DRL algorithm in the 2C-MEC system model. mize wireless video transmission across the entire network.
Due to the large number of edge nodes and the large
3. Framework Design of System difference in capabilities among them, a hierarchical man-
agement model is proposed to cluster edge nodes. The com-
3.1. 2C-MEC System Model. In order to meet the transmis- munication protocols within and among clusters can draw
sion requirements of real video services in the internet, the on the related research of sensor networks and P2P networks.
video transmission strategy based on mobile edge computing The influencing factors of edge node clustering strategy
must consider a heterogeneous wireless access network envi- include edge node capabilities, geographic location distribu-
ronment and popular video transmission technology. As tion, number and activity of users. The 2C-MEC system
shown in Figure 1, this paper intends to propose a video can promote mutual cooperation among MEC servers and
transmission model combining Cluster-based MEC and reduces unnecessary backhaul consumption and processing
CDN technology, which is called as a CDN and Cluster- costs.
based Mobile Edge Computing system. Based on the proposed Cluster-based Mobile Edge Com-
The video transmission model-based mobile edge com- puting framework, on the one hand, the storage and comput-
puting is seamlessly connected with the current popular ing capabilities of the MEC servers have been improved. The
video transmission CDN technology. In this model, the edge 2C-MEC system enables the MEC servers’ collaboration
Wireless Communications and Mobile Computing 5

within the cluster to have sufficient storage and computing ing time of one chunk during time stage t, denoted by dðtÞ,
power to meet users’ needs. On the other hand, the collab- can be expressed as
oration among MEC servers is promoted. Under this
framework, it is possible to pursue the multi-MEC collab- B ðt Þ ∗ D
d ðt Þ = : ð2Þ
oration method within the cluster, which focuses on W ðt Þ
exploring the effective ways of multi-MEC servers’ collabo-
ration of caching and transcoding. On the contrary, existed Furthermore, the video rebuffering time of playback
studies focused on “cloud-edge” collaboration or “edge- buffer during time stage t is denoted as RðtÞ. Then, we can get
edge” collaboration.
In this paper, we plan to design the edge node clustering Rðt Þ = max ðdðt Þ − T ðt Þ, 0Þ,
algorithm based on the following ideas: (i) firstly, the cluster ð3Þ
division is based on the principle of proximity to geographic T ðt + 1Þ = D + max ðT ðt Þ − d ðt Þ, 0Þ:
location. (ii) Secondly, the overall service capabilities of the
nodes in the cluster should match their users’ needs, and 3.3. Video Quality Rate Model. In video processing, Peak Sig-
the edge service capabilities among different clusters should nal to Noise Ratio metric (PSNR) is the de facto standard cri-
be balanced to a certain extent. (iii) Thirdly, if the edge node terion to provide objective quality evaluation between the
is located in the intersection area of two clusters, the appro- original frame and the compressed one. In the video quality
priate cluster is selected based on the similarity of the video evaluation, the video quality rate qðtÞ of a video coded at rate
access preferences of the users managed by this node and BðtÞ can be approximated by a logarithmic function [61] as
the video access preferences of the users managed by other follows:
nodes in one cluster. (iv) Finally, after the clustering is com-
pleted, we can comprehensively consider the computing, qðt Þ = β log ðBðtÞÞ, ð4Þ
storage, communication capabilities of the edge node, and
its communication delay with other nodes in the cluster to where the β value can be obtained from the video encoder
elect this cluster’s head. during the encoding in video source. Generally, the men-
In a word, the 2C-MEC system model proposed in this tioned quality rate qðtÞ is a nondecreasing function, which
paper is compatible with popular CDN technology, resulting means a higher bitrate may be a high definition video while
in conveniently utilization of its research results in cache a lower bitrate may be a standard definition video.
replacement, content prefetching and load balancing. Fur- Then, let Bu ði, tÞ ∈ fB1 , B2 , ⋯, Bmax g and Bmax be the set
thermore, the ability of MEC to utilize heterogeneous edge of all video layers after video transcoding and the highest
nodes with different capabilities and deployments further video level at the MEC servers, respectively. And Bu ði, tÞ
improves the quality of video transmission. denotes the bitrate assigned to user i at timeslot t.

3.2. Rebuffer Model. In order to keep continuous playback in 3.4. Cache Hit Rate Model. In our setting, requests by all users
video streaming service, a playback buffer is usually deployed are served by the MEC severs; all video have the same size,
at the user device, in which the video chunks are downloaded and there are no priorities for different users, while there
into. The rebuffer model used in this paper comes from the are popularities for different videos. Videos popularity distri-
reference [60]. Let BðtÞ denote the bitrate of the chunk at bution is always the key to solve the video caching problem.
time stage t for the user. And WðtÞ denotes the wireless Considering the changing popularities, the probability that
transmission rate (bits/second) of user experienced during the requests of video v is defined as Z v , which follows the Zipf
time stage t. Then, the buffer occupancy rate LðtÞ is defined distribution [16] as follows:
as follows:
v−α
Zv = , ð5Þ
Buffer occupancy ∑Vv=1 v−α
L ðt Þ = : ð1Þ
Buffer size
where α > 0 is the parameter of Zipf distribution which indi-
cates the skewness degree. According to our setting, the video
When BðtÞ/WðtÞ < 1, the new video chunk is put into the streaming service quality of content caching can be evaluated
buffer at rate of less than 1; then, the buffer decreases. In in terms of the cache hit rate. The cache hit rate CRHðtÞ in T
another way, if more than one chunk is played before the requests during time stage t is defined [40] as
next chunk arrives, then, the buffer is depleted and the rebuf-
fering is happened. So, in the rebuffer model, the term of ∑Ti=1 lðH i Þ
rebuffering time and buffered video time are usually intro- CRHðt Þ = , ð6Þ
T
duced, which are used in Reference [56]. A video has some
chunks; each chunk also contains a fixed duration of video, where indicator function lðH i Þ is defined as
such as D seconds of video. Let TðtÞ denote the buffered (
video time at playback buffer at the beginning of time stage 1, H i ∈ CT ,
t. In the rebuffer model, we assume that one chunk will be lðH i Þ = ð7Þ
downloaded into the buffer at one time. The total download- 0, H i ∉ CT ,
6 Wireless Communications and Mobile Computing

where CT represents the cache state during this period; if reward of the collaborative video caching and transcoding
there is the cache of video, the request H i can hit in the cache. optimization problem.
3.5. System Cost Model. In the system cost model, most of the 4.1. State Space. The state at time stage t is jointly determined
operational cost consists of bandwidth cost and transcoding by the four tuples, the current bandwidth cost C b ðtÞ, the cur-
cost in video streaming service. The fraction of other service rent buffer occupancy rate LðtÞ, the current rebuffer time R
cost is negligible comparing with the above two kinds of cost. ðtÞ, and the current video quality qðtÞ. Then, the state space
Then, the bandwidth cost C b ðtÞ [62] of all MEC servers in SðtÞ at time stage t can be defined as follows:
the cluster can be obtained by the following formula:
Sðt Þ = fCb ðt Þ, Lðt Þ, Rðt Þ, qðt Þg, ð11Þ
M
C b ðt Þ = 〠 Pðn, t Þ ⋅ W ðn, t Þ, where the state space is denoted as S.
n=1 ð8Þ
4.2. Action Space. The control action for the agent is to select
W ðn, t Þ = 〠 Bu ði, t Þ ⋅ I t ði, nÞ, n ∈ f0, ⋯, M‐1g,
the video caching strategy and video transcoding strategy for
i∈U t
the next requested video chunk according to the current sys-
where U t and M are the user group and the numbers of tem state. In this network, the action at each time stage t is
severs in the cluster at time stage t. And I t ði, nÞ is an indicator the joint video cache updating, cacheðMðtÞ, UðtÞÞ, and video
that represents whether user i is connected to MEC server n transcoding layer adaption decision, Bu ði, tÞ.
at the time stage t. Respectively, Pðn, tÞ and Wðn, tÞ be the So, the action is selected from the action set AðtÞ, in
unit bandwidth price and the amount of bandwidth usage which MðtÞ, UðtÞ, and Bu ði, tÞ represent the number of
in the MEC server n. MEC severs selected in the cluster, the decision of video
Beside the bandwidth cost, the video streaming service cache updating, and the target video layer, respectively. Then,
also needs to consider the transcoding cost. Based on the def- the action space can be described as
inition and description of video transcoding in [56, 62], the
transcoding cost is closely related to the input bit-rate, target Aðt Þ = fM ðt Þ, U ðt Þ, Bu ði, t Þg, ð12Þ
bit-rate, the video length, and the number of CPU cores
needed for transcoding according to the video pricing model. where the action space is denoted as A.
Then, we define the transcoding cost incurred at time stage In practice, since the numbers of MEC severs in a cluster
t as and the set of all video layers are not large; also, the decision
of video cache updating is only yes or no; the number of pos-
Oðt Þ = σ ∗ ðLmax − lÞ ∗ T v ∗ N cpu , l ∈ fL1 , L2 , ⋯, Lmax g, ð9Þ sible actions in the state space set for the collaborative video
caching and transcoding problem can be not very large.
where σ is an adjustable parameter and symbols of l, T v , 4.3. Reward. The reward should reflect the objective of the
and N cpu represent the level of input video, the video framework, which, in our case, is to reduce the operational
length, and the number of CPU cores required for trans- cost and desire best QoE for users by solving the collaborative
coding, respectively. caching and transcoding for multi-MEC servers. In our
In order to simplify the problem formulation, in our sys- paper, we define the reward function during time stage t,
tem cost model, the operational cost mainly consisted of denoted by rðtÞ, as follows:
bandwidth cost and transcoding cost. Since bandwidth cost
and transcoding cost have different measurement units, r ðt Þ = ω1 CRHsl ðt Þ + λqðt Þ − ω2 kqðt Þ − qðt − 1Þk − ω3 Rðt Þ
bandwidth cost reflects the network transmission capacity,
− ω4 C b ðt Þ − ω5 Oðt Þ:
while transcoding cost reflects the computing power of the
MEC node; it is not easy to unify the corresponding dimen- ð13Þ
sional units. However, in the comparison of simulation
experiments, only the cost of comparing different environ- The first term on the right-hand side of (13) is the
ments is required. Therefore, like the design in Reference weighted sum of the short and long-term cache hit rate. Con-
[62], the bandwidth cost and transcoding cost can be sidering the number of requests for local video in the next
regarded as values without a unit of measurement, and there epoch, the short-term cache hit rate CRHs ðtÞ can be either
is no need to consider the details of the unit of measurement. 0 or 1. Thus, let the total normalized number of requests
The operational cost can be expressed as for local video within the last 20 requests as the long-term
cache hit rate CRHl ðtÞ ∈ ½0, 1. The total cache hit rate
C ð t Þ = C b ð t Þ + O ð t Þ: ð10Þ CRHsl ðtÞ for each step is defined as the weighted sum of
the short and long-term cache hit rate, which is defined as
4. Optimization Problem Formulation
CRHsl ðt Þ = CRHs ðt Þ + μ ∗ CRHl ðt Þ, ð14Þ
Based on using the DRL algorithm for resource optimization
in the 2C-MEC system, we describe the three basic elements where μ is the weight to balance the short and long-term
of reinforcement learning. They are the state, action, and cache hit rate.
Wireless Communications and Mobile Computing 7

Agent Environment

Evaluate network Target network State MEC servers in cluster


Update

Reward
QoE

Caching update
Transcoding decision
Reply memory
MEC server selected

Action Action
Mobile terminal

Figure 2: The design of DRL-based intracluster collaborative caching and transcoding framework.

The second, third, and fourth terms on the right-hand cluster collaborative caching and transcoding framework is
side of (13) are video quality, video quality variation, and shown in Figure 2.
video rebuffering time, respectively. The fifth and sixth terms
on the right-hand side of (13) are two penalty terms for the 5. DRL-Based Intracluster Collaborative
bandwidth cost and transcoding cost in each step. The total Caching and Transcoding Algorithm
cache hit rate, video quality, video quality variation, and
video playback rebuffering time are directly associated with 5.1. Deep Reinforcement Learning-Based Collaborative Video
user perceived QoE in the video streaming service. And the Caching and Transcoding for Multi-MEC Servers. Based on
weights ω1 , λ, ω2 , ω3 , ω4 , ω5 are the weighting parameters. DQN’s excellent performance when dealing with discrete
state space and action space, we adopt DQN for learning
4.4. Problem Formulation. In this paper, our objective is to the intracluster collaborative caching and transcoding policy.
derive the jointly optimal video caching policy and video Specifically, as illustrated in Figure 2, the inputs of the deep
transcoding policy for maximizing the rewards in video neural network are the video service system states listed in
streaming service. Future rewards and present rewards have Equation (11), and the outputs of the network are the Q value
different importance and weights because of the uncertainly function, Qðs, a ; θÞ, for each action listed in Equation (12).
of system dynamics. The objective of the joint video caching We illustrate the details of the DRL-based learning algo-
policy and video transcoding policy is to maximize the rithm for collaborative caching and transcoding for multi-
expected average reward. Then, we can formulate the MEC servers in Algorithm 1.
dynamic optimization problem as a Markov decision process
(MDP) as follows: 6. Simulation Results and Analysis
" # In this section, firstly, we illustrate the experiment settings.
T−1
max J ð t Þ = E 〠 γ r ðt Þ
t Then, the computer simulations are carried out to demon-
M ðt Þ,U ðt Þ,Bu ði,t Þ strate the performance of the proposed DRL algorithm of
t=0
collaborative caching and transcoding for multi-MEC servers
s:t: C1 : M ðt Þ ∈ f0, 1, ⋯, M g, ∀t
in mobile edge computing wireless networks.
C2 : U ðt Þ ∈ f0, 1g, ∀t
6.1. Experimental Settings
C3 : Bu ði, t Þ ∈ fB1 , B2 , ⋯, Bmax g, ∀t,
6.1.1. Data Generation. In our experiments, the user data of
ð15Þ
requests is generated randomly, while the video data of users’
requests is generated according to the Zipf distribution. We
where γ ∈ ð0, 1 is the discount factor. have collected different numbers of requests in one episode
It is impractical for the optimization problem with a large as the testing data, such as 30, 40, and 50. In order to make
number of states in state space. But the DRL algorithm has the experiment more comprehensive, we generate two types
been proved a useful mathematical tool for large-scale opti- of data sets. Firstly, the video data set in users’ different-
mization problem which does not need any prior knowledge number requests was generated with unchanged popularity
of state transition probability. Based on this, we propose a distribution with Zipf parameter set as 1.3. Then, the video
DRL-based algorithm to solve the optimization problem in data set in users’ same-number requests was generated with
formulation (15). Thus, the design of DRL-based intra- a varying Zipf parameter.
8 Wireless Communications and Mobile Computing

1: Initialization:
2: Initialize replay memory D to capacity N
3: Initialize Q network and target Q network with random weights
4: Initialize MEC service matrix V of requests
5: for episode =1, M do
6: Generate the user requests data
7: Observe the initial state s1 as illustrated in Eq. (11)
8: for t =1, T do
9: Give a random probability ς ∈ ½0, 1
( ∗
a ðtÞ = arg max Qðs, a ; θÞ, ς > ε
10: Choose action A(t) which listed in Eq. (12) as AðtÞ = a

aðtÞ ≠ a ðtÞ, randomly select aðtÞ, others
11: Based the action A(t), execute the transcoding policy and the caching updated
12: Observe the reward r(t), state s(t+1)
13: Store the transition (s(t), A(t), r(t), s(t+1)) in D
14: Update MEC service matrix V of requests
15: Sample random minibatch of transitions
(s(t), A(t),
8 r(t), s(t+1)) from D
< rj for terminal s ′
16: Set y j =
: r + γ max Qðs ′ , a ′ ; θ Þjs, aÞ for non‐terminal s ′
j a′ i−1
Li ðθi Þ = Es,a∼ρð⋅Þ ½ðyi − Qðs, a ; θi ÞÞ2 
17: Perform a gradient descent step according to equation: .
yi = r + γ maxa ′ Qðs ′ , a ′ ; θi−1 Þjs, aÞ
18: Update the parameters in the Q network
19: Reset the parameters in the target Q network every G time stages
20: end for
21: end for

Algorithm 1: Deep reinforcement learning algorithm for collaborative video caching and transcoding (DRL-CCT).

6.1.2. Parameter Setting. In our experiments, we set 7 MEC batch gradient descent is 32. The experiments are imple-
severs in one cluster, which serve 30 users in this region mented using Python and TensorFlow.
and provide about 50 videos for users’ requests. Then, we
set D = 10s, β = 6:5, α = 1:3, μ = 0:6, σ = 1:2, the weights asso- 6.2. Simulation Results. In this section, we compare the pro-
ciated with cache hit rate and QoE in the reward function are posed DRL algorithm (called DRL-CCT) with the latest base-
set as ω1 = 1, λ = 0:9, ω2 = 0:9, ω3 = 0:1, and the weights asso- line methods, such as the method (called caching only at
ciated with cost penalty in the reward function are set as ω4 network edge) in Reference [51] and the method (called
= 0:1, ω5 = 0:1. transcoding only at network edge) in Reference [56]. In our
In the experiment, there are four video layers of the video, experimental framework, we simulated the above methods
with Bmax = 10 Mbps as the highest layer at the MEC server. according to the setting form of the reward function in the
The bitrates of the three transcoded layers are B1 = 1 Mbps, above literature. Also, we compare the proposed DRL
B2 = 2 Mbps, and B3 = 4 Mbps, and the set of available CPU algorithm with the algorithm of DRL-CCT without trans-
cores at MEC is f2, 4, 6, 8g. Video transcoding from Bmax to coding policy. Especially due to the characteristics of deep
B1 , B2 , and B3 needs 2, 4, and 6 CPU cycles, respectively. reinforcement learning, for our proposed algorithm, all
With the number of caching strategy being 2 (yes or no), reported results were obtained from average of 20 algo-
the number of videos’ bitrates being 4, and the number of rithm executions.
MEC severs in one cluster being 7, the number of actions in Figure 3 shows the convergence performance of the DRL-
action set A is 2 × 4 × 7 = 56. CCT algorithm under the set of full weight in the different
learning rates. With continuous learning, the average reward
6.1.3. Deep Neural Network for DQN. We use a fully con- gradually stabilizes. Compared with the balanced method of
nected neural network with 2 hidden layers, 256 and 512 in the algorithm in Reference [56], the average reward of the
size. The loss function is the mean square error. The naive ε algorithm we proposed converges faster, and the subsequent
-greedy strategy is used for exploration, and the probability fluctuations are slightly larger. But in contrast, the deep net-
of randomly choosing an action during training is ε. As the work used in our DRL-CCT algorithm is more concise and
learning progresses, the degree of exploration continues to efficient. The convergence performance is influenced by
shrink. The learning rate is 0.01, the size of experience replay learning rate. The performance of the learning rate 0.01 is
in DQN is 2000, the attenuation parameter used to update better than the performance of the learning rates 0.1, 0.001,
the target Q network is 0.9, and the batch size in stochastic and 0.0001. The convergence performance becomes worse
Wireless Communications and Mobile Computing 9

Average reward

–100

–200

–300

–400

5 10 15 20 25 30 35 40

Episode
Learning rate = 0.1 Learning rate = 0.001
Learning rate = 0.01 Learning rate = 0.0001

Figure 3: The convergence performance of DRL-CCT algorithm in the different learning rate.

CHR

80

60

40

20

0
0.0 0.02 0.04 0.06 0.08 1.0

Cache ratio
DRL-CCT Caching only at network edge
Transcoding only at network edge DRL-CCT without transcoding policy

Figure 4: Cache hit rate vs. cache ratio.

in the learning rate 0.1, owing to a large update step such that Figure 4 gives the comparison of cache hit rates in differ-
the average reward converges to a local optimal solution. In ent algorithms at the same cache ratios. Compared with the
fact, an appropriate learning rate depends on the state of other algorithms, the DRL-CCT algorithm has a higher cache
the environment in the current optimization process. hit rate. Since the 2C-MEC system model has been proposed,
10 Wireless Communications and Mobile Computing

CHR

80

60

40

20

0
1.05 1.1 1.2 1.3 1.4 1.5

Zipf exponent
DRL-CCT

Figure 5: Cache hit rate vs. Zipf exponent.

the cluster-based video cache hit rate is definitely better than video streaming services. It can be seen from Figure 7 that
the video cache hit rate based on a single MEC server, espe- when there is no transcoding function at the network edge,
cially when the cache ratio is relatively small. In addition, the bandwidth cost is greater than the DRL-CCT algorithm,
the performance in the cache hit rate of DRL-CCT without because the uncached video has to be extracted from the
transcoding policy algorithm is the worst one because only source server which leads to consume a lot of bandwidth cost.
the highest version of the video is cached in the MEC. Owing The difference of bandwidth cost performance between
to the absence of transcoding function at network edge, the “transcoding only” algorithm and DRL-CCT algorithm is
MEC server has to return to the source server for extraction slight in the latter stage.
when the user requests for other version of the video, which The average bandwidth cost and QoE performance in
results in low cache hit rate. DRL-CCT algorithm with different experimental settings
In Figure 5, we study the cache hit rate as a function of are shown in Figures 8–11. Figures 8 and 9 are the perfor-
the Zipf exponent. As Zipf exponent increases, cache hit rates mance for different request numbers in an episode. It can
achieved by the caching policy increase first and then be seen from Figure 8 that as the number of user requests
decrease. This is due to the fact that with larger Zipf expo- in a time slot increases, the average bandwidth cost of each
nent, the video popularity distribution is more concentrated, MEC will continue to increase. This is because the number
and therefore, the popularity of the files is skewed. Conse- of MEC servers is fixed. When the number of user requests
quently, caching these more popular videos leads to an has increased, the number of user requests served by each
increase first in the cache hit rates. Then, the cache hit rates MEC must increase, which directly leads to an increase in
have a fall. It is because that the DRL-CCT algorithm stores the average bandwidth cost of each MEC. The following con-
the most popular files initially when the number of popular clusions can be directly obtained in Figure 9 that the change
files gets small. However, it eventually experiences diminish- in the number of requests from different users in a time slot
ing returns as Zipf exponent is further increased, and the does not have a great impact on the average QoE of the users,
larger the Zipf exponent, the smaller the influence of less and the QoE value of the video streaming service is stable in a
popular files is. good range.
As for average QoE performance in Figure 6, DRL-CCT Then, Figures 10 and 11 are the performance for different
is much better than “transcoding only” and the other two MEC numbers within a cluster at network edge. According to
algorithms. Due to the long rebuffering time, the average Figure 10, on the premise that the number of user requests in
QoE value of the DRL-CCT without transcoding algorithm a time slot is determined, when the number of MEC nodes in
and “caching only” algorithm are below zero all the time. the edge cluster decreases, the average bandwidth cost of each
Compared with these methods which has no joint caching MEC will increase at the beginning. However, as the deep
and transcoding at the edge, DRL-CCT has the highest reinforcement learning process progresses, the average band-
QoE, which means users can get much better experience in width cost of each MEC will tend to stabilize. This is due to
Wireless Communications and Mobile Computing 11

Average QoE
10

–10

–20

–30

–40

5 10 15 20 25 30 35 40

Epsiode

DRL-CCT Caching only at network edge


Transcoding only at network edge DRL-CCT without transcoding policy

Figure 6: The QoE performance in different algorithms.

Bandwidth cost

450

400

350

300

250

200

150

100

50

5 10 15 20 25 30 35 40

Epsiode

DRL-CCT Caching only at network edge


Transcoding only at network edge DRL-CCT without transcoding policy

Figure 7: The bandwidth cost performance in different algorithms.

the adaptive decision-making function of deep reinforcement Figure 9, the average QoE performance of the system has
learning, which continuously optimizes the MEC load distri- always been relatively stable, indicating that the proposed
bution in one edge cluster. In Figure 11, the same as in method has excellent robustness to environmental changes.
12 Wireless Communications and Mobile Computing

Average bandwidth cost in every MEC


50

40

30

20

10

0
5 10 15 20 25 30 35 40

Epsiode

Requests = 40
Requests = 50
Requests = 60

Figure 8: The average bandwidth cost in the DRL-CCT algorithm at different request numbers in an episode.

Average QoE
20

15

10

–5

–10
5 10 15 20 25 30 35 40

Epsiode

Requests = 40
Requests = 50
Requests = 60

Figure 9: The average QoE performance in the DRL-CCT algorithm at different requests numbers in an episode.
Wireless Communications and Mobile Computing 13

Average bandwidth cost in every MEC


100

80

60

40

20

0
5 10 15 20 25 30 35 40

Epsiode

MEC numbers = 5
MEC numbers = 6
MEC numbers = 7

Figure 10: The average bandwidth cost in DRL-CCT algorithm at different MEC numbers with in a cluster.

Average QoE
20

15

10

–5

–10
5 10 15 20 25 30 35 40

Epsiode

MEC numbers = 5
MEC numbers = 6
MEC numbers = 7

Figure 11: The average QoE performance in DRL-CCT algorithm at different MEC numbers with in a cluster.
14 Wireless Communications and Mobile Computing

7. Conclusions [4] Cisco, Cisco Visual Networking Index: Global Mobile Data
Traffic Forecast Update, 2017-2022 White Paper, 2019,
In this paper, we first propose a CDN and Cluster-based https://www.cisco.com/c/dam/m/en_in/innovation/
Mobile Edge Computing system that can enhance the ability enterprise/assets/mobile-white-paper-c11-520862.pdf.
of caching and computing and promote the collaboration [5] S. M. Azimi, O. Simeone, A. Sengupta, and R. Tandon, “Online
among MEC severs in one cluster. In addition, we formulate edge caching and wireless delivery in fog-aided networks with
a novel deep reinforcement learning based framework to dynamic content popularity,” IEEE Journal on Selected Areas
automatically obtain the intracluster collaborative caching in Communications, vol. 36, no. 6, pp. 1189–1202, 2018.
and transcoding decisions, which are executed based on [6] G. Gao, Y. Wen, and J. Cai, “vcache: supporting cost-efficient
video popularity, user requirement prediction, and abilities adaptive bitrate streaming,” IEEE Multimedia, vol. 24, no. 3,
of MEC servers. Then, numerical results are presented to val- pp. 19–27, 2017.
idate the effectiveness of the proposed method. [7] Y. C. Hu, M. Patel, D. Sabella, N. Sprecher, and V. Young,
Under the framework of the 2C-MEC system, this paper “Mobile edge computing-a key technology towards 5G,” ETSI
mainly researches on promoting the collaboration among White Paper, vol. 11, 2015.
MEC servers in the cluster. In the future work, intercluster [8] K. Zhang, Y. Mao, S. Leng et al., “Energy-efficient offloading
collaboration needs to be considered when intracluster com- for mobile edge computing in 5G heterogeneous networks,”
puting and storage capabilities are insufficient. If it is assumed IEEE Access, vol. 4, pp. 5896–5907, 2016.
that the terminal has caching and computing capabilities, it is [9] A. Ahmed and E. Ahmed, “A survey on mobile edge comput-
also possible to consider “edge-end” collaboration, “end-end” ing,” in Proc. IEEE Int. Conf. On Intelligent Systems and Con-
trol (ISCO), Nanjing, China, 2016.
collaboration, and other collaboration modes to implement a
multidimensional collaboration model of “cloud-edge-end” [10] J. Liu, Y. Mao, J. Zhang, and K. B. Letaief, “Delay-optimal com-
putation task scheduling for mobile-edge computing systems,”
among different agents. At the same time, load balancing
in 2016 IEEE International Symposium on Information Theory
among MEC servers in the mobile edge cluster still needs (ISIT), pp. 1451–1455, Barcelona, 2016.
further research to explore efficient ways to solve the con-
[11] Y. Mao, J. Zhang, and K. B. Letaief, “Dynamic computation
tradiction between the balance of MEC servers and the offloading for mobile-edge computing with energy harvesting
improvement of user QoE. devices,” IEEE Journal on Selected Areas in Communications,
vol. 34, no. 12, pp. 3590–3605, 2016.
Data Availability [12] T. X. Tran, A. Hajisami, P. Pandey, and D. Pompili, “Collabo-
rative mobile edge computing in 5g networks: new paradigms,
The data used to support the findings of this study are avail- scenarios, and challenges,” IEEE Communications Magazine,
able from the corresponding author upon request. vol. 55, no. 4, pp. 54–61, 2017.
[13] Y. Sánchez dela Fuente, T. Schierl, C. Hellge et al., “iDASH:
improved dynamic adaptive streaming over HTTP using scal-
Conflicts of Interest able video coding,” Proceeding of ACM Multimedia Systems,
pp. 23–25, 2011.
The authors declare that they have no conflicts of interest.
[14] A. Mehrabi, M. Siekkinen, and A. Ylä-Jääski, “Edge computing
assisted adaptive mobile video streaming,” IEEE Transactions
Acknowledgments on Mobile Computing, vol. 18, no. 4, pp. 787–800, 2019.
[15] D. Wang, Y. Peng, X. Ma et al., “Adaptive wireless video
The work was supported by the National Natural Science streaming based on edge computing: opportunities and
Foundation of China (No. 61961021), the Science and Tech- approaches,” IEEE Transactions on Services Computing,
nology Project of Jiangxi Education Department (No. vol. 12, no. 5, pp. 685–697, 2019.
GJJ180251 and No. GJJ171011), and the Innovation Special [16] T. X. Tran and D. Pompili, “Adaptive bitrate video caching
Fund for Individual Graduate Student of Jiangxi University and processing in mobile-edge computing networks,” IEEE
of Finance and Economics (2020 Annual, No. 24). Transactions on Mobile Computing, vol. 18, no. 9, pp. 1965–
1978, 2019.
[17] J. Yao, T. Han, and N. Ansari, “On mobile edge caching,” IEEE
References Communications Surveys & Tutorials, vol. 21, 2019.
[1] N. Kato, B. Mao, F. Tang, Y. Kawamoto, and J. Liu, “Ten chal- [18] S. Safavat, N. N. Sapavath Naveen, and D. B. Rawat, “Recent
lenges in advancing machine learning technologies toward advances in mobile edge computing and content caching,”
6G,” IEEE Wireless Communications, vol. 27, no. 3, pp. 96– Digital Communications and Networks, vol. 6, 2020.
103, 2020. [19] K. He, Z. Wang, W. Huang, D. Deng, J. Xia, and L. Fan,
[2] K. Zhang, Y. Zhu, S. Maharjan, and Y. Zhang, “Edge intelli- “Generic deep learning-based linear detectors for MIMO sys-
gence and blockchain empowered 5G beyond for the industrial tems over correlated noise environments,” IEEE Access,
Internet of things,” IEEE Network Magazine, vol. 33, no. 5, vol. 8, pp. 29922–29929, 2020.
pp. 12–19, 2019. [20] J. Xia, L. Fan, W. Xu et al., “Secure cache-aided multi-relay net-
[3] Y. Dai, D. Xu, S. Maharjan, G. Qiao, and Y. Zhang, “Artificial works in the presence of multiple eavesdroppers,” IEEE Transac-
intelligence empowered edge computing and caching for inter- tions on Communications, vol. 67, no. 11, pp. 7672–7685, 2019.
net of vehicles,” IEEE Wireless Communications Magazine, [21] H. Liu, C. Lin, J. Cui, L. Fan, X. Xie, and B. F. Spencer, “Detec-
vol. 26, no. 3, pp. 12–18, 2019. tion and localization of rebar in concrete by deep learning
Wireless Communications and Mobile Computing 15

using ground penetrating radar,” Automation in Construction, [39] L. Huang, S. Bi, and Y. J. Zhang, “Deep reinforcement learning
vol. 118, 2020. for online computation offloading in wireless powered mobile-
[22] K. He, Z. Wang, D. Li, F. Zhu, and L. Fan, “Ultra-reliable MU- edge computing networks,” IEEE Transactions on Mobile
MIMO detector based on deep learning for 5G/B5G-enabled Computing, vol. 99, 2018.
IoT,” Physical Communication, vol. 43, p. 101181, 2020. [40] X. Chen, H. Zhang, C. Wu, S. Mao, Y. Ji, and M. Bennis, “Opti-
[23] S. S. Mousavi, M. Schukat, and E. Howley, “Deep reinforce- mized computation offloading performance in virtual edge
ment learning: an overview,” in Proceedings of SAI Intelligent computing systems via deep reinforcement learning,” IEEE
Systems Conference, pp. 426–440, Cham, 2016. Internet of Things Journal, vol. 6, 2018.
[24] S. S. Mousavi, M. Schukat, and E. Howley, Deep Reinforcement [41] S. Park, J. Kim, D. Kwon, M. Shin, and J. Kim, “Joint offloading
Learning: An Overview, Intelligent Systems Conference 2018 and streaming in mobile edges: a deep reinforcement learning
(IntelliSys 2018), London, United Kingdom, 2018. approach,” in 2019 IEEE VTS Asia Pacific Wireless Communi-
cations Symposium (APWCS), Singapore, Singapore, 2019.
[25] Y. He, Z. Zhang, F. R. Yu et al., “Deep-reinforcement-learning-
based optimization for cache-enabled opportunistic interfer- [42] H. Zhang, W. Wu, C. Wang, M. Li, and R. Yang, “Deep rein-
ence alignment wireless networks,” IEEE Transactions on forcement learning-based offloading decision optimization in
Vehicular Technology, vol. 66, no. 11, pp. 10433–10445, 2017. mobile edge computing,” in 2019 IEEE Wireless Communica-
tions and Networking Conference (WCNC), Marrakesh,
[26] X. Wang, Y. Han, V. C. M. Leung, D. Niyato, X. Yan, and Morocco, Morocco, 2019.
X. Chen, “Convergence of edge computing and deep learning:
a comprehensive survey,” IEEE Communications Surveys & [43] Z. Cheng and Z. Zheng, “Task migration for mobile edge com-
Tutorials, vol. 22, no. 2, pp. 869–904, 2020. puting using deep reinforcement learning,” Future Generation
Computer Systems, vol. 96, pp. 111–118, 2019.
[27] D. Guo, L. Tang, X. Zhang, and Y. Liang, “Joint optimization
of handover control and power allocation based on multi- [44] S. Wang, X. Zhang, Y. Zhang, L. Wang, J. Yang, and W. Wang,
agent deep reinforcement learning,” IEEE Transactions on “A survey on mobile edge networks: convergence of comput-
Vehicular Technology, vol. 69, 2020. ing, caching and communications,” IEEE Access, vol. 5, no. 3,
pp. 6757–6779, 2017.
[28] S. Lai, “Intelligent secure mobile edge computing for beyond
[45] Z. Zhang, Y. Zheng, C. Li, Y. Huang, and L. Yang, “Cache-
5G wireless networks,” Physical Communication, vol. 99,
enabled adaptive bit rate streaming via deep self-transfer rein-
pp. 1–8, 2020.
forcement learning,” in 2018 10th International Conference on
[29] R. Zhao, “Deep reinforcement learning based mobile edge Wireless Communications and Signal Processing (WCSP) IEEE,
computing for intelligent Internet of things,” Physical Commu- Hangzhou, China, 2018.
nication, vol. 43, article 101184, 2020.
[46] L. Lei, L. You, G. Dai, T. X. Vu, D. Yuan, and S. Chatzinotas,
[30] R. S. Sutton and A. G. Barto, Reinforcement Learning: An “A deep learning approach for optimizing content delivering
Introduction, MIT Press, Cambridge, MA, 2018. in cache-enabled hetnet,” in Wireless Communication Systems
[31] V. Mnih, K. Kavukcuoglu, D. Silver et al., “Playing atari with (ISWCS), pp. 449–453, Bologna, Italy, 2017.
deep reinforcement learning,” NIPS Deep Learning Workshop, [47] L. Lei, X. Xiong, H. Lu, and K. Zheng, “Collaborative edge
2013. caching through service function chaining: architecture and
[32] V. Mnih, K. Kavukcuoglu, D. Silver et al., “Human-level con- challenges,” IEEE Wireless Communications, vol. 25, no. 3,
trol through deep reinforcement learning,” Nature, vol. 518, pp. 94–102, 2018.
no. 7540, pp. 529–533, 2015. [48] C. H. Wei, Y. W. Hung, and F. L. Chin, “Q-learning based col-
[33] B. Guo, X. Zhang, Y. Wang, and H. Yang, “Deep-Q-network- laborative cache allocation in mobile edge computing,” Future
based multimedia multi-service QoS optimization for mobile Generation Computer Systems, vol. 102, pp. 603–610, 2020.
edge computing systems,” IEEE Access, vol. 7, pp. 160961– [49] Z. Yang, Y. Liu, Y. Chen, and G. Tyson, “Deep reinforcement
160972, 2019. learning in cache-aided MEC networks,” in ICC 2019-2019
[34] I. AlQerm and B. Shihada, “Energy efficient power allocation IEEE International Conference on Communications (ICC),
in multi-tier 5g networks using enhanced online learning,” Shanghai, China, 2019.
IEEE Transactions on Vehicular Technology, vol. 66, 2017. [50] Z. Chen, M. C. Gursoy, and S. Velipasalar, “A deep reinforce-
[35] X. He, K. Wang, H. Huang, T. Miyazaki, Y. Wang, and S. Guo, ment learning-based framework for content caching,” in 2018
“Green resource allocation based on deep reinforcement learn- 52nd Annual Conference on Information Sciences and Systems
ing in content-centric iot,” IEEE Transactions on Emerging (CISS), Princeton University, NJ, USA, 2018.
Topics in Computing, vol. 8, 2020. [51] C. Zhong, M. Cenk Gursoy, and S. Velipasalar, “Deep rein-
[36] Y. S. Nasir and D. Guo, “Multi-agent deep reinforcement forcement learning based edge caching in wireless networks,”
learning for dynamic power allocation in wireless networks,” IEEE Transactions on Cognitive Communications and Net-
IEEE Journal on Selected Areas in Communications, vol. 37, working, vol. 6, no. 1, pp. 48–61, 2020.
2019. [52] M. C. Gursoy, C. Zhong, and S. Velipasalar, “Deep multi-agent
[37] H. TY, N. Zhao, and H. Yin, “Integrated networking, caching reinforcement learning for cooperative edge caching,” in
and computing for connected vehicles: a deep reinforcement Machine Learning for Future Wireless Communications,
learning approach,” IEEE Transactions on Vehicular Technol- pp. 439–457, Shanghai, China, China, 2020.
ogy, vol. 99, no. 10, 2017. [53] L. Liu, H. Hu, Y. Luo, and Y. Wen, “When wireless video
[38] J. Li, H. Gao, T. Lv, and Y. Lu, “Deep reinforcement learning streaming meets AI: a deep learning approach,” IEEE Wireless
based computation offloading and resource allocation for Communications, vol. 27, 2019.
mec,” IEEE Wireless Communications and Networking Confer- [54] H. Zhang, L. Dong, G. Gao, H. Hu, Y. Wen, and K. Guan,
ence (WCNC), 2018, pp. 1–6, Barcelona, Spain, 2018. “DeepQoE: a multimodal learning framework for video quality
16 Wireless Communications and Mobile Computing

of experience (QoE) prediction,” IEEE Transactions on Multi-


media, vol. 22, 2020.
[55] G. Gao, L. Dong, H. Zhang, Y. Wen, and W. Zeng, “Content-
aware personalised rate adaptation for adaptive streaming via
deep video analysis,” in ICC 2019-2019 IEEE International
Conference on Communications (ICC), Shanghai, China,
China, 2019.
[56] Y. Guo, F. R. Yu, J. An, K. Yang, C. Yu, and V. C. M. Leung,
“Adaptive bitrate streaming in wireless networks with trans-
coding at network edge using deep reinforcement learning,”
IEEE Transactions on Vehicular Technology, vol. 69, no. 4,
pp. 3879–3892, 2020.
[57] M. Liu, Y. Teng, F. R. Yu, V. C. M. Leung, and M. Song, “A
deep reinforcement learning-based transcoder selection
framework for blockchain-enabled wireless D2D transcoding,”
IEEE Transactions on Communications, vol. 68, no. 6,
pp. 3426–3439, 2020.
[58] F. Wang, C. Zhang, F. Wang et al., “Intelligent edge-assisted
crowdcast with deep reinforcement learning for personalized
QoE,” in IEEE INFOCOM 2019, Paris, France, 2019.
[59] Z. Pang, L. Sun, T. Huang, Z. Wang, and S. Yang, “Towards
QoS-aware cloud live transcoding: a deep reinforcement learn-
ing approach,” in 2019 IEEE International Conference on Mul-
timedia and Expo (ICME), pp. 670–675, Shanghai, China,
China, 2019.
[60] T. Y. Huang, R. Johari, N. Mc Keown, M. Trunnell, and
M. Watson, “A buffer-based approach to rate adaptation: evi-
dence from a large video streaming service,” acm special inter-
est group on data communication, vol. 44, no. 4, pp. 187–198,
2015.
[61] M. Chen, M. Ponec, S. Sengupta, J. Li, and P. A. Chou, “Utility
maximization in peer-to-peer systems with applications to
video conferencing,” IEEE ACM Transactions on Networking,
vol. 20, no. 6, pp. 1681–1694, 2012.
[62] Y. Zheng, D. Wu, Y. Ke, C. Yang, M. Chen, and G. Zhang,
“Online cloud transcoding and distribution for crowdsourced
live game video streaming,” IEEE Transactions on Circuits
and Systems for Video Technology, vol. 27, no. 8, pp. 1777–
1789, 2017.

You might also like