Full Text 01

Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING,

SECOND CYCLE, 30 CREDITS


STOCKHOLM, SWEDEN 2021

Transformer learning for traffic


prediction in mobile networks

DANIEL WASS

KTH ROYAL INSTITUTE OF TECHNOLOGY


SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Author
Daniel Wass (dwass@kth.se)
Department of Intelligent Systems
KTH Royal Institute of Technology

Host Company
Ericsson AB
Stockholm, Sweden

Company Supervisor
Lackis Eleftheriadis
Senior Specialist Sustainable AI
Ericsson AB

Examiner
Olov Engwall
Division of Speech, Music, and Hearing
KTH Royal Institute of Technology

Supervisor
Atsuto Maki
Division of Robotics, Perception, and Learning
KTH Royal Institute of Technology

ii
Abstract

The resources of mobile networks are expensive and limited, and as demand for mobile
data continues to grow, improved resource utilisation is a prioritised issue. Traffic
demand at base stations (BSs) vary throughout the day and week, but the capacity
remains constant and utilisation could be significantly improved based on precise,
robust, and efficient forecasting.

This degree project proposes a fully attention­based Transformer model for traffic
prediction at mobile network BSs. Similar approaches have shown to be extremely
successful in other domains but there seems to be no previous work where a model
fully based on the Transformer is applied to predict mobile traffic. The proposed
model is evaluated in terms of prediction performance and required time for training
by comparison to a recurrent long short­term memory (LSTM) network.

The implemented attention­based approach consists of stacked layers of multi­head


attention combined with simple feed­forward neural network layers. It thus lacks
recurrence and was expected to train faster than the LSTM network. Results show
that the Transformer model is outperformed by the LSTM in terms of prediction error
in all performed experiments when compared after training for an equal number of
epochs. The results also show that the Transformer trains roughly twice as fast as
the LSTM, and when compared on equal premises in terms of training time, the
Transformer predicts with a lower error rate than the LSTM in three out of four
evaluated cases.

Keywords

Transformer, Attention, LSTM, Mobile traffic prediction.

iii
Sammanfattning
Transformerinlärning för prediktion av mobil
nätverkstrafik
Efterfrågan av mobildata ökar ständigt och resurserna vid mobila nätverk är både dyra
och begränsade. Samtidigt bestäms basstationers kapacitet utifrån hur hög efterfrågan
av deras tjänster är när den är som högst, vilket leder till låg utnyttjandegrad av
basstationernas resurser när efterfrågan är låg. Genom robust, träffsäker och effektiv
prediktion av mobiltrafik kan en lösning där kapaciteten istället följer efterfrågan
möjliggöras, vilket skulle minska överflödig resursförbrukning vid låg efterfrågan utan
att kompromissa med behovet av hög kapacitet vid hög efterfrågan.

Den här studien föreslår en transformermetod, helt baserad på attention­mekanismen,


för att prediktera trafik vid basstationer i mobila nätverk. Liknande metoder har
visat sig extremt framgångsrika inom andra områden men transformers utan stöd från
andra komplexa strukturer tycks vara obeprövade för prediktion av mobiltrafik. För att
utvärderas jämförs metoden med ett neuralt nätverk, innefattande noder av typen long
short­term memory (LSTM). Jämförelsen genomförs med avseende på träningstid och
felprocent vid prediktioner.

Transformermodellen består av flera attention­lager staplade i kombination med


vanliga feed­forward­lager och den förväntades träna snabbare än LSTM­modellen.
Studiens resultat visar att transformermodellen förutspår mobiltrafiken med högre
felprocent än LSTM­nätverket när de jämförs efter lika många epoker av träning.
Transformermodellen tränas dock knappt dubbelt så snabbt och när modellerna
jämförs på lika grunder vad gäller träningstid presterar transformermodellen bättre
än LSTM­modellen i tre av fyra utvärderade fall.

Nyckelord

Transformer, Attention, LSTM, Prediktering av mobil nätverkstrafik.

iv
Acknowledgements

I would like to express my deepest appreciation to my supervisors Lackis Eleftheriadis


at Ericsson and Atsuto Maki at KTH for their invaluable guidance, patience, and
continuous advice throughout the entire work of this degree project. I also want to
thank my examiner Olov Engwall for providing accurate, important, and rapid input
in the final stages of the report.

Furthermore, I would like to extend my sincere thanks to Athanasios Karapantelakis


and Maxim Teslenko at the Ericsson team for offering their reasoning and valuable
discussions on the technical aspects of the work. The support I received from
Athanasios for setting up the environment for data collection and cloud computing
at Ericsson has been completely indispensable and, for this, I cannot express enough
thanks. Being welcomed onto their and Lackis’ team for the semester has been a
privilege and I have been granted many great opportunities for learning.

v
Acronyms

ADAM adaptive moment estimation


ANN artificial neural network
ARIMA autoregressive integrated moving average
BS base station
CNN convolutional neural network
FFNN feed­forward neural network
GPU graphics processing unit
GRU gated recurrent unit
ICT Information and Communication Technology
LSTM long short­term memory
LTE Long­Term Evolution
MA moving average
ML machine learning
MLP multi­layer perceptron
MSE mean squared error
QoS quality of service
ReLU rectified linear unit
RMSE root mean squared error
RNN recurrent neural network
UE user equipment

vi
Contents

1 Introduction 1
1.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Purpose and goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Ethics and sustainability . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 5
2.1 Time series prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Machine learning for time series prediction . . . . . . . . . . . . . . . . 7
2.2.1 Recurrent neural networks . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Vanishing and exploding gradients for RNNs . . . . . . . . . . . 9
2.2.3 An RNN without learning issues: LSTM . . . . . . . . . . . . . . 9
2.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Mobile network traffic prediction . . . . . . . . . . . . . . . . . . 11
2.3.2 Prediction without recurrence: Transformers . . . . . . . . . . . 12
2.3.3 Representation of time: Time2Vec . . . . . . . . . . . . . . . . . 16

3 Method 17
3.1 Model architecture of the Transformer model . . . . . . . . . . . . . . . 17
3.2 Model architecture of the recurrent LSTM . . . . . . . . . . . . . . . . . 19
3.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Wilcoxon signed­rank for statistical testing . . . . . . . . . . . . . . . . 20
3.5 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Experimental work 21
4.1 Data and pre­processing of data . . . . . . . . . . . . . . . . . . . . . . 21

vii
CONTENTS

4.1.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . 21


4.1.2 Pre­processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.3 Evaluation and splitting of data . . . . . . . . . . . . . . . . . . . 24
4.2 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.1 Parameter grid search for the LSTM . . . . . . . . . . . . . . . . 26
4.2.2 Parameter grid search for the Transformer . . . . . . . . . . . . 27
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.1 Four scenarios for training and testing . . . . . . . . . . . . . . . 28
4.3.2 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.3 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Discussion 34
5.1 Interpretation of the experimental findings . . . . . . . . . . . . . . . . . 34
5.1.1 Comparing the Transformer to the LSTM . . . . . . . . . . . . . 34
5.1.2 Additional findings . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6 Conclusion 39
6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

References 41

viii
Chapter 1

Introduction

The resources of a mobile network are expensive and limited, and with the
development towards an internet of everything and everyone, the demand seems to be
ever­growing. Data packages are constantly requested in higher volume and at lower
latency, whereas the industry struggles with goals of improved energy efficiency to
reduce its climate footprint [24]. The traditional approach to meet increases in demand
has been to match base station (BS) resources to peak traffic scenarios, which leads
to excessive capacity at off­peak hours [6]. Thus, alternative solutions of proactive
resource management can generate significant opportunities for improvement in
resource utilisation.

Proactive resource management relies on robust and precise traffic prediction to


not compromise quality of service (QoS), and forecasting in mobile networks is a
well­researched area. Older works commonly include relatively simple methods of
statistical analysis, such as the autoregressive integrated moving average (ARIMA) [19,
51]. However, recent leaps in technology development have allowed for more complex
computations to be run on large scale [4], paving the way for machine learning (ML)
approaches such as artificial neural networks (ANNs).

An ANN method which has reached enormous success in time series prediction tasks
is the recurrent neural network (RNN) implemented with long short­term memory
(LSTM) units [23]. The LSTM units allow the RNN to capture long­term dependencies,
where it otherwise suffers [3, 22]. Mobile traffic forecasting is no exception and
LSTM­based networks have often been the top­performing method for mobile traffic
forecasting [1, 25, 42, 45].

1
CHAPTER 1. INTRODUCTION

The attention­based Transformer model, introduced by Vaswani et al. in [44], has been
applied to a range of problem domains with unprecedented results, such as machine
translation [43], language understanding [11], speech recognition [12] and image
generation [35]. Lately, the Transformer has been applied to time series prediction
problems with results competitive to the state­of­the­art [32, 47, 48, 52], and the
attention mechanism of the Transformer has been included in LSTM, convolutional
neural network (CNN), and RNN­based work for mobile traffic prediction [13, 17,
21, 30]. However, to the best of the author’s knowledge, there is currently no work
applying a model fully based on the Transformer and its attention mechanism to traffic
prediction in mobile networks.

This degree project suggests a fully attention­based Transformer approach for mobile
traffic forecasting at a BS. The model is trained and tested on multi­variate real­world
data from a Long­Term Evolution (LTE) BS. The model performance is compared to
the domain­dominant approach of a recurrent LSTM network.

1.1 Problem
There are several problem areas related to the limited resources of mobile networks
that can benefit from proactive resource management. One of them is the industry
prioritised issue of improving energy efficiency while retaining QoS [24]. In the area
covered by a BS, each user equipment (UE) is connected to one of the BS’s radio units.
As the number of units is determined by peak traffic scenarios, and since the power
consumption of the units shows little variation compared to the number of connected
UE, the power consumption at off­peak traffic hours is excessive in relation to demand
[14, 15].

An approach to improve energy efficiency could be to, at off­peak traffic hours, off­load
connected UE into a smaller number of radio units which would enable deactivation of
redundant radio capacity to reduce power consumption. However, such an approach
would require robust and accurate traffic prediction to not compromise QoS due to
reduced capacity by radio deactivation. To achieve significant effect, the approach
must be implemented on large scale and thus, in addition to being robust and precise,
an underlying prediction model would need to be cost­efficient. The latter requirement
sheds light on the importance of low computational complexity and small required data
flows, both for energy efficiency and operational cost of continuously running models

2
CHAPTER 1. INTRODUCTION

and data streams.

Recurrent LSTM networks have proven successful at capturing temporal dependencies


for mobile traffic forecasting, but findings of previous work in other domains of
time series prediction implicate that a Transformer­based model could provide
improvements in terms of required time for training while performing similar or even
better in terms of error rates. Thus, this degree project considers the following research
question:

When predicting traffic in a mobile network BS, how does a fully attention­based
Transformer model compare to a recurrent LSTM network in terms of prediction
performance and training time?

1.2 Purpose and goal


By applying a fully Transformer­based model to a new application domain, the purpose
of this degree project is twofold. Firstly, the study aims at contributing to the current
knowledge base of completely attention­based time series prediction, and secondly
to contribute to the present research mode of traffic prediction at mobile networks.
Moreover, by fulfilling the latter, the project aspires to add valuable insights on
Transformers for mobile traffic prediction and inspire further studies on efficient
prediction architectures which in the long run can contribute to energy efficiency and
sustainability, not only in mobile networks.

The goal of this degree project is to answer the research question formulated in section
1.1 by proposing a fully Transformer­based model for traffic prediction at a mobile
network BS and evaluate it on prediction performance and required training time in
comparison to the domain­dominant recurrent LSTM network.

1.3 Delimitations
This project intends to propose and evaluate a fully Transformer­based model for
traffic prediction at a mobile network BS limited to LTE radio units. The study is thus
limited to LTE traffic data and does not concern any other mobile traffic. The aim is
not necessarily to propose the best possible model, but to evaluate an approach that is
novel to the specific problem domain.

3
CHAPTER 1. INTRODUCTION

1.4 Ethics and sustainability


The Information and Communication Technology (ICT) industry has become a major
contributor to greenhouse gas emission [31] and network energy efficiency at mobile
networks is a high­priority question [24]. By aiming at extending the present
knowledge base of machine learning models for forecasting in mobile networks, this
thesis project is conducted with the hope of contributing along the path towards more
energy­efficient and sustainable mobile networks.

Another aspect of energy efficiency is related to the complexity of the underlying


predictive model. Such a model applied at large scale becomes a source of power
consumption itself. By also evaluating the models on training time, the aspect of
computational power is accounted for, and thus also indirectly the energy efficiency of
the models. In the long run, the best contributor to sustainability in mobile networks
is not necessarily the best performing model in terms of error rate, but rather the
most efficient model in the trade­off between prediction performance and power
consumption.

The privacy of data is assured throughout the project. No information regarding the
geographical position of the BS is provided, nor is any information of what the mobile
data flows include. Due to legal restrictions and customer privacy policies of the host
company Ericsson, any sensitive data values, including dates and timestamps of the
recorded data, have been encoded or censured to prevent prohibited disclosure.

1.5 Outline
A literature review describing the current research mode of mobile traffic prediction
as well as an introduction to the recurrent LSTM network and the Transformer
architecture are presented in chapter 2. Chapter 3 provides a detailed description
of the implemented models and the hardware setup for the experimental work. The
applied data setup and the experiments are presented in chapter 4. The findings
of the experiments are then discussed in chapter 5, followed by the conclusions and
suggestions for future work in chapter 6.

4
Chapter 2

Background

A mobile communications network consists of multiple interconnected BSs which


each allows UE to send (through the uplink) and receive (through the downlink) data
packages when they are connected to the BS. The area covered by a BS is typically
divided into three sectors. The demand in data traffic generally follows our human
patterns, and it varies with different BSs and different sectors of the BSs. Each sector
has its own setup of radio units to meet the demand of its area, i.e. the radio capacity of
each sector is deployed according to peak traffic scenarios in the covered area. As the
demand for mobile data grows, capacity needs to be increased, and since increasing
capacity generally include increased power consumption, the industry struggles with
an expanding climate footprint [31].

Indeed, improving energy efficiency at BSs is a prioritised question for all stakeholders.
It decreases the total climate footprint of the mobile network, and thus also of the entire
ICT industry which with its seemingly ever­increasing market has become a major
contributor to greenhouse gas emission [31]. Furthermore, energy efficiency at BSs has
a large impact on the network’s overall operational expenditure; up to 80 per cent of
total network power consumption is annotated to the BSs [37, 50]. In the IMT­20201 ,
network energy efficiency is listed as one of thirteen performance requirements for 5G
networks [24].

The power consumption at BSs shows weak correlation to the activity of their
connected UE, and energy efficiency is one of the problem areas where proactive
1
IMT­2020 are requirements stated by the International Telecommunication Union’s sector of
radiocommunication for 5G networks, services and devices.

5
CHAPTER 2. BACKGROUND

resource management could contribute to significant improvements. It has even been


showed that the power consumption of an LTE network can be accurately estimated
based on its number and types of BSs [14, 15], and in [15], the actual user traffic
impact on a BS’s power consumption was estimated to 1.4 per cent. By more dynamic
scheduling of capacity, the correlation between UE activity and BS power consumption
could be strengthened, i.e. so that differences in UE activity is more reflected in the
power consumption. Any proactive capacity management for mobile network BSs
would rely on robust, precise, and preferably computationally efficient predictions of
the UE activity for some subsequent time.

2.1 Time series prediction

When real­world data are collected, they often include some notion of time. Such
measurements, concatenated together, compose a time series [16]. There are
univariate and multivariate time series. The first refers to a sequence of a single
observation, i.e. a one­dimensional time series, and the latter refers to a group
of mutually involved time series where their interactions are considered, i.e. a
multidimensional time series. Analysis of time series either concerns capturing the
underlying pattern and structure of the measured data, or training a model to enable
future predictions [9]. Three typical characteristics for time series data are 1) not
all data are available at once, 2) the order of events matters, and 3) there can be
dependencies not only in space but also in time. This study addresses a multivariate­
to­multivariate time series prediction problem where the input is multidimensional
and the output is two­dimensional.

The most basic and general time series prediction methods are the ones based on
moving average (MA), such as ARIMA models. These are built upon the assumption of
the time series being stationary, resulting in the models simply being linear equations
[9]. The ARIMA equation can be formulated as

n n
1! 1!
yt = c + yt−i + et−i , (2.1)
n i n i

where yt and et are the output and error at time t, c is some constant and n is the window
size determining how far in the past the model sees. ARIMA models have been applied

6
CHAPTER 2. BACKGROUND

Figure 2.2.1: A multi­layer perceptron with one hidden layer. The first layer consists of
a data sequence x which is fed to the network. x is multiplied to the first weight matrix
W , which creates the hidden layer h. The output o is generated by multiplying h to the
second weight matrix U .

for predicting mobile traffic load, for example in [38] and [54], but they generally tend
to over reproduce the mean values of past data [45].

Significant improvements in graphics processing units (GPUs) over the years have
allowed for relatively complex computations to be run on large scale datasets, without
requiring a supercomputer [4]. This has facilitated the evaluation of certain ML genres,
namely approaches involving large amounts of operations which can be computed
in parallel, most commonly matrix multiplications [33]. The genre of ANNs is the
perhaps most favoured one.

2.2 Machine learning for time series prediction

2.2.1 Recurrent neural networks

Initially, although vaguely, inspired by the human brain, ANNs are networks of nodes
connected to each other through weights. Each node applies some non­linear function
to the combination of its input and its value is passed on through the network.
Typically, the node output is binary and the node is said to activate if some threshold
value is breached [29].

The multi­layer perceptron (MLP) is the archetypal ANN model and many of its basic
principles can be transferred to other ANN models. It consists of an input layer, i.e.
the input datapoints, which is also called the visible layer, an output layer, i.e. the

7
CHAPTER 2. BACKGROUND

predictions, and in between them a number of hidden layers each of which consists of
a set of nodes. The data flows forward through the network, from the input layer to the
output, transformed between the different stages by matrices. The MLP is therefore a
so­called feed­forward neural network (FFNN).

For the input to be understood by the machine, it is broken down into simpler
representations, further for each layer. The first hidden layer is a representation
of the visible layer, the second hidden layer is a representation of the first, and so
on. The further the representations are from the visible layer, the more abstract
they become [18]. That is, more abstract to us, but simpler to the model. Figure
2.2.1, showing an MLP with one hidden layer, serves to illustrate the fundamental
computations of an FFNN. The weights are updated through backpropagation, i.e. by
computing the gradient of the loss function with respect to each parameter, one layer
at a time, starting from the final layer. There are several different loss functions, and
which to use depends on the application. For example, mean squared error (MSE)
is common for regression tasks whereas cross­entropy often is applied for classifiers.
An FFNN considered as more successful than the MLP is the CNN [18]. After the
breakthrough in [28] from 2012, CNNs have been seen winning many competitions
and they are successfully applied to various domains [18, 29]. They achieved human­
level performance in certain tasks of image recognition in 2015 [40] and have for long
dominated computer vision.

Although FFNNs have delivered highly competitive results for time series, e.g. in [25],
they naturally cannot capture dependencies in time further than the length of their
input sequence, i.e. the width of the input layer. This is referred to as the network’s
memory depth, and increasing it to capture long term dependencies would incur an
extreme amount of multiplications [18], see figure 2.2.1.

RNNs, as illustrated in figure 2.2.2, are designed for processing sequences and are
scalable to handle much longer time series than FFNNs. By passing information
of past timesteps through an extra weight matrix, an RNN can recognise long term
dependencies without requiring the input layer size to match the length of the
dependencies. RNNs are trained by backpropagation through time, which, after
unrolling the network as in figure 2.2.2 equals the backpropagation training process of
an FFNN [18]. However, as independently discovered by Hochreiter [22] and Bengio
et al. [3] in the ’90s, even RNNs are limited in learning long term dependencies.

8
CHAPTER 2. BACKGROUND

Figure 2.2.2: The recurrent connection of an RNN is typically denoted by a loop (left­
hand­side). A represents the hidden layer of the network, U, V, W denotes the weight
matrices, L is the loss function, and x, o, y are input, output and truth respectively. The
loop illustrates how the hidden layer not only passes its representation of the input to
the output but also recurrently to the hidden layer at the next input timestep. The
RNN is also illustrated unrolled (right­hand­side) to clarify how past representations
are passed on in time through W . The RNN is trained by applying the backpropagation
algorithm onto the computational graph of the unrolled network to minimise the loss
function, exactly like with an FFNN [18].

2.2.2 Vanishing and exploding gradients for RNNs

The problem of vanishing and exploding gradients arises when the computational
graph onto which the backpropagation algorithm is applied becomes too deep.
Repeated multiplication of the same values causes the gradient to either vanish or
in rare cases explode [18]. For example, repeated multiplication by the matrix W
means W t at timestep t. If W has the eigendecomposition W = V diag(λ)V −1 , then
W t = V diag(λ)t V −1 , which means that any eigenvalues λ that are not very close to 1
or −1 will vanish or explode for some t. Vanishing gradients stop the algorithm from
learning as it complicates updating the parameters in the correct direction. Exploding
gradients make optimisation difficult as learning becomes unstable [18].

2.2.3 An RNN without learning issues: LSTM

There are many approaches developed to tackle the problem of vanishing or exploding
gradients, but the perhaps most popular one for practical applications is to include
gated units in some recurrent architecture [18]. In this study we consider the gated
LSTM unit. It was introduced by Hochreiter and Schmidhuber in [23] and is commonly
used in gated RNNs.

9
CHAPTER 2. BACKGROUND

Figure 2.2.3: The LSTM unit keeps track of past information by passing a cell state ct
through time. What information should be stored, and what should be forgotten, is for
each timestep t, determined by modulation of input data xt through the input gate it ,
the forget gate ft , and the output gate ot .

Instead of single operation cells as in regular RNNs, an LSTM network has cells that
are like a miniature network themselves, as displayed in figure 2.2.3. The basic idea is
to add a path through time without fostering the learning problem of vanishing (nor
exploding) gradients. This memory­like ability is achieved through an inner recurrence
(in addition to the outer recurrence of the RNN) of the cell state ct which has a so­called
self­loop that allows for previously processed information to be passed on through time
[18]. What information should be stored and forgotten are decided by modulation
through a series of gates, specifically the input gate it , output gate ot , and forget gate
ft . For each timestep t they are defined as

it = σ(Wi xt + Ui ht−1 + bi ), (2.2)

ot = σ(Wo xt + Uo ht−1 + bo ), (2.3)

ft = σ(Wf xt + Uf ht−1 + bf ), (2.4)

where W and U are weight matrices for the respective gate. ht is the hidden state

ht = ot " tanh(ct ), (2.5)

where " is the element­wise (Hadamard) product, tanh is the hyperbolic tan activation

10
CHAPTER 2. BACKGROUND

function and ct is the cell state

ct = ft " ct−1 + it " tanh(Wc xt + Uc ht−1 + bc ). (2.6)

For each timestep, the hidden state is passed on as output to the next layer, but it
is also passed on in time to the next recursion together with the cell state. When
included in an RNN, the learnable parameters of the LSTM cell W , U and b are trained
by backpropagation through time essentially like the regular RNN. The difference,
however, is that the gradient of ct lack any intrinsic factor which would drive it to vanish
or explode [2]. Through this path, an RNN incorporating LSTM cells is thus trainable
even on large time lags.

2.3 Related work

2.3.1 Mobile network traffic prediction

Traffic prediction in cellular networks has been extensively researched for various
purposes, more or less similar to the use case in this study. RNN based architectures
including gated LSTM units are often amongst the top performers.

In [53], a multi­view ensemble learning approach to predict mobile traffic load was
proposed to enable putting BSs to sleep for reduced energy consumption. In addition
to temporal trends, the models attend to capturing spatial influence and influence of
external events, such as weather and holidays. On top of the ensemble of models, an
optimisation algorithm was applied, accounting for operation cost, energy efficiency
and quality of service. A deep reinforcement learning approach for scheduling the
new demand of data traffic with high volume and high time flexibility was presented
in [7]. This new segment of traffic, often originating from data uploads of Internet
of things applications such as smart homes, needs to find gaps in between traffic
peaks to not intervene with real­time, delay­sensitive services such as voice calls and
video streaming. Underlying the reinforcement learning agent is an LSTM predictor
which successfully forecasts the network’s throughput congestion. Similarly, in [49],
unprecedented energy efficiency improvements by activation and deactivation of
entire BSs in a heterogeneous mobile network through a deep reinforcement learning
approach based on traffic predictions by a deep ANN model was presented.

11
CHAPTER 2. BACKGROUND

In [45], a hybrid model for spatial and temporal prediction in mobile networks was
presented, where the temporal part is based on LSTM cells. The model significantly
outperformed the selected baseline methods of an ARIMA and a support vector
regression model. In [42], a recurrent LSTM network was compared to an ARIMA
model and an FFNN, where both ANN approaches showed superior to the ARIMA
results with the LSTM slightly outperforming the FFNN. The models were applied
to real network data consisting of the number of connected users to an LTE node per
ms for K previous timesteps T , with K being tested from 1­10, and the duration of
T ∈ {10, 30, 60, 120} in ms. For the LSTM network, increasing T decreases accuracy
and increasing K increases accuracy. In [1], a comparative evaluation of an LSTM
network and an ARIMA model for mobile traffic prediction was performed. Although
the recurrent LSTM network showed superior results over the ARIMA, and particularly
for long­ranging time series, the study reveals that there can be scenarios in which the
ARIMA performs close to the LSTM with lower complexity. In [25], an LSTM network
for mobile traffic prediction which outperformed their baseline models was presented,
i.e. it performed more accurate predictions than an ARIMA model and with similar
error rates at significantly shorter training time than an FFNN.

2.3.2 Prediction without recurrence: Transformers

In 2017, Vaswani et al. presented the Transformer model as the first sequence to
sequence model based entirely on attention and thus without any recurrent layers
[44]. With their model, proposed for natural language translation tasks, they achieved
new state­of­the­art results when training significantly faster than models applying
convolutional or recurrent layers.

The Transformer model essentially consists of N identical stacked encoder layers and
N identical stacked decoder layers. Both types of layers are fully connected and are
illustrated in the left and right half of figure 2.3.1 respectively.

Each encoder layer serves to map the input sequence of symbol representations, e.g.
a sentence for a natural language translation task, and its positional encoding to a
sequence of continuous representations. Each layer has two sub­layers each of which
includes a residual connection and layer normalisation, i.e. the output of each sub­
layer is LayerNorm(x + SubLayer(x)), where x denotes the input of the sub­layer. The
bottom sub­layer is a multi­head self­attention mechanism, and the top one is a fully

12
CHAPTER 2. BACKGROUND

Figure 2.3.1: The model architecture of the Transformer [44].

connected FFNN.

Each decoder layer takes the output of the encoder stack and the right­shifted decoder
output as input to generate an output sequence, e.g. the same sentence in another
language for the translation task. The decoder layers are similar to the encoders, but
in between the self­attention and the FFNN sub­layers is another multi­head attention
mechanism that attends to the output of the encoder stack. Furthermore, the self­
attention sub­layer is masked to ensure that the shifted prediction at a position cannot
depend on subsequent outputs.

The purpose of the attention mechanism is to, when processing some input, let the
model attend only to previous knowledge which is in some way relevant to that input.
The multi­head attention is a stack of multiple attention mechanisms which allows
the model to learn different relationships to attend to. In practice, the attention is
achieved by mapping a matrix of queries Q to a set of key and value matrix pairs (K, V ).
Vaswani et al. proposed a scaled version of the attention mechanism, which they call
the scaled dot­product attention [44]. For keys and queries of dimension dk and values

13
CHAPTER 2. BACKGROUND

Figure 2.3.2: Scaled dot­product attention (left) and multi­head attention (right) [44].
Q, K and V represents the queries, keys and values respectively, and h denotes the
number of heads in the multi­head attention layer.

of dimension dv , the attention output is computed as

QK T
Attention(Q, K, V ) = softmax( √ )V. (2.7)
dk

The multi­headed attention mechanism consists of h parallel attention mechanisms


which attend to h different versions of Q, K, and V , linearly projected into
their respective dimensions by learned projections. The attention mechanisms
are concatenated and again projected into the output values. The structure and
information flow of the scaled dot­product attention mechanism and the multi­head
attention are illustrated in figure 2.3.2. The multi­head attention mechanism can be
formulated as

MultiHead(Q, K, V ) = Concat(head1 , ..., headh )W O , (2.8)


where (2.9)
headi = Attention(QWiQ , KWiK , V WiV ), (2.10)

where WiQ ∈ Rdmodel ×dk , WiK ∈ Rdmodel ×dk , WiV ∈ Rdmodel ×dv and WiO ∈ Rhdv ×dmodel are
the parameter matrices of projection. Vaswani et al. employed a dimension structure
of dk = dv = dmodel /h, with h = 8 and dk = 64 [44].

The encoder­decoder structure of the Transformer makes it suitable for generative


tasks. In addition to new state­of­the­art performance in the machine translation

14
CHAPTER 2. BACKGROUND

domain [43, 44], Transformer models have been applied to a range of generative tasks
with extremely successful results, such as image generation [35], speech recognition
[12], and language understanding [11].

Fully Transformer­based approaches have also been applied to time series prediction
tasks. In [47], a full encoder­decoder Transformer for forecasting the spread of yearly
influenza was presented with results similar to the state­of­the­art, but favourable
to the latter. The Transformer model was however superior to the LSTM. In [32],
a model including the decoder part of the Transformer for time series forecasting
was proposed, with results competitive to the state­of­the­art on benchmark datasets
and implications that the Transformer can capture long­term dependencies that
are difficult to the LSTM. Temporal and spatial transformers were combined in
stacked blocks to capture the spatio­temporal dependencies of vehicle traffic flow in
[25] with state­of­the­art competitive results. In [52], an incredibly successful pre­
trainable unsupervised model for representation learning of time series based on the
Transformer encoder was presented. The model clearly outperformed the state­of­
the­art, even for supervised learning, and it is currently the best performing model for
multivariate time series regression and classification, the authors claim [52].

The attention mechanism of the Transformer has been included in works based on
LSTM, CNN, and RNN architectures for mobile traffic prediction, but mainly for
capturing spatial dependencies. Novel approaches including stacking an attention
mechanism on some ANN architecture were proposed in [21] and [30] for predicting
the total mobile traffic volume in a specific region by not only capturing temporal
dependencies but also spatial through neighbouring regions. In [13], a deep LSTM
model stacked onto an attention mechanism that extracts temporal and spatial features
from mobile traffic data was proposed for large­scale traffic prediction of a cellular
network. Compared to the chosen baseline of an RNN with gated recurrent units
(GRUs)2 , the proposed model was superior for predictions at three out of six BSs
[13]. A spatial­temporal attention­based LSTM network for mobile traffic prediction
was presented in [17]. In their experiments, the proposed model scored higher than
competing models.

However, fully attention­based approaches, which have gained such acknowledgement


in other domains, seem to remain unexplored for mobile traffic prediction. The
2
The GRU, introduced by Cho et al. in [8], is second to the LSTM the most commonly used gated
unit for RNNs.

15
CHAPTER 2. BACKGROUND

machine translation Transformer presented by Vaswani et al. in [44] is based on a


positional embedding of the input words, and some similar representation of time is
needed for such an approach to apply to time series prediction tasks.

2.3.3 Representation of time: Time2Vec

In [26], Kazemi et al. presented the alternative representation of time Time2Vec. It


is similar to the periodic positional embedding of the Transformer [44], but instead
of being fixed, Time2Vec includes learnable parameters in its representation of time.
The authors claim that by including their representation of time in the input fed
to the model, performance can be improved for most models in handling problems
including temporal dependencies [26]. The scalar notion of time τ is transformed into
its Time2Vec representation t2v(τ ) by



ωi τ + ϕi , if i = 0.
t2v(τ )[i] = (2.11)

F(ωi τ + ϕi ), if 1 ≤ i ≤ k,

where t2v(τ )[i] is the ith element of a vector of size k + 1, and ωi and ϕi are learnable
parameters. In this study, F is the sine function F = sin(ωi τ + ϕi ), but other similar
periodic activation functions, such as cosine, produce equivalent representations
[26].

16
Chapter 3

Method

The prediction problem is formed as a multi­variate many­to­many problem. The


input matrix X consists of 15 feature vectors and the target matrix Y is a subset of the
one­step left­shifted input LeftShift(X). This subset consists of the two feature vectors
which limit the capacity of each radio unit, which are what we want to predict. In other
words, the problem for the models to learn is to, for the input row xt of timestep t,
predict the the subsequent values xat+1 and xbt+1 , where a and b denote the two features
that left­shifted constitutes Y .

3.1 Model architecture of the Transformer model


The original Transformer architecture, as introduced by Vaswani et al. [44], includes
an encoder­decoder structure where the decoder part makes it suitable for sequence­
to­sequence tasks, and in particular when the output sequence length is not pre­
specified.

In our case, the output sequence length is fixed and the Transformer model was
implemented with the encoder part but without the decoder part. In addition to the
decoder layer being redundant, discarding it roughly cuts the number of parameters
in half, which leads to benefits regarding computational complexity and learning
feasibility.

The implemented Transformer model essentially consists of a number of stacked


identical attention layers each of which is equal to the encoder of [44]. That is, the
multi­head attention sub­layers and the feed­forward sub­layers are implemented with

17
CHAPTER 3. METHOD

Figure 3.1.1: The Transformer model consists of stacked identical attention layers,
which each is equal to the original encoder layer of [44]. Instead of positional encoding
added to the input, a Time2Vec representation of time is concatenated to the input
data. The decoder layer from [44] is completely discarded, and in its place is a linear
layer transforming to correct output shape.

a residual connection, which adds their input to their output before being normalised
by layer­normalisation so that the output of each sub­layer becomes LayerNorm(x +
SubLayer(x)). The feed­forward sub­layers are fully connected and their activation
functions are the rectified linear unit (ReLU) [20], i.e. max(0, x).

The structure and information flow of the model are displayed in figure 3.1.1 and can
be described as follows. The input X ∈ Rn×m of n samples and m feature vectors is
transformed to a Time2Vec representation matrix T 2V ∈ Rn×m . The input and the
Time2Vec representation are then concatenated into XT 2V ∈ Rn×2m before being fed
to the first attention layer. The output of each attention layer is shaped identical to its
input, and can simply be passed on as input to the next attention layer. The output
of the final attention layer is projected to the correct output shape through a linear
layer.

The dimensions of the model, the number of attention layers, the number of heads
h per layer, the size of each FFNN sub­layer, the dropout rate and the batch size of
input data were selected based on the parameter grid search described in section 4.2.
The dimensions of queries dq , keys dk , values dv and the total multi­head attention
mechanism dmodel follow the same relationship as in the original paper dq = dk = dv =
dmodel /h [44].

The model is trained through backpropagation and the MSE was chosen as loss

18
CHAPTER 3. METHOD

function. For the number of predictions n, the measured values yi and the predicted
value ŷi , the MSE is

n
1!
MSE = (ŷi − yi )2 . (3.1)
n i=1

The adaptive moment estimation (ADAM) [27] algorithm was applied for optimisation
(minimisation) of the loss function. Unlike the classical stochastic gradient descent
method with a fixed learning rate, ADAM assigns individual adaptive learning rates
to the learnable parameters based on estimates from first and second gradient
movements. The parameters of the ADAM optimiser were set to α = 0.001, β1 = 0.9
and β2 = 0.999.

3.2 Model architecture of the recurrent LSTM


The implemented LSTM network consists of one or more recurrent layers of LSTM
units. On top of the recurrent layers is a linear layer, projecting the output to the correct
shape. The parameters are learned by backpropagation through time, as described in
section 2.2.1. The loss function is the MSE and optimisation is performed through
ADAM, i.e. the same optimiser as for the implemented Transformer model. The
number of hidden layers, the number of nodes, the batch size, and the dropout rate
is determined by the parameter grid search, as described in section 4.2.

Like the Transformer, the LSTM model was implemented with the MSE loss function
of equation 3.1 and the ADAM optimiser for learning the weights. The parameters of
the ADAM optimiser of the LSTM model are set identically to the parameters of the
Transformer model.

3.3 Regularization
A common problem for complex ANNs is overfitting to training data, i.e. when the
model learns the patterns of the training set such closely that it struggles with fitting
to data from outside the training set, such as to a test set. Methods aiming towards
reducing error on data other than the training data are referred to as regularization
strategies [18].

19
CHAPTER 3. METHOD

To prohibit overfitting to training data, the regularization strategy of dropout [39] was
applied to both models with rates determined by the parameter grid search, described
in section 4.2. It is powerful but computationally cheap [18]. Dropout was applied to
the original Transformer at a rate of 0.1 [44] and it has often been applied to LSTM
networks. The dropout rates determined through the grid search were both zero,
meaning no dropout rate was implemented in the selected models.

3.4 Wilcoxon signed­rank for statistical testing


The Wilcoxon signed­rank test [46] was applied to statistically confirm the differences
in the models’ performances. It excludes the assumptions that samples follow a
Gaussian distribution and that results are independent, which makes it suitable for
comparing the performance of algorithms [10]. The data in the use case example
cannot be assumed normally distributed.

For each of the number of sliding window iterations (as described in section 4.1.3) N ,
the difference between the RMSE­score of the two models is computed, denoted as di
for the i­th test. The N test setups are then ranked from 1 to N in ascending order
where any lacking difference di = 0 is discarded. The ranks of all positive differences
&
are summed into R+ = di >0 and the ranks of all negative differences are summed into
&
R− = di <0 . The smaller of the two sums W = min(R+ , R− ) can then be compared to
the N and p­value specific critical value Wcrit . If W < Wcrit for the current N , the score
differences are unlikely to have occurred by chance to the corresponding p­value.

3.5 Implementation details


All practical implementations were built in Python3.6. The Transformer model applies
the built­in multi­head attention layer from Tensorflow’s Keras API [41]. Likewise, the
LSTM network consists of the built­in recurrent LSTM layers from the Keras API.

The hardware setup consisted of one machine with one NVIDIA GeForce 1080 GPUs,
four CPUs, and a memory of 16 Gb.

20
Chapter 4

Experimental work

This chapter firstly presents the data applied for the experiments and how data were
processed to facilitate learning. Then, the selection of model parameters is motivated,
and finally, the experiments conducted to evaluate the two models and their respective
results are presented.

4.1 Data and pre­processing of data


The data used to perform the experiments consist of eleven feature vectors sampled at
even intervals during a period of ten months at a real BS with three sectors and three
LTE radio units in each sector. In addition to these, four synthetic feature vectors
extracted from the timestamps are used. Some periods completely lack measurements
that leads to gaps in the data with lengths ranging from a couple of hours to several
weeks.

4.1.1 Data description

The three sectors omit patterns with similar weekly and daily periods, but of different
magnitude. This is displayed in a snapshot of three weeks of one of the label columns in
figure 4.1.1. At each measured datapoint, the curves represent the maximum measured
number of connected users, averaged over the three units of each sector.

This example of differences between the sectors’ patterns emphasises the effect of
different geographical areas on mobile traffic. The data from the same sector, i.e. radio
traffic from units covering the same geographical area, correspond to each other more

21
CHAPTER 4. EXPERIMENTAL WORK

Figure 4.1.1: Maximum connected users per sector averaged over all radio units of each
sector for three weeks of measurements. Weekly and daily patterns can be identified.

than across sectors. This implies that model generalisability across different areas will
suffer, not only between sectors but also between BSs.

The periodic patterns of the data can be highlighted by transformation to the frequency
domain (by Fourier transformation), as displayed in figure 4.1.2. Strong trends can be
identified at occurrences of once and twice a day, but there is also a clear periodicity at
once a week. Representing each sample’s time by periodic functions of some desired
period provides a simple but effective signal of where in that period the sample was
measured, which is further described in section 4.1.2.

4.1.2 Pre­processing

The raw data from the BS were processed in several ways to facilitate learning. All pre­
processing strategies were applied to all data, i.e. to both training and test sets.

First of all, any rows with missing values were deleted. This results in some additional
gaps in the time series which might affect temporal dependencies. This problem is
handled by transforming the timestamps to periodical representations of time, which
also handles the longer gaps in the measurements.

The datasets were extended by sine and cosine representations of time, providing the

22
CHAPTER 4. EXPERIMENTAL WORK

Figure 4.1.2: Number of connected users transformed to the frequency domain. Peaks
can be identified at occurrences of once per week, once per day and twice per day.

models with information of when in the week and when in the day each sample was
measured. For the time τ in seconds, the time­of­the­day signal consists of the two
new feature columns sin(τ 2π
d
) and cos(τ 2π
d
), where d denotes total number of seconds
per day. Likewise, the time­of­the­week signal consists of sin(τ 2π
w
) and cos(τ 2π
w
),
where w represents total seconds per week. These four additional features emphasise
that 12 pm is close to 01 am and Sundays are close to Mondays. In addition to
these manually extracted periods, the learnable periodic parameters of the Time2Vec
representation, as part of the implemented Transformer model, is capable of capturing
other periodicities.

The inconsistency of the longer gaps in the measurements constitutes a problem to the
sequential models. When aspiring to learn the patterns of time series, the sequence of
data is important. For example, after processing a sample from a Monday at 8 am, it
would make no sense to then process a sample from a Sunday at 10 pm. In order to
retain the manually extracted periods, the gaps of the dataset are handled such that any
subsequent sample has the correct subsequent time­of­the­day and time­of­the­week
signal.

The different feature vectors of the input data have different scales. To inhibit
bias towards the larger scaled features, all feature vectors were scaled into range

23
CHAPTER 4. EXPERIMENTAL WORK

[0, 1] by min­max normalisation. For each feature column vector xi , its normalised
representation x∗i is computed by

xi − min(xi )
x∗i = . (4.1)
max(xi ) − min(xi )

For each iteration of train­test splits, all data were normalised on the min­max values
of the current training set. That is, no information from the test dataset was ever
used in the normalisation process. For each target column vector yi , its normalised
representation y∗i is computed by

yi − min(xi )
y∗i = . (4.2)
max(xi ) − min(xi )

4.1.3 Evaluation and splitting of data

For ML models applied to other tasks than time series prediction, a popular, simple,
and robust approach to validate results is the k­fold cross­validation. It essentially
consists of splitting the complete dataset into k splits of equal size, shuffle the order
of the splits, and then iterate over the possible combinations of splits for training and
testing according to some ratio. This typically generates k results to average over for a
more robust score.

However, the k­fold cross­validation approach is not as trivial for time series
prediction. It makes no sense to change the order of occurrence of data and train the
model on data chunks subsequent to the test chunks, which would be the case of k­
fold cross­validation and any other method involving shuffling of data. It can instead
be beneficial to adopt methods that account for the problem’s temporal aspects, as is
recommended by [34].

To avoid scenarios where the models are trained on future data and tested on past, but
still promote unbiased scores, a prequential block evaluation with a sliding window is
applied for splitting and evaluating data. Although it has some downsides compared
to cross­validation and holdout approaches [5], it has the important advantage of
providing adequate error estimates while keeping the sequence of data intact [34].
The prequential block approach was implemented with a sliding window size of five,
where the models were trained on four splits of data and tested on the next subsequent
split. The data were split into 12 equally sized chunks {D1 , D2 , ..., D12 }, and for each

24
CHAPTER 4. EXPERIMENTAL WORK

Figure 4.1.3: The data were split into 12 and iterated over by a prequential block
method with a sliding window consuming five splits per test iteration. This leads to
a total of eight scores to average over.

sliding window iteration i, the models were trained on {Di , Di+1 , Di+2 , Di+3 } and tested
on Di+4 . This process, as illustrated in figure 4.1.3, provides eight iterations and a
corresponding number of error rates to average over for a more robust error estimate.
When more than one radio unit dataset are used, the process was repeated for each
dataset.

One effect of the selected method for data splitting is that each training iteration utilises
less data than alternative methods, for example, compared to a similar approach with
a growing window that utilises all data in the final iteration. However, the growing
window approach includes different amounts of data for each iteration, which makes
the models somewhat trained on different premises for each iteration. The lower data
utilisation generally leads to a lower average error rate, but as other methods tend to
provide overly optimistic scores, it is not necessarily a negative effect [34]. It can also
be expected to result in a lower standard deviation between iterations than alternative
approaches such as a similar approach with a growing window.

The models were evaluated on the root mean squared error (RMSE). It is simply
the square root of the MSE of equation 3.1. In our use case scenario, large errors
are significantly more harmful than small errors and we thus want to penalise large
errors more than small errors. The loss function MSE and the evaluation metric
RMSE were chosen as they both include squaring the error. For very small error rates,
which was the case for the better performing versions of the models, the RMSE simply
leads to larger values than the MSE which makes the error rates more convenient to
compare.

Additionally, however secondly, the models were evaluated on the time required for
training. This was measured by Python’s built­in time module [36]. The hardware

25
CHAPTER 4. EXPERIMENTAL WORK

setup is described in section 3.5.

4.2 Model selection


For each of the two models, a parameter grid search was performed to determine
tunable parameters. These were performed on data from one of the sectors with the
data from each radio unit treated separately, by training one separate model for each
of the sector’s three datasets.

In order to prevent data leakage from test scenarios, the parameter searches were
performed on the four data splits which were never used for testing D1 , D2 , D3 and
D4 . For each radio unit of the sector, the models were trained and evaluated in two
iterations, resulting in a total of six scores and timings to average over for each model.
The first iteration consisted of training on D1 and D2 and testing on D3 , and the second
iteration of training on D2 and D3 followed by testing on D4 .

For the LSTM model, the search was performed in four parameter spaces, namely the
number of layers, the number of units per layer, dropout rate and batch size. For the
Transformer model, it was performed in six spaces. These were the dimensions of the
model dk = dq = dv , the number of attention layers, the number of heads per layer h,
the size of each FFNN sub­layer, the dropout rate and the batch size of input data. All
other parameters were fixed during the grid search, including the number of epochs
which was set to 25.

4.2.1 Parameter grid search for the LSTM

For the LSTM network, the parameter grid search was performed over the
number of hidden layers ∈ {1, 2, 3}, the number of units per hidden layer ∈
{4, 8, 16, 32, 64, 128, 256, 512, 1024}, batch size ∈ {4, 8, 12, 16} and dropout rate ∈
{0.0, 0.1, 0.2}. Regarding the latter, a dropout rate of 0.0 means no dropout. The top ten
scoring parameter setups, as displayed in table 4.2.1, performed very similar in terms
of RMSE.

The best option for dropout rate is clearly to discard any dropout. Regarding the other
parameters, batch size is chosen as 12, the number of layers as 1 and the number of
units as 256. This is the best performing model regarding error rate and the fastest
learning one amongst the top 10 options. It is noteworthy that the worst performing

26
CHAPTER 4. EXPERIMENTAL WORK

model in the entire grid search predicted at a reduced error rate of merely 1.5 per cent
compared to the best performing one.

4.2.2 Parameter grid search for the Transformer

The parameter grid search for the Transformer model included six parameter grids.
These were the number of layers ∈ {1, 2, 4, 8}, the number of heads in each multi­head
attention sub­layer h ∈ {2, 4, 8, 12, 16}, the key and value dimensions of each head
dk = dv ∈ {32, 64, 128}, the dimension of the FFNN sub­layer ∈ {8, 32, 64, 128, 256},
the dropout rate ∈ {0.0, 0.1, 0.2} and the size of each batch fed into the model ∈
{32, 64, 128, 256}.

The results are again similar, as can be seen in figure 4.2.2. The model setup with the
second lowest error rate is selected for further experiments as the parameter search
implies that it requires significantly shorter time for training than the best performing
model setup in terms of prediction error. That is, the selected model is the one with 6
layers, 8 heads per layer, a key­value dimension of 128, a batch size 256, a dimension of
128 for the FFNN­layer and no dropout. The error rate of the worst performing model
was nine per cent lower than of the best performing one.

4.3 Experiments

Four scenarios of how to train and test the data were used for evaluation of the models.
These are denoted as S1, S2, S3, and S4, and are described in the first part of this

layers units dropout batch size RMSE time (s)


1 256 0 12 0.184489 9.890437
1 512 0 12 0.184502 10.33306
1 128 0 12 0.184611 10.16801
1 64 0 8 0.184625 14.03329
2 256 0 12 0.184727 14.56715
2 128 0 12 0.184737 15.03414
1 1024 0 12 0.184755 10.55461
1 128 0 8 0.184769 14.29351
1 32 0 8 0.18479 13.97841
2 64 0 8 0.184817 20.44822

Table 4.2.1: The 10 best results of the parameter grid search for the LSTM model. The
bold setup marks the chosen model.

27
CHAPTER 4. EXPERIMENTAL WORK

layers h dk FFNN units batch size dropout RMSE time(s)


8 8 64 128 64 0 0.185346 22.77483
2 4 64 128 64 0 0.185693 7.116714
8 8 64 32 64 0 0.185704 22.19315
8 4 32 8 64 0.1 0.185724 22.2351
4 16 32 64 64 0 0.185735 12.13569
4 4 32 128 64 0 0.185756 11.8884
8 16 32 128 64 0 0.185791 22.37578
8 16 128 128 64 0 0.185826 22.39942
8 8 32 64 64 0 0.185882 22.75767
4 8 32 256 256 0.1 0.18589 7.389337

Table 4.2.2: The 10 best results of the parameter grid search for the Transformer
model. The bold setup marks the chosen model.

S1 S2 S3 S4
Training splits 4 4 36 90
Test splits 1 1 9 2
Test iterations 24 24 8 9

Table 4.3.1: The number of data splits used for training and testing for each scenario.

section. Notable is that, in the first two scenarios, the models are trained on a less
amount of data than in the third, and that in the third scenario, in turn, the models
are trained on a less amount of data than in the fourth. The amount of training data
applied, and for how many train­test iterations, is illustrated for each scenario in table
4.3.1.

The scenarios were exposed to two experiments, referred to as Experiment 1 and


Experiment 2. The first includes training each model for 25, 50, 75, 100, and 150
epochs on all 15 input features to predict both target variables, and the second includes
variations to the number of input columns and target columns. After the description of
the scenarios, the two experiments are presented together with their respective results
in the remaining part of the section. Unless stated otherwise, the experiments were
run according to the prequential block evaluation in section 4.1.3.

4.3.1 Four scenarios for training and testing

S1: Separate treatment of radio units

For one sector, three individual but identical models were trained and evaluated on
data from one of the three units of the sector respectively. They were trained for 50

28
CHAPTER 4. EXPERIMENTAL WORK

epochs and evaluated according to section 4.1.3.

S2: Generalising within the sector

The training process was performed as in S1, but the models were evaluated on a
dataset from a different unit than the one applied for training. However, both the
training data unit and the test data unit were, for all iterations, from the same sector.
Again, the models were trained for 50 epochs. Although the test dataset originated
from a different radio unit, the sequential order of data splits was unaltered, i.e. the
test dataset consisted of the subsequent split following the last split of the training
dataset.

S3: Generalising across sectors

To investigate the models’ ability to generalise across sectors, the models were trained
on data from all sectors and all radios, to then be tested on data from ditto. The
sequential order of data was preserved by applying the prequential approach presented
in section 4.1.3. In fact, the models were trained on four data splits from every radio
unit of the BS and tested on the subsequent split from every unit of the BS, with a rolling
window resulting in eight iterations. This means that each model was trained on a total
of 36 splits and tested eight times, with each test set consisting of nine splits.

S4: Utilising all data

In this experiment scenario, almost all available data are utilised for training each
model. One Transformer model and one LSTM model were trained on the first ten
data splits of each radio unit, i.e. each of the models was trained on 90 data splits.
They were then evaluated on the last two splits of each unit separately, resulting in
nine scores and timings to average over. Accordingly, the prequential approach of the
previously presented experiments was not implemented for S4.

29
CHAPTER 4. EXPERIMENTAL WORK

epochs
test p­value model RMSE σe ∆e (%) time (s) στ ∆τ (%)
S1 25 LSTM 0.197 0.027 ­2 18.0 0.5 +40
p = 0.01 Transf. 0.201 0.028 +2 10.8 0.3 ­67
50 LSTM 0.179 0.029 ­2 34.8 0.4 +46
p = 0.02 Transf. 0.183 0.030 +2 18.8 0.3 ­85
75 LSTM 0.199 0.028 ­2 51.9 0.5 +46
p = 0.01 Transf. 0.203 0.031 +2 27.8 0.4 ­87
100 LSTM 0,198 0,028 ­3 65,7 0,4 +47
p = 0.01 Transf. 0,203 0,027 +3 35,1 0,7 ­87
150 LSTM 0.197 0.027 ­1 98.2 0.6 +47
p = 0.03 Transf. 0.199 0.029 +1 52.0 0.7 ­89
S2 25 LSTM 0.271 0.096 ­2 17.8 0.2 +39
p = 0.07 Transf. 0.276 0.088 +2 10.8 0.3 ­65
50 LSTM 0.248 0.085 ­1 34.7 0.3 +45
p = 0.30 Transf. 0.251 0.082 +1 19.0 0.2 ­83
75 LSTM 0.273 0.092 ­3 51.9 0.3 +46
p = 0.01 Transf. 0.282 0.088 +3 27.8 0.4 ­87
100 LSTM 0,271 0,097 ­3 65,7 0,4 +46
p = 0.03 Transf. 0,279 0,101 +3 35,3 0,5 ­86
150 LSTM 0.273 0.090 ­4 98.2 0.3 +47
p = 0.01 Transf. 0.284 0.080 +4 52.1 0.8 ­89
S3 25 LSTM 0.231 0.035 ­5 40.2 0.8 +47
p = 0.01 Transf. 0.244 0.035 +5 21.3 1.5 ­89
50 LSTM 0.239 0.049 ­6 79.0 1.1 +48
p = 0.01 Transf. 0.254 0.049 +6 40.9 1.3 ­93
75 LSTM 0.235 0.035 ­4 123.6 0.8 +49
p = 0.01 Transf. 0.244 0.036 +4 63.4 1.3 ­95
100 LSTM 0,238 0,034 ­3 158,0 0,9 +48
p = 0.01 Transf. 0,245 0,035 +3 82,3 1,6 ­92
150 LSTM 0.236 0.036 ­3 238.4 0.9 +48
p = 0.01 TRANS 0.243 0.037 +3 123.6 1.5 ­93
S4 25 LSTM 0.153 0.032 ­2 295.6 ­ +49
p = 0.57 Transf. 0.156 0.035 +2 150.0 ­ ­97
50 LSTM 0.128 0.032 ­3 587.1 ­ +49
p = 0.01 Transf. 0.133 0.034 +3 300.1 ­ ­96
75 LSTM 0.153 0.032 ­2 912.5 ­ +49
p = 0.01 Transf. 0.156 0.033 +2 464.6 ­ ­96
100 LSTM 0,152 0,031 ­2 1176,6 ­ +49
p = 0.01 Transf. 0,156 0,033 +2 600,9 ­ ­96
150 LSTM 0.152 0.031 ­2 1773.7 ­ +48
p = 0.01 Transf. 0.155 0.033 +2 919.7 ­ ­93

Table 4.3.2: Results of the models when trained on input data consisting of all 15
feature columns for predicting both target variables at once. The ∆e column showcases
the differences in prediction error between the two models in percentage based on the
average error of the respective model. ∆τ displays the differences in consumed time
for training.

30
CHAPTER 4. EXPERIMENTAL WORK

4.3.2 Experiment 1

In the first experiment, the two models were evaluated after being trained on all 15
input feature vectors. The results are presented in table 4.3.2 for comparison on equal
numbers of epochs. For all four scenarios, the recurrent LSTM model predicted the
targets with a lower error rate than the Transformer when compared after training for
an equal number of epochs. The superiority of the LSTM in RMSE ranges from scoring
1 to 6 per cent better than the Transformer. The differences are statistically significant
to a p­value of 0.03 or lower in all cases but three, according to the Wilcoxon signed­
rank test.

However, the Transformer model required a significantly shorter training time than
the LSTM, again, for all four scenarios. The prediction error of the Transformer can
also be compared to the LSTM on equal premises in terms of consumed training time.
For comparison after (roughly) equal training time, the results are presented in table
4.3.3. In three out of the four tests evaluated on equal training time, the Transformer
model outperformed the LSTM network in average error rate.

p­value epochs model RMSE σe ∆e (%) time (s)


S1 p = 0.01 25 LSTM 0.197 0.027 +7 18.0
50 Transf. 0.183 0.030 ­8 18.8
S2 p = 0.05 25 LSTM 0.271 0.096 +7 17.8
50 Transf. 0.251 0.082 ­8 19.0
S3 p = 0.01 25 LSTM 0.231 0.035 ­10 40.2
50 Transf. 0.254 0.049 +9 40.9
S4 p = 0.01 25 LSTM 0.153 0.032 +13 295.6
50 Transf. 0.133 0.034 ­15 300.1

Table 4.3.3: Results of the models when trained on input data consisting of all 15
feature columns for predicting both target variables at once, illustrated for comparison
on roughly equal training time. The ∆e column displays the differences in prediction
error between the two models in percentage based on the average error of the respective
model.

4.3.3 Experiment 2

In this experiment, the models were exposed to training on a reduced setup of input
columns. Furthermore, a setup of training one separate model for each target feature
was tested, i.e. training one model­pair of Transformers and one model­pair of LSTM

31
CHAPTER 4. EXPERIMENTAL WORK

networks for predicting the two target variables. All models were trained for 50
epochs.

The results of training the models on reduced input width are presented in table 4.3.4.
Firstly, the models were evaluated after being exclusively trained on a subset of six of
the input feature columns, namely the two that when left­shifted constitute the target
variables and the four periodical representations of the time­of­the­week and time­
of­the­day. Secondly, they were evaluated after being exclusively trained on the two
feature columns that when left­shifted constitute the target columns. All models were
trained for 50 epochs and all differences can be considered as statistically significant
according to the Wilcoxon signed­rank test. The results show that the training time is
not reduced for any of the models when training on a reduced set of input data. In S1,
S2, and S4, the RMSE scores are significantly higher compared to the scores of the first
experiment, whereas the scores of S3 are similar.

In table 4.3.5, the results of training one separate model for each target variable
are showcased. We can see that training each model­pair is roughly twice as time­
consuming compared to training a single model for the same task (Experiment 1). In
S1, the combined error rates are lower than they are in Experiment 1 for the same
number of epochs. However, for S2, S3, and S4, the models are outperformed by their
corresponding versions of the first experiment.

32
CHAPTER 4. EXPERIMENTAL WORK

input cols RMSE σe time (s) στ ∆e (%)


S1 6 LSTM 0.199 0.028 33.8 0.5 ­2
p = 0.01 Transf. 0.203 0.028 18.5 0.3 +2
2 LSTM 0.198 0.028 33.6 0.3 0
p = 0.01 Transf. 0.199 0.026 18.1 0.3 0
S2 6 LSTM 0.273 0.095 33.8 0.3 ­3
p = 0.01 Transf. 0.280 0.103 18.8 0.4 +3
2 LSTM 0.272 0.095 33.6 0.2 ­4
p = 0.01 Transf. 0.282 0.098 18.1 0.3 +4
S3 6 LSTM 0.243 0.036 79.4 1.3 ­4
p = 0.02 Transf. 0.253 0.034 40.5 1.5 +4
2 LSTM 0.242 0.036 79.1 0.7 ­5
p = 0.01 Transf. 0.253 0.041 39.6 1.4 +4
S4 6 LSTM 0.153 0.031 582.2 ­ ­2
p = 0.02 Transf. 0.157 0.033 294.0 ­ +2
2 LSTM 0.153 0.030 586.9 ­ ­3
p = 0.02 Transf. 0.156 0.033 291.4 ­ 2

Table 4.3.4: Results of the models when trained in reduced setups of input features for
50 epochs. All models were trained on the two feature columns constituting the targets
when left­shifted, and the models trained on six feature columns were also trained
on the four synthetically generated time representations presented in section 4.1.2.
The ∆e column displays the differences in prediction error between the two models in
percentage based on the average error of the respective model.

p­value
test y1 y2 model RMSE1 σ1 RMSE2 σ2 RMSE time (s)
S1 0.6 0.01 LSTM 0.068 0.017 0.099 0.012 0.168 68.2
Transf. 0.070 0.019 0.104 0.015 0.175 37.9
S2 0.01 0.00 LSTM 0.115 0.054 0.146 0.080 0.261 68.2
Transf. 0.131 0.066 0.168 0.094 0.299 37.8
S3 0.55 0.38 LSTM 0.133 0.026 0.133 0.026 0.266 162.2
Transf. 0.135 0.025 0.136 0.026 0.271 83.2
S4 0.10 0.05 LSTM 0.087 0.012 0.087 0.012 0.174 1205.2
Transf. 0.088 0.011 0.088 0.011 0.176 608.9

Table 4.3.5: Results of training one LSTM model and one Transformer model for each
target variable. Each model was trained on all 15 input features to predict one of the
targets. The columns RMSE1 , RMSE2 , σ1 , and σ2 show the error rates and standard
deviations when predicting target variable y1 and y2 respectively. The combined error
of each model­pair is presented in column RMSE and the time­column displays their
respective total training time.

33
Chapter 5

Discussion

This chapter mainly serves to explain and discuss the key findings of the experiments
in relation to the research question of the project, i.e. to how a fully attention­based
Transformer model compares to a recurrent LSTM network in terms of prediction
performance and training time when predicting traffic in a mobile network BS. Beyond
this, a few additional findings of the experimental work are discussed. Initially, the
experiments are interpreted and discussed, followed by a section that acknowledges
the limitations of the work.

5.1 Interpretation of the experimental findings

5.1.1 Comparing the Transformer to the LSTM

First of all, when evaluating the models on a fixed number of epochs, the results of
the conducted experiments are unambiguous concerning the prediction performance­
related part of the research question. In terms of prediction error, the implemented
Transformer performs worse than the LSTM in all performed experiments and all
scenarios for training and testing. The differences are significant to a p­value of 0.05
or lower in 25 out of the total of 32 tests. In other words, the risk of any reproduction
of the experiments leading to a different finding than the LSTM being superior to
the implemented Transformer model when trained on equal numbers of epochs is
low.

However, the results are equally clear concerning the part of the research question
attending to the time consumed for model training. In terms of required training

34
CHAPTER 5. DISCUSSION

time, the Transformer completely outperforms the LSTM model in all experiments and
all scenarios. As stated in section 1.1, in a practical implementation such as the use
case example of this project, the trade­off between the model’s direct effect on energy
consumption and error rate is most essential, rather than the error rate itself. In some
cases, it might thus be reasonable to select the Transformer model over the LSTM for
prediction to the cost of a few per cent increase in prediction error.

When the models are compared after equal training time instead of after a fixed
number of epochs, the results lead to other findings. For all scenarios in the first
experiment, the models’ training times are similar enough for comparison when the
LSTM has been trained for half the number of epochs as the Transformer. After
being trained for 25 and 50 epochs respectively, with a p­value of 0.05 or less, the
Transformer model outperforms the LSTM in terms of prediction error in S1, S2, and
S4, but on the contrary, the LSTM outperforms the Transformer in S3, as can be seen in
table 4.3.3. The models have also been trained on equal time after 50 and 100 epochs
respectively, as well as after 75 and 150 epochs respectively. However, from the results
presented in table 4.3.2, it is clear that both models tend towards overfitting as the
number of epochs passes 50. Since no regularization is implemented, the comparison
after equal training time becomes unfair towards the faster model after some number
of epochs.

The key findings of the study are threefold. Firstly, in line with the hypothesis,
the Transformer model requires a significantly shorter time for training than the
LSTM at an equal number of epochs. Secondly, by comparison at such premises,
the Transformer is clearly outperformed by the domain­dominant LSTM network.
Lastly, however, when instead comparing the models on similar premises in terms of
training time, the results show that the implemented Transformer model outperforms
the recurrent LSTM network in three out of the four evaluated cases. Although the
number of comparisons on this end is low, it still indicates that, when trained on equal
training time, a fully attention­based Transformer model can compare well, and even
favourable, to the domain­dominant and often state­of­the­art performing model of
a recurrent LSTM network in terms of prediction error. However, more evidence is
needed to draw any stronger conclusions on the matter.

35
CHAPTER 5. DISCUSSION

5.1.2 Additional findings

For comparison of the different scenarios for training and testing, it is noteworthy that
the parameter search for the respective models was performed on the training and
testing setup of S1. Any findings from such a comparison must thus be interpreted
with caution, and particularly when the result is favourable to S1. However, as
can be seen in section 4.2.1 and 4.2.2, the effect of changing parameters within the
defined search grids is relatively little. Additionally, as can be seen in table 4.3.1, the
different scenarios include differently distributed train­test proportions which affect
both training times and error rates.

With this in mind, we can carefully acknowledge that the results presented in table
4.3.2 indicate that both models suffer when generalising across different radio units of
the sector as well as across different sectors of the BS, i.e. the error rates of S2 and S3
are significantly higher than the ones of S1. As discussed in section 4.1, it was expected
that the models would struggle in such scenarios, but not that the models would suffer
roughly equally when generalising across different sectors as when generalising across
units of the same sector. Another interesting notation is that the standard deviation
of the error rate is higher in S2 than in S3. This behaviour would be unexpected if the
scenarios could be compared on an even basis, since the target variables differ more
between sectors than within sectors, as discussed in section 4.1. However, as can be
seen in table 4.3.1, each model is trained on four data splits in S2, whereas they are
trained on 36 data splits in S3. This can be a contributing factor to the similar error
rates, as well as to the lower standard deviation in S3.

The results from S1 and S4 of the tests presented in table 4.3.2 indicate that the
reduced prediction performance from generalising across sectors can be compensated
by adding more training data. The models of S1 were trained on merely four data splits,
whereas the models of S4 were trained on 90, with the latter resulting in significantly
lower prediction errors. This indicates that any possible practical advantages of
applying a single model that can generalise across sectors might be utilisable without
compromising prediction performance. Fewer models need to be implemented, which
might facilitate data streams, and increases in prediction error can be compensated by
the increases in available data. It is possible that the same behaviour of increasing data
would be seen for generalisation attempts across BSs, which could lead to even further
advantages in a practical implementation.

36
CHAPTER 5. DISCUSSION

By comparing the results from training the models on reduced setups of input width of
table 4.3.4 to the results of applying all 15 input features of table 4.3.2, we can see that
reducing input data has little or no effect on the consumed training time. However,
reducing the input width appears to have a significant negative effect on prediction
error for both the implemented models. Furthermore, these results indicate that the
effect of the added periodical time representations, see section 4.1.2, add little or no
value to the models in comparison to being trained on the two target­corresponding
features. In practice, there is thus no point from a model computational complexity
aspect to reduce the input data width, however, there might be other advantages from
reducing the size of data flows depending on the practical application details.

Training separate models requires roughly twice as much time for training with little
or no effect on prediction error, as can be seen in table 4.3.5. More specifically,
in comparison to applying a single model for the two target variables, both the
Transformer and LSTM model pairs show significant increases in average RMSE for
S2, S3, and S4, and significant decreases of 4 and 6 per cent respectively in S1. The
results thus indicate that there is no point in training separate model pairs for the
prediction task of this project.

5.2 Limitations
Although the goal of this project does not include finding the optimal models, the model
selection of section 4.2 is limited to selecting the models for S1. This has effect on
the validity of any findings from direct comparison between different scenarios, and
especially when S1 is involved. A more thorough approach, but perhaps unnecessarily
exhaustive as optimal model selection was out of scope, would have been to tune
parameters for each individual test. Furthermore, the grid search could have been
more detailed and it could have ranged over more dimensions. That is, the investigated
parameter options could have been more narrow, and not all dimensions were
investigated, e.g. different optimisation and regularization strategies. However, even
these possible extensions of the grid search were deemed as unnecessarily exhaustive
in relation to the project scope. Similarly, further implementation options for the
Transformer architecture remain unexplored, but then again, the study does not
aim towards finding the optimal models, but rather to compare them on common
ground.

37
CHAPTER 5. DISCUSSION

Another limitation, also related to the grid search, is the comparison on equal premises
in terms of training time. The grid search determined that completely disregarding
any dropout regularization was the best alternative for both the Transformer and the
LSTM implementations. But as the grid search was performed over 25 epochs, training
behaviour suffered when training for more epochs and longer time. Both models show
clear tendencies to overfitting to training data when being trained for more than 50
epochs, which can be seen in table 4.3.2. It would have been desirable to compare
the models for more cases than the four on equal training time, which became the
current case. For fair comparisons on such basis, a grid search, at least covering
regularization strategies, would be needed for each investigated number of epochs or
time of training. The comparison where the implemented LSTM network is trained for
25 epochs, and the Transformer is trained for 50, is however still a valid comparison as
the difference ended up in favour to the Transformer and both models were parameter
tuned at 25 epochs. Although, its statistical significance is indeed weak due to the few
samples.

Another limitation to the study is the evaluation of training time. There are no
experiments which directly investigate the behaviour of training time in different
settings. An example of such an experiment could have been how training time scales
with number of epochs, input dimensionality and different batch sizes. The focus
of this project has been on the prediction performance evaluation, and training time
has merely been accounted for. However, as claimed in section 1.1, training time and
computational complexity is an important aspect to the problem, especially concerning
energy efficiency.

38
Chapter 6

Conclusion

By proposing a previously unexplored approach of fully attention­based learning for


mobile traffic prediction, this degree project has aimed to contribute to the current
knowledge base of Transformer learning for time series as well as to ditto of traffic
prediction at mobile networks.

The findings of the project indicate that, when predicting mobile traffic at a mobile
network BS, a fully attention­based Transformer model can compare to the domain­
dominant LSTM network when being trained for an equally long time. The results
of this study point in favour to the Transformer, but further investigation would be
required for any conclusion on the matter.

When the models instead are trained for a set number of epochs, it can be
concluded that the implemented Transformer model trains significantly faster than the
implemented LSTM network, but suffers in a comparison on prediction performance.
However, when implementing a prediction model for proactive resource management
at BSs, it would make no sense to account for the error rate based on a set number
of epochs, regardless of required training time. For any practical implementation,
the essential factor should instead be a trade­off between prediction error and
required training time, and on such terms, the implemented Transformer is a realistic
alternative to the LSTM.

Therefore, we can conclude that the fully attention­based Transformer architecture


can, as in many other domains, be a qualified competitor to the currently dominating
methods of time series prediction in mobile networks. The findings of this degree
project are thus in line with previous research in other problem domains of time series

39
CHAPTER 6. CONCLUSION

prediction where fully attention­based models have shown performances at state­of­


the­art levels.

6.1 Future work


A direct continuation of this degree project would be to produce a more thorough
investigation of how the Transformer can compare to the LSTM on equal training
time when applying regularization strategies optimised for that amount of time. As
discussed in section 5.2, the work of this project includes limitations on that end.

Furthermore, it would certainly be interesting to investigate how good a fully attention­


based time series prediction model could be in terms of both prediction error and
training time. The model could for example be implemented with some other
positional embedding, or perhaps the structure of the attention layers could be
modified. In this manner, a study directed towards developing the best possible
attention­based model for mobile traffic prediction is a terrific research topic. An
initial guess, based on the performed review of previous work, would be that some
attention­based structure topped by some sophisticated ANN could be a potential top­
performer.

Several interesting research topics related to the practicalities of proactive resource


management have been identified during the work of this project. One of these is an
investigation of the models’ ability to generalise across different BSs, as the results of
this project indicate that both the Transformer and the LSTM are able to generalise
across radio units which cover different geographical areas. This would be particularly
relevant for how to design any large scale implementation based on mobile traffic
prediction. Another topic is a thorough investigation of how different prediction
models behave on different hardware setups, which would be relevant based on what
type of hardware is available at the different potential implementation points of the
network. When the CPU capacity was limited by mistake during the experimental
work, the training time increased significantly for both models, on all scenarios. This
is a reminder that the differences in training times observed in the experiments are
limited to the specific hardware setup applied in this degree project.

40
Bibliography

[1] Azari, Amin, Papapetrou, Panagiotis, Denic, Stojan, and Peters, Gunnar.
“Cellular traffic prediction and classification: A comparative evaluation of
LSTM and ARIMA”. In: International Conference on Discovery Science.
Springer. 2019, pp. 129–144.

[2] Bayer, Justin Simon. “Learning sequence representations”. PhD thesis.


Technische Universität München, 2015.

[3] Bengio, Yoshua, Simard, Patrice, and Frasconi, Paolo. “Learning long­term
dependencies with gradient descent is difficult”. In: IEEE transactions on
neural networks 5.2 (1994), pp. 157–166.

[4] Bergstra, James, Breuleux, Olivier, Bastien, Frédéric, Lamblin, Pascal,


Pascanu, Razvan, Desjardins, Guillaume, Turian, Joseph,
Warde­Farley, David, and Bengio, Yoshua. “Theano: A CPU and GPU math
compiler in Python”. In: Proc. 9th Python in Science Conf. Vol. 1. 2010,
pp. 3–10.

[5] Cerqueira, Vitor, Torgo, Luis, and Mozetič, Igor. “Evaluating time series
forecasting models: An empirical study on performance estimation methods”.
In: Machine Learning 109.11 (2020), pp. 1997–2028.

[6] Chen, Mingzhe, Challita, Ursula, Saad, Walid, Yin, Changchuan, and
Debbah, Mérouane. “Artificial Neural Networks­Based Machine Learning for
Wireless Networks: A Tutorial”. In: IEEE Communications Surveys Tutorials
21.4 (2019), pp. 3039–3071. DOI: .

[7] Chinchali, Sandeep, Hu, Pan, Chu, Tianshu, Sharma, Manu, Bansal, Manu,
Misra, Rakesh, Pavone, Marco, and Katti, Sachin. “Cellular network traffic
scheduling with deep reinforcement learning”. In: Proceedings of the AAAI
Conference on Artificial Intelligence. Vol. 32. 1. 2018.

41
BIBLIOGRAPHY

[8] Cho, Kyunghyun, Van Merriënboer, Bart, Gulcehre, Caglar,


Bahdanau, Dzmitry, Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua.
“Learning Phrase Representations using RNN Encoder–Decoder for Statistical
Machine Translation”. In: Proceedings of the 2014 Conference on Empirical
Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association
for Computational Linguistics, Oct. 2014, pp. 1724–1734. DOI:
. URL: .

[9] Deb, Chirag, Zhang, Fan, Yang, Junjing, Lee, Siew Eang, and Shah, Kwok Wei.
“A review on time series forecasting techniques for building energy
consumption”. In: Renewable and Sustainable Energy Reviews 74 (2017),
pp. 902–924.

[10] Demšar, Janez. “Statistical comparisons of classifiers over multiple data sets”.
In: The Journal of Machine Learning Research 7 (2006), pp. 1–30.

[11] Devlin, Jacob, Chang, Ming­Wei, Lee, Kenton, and Toutanova, Kristina.
“BERT: Pre­training of Deep Bidirectional Transformers for Language
Understanding”. In: NAACL­HLT. 2019.

[12] Dong, Linhao, Xu, Shuang, and Xu, Bo. “Speech­transformer: a no­recurrence
sequence­to­sequence model for speech recognition”. In: 2018 IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE. 2018, pp. 5884–5888.

[13] Feng, Jie, Chen, Xinlei, Gao, Rundong, Zeng, Ming, and Li, Yong. “Deeptp: An
end­to­end neural network for mobile cellular traffic prediction”. In: IEEE
Network 32.6 (2018), pp. 108–115.

[14] Frenger, Pal and Ericson, Marten. “Assessment of alternatives for reducing
energy consumption in multi­RAT scenarios”. In: 2014 IEEE 79th Vehicular
Technology Conference (VTC Spring). IEEE. 2014, pp. 1–5.

[15] Frenger, Pål, Jading, Ylva, and Turk, John. “A case study on estimating future
radio network energy consumption and CO 2 emissions”. In: 2013 IEEE 24th
International Symposium on Personal, Indoor and Mobile Radio
Communications (PIMRC Workshops). IEEE. 2013, pp. 1–5.

[16] Gamboa, John Cristian Borges. “Deep learning for time­series analysis”. In:
arXiv preprint arXiv:1701.01887 (2017).

42
BIBLIOGRAPHY

[17] Gao, Yun, Wei, Xin, Zhou, Liang, and Lv, Haibing. “A deep learning framework
with spatial­temporal attention mechanism for cellular traffic prediction”. In:
2019 IEEE Globecom Workshops (GC Wkshps). IEEE. 2019, pp. 1–6.

[18] Goodfellow, Ian, Bengio, Yoshua, Courville, Aaron, and Bengio, Yoshua. Deep
learning. Vol. 1. 2. MIT press Cambridge, 2016.

[19] Guo, Jia, Peng, Yu, Peng, Xiyuan, Chen, Qiang, Yu, Jiang, and Dai, Yufeng.
“Traffic forecasting for mobile networks with multiplicative seasonal arima
models”. In: 2009 9th International Conference on Electronic Measurement &
Instruments. IEEE. 2009, pp. 3–377.

[20] Hahnloser, Richard HR, Sarpeshkar, Rahul, Mahowald, Misha A,


Douglas, Rodney J, and Seung, H Sebastian. “Digital selection and analogue
amplification coexist in a cortex­inspired silicon circuit”. In: Nature 405.6789
(2000), pp. 947–951.

[21] He, Kaiwen, Huang, Yufen, Chen, Xu, Zhou, Zhi, and Yu, Shuai. “Graph
attention spatial­temporal network for deep learning based mobile traffic
prediction”. In: 2019 IEEE Global Communications Conference
(GLOBECOM). IEEE. 2019, pp. 1–6.

[22] Hochreiter, Sepp. “Untersuchungen zu dynamischen neuronalen Netzen”.


Diploma. Technische Universität München, 1991.

[23] Hochreiter, Sepp and Schmidhuber, Jürgen. “Long short­term memory”. In:
Neural computation 9.8 (1997), pp. 1735–1780.

[24] International Telecommunication Union. Minimum requirements related to


technical performance for IMT­2020 radio interface(s). 2017. URL:
.

[25] Jaffry, Shan and Hasan, Syed Faraz. “Cellular Traffic Prediction using
Recurrent Neural Networks”. In: 2020 IEEE 5th International Symposium on
Telecommunication Technologies (ISTT). 2020, pp. 94–98. DOI:
.

[26] Kazemi, Seyed Mehran, Goel, Rishab, Eghbali, Sepehr, Ramanan, Janahan,
Sahota, Jaspreet, Thakur, Sanjay, Wu, Stella, Smyth, Cathal, Poupart, Pascal,
and Brubaker, Marcus. “Time2vec: Learning a vector representation of time”.
In: arXiv preprint arXiv:1907.05321 (2019).

43
BIBLIOGRAPHY

[27] Kingma, Diederik P. and Ba, Jimmy Lei. “Adam: A method for stochastic
gradient descent”. In: ICLR: International Conference on Learning
Representations. 2015, pp. 1–15.

[28] Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. “Imagenet


classification with deep convolutional neural networks”. In: Advances in
neural information processing systems 25 (2012), pp. 1097–1105.

[29] LeCun, Yann, Bengio, Yoshua, and Hinton, Geoffrey. “Deep learning”. In:
Nature 521.7553 (2015), pp. 436–444.

[30] Li, Ming, Wang, Yuewen, Wang, Zhaowen, and Zheng, Huiying. “A deep
learning method based on an attention mechanism for wireless network traffic
prediction”. In: Ad Hoc Networks 107 (2020), p. 102258.

[31] Li, Rongpeng, Zhao, Zhifeng, Chen, Xianfu, Palicot, Jacques, and
Zhang, Honggang. “TACT: A transfer actor­critic learning framework for
energy saving in cellular radio access networks”. In: IEEE transactions on
wireless communications 13.4 (2014), pp. 2000–2011.

[32] Li, Shiyang, Jin, Xiaoyong, Xuan, Yao, Zhou, Xiyou, Chen, Wenhu,
Wang, Yu­Xiang, and Yan, Xifeng. “Enhancing the Locality and Breaking the
Memory Bottleneck of Transformer on Time Series Forecasting”. In: Advances
in Neural Information Processing Systems. Ed. by H. Wallach, H. Larochelle,
A. Beygelzimer, F. d’Alché­Buc, E. Fox, and R. Garnett. Vol. 32. Curran
Associates, Inc., 2019. URL:
.

[33] Madiajagan, M. and Raj, S. Sridhar. “Chapter 1 ­ Parallel Computing, Graphics


Processing Unit (GPU) and New Hardware for Deep Learning in
Computational Intelligence Research”. In: Deep Learning and Parallel
Computing Environment for Bioengineering Systems. Ed. by
Arun Kumar Sangaiah. Academic Press, 2019, pp. 1–15. ISBN:
978­0­12­816718­2. DOI:
. URL:
.

[34] Oliveira, Mariana, Torgo, Luis, and Santos Costa, Vitor. “Evaluation
procedures for forecasting with spatiotemporal data”. In: Mathematics 9.6
(2021), p. 691.

44
BIBLIOGRAPHY

[35] Parmar, Niki, Vaswani, Ashish, Uszkoreit, Jakob, Kaiser, Lukasz,


Shazeer, Noam, Ku, Alexander, and Tran, Dustin. “Image transformer”. In:
International Conference on Machine Learning. PMLR. 2018, pp. 4055–4064.

[36] Python. Python Time module: Time access and conversions.


. Accessed: 2021­05­23.

[37] Richter, Fred, Fehske, Albrecht J, and Fettweis, Gerhard P. “Energy efficiency
aspects of base station deployment strategies for cellular networks”. In: 2009
IEEE 70th Vehicular Technology Conference Fall. IEEE. 2009, pp. 1–5.

[38] Shu, Yantai, Yu, Minfang, Yang, Oliver, Liu, Jiakun, and Feng, Huifang.
“Wireless traffic modeling and prediction using seasonal ARIMA models”. In:
IEICE transactions on communications 88.10 (2005), pp. 3992–3999.

[39] Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and
Salakhutdinov, Ruslan. “Dropout: a simple way to prevent neural networks
from overfitting”. In: The journal of machine learning research 15.1 (2014),
pp. 1929–1958.

[40] Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott,
Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and
Rabinovich, Andrew. “Going deeper with convolutions”. In: Proceedings of the
IEEE conference on computer vision and pattern recognition. 2015, pp. 1–9.

[41] Tensorflow. Tensorflow Keras.


. Accessed:
2021­05­23.

[42] Trinh, Hoang Duy, Giupponi, Lorenza, and Dini, Paolo. “Mobile traffic
prediction from raw data using LSTM networks”. In: IEEE 29th Annual
International Symposium on Personal, Indoor and Mobile Radio
Communications (PIMRC). IEEE. 2018, pp. 1827–1832.

[43] Vaswani, Ashish, Bengio, Samy, Brevdo, Eugene, Chollet, Francois,


Gomez, Aidan, Gouws, Stephan, Jones, Llion, Kaiser, Łukasz,
Kalchbrenner, Nal, Parmar, Niki, Sepassi, Ryan, Shazeer, Noam, and
Uszkoreit, Jakob. “Tensor2Tensor for Neural Machine Translation”. In:
Proceedings of the 13th Conference of the Association for Machine
Translation in the Americas (Volume 1: Research Track). Boston, MA:

45
BIBLIOGRAPHY

Association for Machine Translation in the Americas, Mar. 2018, pp. 193–199.
URL: .

[44] Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion,
Gomez, Aidan N, Kaiser, Łukasz, and Polosukhin, Illia. “Attention is All you
Need”. In: Advances in Neural Information Processing Systems. Ed. by
I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,
and R. Garnett. Vol. 30. Curran Associates, Inc., 2017. URL:

[45] Wang, Jing, Tang, Jian, Xu, Zhiyuan, Wang, Yanzhi, Xue, Guoliang,
Zhang, Xing, and Yang, Dejun. “Spatiotemporal modeling and prediction in
cellular networks: A big data enabled deep learning approach”. In: IEEE
INFOCOM 2017­IEEE Conference on Computer Communications. IEEE.
2017, pp. 1–9.

[46] Wilcoxon, Frank. “Individual Comparisons by Ranking Methods”. In:


Biometrics Bulletin 1.6 (1945), pp. 80–83. ISSN: 00994987. URL:
.

[47] Wu, Neo, Green, Bradley, Ben, Xue, and O’Banion, Shawn. “Deep transformer
models for time series forecasting: The influenza prevalence case”. In: arXiv
preprint arXiv:2001.08317 (2020).

[48] Xu, Mingxing, Dai, Wenrui, Liu, Chunmiao, Gao, Xing, Lin, Weiyao,
Qi, Guo­Jun, and Xiong, Hongkai. “Spatial­temporal transformer networks for
traffic flow forecasting”. In: arXiv preprint arXiv:2001.02908 (2020).

[49] Ye, Junhong and Zhang, Ying­Jun Angela. “DRAG: Deep reinforcement
learning based base station activation in heterogeneous networks”. In: IEEE
Transactions on Mobile Computing 19.9 (2019), pp. 2076–2087.

[50] Yu, Nuo, Miao, Yuting, Mu, Lan, Du, Hongwei, Huang, Hejiao, and
Jia, Xiaohua. “Minimizing energy cost by dynamic switching on/off base
stations in cellular networks”. In: IEEE Transactions on Wireless
Communications 15.11 (2016), pp. 7457–7469.

46
BIBLIOGRAPHY

[51] Yu, Yanhua, Wang, Jun, Song, Meina, and Song, Junde. “Network traffic
prediction and result analysis based on seasonal ARIMA and correlation
coefficient”. In: 2010 International Conference on Intelligent System Design
and Engineering Application. Vol. 1. IEEE. 2010, pp. 980–983.

[52] Zerveas, George, Jayaraman, Srideepika, Patel, Dhaval,


Bhamidipaty, Anuradha, and Eickhoff, Carsten. “A Transformer­based
Framework for Multivariate Time Series Representation Learning”. In: arXiv
preprint arXiv:2010.02803 (2020).

[53] Zhang, Sheng, Zhao, Shenglin, Yuan, Mingxuan, Zeng, Jia, Yao, Jianguo,
Lyu, Michael R, and King, Irwin. “Traffic prediction based power saving in
cellular networks: A machine learning method”. In: Proceedings of the 25th
ACM SIGSPATIAL international conference on advances in geographic
information systems. 2017, pp. 1–10.

[54] Zhou, Bo, He, Dan, Sun, Zhili, and Ng, Wee Hock. “Network traffic modeling
and prediction with ARIMA/GARCH”. In: Proc. of HET­NETs Conference.
2005, pp. 1–10.

47
TRITA -EECS-EX-2021:644

www.kth.se

You might also like