Full Text 01

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING,
SECOND CYCLE, 30 CREDITS

STOCKHOLM, SWEDEN 2021
Transformer learning for traffic

prediction in mobile networks
DANIEL WASS
KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Author
Daniel Wass (dwass@kth.se)
Department of Intelligent Systems
KTH Royal Institute of Technology
Host Company
Ericsson AB
Stockholm, Sweden
Company Supervisor
Lackis Eleftheriadis
Senior Specialist Sustainable AI
Ericsson AB
Examiner
Olov Engwall
Division of Speech, Music, and Hearing
Supervisor
Atsuto Maki
Division of Robotics, Perception, and Learning
ii
Abstract
The resources of mobile networks are expensive and limited, and as demand for mobile
data continues to grow, improved resource utilisation is a prioritised issue. Traffic
demand at base stations (BSs) vary throughout the day and week, but the capacity
remains constant and utilisation could be significantly improved based on precise,
robust, and efficient forecasting.
This degree project proposes a fully attentionbased Transformer model for traffic
prediction at mobile network BSs. Similar approaches have shown to be extremely
successful in other domains but there seems to be no previous work where a model
fully based on the Transformer is applied to predict mobile traffic. The proposed
model is evaluated in terms of prediction performance and required time for training
by comparison to a recurrent long shortterm memory (LSTM) network.
The implemented attentionbased approach consists of stacked layers of multihead

attention combined with simple feedforward neural network layers. It thus lacks
recurrence and was expected to train faster than the LSTM network. Results show
that the Transformer model is outperformed by the LSTM in terms of prediction error
in all performed experiments when compared after training for an equal number of
epochs. The results also show that the Transformer trains roughly twice as fast as
the LSTM, and when compared on equal premises in terms of training time, the
Transformer predicts with a lower error rate than the LSTM in three out of four
evaluated cases.
Keywords
Transformer, Attention, LSTM, Mobile traffic prediction.
iii
Sammanfattning
Transformerinlärning för prediktion av mobil
nätverkstrafik
Efterfrågan av mobildata ökar ständigt och resurserna vid mobila nätverk är både dyra
och begränsade. Samtidigt bestäms basstationers kapacitet utifrån hur hög efterfrågan
av deras tjänster är när den är som högst, vilket leder till låg utnyttjandegrad av
basstationernas resurser när efterfrågan är låg. Genom robust, träffsäker och effektiv
prediktion av mobiltrafik kan en lösning där kapaciteten istället följer efterfrågan
möjliggöras, vilket skulle minska överflödig resursförbrukning vid låg efterfrågan utan
att kompromissa med behovet av hög kapacitet vid hög efterfrågan.
Den här studien föreslår en transformermetod, helt baserad på attentionmekanismen,

för att prediktera trafik vid basstationer i mobila nätverk. Liknande metoder har
visat sig extremt framgångsrika inom andra områden men transformers utan stöd från
andra komplexa strukturer tycks vara obeprövade för prediktion av mobiltrafik. För att
utvärderas jämförs metoden med ett neuralt nätverk, innefattande noder av typen long
shortterm memory (LSTM). Jämförelsen genomförs med avseende på träningstid och
felprocent vid prediktioner.
Transformermodellen består av flera attentionlager staplade i kombination med

vanliga feedforwardlager och den förväntades träna snabbare än LSTMmodellen.
Studiens resultat visar att transformermodellen förutspår mobiltrafiken med högre
felprocent än LSTMnätverket när de jämförs efter lika många epoker av träning.
Transformermodellen tränas dock knappt dubbelt så snabbt och när modellerna
jämförs på lika grunder vad gäller träningstid presterar transformermodellen bättre
än LSTMmodellen i tre av fyra utvärderade fall.
Nyckelord
Transformer, Attention, LSTM, Prediktering av mobil nätverkstrafik.
iv
Acknowledgements
I would like to express my deepest appreciation to my supervisors Lackis Eleftheriadis

at Ericsson and Atsuto Maki at KTH for their invaluable guidance, patience, and
continuous advice throughout the entire work of this degree project. I also want to
thank my examiner Olov Engwall for providing accurate, important, and rapid input
in the final stages of the report.
Furthermore, I would like to extend my sincere thanks to Athanasios Karapantelakis

and Maxim Teslenko at the Ericsson team for offering their reasoning and valuable
discussions on the technical aspects of the work. The support I received from
Athanasios for setting up the environment for data collection and cloud computing
at Ericsson has been completely indispensable and, for this, I cannot express enough
thanks. Being welcomed onto their and Lackis’ team for the semester has been a
privilege and I have been granted many great opportunities for learning.
v
Acronyms
ADAM adaptive moment estimation

ANN artificial neural network
ARIMA autoregressive integrated moving average
BS base station
CNN convolutional neural network
FFNN feedforward neural network
GPU graphics processing unit
GRU gated recurrent unit
ICT Information and Communication Technology
LSTM long shortterm memory
LTE LongTerm Evolution
MA moving average
ML machine learning
MLP multilayer perceptron
MSE mean squared error
QoS quality of service
ReLU rectified linear unit
RMSE root mean squared error
RNN recurrent neural network
UE user equipment
vi
Contents
1 Introduction 1
1.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Purpose and goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Ethics and sustainability . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 Time series prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Machine learning for time series prediction . . . . . . . . . . . . . . . . 7
2.2.1 Recurrent neural networks . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Vanishing and exploding gradients for RNNs . . . . . . . . . . . 9
2.2.3 An RNN without learning issues: LSTM . . . . . . . . . . . . . . 9
2.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Mobile network traffic prediction . . . . . . . . . . . . . . . . . . 11
2.3.2 Prediction without recurrence: Transformers . . . . . . . . . . . 12
2.3.3 Representation of time: Time2Vec . . . . . . . . . . . . . . . . . 16
3 Method 17
3.1 Model architecture of the Transformer model . . . . . . . . . . . . . . . 17
3.2 Model architecture of the recurrent LSTM . . . . . . . . . . . . . . . . . 19
3.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Wilcoxon signedrank for statistical testing . . . . . . . . . . . . . . . . 20
3.5 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Experimental work 21
4.1 Data and preprocessing of data . . . . . . . . . . . . . . . . . . . . . . 21
vii
CONTENTS
4.1.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.3 Evaluation and splitting of data . . . . . . . . . . . . . . . . . . . 24
4.2 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.1 Parameter grid search for the LSTM . . . . . . . . . . . . . . . . 26
4.2.2 Parameter grid search for the Transformer . . . . . . . . . . . . 27
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.1 Four scenarios for training and testing . . . . . . . . . . . . . . . 28
4.3.2 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.3 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5 Discussion 34
5.1 Interpretation of the experimental findings . . . . . . . . . . . . . . . . . 34
5.1.1 Comparing the Transformer to the LSTM . . . . . . . . . . . . . 34
5.1.2 Additional findings . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6 Conclusion 39
6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
References 41
viii
Chapter 1
Introduction
The resources of a mobile network are expensive and limited, and with the
development towards an internet of everything and everyone, the demand seems to be
evergrowing. Data packages are constantly requested in higher volume and at lower
latency, whereas the industry struggles with goals of improved energy efficiency to
reduce its climate footprint [24]. The traditional approach to meet increases in demand
has been to match base station (BS) resources to peak traffic scenarios, which leads
to excessive capacity at offpeak hours [6]. Thus, alternative solutions of proactive
resource management can generate significant opportunities for improvement in
resource utilisation.
Proactive resource management relies on robust and precise traffic prediction to

not compromise quality of service (QoS), and forecasting in mobile networks is a
wellresearched area. Older works commonly include relatively simple methods of
statistical analysis, such as the autoregressive integrated moving average (ARIMA) [19,
51]. However, recent leaps in technology development have allowed for more complex
computations to be run on large scale [4], paving the way for machine learning (ML)
approaches such as artificial neural networks (ANNs).
An ANN method which has reached enormous success in time series prediction tasks
is the recurrent neural network (RNN) implemented with long shortterm memory
(LSTM) units [23]. The LSTM units allow the RNN to capture longterm dependencies,
where it otherwise suffers [3, 22]. Mobile traffic forecasting is no exception and
LSTMbased networks have often been the topperforming method for mobile traffic
forecasting [1, 25, 42, 45].
1
CHAPTER 1. INTRODUCTION
The attentionbased Transformer model, introduced by Vaswani et al. in [44], has been
applied to a range of problem domains with unprecedented results, such as machine
translation [43], language understanding [11], speech recognition [12] and image
generation [35]. Lately, the Transformer has been applied to time series prediction
problems with results competitive to the stateoftheart [32, 47, 48, 52], and the
attention mechanism of the Transformer has been included in LSTM, convolutional
neural network (CNN), and RNNbased work for mobile traffic prediction [13, 17,
21, 30]. However, to the best of the author’s knowledge, there is currently no work
applying a model fully based on the Transformer and its attention mechanism to traffic
prediction in mobile networks.
This degree project suggests a fully attentionbased Transformer approach for mobile
traffic forecasting at a BS. The model is trained and tested on multivariate realworld
data from a LongTerm Evolution (LTE) BS. The model performance is compared to
the domaindominant approach of a recurrent LSTM network.
1.1 Problem
There are several problem areas related to the limited resources of mobile networks
that can benefit from proactive resource management. One of them is the industry
prioritised issue of improving energy efficiency while retaining QoS [24]. In the area
covered by a BS, each user equipment (UE) is connected to one of the BS’s radio units.
As the number of units is determined by peak traffic scenarios, and since the power
consumption of the units shows little variation compared to the number of connected
UE, the power consumption at offpeak traffic hours is excessive in relation to demand
[14, 15].
An approach to improve energy efficiency could be to, at offpeak traffic hours, offload
connected UE into a smaller number of radio units which would enable deactivation of
redundant radio capacity to reduce power consumption. However, such an approach
would require robust and accurate traffic prediction to not compromise QoS due to
reduced capacity by radio deactivation. To achieve significant effect, the approach
must be implemented on large scale and thus, in addition to being robust and precise,
an underlying prediction model would need to be costefficient. The latter requirement
sheds light on the importance of low computational complexity and small required data
flows, both for energy efficiency and operational cost of continuously running models
2
and data streams.
Recurrent LSTM networks have proven successful at capturing temporal dependencies

for mobile traffic forecasting, but findings of previous work in other domains of
time series prediction implicate that a Transformerbased model could provide
improvements in terms of required time for training while performing similar or even
better in terms of error rates. Thus, this degree project considers the following research
question:
When predicting traffic in a mobile network BS, how does a fully attentionbased
Transformer model compare to a recurrent LSTM network in terms of prediction
performance and training time?
1.2 Purpose and goal

By applying a fully Transformerbased model to a new application domain, the purpose
of this degree project is twofold. Firstly, the study aims at contributing to the current
knowledge base of completely attentionbased time series prediction, and secondly
to contribute to the present research mode of traffic prediction at mobile networks.
Moreover, by fulfilling the latter, the project aspires to add valuable insights on
Transformers for mobile traffic prediction and inspire further studies on efficient
prediction architectures which in the long run can contribute to energy efficiency and
sustainability, not only in mobile networks.
The goal of this degree project is to answer the research question formulated in section
1.1 by proposing a fully Transformerbased model for traffic prediction at a mobile
network BS and evaluate it on prediction performance and required training time in
comparison to the domaindominant recurrent LSTM network.
1.3 Delimitations
This project intends to propose and evaluate a fully Transformerbased model for
traffic prediction at a mobile network BS limited to LTE radio units. The study is thus
limited to LTE traffic data and does not concern any other mobile traffic. The aim is
not necessarily to propose the best possible model, but to evaluate an approach that is
novel to the specific problem domain.
3
1.4 Ethics and sustainability

The Information and Communication Technology (ICT) industry has become a major
contributor to greenhouse gas emission [31] and network energy efficiency at mobile
networks is a highpriority question [24]. By aiming at extending the present
knowledge base of machine learning models for forecasting in mobile networks, this
thesis project is conducted with the hope of contributing along the path towards more
energyefficient and sustainable mobile networks.
Another aspect of energy efficiency is related to the complexity of the underlying

predictive model. Such a model applied at large scale becomes a source of power
consumption itself. By also evaluating the models on training time, the aspect of
computational power is accounted for, and thus also indirectly the energy efficiency of
the models. In the long run, the best contributor to sustainability in mobile networks
is not necessarily the best performing model in terms of error rate, but rather the
most efficient model in the tradeoff between prediction performance and power
consumption.
The privacy of data is assured throughout the project. No information regarding the
geographical position of the BS is provided, nor is any information of what the mobile
data flows include. Due to legal restrictions and customer privacy policies of the host
company Ericsson, any sensitive data values, including dates and timestamps of the
recorded data, have been encoded or censured to prevent prohibited disclosure.
1.5 Outline
A literature review describing the current research mode of mobile traffic prediction
as well as an introduction to the recurrent LSTM network and the Transformer
architecture are presented in chapter 2. Chapter 3 provides a detailed description
of the implemented models and the hardware setup for the experimental work. The
applied data setup and the experiments are presented in chapter 4. The findings
of the experiments are then discussed in chapter 5, followed by the conclusions and
suggestions for future work in chapter 6.
4
Chapter 2
Background
A mobile communications network consists of multiple interconnected BSs which

each allows UE to send (through the uplink) and receive (through the downlink) data
packages when they are connected to the BS. The area covered by a BS is typically
divided into three sectors. The demand in data traffic generally follows our human
patterns, and it varies with different BSs and different sectors of the BSs. Each sector
has its own setup of radio units to meet the demand of its area, i.e. the radio capacity of
each sector is deployed according to peak traffic scenarios in the covered area. As the
demand for mobile data grows, capacity needs to be increased, and since increasing
capacity generally include increased power consumption, the industry struggles with
an expanding climate footprint [31].
Indeed, improving energy efficiency at BSs is a prioritised question for all stakeholders.
It decreases the total climate footprint of the mobile network, and thus also of the entire
ICT industry which with its seemingly everincreasing market has become a major
contributor to greenhouse gas emission [31]. Furthermore, energy efficiency at BSs has
a large impact on the network’s overall operational expenditure; up to 80 per cent of
total network power consumption is annotated to the BSs [37, 50]. In the IMT20201 ,
network energy efficiency is listed as one of thirteen performance requirements for 5G
networks [24].
The power consumption at BSs shows weak correlation to the activity of their
connected UE, and energy efficiency is one of the problem areas where proactive
1
IMT2020 are requirements stated by the International Telecommunication Union’s sector of
radiocommunication for 5G networks, services and devices.
5
CHAPTER 2. BACKGROUND
resource management could contribute to significant improvements. It has even been

showed that the power consumption of an LTE network can be accurately estimated
based on its number and types of BSs [14, 15], and in [15], the actual user traffic
impact on a BS’s power consumption was estimated to 1.4 per cent. By more dynamic
scheduling of capacity, the correlation between UE activity and BS power consumption
could be strengthened, i.e. so that differences in UE activity is more reflected in the
power consumption. Any proactive capacity management for mobile network BSs
would rely on robust, precise, and preferably computationally efficient predictions of
the UE activity for some subsequent time.
2.1 Time series prediction
When realworld data are collected, they often include some notion of time. Such
measurements, concatenated together, compose a time series [16]. There are
univariate and multivariate time series. The first refers to a sequence of a single
observation, i.e. a onedimensional time series, and the latter refers to a group
of mutually involved time series where their interactions are considered, i.e. a
multidimensional time series. Analysis of time series either concerns capturing the
underlying pattern and structure of the measured data, or training a model to enable
future predictions [9]. Three typical characteristics for time series data are 1) not
all data are available at once, 2) the order of events matters, and 3) there can be
dependencies not only in space but also in time. This study addresses a multivariate
tomultivariate time series prediction problem where the input is multidimensional
and the output is twodimensional.
The most basic and general time series prediction methods are the ones based on
moving average (MA), such as ARIMA models. These are built upon the assumption of
the time series being stationary, resulting in the models simply being linear equations
[9]. The ARIMA equation can be formulated as
n n
1! 1!
yt = c + yt−i + et−i , (2.1)
n i n i
where yt and et are the output and error at time t, c is some constant and n is the window
size determining how far in the past the model sees. ARIMA models have been applied
6
Figure 2.2.1: A multilayer perceptron with one hidden layer. The first layer consists of
a data sequence x which is fed to the network. x is multiplied to the first weight matrix
W , which creates the hidden layer h. The output o is generated by multiplying h to the
second weight matrix U .
for predicting mobile traffic load, for example in [38] and [54], but they generally tend
to over reproduce the mean values of past data [45].
Significant improvements in graphics processing units (GPUs) over the years have
allowed for relatively complex computations to be run on large scale datasets, without
requiring a supercomputer [4]. This has facilitated the evaluation of certain ML genres,
namely approaches involving large amounts of operations which can be computed
in parallel, most commonly matrix multiplications [33]. The genre of ANNs is the
perhaps most favoured one.
2.2 Machine learning for time series prediction
2.2.1 Recurrent neural networks
Initially, although vaguely, inspired by the human brain, ANNs are networks of nodes
connected to each other through weights. Each node applies some nonlinear function
to the combination of its input and its value is passed on through the network.
Typically, the node output is binary and the node is said to activate if some threshold
value is breached [29].
The multilayer perceptron (MLP) is the archetypal ANN model and many of its basic
principles can be transferred to other ANN models. It consists of an input layer, i.e.
the input datapoints, which is also called the visible layer, an output layer, i.e. the
7
predictions, and in between them a number of hidden layers each of which consists of
a set of nodes. The data flows forward through the network, from the input layer to the
output, transformed between the different stages by matrices. The MLP is therefore a
socalled feedforward neural network (FFNN).
For the input to be understood by the machine, it is broken down into simpler
representations, further for each layer. The first hidden layer is a representation
of the visible layer, the second hidden layer is a representation of the first, and so
on. The further the representations are from the visible layer, the more abstract
they become [18]. That is, more abstract to us, but simpler to the model. Figure
2.2.1, showing an MLP with one hidden layer, serves to illustrate the fundamental
computations of an FFNN. The weights are updated through backpropagation, i.e. by
computing the gradient of the loss function with respect to each parameter, one layer
at a time, starting from the final layer. There are several different loss functions, and
which to use depends on the application. For example, mean squared error (MSE)
is common for regression tasks whereas crossentropy often is applied for classifiers.
An FFNN considered as more successful than the MLP is the CNN [18]. After the
breakthrough in [28] from 2012, CNNs have been seen winning many competitions
and they are successfully applied to various domains [18, 29]. They achieved human
level performance in certain tasks of image recognition in 2015 [40] and have for long
dominated computer vision.
Although FFNNs have delivered highly competitive results for time series, e.g. in [25],
they naturally cannot capture dependencies in time further than the length of their
input sequence, i.e. the width of the input layer. This is referred to as the network’s
memory depth, and increasing it to capture long term dependencies would incur an
extreme amount of multiplications [18], see figure 2.2.1.
RNNs, as illustrated in figure 2.2.2, are designed for processing sequences and are
scalable to handle much longer time series than FFNNs. By passing information
of past timesteps through an extra weight matrix, an RNN can recognise long term
dependencies without requiring the input layer size to match the length of the
dependencies. RNNs are trained by backpropagation through time, which, after
unrolling the network as in figure 2.2.2 equals the backpropagation training process of
an FFNN [18]. However, as independently discovered by Hochreiter [22] and Bengio
et al. [3] in the ’90s, even RNNs are limited in learning long term dependencies.
8
Figure 2.2.2: The recurrent connection of an RNN is typically denoted by a loop (left
handside). A represents the hidden layer of the network, U, V, W denotes the weight
matrices, L is the loss function, and x, o, y are input, output and truth respectively. The
loop illustrates how the hidden layer not only passes its representation of the input to
the output but also recurrently to the hidden layer at the next input timestep. The
RNN is also illustrated unrolled (righthandside) to clarify how past representations
are passed on in time through W . The RNN is trained by applying the backpropagation
algorithm onto the computational graph of the unrolled network to minimise the loss
function, exactly like with an FFNN [18].
2.2.2 Vanishing and exploding gradients for RNNs
The problem of vanishing and exploding gradients arises when the computational
graph onto which the backpropagation algorithm is applied becomes too deep.
Repeated multiplication of the same values causes the gradient to either vanish or
in rare cases explode [18]. For example, repeated multiplication by the matrix W
means W t at timestep t. If W has the eigendecomposition W = V diag(λ)V −1 , then
W t = V diag(λ)t V −1 , which means that any eigenvalues λ that are not very close to 1
or −1 will vanish or explode for some t. Vanishing gradients stop the algorithm from
learning as it complicates updating the parameters in the correct direction. Exploding
gradients make optimisation difficult as learning becomes unstable [18].
2.2.3 An RNN without learning issues: LSTM
There are many approaches developed to tackle the problem of vanishing or exploding
gradients, but the perhaps most popular one for practical applications is to include
gated units in some recurrent architecture [18]. In this study we consider the gated
LSTM unit. It was introduced by Hochreiter and Schmidhuber in [23] and is commonly
used in gated RNNs.
9
Figure 2.2.3: The LSTM unit keeps track of past information by passing a cell state ct
through time. What information should be stored, and what should be forgotten, is for
each timestep t, determined by modulation of input data xt through the input gate it ,
the forget gate ft , and the output gate ot .
Instead of single operation cells as in regular RNNs, an LSTM network has cells that
are like a miniature network themselves, as displayed in figure 2.2.3. The basic idea is
to add a path through time without fostering the learning problem of vanishing (nor
exploding) gradients. This memorylike ability is achieved through an inner recurrence
(in addition to the outer recurrence of the RNN) of the cell state ct which has a socalled
selfloop that allows for previously processed information to be passed on through time
[18]. What information should be stored and forgotten are decided by modulation
through a series of gates, specifically the input gate it , output gate ot , and forget gate
ft . For each timestep t they are defined as
it = σ(Wi xt + Ui ht−1 + bi ), (2.2)
ot = σ(Wo xt + Uo ht−1 + bo ), (2.3)
ft = σ(Wf xt + Uf ht−1 + bf ), (2.4)
where W and U are weight matrices for the respective gate. ht is the hidden state
ht = ot " tanh(ct ), (2.5)
where " is the elementwise (Hadamard) product, tanh is the hyperbolic tan activation
10
function and ct is the cell state
ct = ft " ct−1 + it " tanh(Wc xt + Uc ht−1 + bc ). (2.6)
For each timestep, the hidden state is passed on as output to the next layer, but it
is also passed on in time to the next recursion together with the cell state. When
included in an RNN, the learnable parameters of the LSTM cell W , U and b are trained
by backpropagation through time essentially like the regular RNN. The difference,
however, is that the gradient of ct lack any intrinsic factor which would drive it to vanish
or explode [2]. Through this path, an RNN incorporating LSTM cells is thus trainable
even on large time lags.
2.3 Related work
2.3.1 Mobile network traffic prediction
Traffic prediction in cellular networks has been extensively researched for various
purposes, more or less similar to the use case in this study. RNN based architectures
including gated LSTM units are often amongst the top performers.
In [53], a multiview ensemble learning approach to predict mobile traffic load was
proposed to enable putting BSs to sleep for reduced energy consumption. In addition
to temporal trends, the models attend to capturing spatial influence and influence of
external events, such as weather and holidays. On top of the ensemble of models, an
optimisation algorithm was applied, accounting for operation cost, energy efficiency
and quality of service. A deep reinforcement learning approach for scheduling the
new demand of data traffic with high volume and high time flexibility was presented
in [7]. This new segment of traffic, often originating from data uploads of Internet
of things applications such as smart homes, needs to find gaps in between traffic
peaks to not intervene with realtime, delaysensitive services such as voice calls and
video streaming. Underlying the reinforcement learning agent is an LSTM predictor
which successfully forecasts the network’s throughput congestion. Similarly, in [49],
unprecedented energy efficiency improvements by activation and deactivation of
entire BSs in a heterogeneous mobile network through a deep reinforcement learning
approach based on traffic predictions by a deep ANN model was presented.
11
In [45], a hybrid model for spatial and temporal prediction in mobile networks was
presented, where the temporal part is based on LSTM cells. The model significantly
outperformed the selected baseline methods of an ARIMA and a support vector
regression model. In [42], a recurrent LSTM network was compared to an ARIMA
model and an FFNN, where both ANN approaches showed superior to the ARIMA
results with the LSTM slightly outperforming the FFNN. The models were applied
to real network data consisting of the number of connected users to an LTE node per
ms for K previous timesteps T , with K being tested from 110, and the duration of
T ∈ {10, 30, 60, 120} in ms. For the LSTM network, increasing T decreases accuracy
and increasing K increases accuracy. In [1], a comparative evaluation of an LSTM
network and an ARIMA model for mobile traffic prediction was performed. Although
the recurrent LSTM network showed superior results over the ARIMA, and particularly
for longranging time series, the study reveals that there can be scenarios in which the
ARIMA performs close to the LSTM with lower complexity. In [25], an LSTM network
for mobile traffic prediction which outperformed their baseline models was presented,
i.e. it performed more accurate predictions than an ARIMA model and with similar
error rates at significantly shorter training time than an FFNN.
2.3.2 Prediction without recurrence: Transformers
In 2017, Vaswani et al. presented the Transformer model as the first sequence to
sequence model based entirely on attention and thus without any recurrent layers
[44]. With their model, proposed for natural language translation tasks, they achieved
new stateoftheart results when training significantly faster than models applying
convolutional or recurrent layers.
The Transformer model essentially consists of N identical stacked encoder layers and
N identical stacked decoder layers. Both types of layers are fully connected and are
illustrated in the left and right half of figure 2.3.1 respectively.
Each encoder layer serves to map the input sequence of symbol representations, e.g.
a sentence for a natural language translation task, and its positional encoding to a
sequence of continuous representations. Each layer has two sublayers each of which
includes a residual connection and layer normalisation, i.e. the output of each sub
layer is LayerNorm(x + SubLayer(x)), where x denotes the input of the sublayer. The
bottom sublayer is a multihead selfattention mechanism, and the top one is a fully
12
Figure 2.3.1: The model architecture of the Transformer [44].
connected FFNN.
Each decoder layer takes the output of the encoder stack and the rightshifted decoder
output as input to generate an output sequence, e.g. the same sentence in another
language for the translation task. The decoder layers are similar to the encoders, but
in between the selfattention and the FFNN sublayers is another multihead attention
mechanism that attends to the output of the encoder stack. Furthermore, the self
attention sublayer is masked to ensure that the shifted prediction at a position cannot
depend on subsequent outputs.
The purpose of the attention mechanism is to, when processing some input, let the
model attend only to previous knowledge which is in some way relevant to that input.
The multihead attention is a stack of multiple attention mechanisms which allows
the model to learn different relationships to attend to. In practice, the attention is
achieved by mapping a matrix of queries Q to a set of key and value matrix pairs (K, V ).
Vaswani et al. proposed a scaled version of the attention mechanism, which they call
the scaled dotproduct attention [44]. For keys and queries of dimension dk and values
13
Figure 2.3.2: Scaled dotproduct attention (left) and multihead attention (right) [44].
Q, K and V represents the queries, keys and values respectively, and h denotes the
number of heads in the multihead attention layer.
of dimension dv , the attention output is computed as
QK T
Attention(Q, K, V ) = softmax( √ )V. (2.7)
dk
The multiheaded attention mechanism consists of h parallel attention mechanisms

which attend to h different versions of Q, K, and V , linearly projected into
their respective dimensions by learned projections. The attention mechanisms
are concatenated and again projected into the output values. The structure and
information flow of the scaled dotproduct attention mechanism and the multihead
attention are illustrated in figure 2.3.2. The multihead attention mechanism can be
formulated as
MultiHead(Q, K, V ) = Concat(head1 , ..., headh )W O , (2.8)

where (2.9)
headi = Attention(QWiQ , KWiK , V WiV ), (2.10)
where WiQ ∈ Rdmodel ×dk , WiK ∈ Rdmodel ×dk , WiV ∈ Rdmodel ×dv and WiO ∈ Rhdv ×dmodel are
the parameter matrices of projection. Vaswani et al. employed a dimension structure
of dk = dv = dmodel /h, with h = 8 and dk = 64 [44].
The encoderdecoder structure of the Transformer makes it suitable for generative

tasks. In addition to new stateoftheart performance in the machine translation
14
domain [43, 44], Transformer models have been applied to a range of generative tasks
with extremely successful results, such as image generation [35], speech recognition
[12], and language understanding [11].
Fully Transformerbased approaches have also been applied to time series prediction
tasks. In [47], a full encoderdecoder Transformer for forecasting the spread of yearly
influenza was presented with results similar to the stateoftheart, but favourable
to the latter. The Transformer model was however superior to the LSTM. In [32],
a model including the decoder part of the Transformer for time series forecasting
was proposed, with results competitive to the stateoftheart on benchmark datasets
and implications that the Transformer can capture longterm dependencies that
are difficult to the LSTM. Temporal and spatial transformers were combined in
stacked blocks to capture the spatiotemporal dependencies of vehicle traffic flow in
[25] with stateoftheart competitive results. In [52], an incredibly successful pre
trainable unsupervised model for representation learning of time series based on the
Transformer encoder was presented. The model clearly outperformed the stateof
theart, even for supervised learning, and it is currently the best performing model for
multivariate time series regression and classification, the authors claim [52].
The attention mechanism of the Transformer has been included in works based on
LSTM, CNN, and RNN architectures for mobile traffic prediction, but mainly for
capturing spatial dependencies. Novel approaches including stacking an attention
mechanism on some ANN architecture were proposed in [21] and [30] for predicting
the total mobile traffic volume in a specific region by not only capturing temporal
dependencies but also spatial through neighbouring regions. In [13], a deep LSTM
model stacked onto an attention mechanism that extracts temporal and spatial features
from mobile traffic data was proposed for largescale traffic prediction of a cellular
network. Compared to the chosen baseline of an RNN with gated recurrent units
(GRUs)2 , the proposed model was superior for predictions at three out of six BSs
[13]. A spatialtemporal attentionbased LSTM network for mobile traffic prediction
was presented in [17]. In their experiments, the proposed model scored higher than
competing models.
However, fully attentionbased approaches, which have gained such acknowledgement

in other domains, seem to remain unexplored for mobile traffic prediction. The
2
The GRU, introduced by Cho et al. in [8], is second to the LSTM the most commonly used gated
unit for RNNs.
15
machine translation Transformer presented by Vaswani et al. in [44] is based on a

positional embedding of the input words, and some similar representation of time is
needed for such an approach to apply to time series prediction tasks.
2.3.3 Representation of time: Time2Vec
In [26], Kazemi et al. presented the alternative representation of time Time2Vec. It

is similar to the periodic positional embedding of the Transformer [44], but instead
of being fixed, Time2Vec includes learnable parameters in its representation of time.
The authors claim that by including their representation of time in the input fed
to the model, performance can be improved for most models in handling problems
including temporal dependencies [26]. The scalar notion of time τ is transformed into
its Time2Vec representation t2v(τ ) by


ωi τ + ϕi , if i = 0.
t2v(τ )[i] = (2.11)

F(ωi τ + ϕi ), if 1 ≤ i ≤ k,
where t2v(τ )[i] is the ith element of a vector of size k + 1, and ωi and ϕi are learnable
parameters. In this study, F is the sine function F = sin(ωi τ + ϕi ), but other similar
periodic activation functions, such as cosine, produce equivalent representations
[26].
16
Chapter 3
Method
The prediction problem is formed as a multivariate manytomany problem. The

input matrix X consists of 15 feature vectors and the target matrix Y is a subset of the
onestep leftshifted input LeftShift(X). This subset consists of the two feature vectors
which limit the capacity of each radio unit, which are what we want to predict. In other
words, the problem for the models to learn is to, for the input row xt of timestep t,
predict the the subsequent values xat+1 and xbt+1 , where a and b denote the two features
that leftshifted constitutes Y .
3.1 Model architecture of the Transformer model

The original Transformer architecture, as introduced by Vaswani et al. [44], includes
an encoderdecoder structure where the decoder part makes it suitable for sequence
tosequence tasks, and in particular when the output sequence length is not pre
specified.
In our case, the output sequence length is fixed and the Transformer model was
implemented with the encoder part but without the decoder part. In addition to the
decoder layer being redundant, discarding it roughly cuts the number of parameters
in half, which leads to benefits regarding computational complexity and learning
feasibility.
The implemented Transformer model essentially consists of a number of stacked

identical attention layers each of which is equal to the encoder of [44]. That is, the
multihead attention sublayers and the feedforward sublayers are implemented with
17
CHAPTER 3. METHOD
Figure 3.1.1: The Transformer model consists of stacked identical attention layers,
which each is equal to the original encoder layer of [44]. Instead of positional encoding
added to the input, a Time2Vec representation of time is concatenated to the input
data. The decoder layer from [44] is completely discarded, and in its place is a linear
layer transforming to correct output shape.
a residual connection, which adds their input to their output before being normalised
by layernormalisation so that the output of each sublayer becomes LayerNorm(x +
SubLayer(x)). The feedforward sublayers are fully connected and their activation
functions are the rectified linear unit (ReLU) [20], i.e. max(0, x).
The structure and information flow of the model are displayed in figure 3.1.1 and can
be described as follows. The input X ∈ Rn×m of n samples and m feature vectors is
transformed to a Time2Vec representation matrix T 2V ∈ Rn×m . The input and the
Time2Vec representation are then concatenated into XT 2V ∈ Rn×2m before being fed
to the first attention layer. The output of each attention layer is shaped identical to its
input, and can simply be passed on as input to the next attention layer. The output
of the final attention layer is projected to the correct output shape through a linear
layer.
The dimensions of the model, the number of attention layers, the number of heads
h per layer, the size of each FFNN sublayer, the dropout rate and the batch size of
input data were selected based on the parameter grid search described in section 4.2.
The dimensions of queries dq , keys dk , values dv and the total multihead attention
mechanism dmodel follow the same relationship as in the original paper dq = dk = dv =
dmodel /h [44].
The model is trained through backpropagation and the MSE was chosen as loss
18
CHAPTER 3. METHOD
function. For the number of predictions n, the measured values yi and the predicted
value ŷi , the MSE is
n
1!
MSE = (ŷi − yi )2 . (3.1)
n i=1
The adaptive moment estimation (ADAM) [27] algorithm was applied for optimisation
(minimisation) of the loss function. Unlike the classical stochastic gradient descent
method with a fixed learning rate, ADAM assigns individual adaptive learning rates
to the learnable parameters based on estimates from first and second gradient
movements. The parameters of the ADAM optimiser were set to α = 0.001, β1 = 0.9
and β2 = 0.999.
3.2 Model architecture of the recurrent LSTM

The implemented LSTM network consists of one or more recurrent layers of LSTM
units. On top of the recurrent layers is a linear layer, projecting the output to the correct
shape. The parameters are learned by backpropagation through time, as described in
section 2.2.1. The loss function is the MSE and optimisation is performed through
ADAM, i.e. the same optimiser as for the implemented Transformer model. The
number of hidden layers, the number of nodes, the batch size, and the dropout rate
is determined by the parameter grid search, as described in section 4.2.
Like the Transformer, the LSTM model was implemented with the MSE loss function
of equation 3.1 and the ADAM optimiser for learning the weights. The parameters of
the ADAM optimiser of the LSTM model are set identically to the parameters of the
Transformer model.
3.3 Regularization
A common problem for complex ANNs is overfitting to training data, i.e. when the
model learns the patterns of the training set such closely that it struggles with fitting
to data from outside the training set, such as to a test set. Methods aiming towards
reducing error on data other than the training data are referred to as regularization
strategies [18].
19
CHAPTER 3. METHOD
To prohibit overfitting to training data, the regularization strategy of dropout [39] was
applied to both models with rates determined by the parameter grid search, described
in section 4.2. It is powerful but computationally cheap [18]. Dropout was applied to
the original Transformer at a rate of 0.1 [44] and it has often been applied to LSTM
networks. The dropout rates determined through the grid search were both zero,
meaning no dropout rate was implemented in the selected models.
3.4 Wilcoxon signedrank for statistical testing

The Wilcoxon signedrank test [46] was applied to statistically confirm the differences
in the models’ performances. It excludes the assumptions that samples follow a
Gaussian distribution and that results are independent, which makes it suitable for
comparing the performance of algorithms [10]. The data in the use case example
cannot be assumed normally distributed.
For each of the number of sliding window iterations (as described in section 4.1.3) N ,
the difference between the RMSEscore of the two models is computed, denoted as di
for the ith test. The N test setups are then ranked from 1 to N in ascending order
where any lacking difference di = 0 is discarded. The ranks of all positive differences
&
are summed into R+ = di >0 and the ranks of all negative differences are summed into
&
R− = di <0 . The smaller of the two sums W = min(R+ , R− ) can then be compared to
the N and pvalue specific critical value Wcrit . If W < Wcrit for the current N , the score
differences are unlikely to have occurred by chance to the corresponding pvalue.
3.5 Implementation details

All practical implementations were built in Python3.6. The Transformer model applies
the builtin multihead attention layer from Tensorflow’s Keras API [41]. Likewise, the
LSTM network consists of the builtin recurrent LSTM layers from the Keras API.
The hardware setup consisted of one machine with one NVIDIA GeForce 1080 GPUs,
four CPUs, and a memory of 16 Gb.
20
Chapter 4
Experimental work
This chapter firstly presents the data applied for the experiments and how data were
processed to facilitate learning. Then, the selection of model parameters is motivated,
and finally, the experiments conducted to evaluate the two models and their respective
results are presented.
4.1 Data and preprocessing of data

The data used to perform the experiments consist of eleven feature vectors sampled at
even intervals during a period of ten months at a real BS with three sectors and three
LTE radio units in each sector. In addition to these, four synthetic feature vectors
extracted from the timestamps are used. Some periods completely lack measurements
that leads to gaps in the data with lengths ranging from a couple of hours to several
weeks.
4.1.1 Data description
The three sectors omit patterns with similar weekly and daily periods, but of different
magnitude. This is displayed in a snapshot of three weeks of one of the label columns in
figure 4.1.1. At each measured datapoint, the curves represent the maximum measured
number of connected users, averaged over the three units of each sector.
This example of differences between the sectors’ patterns emphasises the effect of
different geographical areas on mobile traffic. The data from the same sector, i.e. radio
traffic from units covering the same geographical area, correspond to each other more
21
CHAPTER 4. EXPERIMENTAL WORK
Figure 4.1.1: Maximum connected users per sector averaged over all radio units of each
sector for three weeks of measurements. Weekly and daily patterns can be identified.
than across sectors. This implies that model generalisability across different areas will
suffer, not only between sectors but also between BSs.
The periodic patterns of the data can be highlighted by transformation to the frequency
domain (by Fourier transformation), as displayed in figure 4.1.2. Strong trends can be
identified at occurrences of once and twice a day, but there is also a clear periodicity at
once a week. Representing each sample’s time by periodic functions of some desired
period provides a simple but effective signal of where in that period the sample was
measured, which is further described in section 4.1.2.
4.1.2 Preprocessing
The raw data from the BS were processed in several ways to facilitate learning. All pre
processing strategies were applied to all data, i.e. to both training and test sets.
First of all, any rows with missing values were deleted. This results in some additional
gaps in the time series which might affect temporal dependencies. This problem is
handled by transforming the timestamps to periodical representations of time, which
also handles the longer gaps in the measurements.
The datasets were extended by sine and cosine representations of time, providing the
22
Figure 4.1.2: Number of connected users transformed to the frequency domain. Peaks
can be identified at occurrences of once per week, once per day and twice per day.
models with information of when in the week and when in the day each sample was
measured. For the time τ in seconds, the timeoftheday signal consists of the two
new feature columns sin(τ 2π
d
) and cos(τ 2π
d
), where d denotes total number of seconds
per day. Likewise, the timeoftheweek signal consists of sin(τ 2π
w
) and cos(τ 2π
w
),
where w represents total seconds per week. These four additional features emphasise
that 12 pm is close to 01 am and Sundays are close to Mondays. In addition to
these manually extracted periods, the learnable periodic parameters of the Time2Vec
representation, as part of the implemented Transformer model, is capable of capturing
other periodicities.
The inconsistency of the longer gaps in the measurements constitutes a problem to the
sequential models. When aspiring to learn the patterns of time series, the sequence of
data is important. For example, after processing a sample from a Monday at 8 am, it
would make no sense to then process a sample from a Sunday at 10 pm. In order to
retain the manually extracted periods, the gaps of the dataset are handled such that any
subsequent sample has the correct subsequent timeoftheday and timeoftheweek
signal.
The different feature vectors of the input data have different scales. To inhibit
bias towards the larger scaled features, all feature vectors were scaled into range
23
[0, 1] by minmax normalisation. For each feature column vector xi , its normalised
representation x∗i is computed by
xi − min(xi )
x∗i = . (4.1)
max(xi ) − min(xi )
For each iteration of traintest splits, all data were normalised on the minmax values
of the current training set. That is, no information from the test dataset was ever
used in the normalisation process. For each target column vector yi , its normalised
representation y∗i is computed by
yi − min(xi )
y∗i = . (4.2)
max(xi ) − min(xi )
4.1.3 Evaluation and splitting of data
For ML models applied to other tasks than time series prediction, a popular, simple,
and robust approach to validate results is the kfold crossvalidation. It essentially
consists of splitting the complete dataset into k splits of equal size, shuffle the order
of the splits, and then iterate over the possible combinations of splits for training and
testing according to some ratio. This typically generates k results to average over for a
more robust score.
However, the kfold crossvalidation approach is not as trivial for time series
prediction. It makes no sense to change the order of occurrence of data and train the
model on data chunks subsequent to the test chunks, which would be the case of k
fold crossvalidation and any other method involving shuffling of data. It can instead
be beneficial to adopt methods that account for the problem’s temporal aspects, as is
recommended by [34].
To avoid scenarios where the models are trained on future data and tested on past, but
still promote unbiased scores, a prequential block evaluation with a sliding window is
applied for splitting and evaluating data. Although it has some downsides compared
to crossvalidation and holdout approaches [5], it has the important advantage of
providing adequate error estimates while keeping the sequence of data intact [34].
The prequential block approach was implemented with a sliding window size of five,
where the models were trained on four splits of data and tested on the next subsequent
split. The data were split into 12 equally sized chunks {D1 , D2 , ..., D12 }, and for each
24
Figure 4.1.3: The data were split into 12 and iterated over by a prequential block
method with a sliding window consuming five splits per test iteration. This leads to
a total of eight scores to average over.
sliding window iteration i, the models were trained on {Di , Di+1 , Di+2 , Di+3 } and tested
on Di+4 . This process, as illustrated in figure 4.1.3, provides eight iterations and a
corresponding number of error rates to average over for a more robust error estimate.
When more than one radio unit dataset are used, the process was repeated for each
dataset.
One effect of the selected method for data splitting is that each training iteration utilises
less data than alternative methods, for example, compared to a similar approach with
a growing window that utilises all data in the final iteration. However, the growing
window approach includes different amounts of data for each iteration, which makes
the models somewhat trained on different premises for each iteration. The lower data
utilisation generally leads to a lower average error rate, but as other methods tend to
provide overly optimistic scores, it is not necessarily a negative effect [34]. It can also
be expected to result in a lower standard deviation between iterations than alternative
approaches such as a similar approach with a growing window.
The models were evaluated on the root mean squared error (RMSE). It is simply
the square root of the MSE of equation 3.1. In our use case scenario, large errors
are significantly more harmful than small errors and we thus want to penalise large
errors more than small errors. The loss function MSE and the evaluation metric
RMSE were chosen as they both include squaring the error. For very small error rates,
which was the case for the better performing versions of the models, the RMSE simply
leads to larger values than the MSE which makes the error rates more convenient to
compare.
Additionally, however secondly, the models were evaluated on the time required for
training. This was measured by Python’s builtin time module [36]. The hardware
25
setup is described in section 3.5.
4.2 Model selection

For each of the two models, a parameter grid search was performed to determine
tunable parameters. These were performed on data from one of the sectors with the
data from each radio unit treated separately, by training one separate model for each
of the sector’s three datasets.
In order to prevent data leakage from test scenarios, the parameter searches were
performed on the four data splits which were never used for testing D1 , D2 , D3 and
D4 . For each radio unit of the sector, the models were trained and evaluated in two
iterations, resulting in a total of six scores and timings to average over for each model.
The first iteration consisted of training on D1 and D2 and testing on D3 , and the second
iteration of training on D2 and D3 followed by testing on D4 .
For the LSTM model, the search was performed in four parameter spaces, namely the
number of layers, the number of units per layer, dropout rate and batch size. For the
Transformer model, it was performed in six spaces. These were the dimensions of the
model dk = dq = dv , the number of attention layers, the number of heads per layer h,
the size of each FFNN sublayer, the dropout rate and the batch size of input data. All
other parameters were fixed during the grid search, including the number of epochs
which was set to 25.
4.2.1 Parameter grid search for the LSTM
For the LSTM network, the parameter grid search was performed over the
number of hidden layers ∈ {1, 2, 3}, the number of units per hidden layer ∈
{4, 8, 16, 32, 64, 128, 256, 512, 1024}, batch size ∈ {4, 8, 12, 16} and dropout rate ∈
{0.0, 0.1, 0.2}. Regarding the latter, a dropout rate of 0.0 means no dropout. The top ten
scoring parameter setups, as displayed in table 4.2.1, performed very similar in terms
of RMSE.
The best option for dropout rate is clearly to discard any dropout. Regarding the other
parameters, batch size is chosen as 12, the number of layers as 1 and the number of
units as 256. This is the best performing model regarding error rate and the fastest
learning one amongst the top 10 options. It is noteworthy that the worst performing
26
model in the entire grid search predicted at a reduced error rate of merely 1.5 per cent
compared to the best performing one.
4.2.2 Parameter grid search for the Transformer
The parameter grid search for the Transformer model included six parameter grids.
These were the number of layers ∈ {1, 2, 4, 8}, the number of heads in each multihead
attention sublayer h ∈ {2, 4, 8, 12, 16}, the key and value dimensions of each head
dk = dv ∈ {32, 64, 128}, the dimension of the FFNN sublayer ∈ {8, 32, 64, 128, 256},
the dropout rate ∈ {0.0, 0.1, 0.2} and the size of each batch fed into the model ∈
{32, 64, 128, 256}.
The results are again similar, as can be seen in figure 4.2.2. The model setup with the
second lowest error rate is selected for further experiments as the parameter search
implies that it requires significantly shorter time for training than the best performing
model setup in terms of prediction error. That is, the selected model is the one with 6
layers, 8 heads per layer, a keyvalue dimension of 128, a batch size 256, a dimension of
128 for the FFNNlayer and no dropout. The error rate of the worst performing model
was nine per cent lower than of the best performing one.
4.3 Experiments
Four scenarios of how to train and test the data were used for evaluation of the models.
These are denoted as S1, S2, S3, and S4, and are described in the first part of this
layers units dropout batch size RMSE time (s)

1 256 0 12 0.184489 9.890437
1 512 0 12 0.184502 10.33306
1 128 0 12 0.184611 10.16801
1 64 0 8 0.184625 14.03329
2 256 0 12 0.184727 14.56715
2 128 0 12 0.184737 15.03414
1 1024 0 12 0.184755 10.55461
1 128 0 8 0.184769 14.29351
1 32 0 8 0.18479 13.97841
2 64 0 8 0.184817 20.44822
Table 4.2.1: The 10 best results of the parameter grid search for the LSTM model. The
bold setup marks the chosen model.
27
layers h dk FFNN units batch size dropout RMSE time(s)

8 8 64 128 64 0 0.185346 22.77483
2 4 64 128 64 0 0.185693 7.116714
8 8 64 32 64 0 0.185704 22.19315
8 4 32 8 64 0.1 0.185724 22.2351
4 16 32 64 64 0 0.185735 12.13569
4 4 32 128 64 0 0.185756 11.8884
8 16 32 128 64 0 0.185791 22.37578
8 16 128 128 64 0 0.185826 22.39942
8 8 32 64 64 0 0.185882 22.75767
4 8 32 256 256 0.1 0.18589 7.389337
Table 4.2.2: The 10 best results of the parameter grid search for the Transformer
model. The bold setup marks the chosen model.
S1 S2 S3 S4
Training splits 4 4 36 90
Test splits 1 1 9 2
Test iterations 24 24 8 9
Table 4.3.1: The number of data splits used for training and testing for each scenario.
section. Notable is that, in the first two scenarios, the models are trained on a less
amount of data than in the third, and that in the third scenario, in turn, the models
are trained on a less amount of data than in the fourth. The amount of training data
applied, and for how many traintest iterations, is illustrated for each scenario in table
4.3.1.
The scenarios were exposed to two experiments, referred to as Experiment 1 and

Experiment 2. The first includes training each model for 25, 50, 75, 100, and 150
epochs on all 15 input features to predict both target variables, and the second includes
variations to the number of input columns and target columns. After the description of
the scenarios, the two experiments are presented together with their respective results
in the remaining part of the section. Unless stated otherwise, the experiments were
run according to the prequential block evaluation in section 4.1.3.
4.3.1 Four scenarios for training and testing
S1: Separate treatment of radio units
For one sector, three individual but identical models were trained and evaluated on
data from one of the three units of the sector respectively. They were trained for 50
28
epochs and evaluated according to section 4.1.3.
S2: Generalising within the sector
The training process was performed as in S1, but the models were evaluated on a
dataset from a different unit than the one applied for training. However, both the
training data unit and the test data unit were, for all iterations, from the same sector.
Again, the models were trained for 50 epochs. Although the test dataset originated
from a different radio unit, the sequential order of data splits was unaltered, i.e. the
test dataset consisted of the subsequent split following the last split of the training
dataset.
S3: Generalising across sectors
To investigate the models’ ability to generalise across sectors, the models were trained
on data from all sectors and all radios, to then be tested on data from ditto. The
sequential order of data was preserved by applying the prequential approach presented
in section 4.1.3. In fact, the models were trained on four data splits from every radio
unit of the BS and tested on the subsequent split from every unit of the BS, with a rolling
window resulting in eight iterations. This means that each model was trained on a total
of 36 splits and tested eight times, with each test set consisting of nine splits.
S4: Utilising all data
In this experiment scenario, almost all available data are utilised for training each
model. One Transformer model and one LSTM model were trained on the first ten
data splits of each radio unit, i.e. each of the models was trained on 90 data splits.
They were then evaluated on the last two splits of each unit separately, resulting in
nine scores and timings to average over. Accordingly, the prequential approach of the
previously presented experiments was not implemented for S4.
29
epochs
test pvalue model RMSE σe ∆e (%) time (s) στ ∆τ (%)
S1 25 LSTM 0.197 0.027 2 18.0 0.5 +40
p = 0.01 Transf. 0.201 0.028 +2 10.8 0.3 67
50 LSTM 0.179 0.029 2 34.8 0.4 +46
p = 0.02 Transf. 0.183 0.030 +2 18.8 0.3 85
75 LSTM 0.199 0.028 2 51.9 0.5 +46
p = 0.01 Transf. 0.203 0.031 +2 27.8 0.4 87
100 LSTM 0,198 0,028 3 65,7 0,4 +47
p = 0.01 Transf. 0,203 0,027 +3 35,1 0,7 87
150 LSTM 0.197 0.027 1 98.2 0.6 +47
p = 0.03 Transf. 0.199 0.029 +1 52.0 0.7 89
S2 25 LSTM 0.271 0.096 2 17.8 0.2 +39
p = 0.07 Transf. 0.276 0.088 +2 10.8 0.3 65
50 LSTM 0.248 0.085 1 34.7 0.3 +45
p = 0.30 Transf. 0.251 0.082 +1 19.0 0.2 83
75 LSTM 0.273 0.092 3 51.9 0.3 +46
p = 0.01 Transf. 0.282 0.088 +3 27.8 0.4 87
100 LSTM 0,271 0,097 3 65,7 0,4 +46
p = 0.03 Transf. 0,279 0,101 +3 35,3 0,5 86
150 LSTM 0.273 0.090 4 98.2 0.3 +47
p = 0.01 Transf. 0.284 0.080 +4 52.1 0.8 89
S3 25 LSTM 0.231 0.035 5 40.2 0.8 +47
p = 0.01 Transf. 0.244 0.035 +5 21.3 1.5 89
50 LSTM 0.239 0.049 6 79.0 1.1 +48
p = 0.01 Transf. 0.254 0.049 +6 40.9 1.3 93
75 LSTM 0.235 0.035 4 123.6 0.8 +49
p = 0.01 Transf. 0.244 0.036 +4 63.4 1.3 95
100 LSTM 0,238 0,034 3 158,0 0,9 +48
p = 0.01 Transf. 0,245 0,035 +3 82,3 1,6 92
150 LSTM 0.236 0.036 3 238.4 0.9 +48
p = 0.01 TRANS 0.243 0.037 +3 123.6 1.5 93
S4 25 LSTM 0.153 0.032 2 295.6 +49
p = 0.57 Transf. 0.156 0.035 +2 150.0 97
50 LSTM 0.128 0.032 3 587.1 +49
p = 0.01 Transf. 0.133 0.034 +3 300.1 96
75 LSTM 0.153 0.032 2 912.5 +49
p = 0.01 Transf. 0.156 0.033 +2 464.6 96
100 LSTM 0,152 0,031 2 1176,6 +49
p = 0.01 Transf. 0,156 0,033 +2 600,9 96
150 LSTM 0.152 0.031 2 1773.7 +48
p = 0.01 Transf. 0.155 0.033 +2 919.7 93
Table 4.3.2: Results of the models when trained on input data consisting of all 15
feature columns for predicting both target variables at once. The ∆e column showcases
the differences in prediction error between the two models in percentage based on the
average error of the respective model. ∆τ displays the differences in consumed time
for training.
30
4.3.2 Experiment 1
In the first experiment, the two models were evaluated after being trained on all 15
input feature vectors. The results are presented in table 4.3.2 for comparison on equal
numbers of epochs. For all four scenarios, the recurrent LSTM model predicted the
targets with a lower error rate than the Transformer when compared after training for
an equal number of epochs. The superiority of the LSTM in RMSE ranges from scoring
1 to 6 per cent better than the Transformer. The differences are statistically significant
to a pvalue of 0.03 or lower in all cases but three, according to the Wilcoxon signed
rank test.
However, the Transformer model required a significantly shorter training time than
the LSTM, again, for all four scenarios. The prediction error of the Transformer can
also be compared to the LSTM on equal premises in terms of consumed training time.
For comparison after (roughly) equal training time, the results are presented in table
4.3.3. In three out of the four tests evaluated on equal training time, the Transformer
model outperformed the LSTM network in average error rate.
pvalue epochs model RMSE σe ∆e (%) time (s)

S1 p = 0.01 25 LSTM 0.197 0.027 +7 18.0
50 Transf. 0.183 0.030 8 18.8
S2 p = 0.05 25 LSTM 0.271 0.096 +7 17.8
50 Transf. 0.251 0.082 8 19.0
S3 p = 0.01 25 LSTM 0.231 0.035 10 40.2
50 Transf. 0.254 0.049 +9 40.9
S4 p = 0.01 25 LSTM 0.153 0.032 +13 295.6
50 Transf. 0.133 0.034 15 300.1
Table 4.3.3: Results of the models when trained on input data consisting of all 15
feature columns for predicting both target variables at once, illustrated for comparison
on roughly equal training time. The ∆e column displays the differences in prediction
error between the two models in percentage based on the average error of the respective
model.
4.3.3 Experiment 2
In this experiment, the models were exposed to training on a reduced setup of input
columns. Furthermore, a setup of training one separate model for each target feature
was tested, i.e. training one modelpair of Transformers and one modelpair of LSTM
31
networks for predicting the two target variables. All models were trained for 50
epochs.
The results of training the models on reduced input width are presented in table 4.3.4.
Firstly, the models were evaluated after being exclusively trained on a subset of six of
the input feature columns, namely the two that when leftshifted constitute the target
variables and the four periodical representations of the timeoftheweek and time
oftheday. Secondly, they were evaluated after being exclusively trained on the two
feature columns that when leftshifted constitute the target columns. All models were
trained for 50 epochs and all differences can be considered as statistically significant
according to the Wilcoxon signedrank test. The results show that the training time is
not reduced for any of the models when training on a reduced set of input data. In S1,
S2, and S4, the RMSE scores are significantly higher compared to the scores of the first
experiment, whereas the scores of S3 are similar.
In table 4.3.5, the results of training one separate model for each target variable
are showcased. We can see that training each modelpair is roughly twice as time
consuming compared to training a single model for the same task (Experiment 1). In
S1, the combined error rates are lower than they are in Experiment 1 for the same
number of epochs. However, for S2, S3, and S4, the models are outperformed by their
corresponding versions of the first experiment.
32
input cols RMSE σe time (s) στ ∆e (%)

S1 6 LSTM 0.199 0.028 33.8 0.5 2
p = 0.01 Transf. 0.203 0.028 18.5 0.3 +2
2 LSTM 0.198 0.028 33.6 0.3 0
p = 0.01 Transf. 0.199 0.026 18.1 0.3 0
S2 6 LSTM 0.273 0.095 33.8 0.3 3
p = 0.01 Transf. 0.280 0.103 18.8 0.4 +3
2 LSTM 0.272 0.095 33.6 0.2 4
p = 0.01 Transf. 0.282 0.098 18.1 0.3 +4
S3 6 LSTM 0.243 0.036 79.4 1.3 4
p = 0.02 Transf. 0.253 0.034 40.5 1.5 +4
2 LSTM 0.242 0.036 79.1 0.7 5
p = 0.01 Transf. 0.253 0.041 39.6 1.4 +4
S4 6 LSTM 0.153 0.031 582.2 2
p = 0.02 Transf. 0.157 0.033 294.0 +2
2 LSTM 0.153 0.030 586.9 3
p = 0.02 Transf. 0.156 0.033 291.4 2
Table 4.3.4: Results of the models when trained in reduced setups of input features for
50 epochs. All models were trained on the two feature columns constituting the targets
when leftshifted, and the models trained on six feature columns were also trained
on the four synthetically generated time representations presented in section 4.1.2.
The ∆e column displays the differences in prediction error between the two models in
percentage based on the average error of the respective model.
pvalue
test y1 y2 model RMSE1 σ1 RMSE2 σ2 RMSE time (s)
S1 0.6 0.01 LSTM 0.068 0.017 0.099 0.012 0.168 68.2
Transf. 0.070 0.019 0.104 0.015 0.175 37.9
S2 0.01 0.00 LSTM 0.115 0.054 0.146 0.080 0.261 68.2
Transf. 0.131 0.066 0.168 0.094 0.299 37.8
S3 0.55 0.38 LSTM 0.133 0.026 0.133 0.026 0.266 162.2
Transf. 0.135 0.025 0.136 0.026 0.271 83.2
S4 0.10 0.05 LSTM 0.087 0.012 0.087 0.012 0.174 1205.2
Transf. 0.088 0.011 0.088 0.011 0.176 608.9
Table 4.3.5: Results of training one LSTM model and one Transformer model for each
target variable. Each model was trained on all 15 input features to predict one of the
targets. The columns RMSE1 , RMSE2 , σ1 , and σ2 show the error rates and standard
deviations when predicting target variable y1 and y2 respectively. The combined error
of each modelpair is presented in column RMSE and the timecolumn displays their
respective total training time.
33
Chapter 5
Discussion
This chapter mainly serves to explain and discuss the key findings of the experiments
in relation to the research question of the project, i.e. to how a fully attentionbased
Transformer model compares to a recurrent LSTM network in terms of prediction
performance and training time when predicting traffic in a mobile network BS. Beyond
this, a few additional findings of the experimental work are discussed. Initially, the
experiments are interpreted and discussed, followed by a section that acknowledges
the limitations of the work.
5.1 Interpretation of the experimental findings
5.1.1 Comparing the Transformer to the LSTM
First of all, when evaluating the models on a fixed number of epochs, the results of
the conducted experiments are unambiguous concerning the prediction performance
related part of the research question. In terms of prediction error, the implemented
Transformer performs worse than the LSTM in all performed experiments and all
scenarios for training and testing. The differences are significant to a pvalue of 0.05
or lower in 25 out of the total of 32 tests. In other words, the risk of any reproduction
of the experiments leading to a different finding than the LSTM being superior to
the implemented Transformer model when trained on equal numbers of epochs is
low.
However, the results are equally clear concerning the part of the research question
attending to the time consumed for model training. In terms of required training
34
CHAPTER 5. DISCUSSION
time, the Transformer completely outperforms the LSTM model in all experiments and
all scenarios. As stated in section 1.1, in a practical implementation such as the use
case example of this project, the tradeoff between the model’s direct effect on energy
consumption and error rate is most essential, rather than the error rate itself. In some
cases, it might thus be reasonable to select the Transformer model over the LSTM for
prediction to the cost of a few per cent increase in prediction error.
When the models are compared after equal training time instead of after a fixed
number of epochs, the results lead to other findings. For all scenarios in the first
experiment, the models’ training times are similar enough for comparison when the
LSTM has been trained for half the number of epochs as the Transformer. After
being trained for 25 and 50 epochs respectively, with a pvalue of 0.05 or less, the
Transformer model outperforms the LSTM in terms of prediction error in S1, S2, and
S4, but on the contrary, the LSTM outperforms the Transformer in S3, as can be seen in
table 4.3.3. The models have also been trained on equal time after 50 and 100 epochs
respectively, as well as after 75 and 150 epochs respectively. However, from the results
presented in table 4.3.2, it is clear that both models tend towards overfitting as the
number of epochs passes 50. Since no regularization is implemented, the comparison
after equal training time becomes unfair towards the faster model after some number
of epochs.
The key findings of the study are threefold. Firstly, in line with the hypothesis,
the Transformer model requires a significantly shorter time for training than the
LSTM at an equal number of epochs. Secondly, by comparison at such premises,
the Transformer is clearly outperformed by the domaindominant LSTM network.
Lastly, however, when instead comparing the models on similar premises in terms of
training time, the results show that the implemented Transformer model outperforms
the recurrent LSTM network in three out of the four evaluated cases. Although the
number of comparisons on this end is low, it still indicates that, when trained on equal
training time, a fully attentionbased Transformer model can compare well, and even
favourable, to the domaindominant and often stateoftheart performing model of
a recurrent LSTM network in terms of prediction error. However, more evidence is
needed to draw any stronger conclusions on the matter.
35
5.1.2 Additional findings
For comparison of the different scenarios for training and testing, it is noteworthy that
the parameter search for the respective models was performed on the training and
testing setup of S1. Any findings from such a comparison must thus be interpreted
with caution, and particularly when the result is favourable to S1. However, as
can be seen in section 4.2.1 and 4.2.2, the effect of changing parameters within the
defined search grids is relatively little. Additionally, as can be seen in table 4.3.1, the
different scenarios include differently distributed traintest proportions which affect
both training times and error rates.
With this in mind, we can carefully acknowledge that the results presented in table
4.3.2 indicate that both models suffer when generalising across different radio units of
the sector as well as across different sectors of the BS, i.e. the error rates of S2 and S3
are significantly higher than the ones of S1. As discussed in section 4.1, it was expected
that the models would struggle in such scenarios, but not that the models would suffer
roughly equally when generalising across different sectors as when generalising across
units of the same sector. Another interesting notation is that the standard deviation
of the error rate is higher in S2 than in S3. This behaviour would be unexpected if the
scenarios could be compared on an even basis, since the target variables differ more
between sectors than within sectors, as discussed in section 4.1. However, as can be
seen in table 4.3.1, each model is trained on four data splits in S2, whereas they are
trained on 36 data splits in S3. This can be a contributing factor to the similar error
rates, as well as to the lower standard deviation in S3.
The results from S1 and S4 of the tests presented in table 4.3.2 indicate that the
reduced prediction performance from generalising across sectors can be compensated
by adding more training data. The models of S1 were trained on merely four data splits,
whereas the models of S4 were trained on 90, with the latter resulting in significantly
lower prediction errors. This indicates that any possible practical advantages of
applying a single model that can generalise across sectors might be utilisable without
compromising prediction performance. Fewer models need to be implemented, which
might facilitate data streams, and increases in prediction error can be compensated by
the increases in available data. It is possible that the same behaviour of increasing data
would be seen for generalisation attempts across BSs, which could lead to even further
advantages in a practical implementation.
36
By comparing the results from training the models on reduced setups of input width of
table 4.3.4 to the results of applying all 15 input features of table 4.3.2, we can see that
reducing input data has little or no effect on the consumed training time. However,
reducing the input width appears to have a significant negative effect on prediction
error for both the implemented models. Furthermore, these results indicate that the
effect of the added periodical time representations, see section 4.1.2, add little or no
value to the models in comparison to being trained on the two targetcorresponding
features. In practice, there is thus no point from a model computational complexity
aspect to reduce the input data width, however, there might be other advantages from
reducing the size of data flows depending on the practical application details.
Training separate models requires roughly twice as much time for training with little
or no effect on prediction error, as can be seen in table 4.3.5. More specifically,
in comparison to applying a single model for the two target variables, both the
Transformer and LSTM model pairs show significant increases in average RMSE for
S2, S3, and S4, and significant decreases of 4 and 6 per cent respectively in S1. The
results thus indicate that there is no point in training separate model pairs for the
prediction task of this project.
5.2 Limitations
Although the goal of this project does not include finding the optimal models, the model
selection of section 4.2 is limited to selecting the models for S1. This has effect on
the validity of any findings from direct comparison between different scenarios, and
especially when S1 is involved. A more thorough approach, but perhaps unnecessarily
exhaustive as optimal model selection was out of scope, would have been to tune
parameters for each individual test. Furthermore, the grid search could have been
more detailed and it could have ranged over more dimensions. That is, the investigated
parameter options could have been more narrow, and not all dimensions were
investigated, e.g. different optimisation and regularization strategies. However, even
these possible extensions of the grid search were deemed as unnecessarily exhaustive
in relation to the project scope. Similarly, further implementation options for the
Transformer architecture remain unexplored, but then again, the study does not
aim towards finding the optimal models, but rather to compare them on common
ground.
37
Another limitation, also related to the grid search, is the comparison on equal premises
in terms of training time. The grid search determined that completely disregarding
any dropout regularization was the best alternative for both the Transformer and the
LSTM implementations. But as the grid search was performed over 25 epochs, training
behaviour suffered when training for more epochs and longer time. Both models show
clear tendencies to overfitting to training data when being trained for more than 50
epochs, which can be seen in table 4.3.2. It would have been desirable to compare
the models for more cases than the four on equal training time, which became the
current case. For fair comparisons on such basis, a grid search, at least covering
regularization strategies, would be needed for each investigated number of epochs or
time of training. The comparison where the implemented LSTM network is trained for
25 epochs, and the Transformer is trained for 50, is however still a valid comparison as
the difference ended up in favour to the Transformer and both models were parameter
tuned at 25 epochs. Although, its statistical significance is indeed weak due to the few
samples.
Another limitation to the study is the evaluation of training time. There are no
experiments which directly investigate the behaviour of training time in different
settings. An example of such an experiment could have been how training time scales
with number of epochs, input dimensionality and different batch sizes. The focus
of this project has been on the prediction performance evaluation, and training time
has merely been accounted for. However, as claimed in section 1.1, training time and
computational complexity is an important aspect to the problem, especially concerning
energy efficiency.
38
Chapter 6
Conclusion
By proposing a previously unexplored approach of fully attentionbased learning for

mobile traffic prediction, this degree project has aimed to contribute to the current
knowledge base of Transformer learning for time series as well as to ditto of traffic
prediction at mobile networks.
The findings of the project indicate that, when predicting mobile traffic at a mobile
network BS, a fully attentionbased Transformer model can compare to the domain
dominant LSTM network when being trained for an equally long time. The results
of this study point in favour to the Transformer, but further investigation would be
required for any conclusion on the matter.
When the models instead are trained for a set number of epochs, it can be
concluded that the implemented Transformer model trains significantly faster than the
implemented LSTM network, but suffers in a comparison on prediction performance.
However, when implementing a prediction model for proactive resource management
at BSs, it would make no sense to account for the error rate based on a set number
of epochs, regardless of required training time. For any practical implementation,
the essential factor should instead be a tradeoff between prediction error and
required training time, and on such terms, the implemented Transformer is a realistic
alternative to the LSTM.
Therefore, we can conclude that the fully attentionbased Transformer architecture

can, as in many other domains, be a qualified competitor to the currently dominating
methods of time series prediction in mobile networks. The findings of this degree
project are thus in line with previous research in other problem domains of time series
39
CHAPTER 6. CONCLUSION
prediction where fully attentionbased models have shown performances at stateof

theart levels.
6.1 Future work

A direct continuation of this degree project would be to produce a more thorough
investigation of how the Transformer can compare to the LSTM on equal training
time when applying regularization strategies optimised for that amount of time. As
discussed in section 5.2, the work of this project includes limitations on that end.
Furthermore, it would certainly be interesting to investigate how good a fully attention

based time series prediction model could be in terms of both prediction error and
training time. The model could for example be implemented with some other
positional embedding, or perhaps the structure of the attention layers could be
modified. In this manner, a study directed towards developing the best possible
attentionbased model for mobile traffic prediction is a terrific research topic. An
initial guess, based on the performed review of previous work, would be that some
attentionbased structure topped by some sophisticated ANN could be a potential top
performer.
Several interesting research topics related to the practicalities of proactive resource

management have been identified during the work of this project. One of these is an
investigation of the models’ ability to generalise across different BSs, as the results of
this project indicate that both the Transformer and the LSTM are able to generalise
across radio units which cover different geographical areas. This would be particularly
relevant for how to design any large scale implementation based on mobile traffic
prediction. Another topic is a thorough investigation of how different prediction
models behave on different hardware setups, which would be relevant based on what
type of hardware is available at the different potential implementation points of the
network. When the CPU capacity was limited by mistake during the experimental
work, the training time increased significantly for both models, on all scenarios. This
is a reminder that the differences in training times observed in the experiments are
limited to the specific hardware setup applied in this degree project.
40
Bibliography
[1] Azari, Amin, Papapetrou, Panagiotis, Denic, Stojan, and Peters, Gunnar.
“Cellular traffic prediction and classification: A comparative evaluation of
LSTM and ARIMA”. In: International Conference on Discovery Science.
Springer. 2019, pp. 129–144.
[2] Bayer, Justin Simon. “Learning sequence representations”. PhD thesis.

Technische Universität München, 2015.
[3] Bengio, Yoshua, Simard, Patrice, and Frasconi, Paolo. “Learning longterm
dependencies with gradient descent is difficult”. In: IEEE transactions on
neural networks 5.2 (1994), pp. 157–166.
[4] Bergstra, James, Breuleux, Olivier, Bastien, Frédéric, Lamblin, Pascal,

Pascanu, Razvan, Desjardins, Guillaume, Turian, Joseph,
WardeFarley, David, and Bengio, Yoshua. “Theano: A CPU and GPU math
compiler in Python”. In: Proc. 9th Python in Science Conf. Vol. 1. 2010,
pp. 3–10.
[5] Cerqueira, Vitor, Torgo, Luis, and Mozetič, Igor. “Evaluating time series
forecasting models: An empirical study on performance estimation methods”.
In: Machine Learning 109.11 (2020), pp. 1997–2028.
[6] Chen, Mingzhe, Challita, Ursula, Saad, Walid, Yin, Changchuan, and
Debbah, Mérouane. “Artificial Neural NetworksBased Machine Learning for
Wireless Networks: A Tutorial”. In: IEEE Communications Surveys Tutorials
21.4 (2019), pp. 3039–3071. DOI: .
[7] Chinchali, Sandeep, Hu, Pan, Chu, Tianshu, Sharma, Manu, Bansal, Manu,
Misra, Rakesh, Pavone, Marco, and Katti, Sachin. “Cellular network traffic
scheduling with deep reinforcement learning”. In: Proceedings of the AAAI
Conference on Artificial Intelligence. Vol. 32. 1. 2018.
41
BIBLIOGRAPHY
[8] Cho, Kyunghyun, Van Merriënboer, Bart, Gulcehre, Caglar,

Bahdanau, Dzmitry, Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua.
“Learning Phrase Representations using RNN Encoder–Decoder for Statistical
Machine Translation”. In: Proceedings of the 2014 Conference on Empirical
Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association
for Computational Linguistics, Oct. 2014, pp. 1724–1734. DOI:
. URL: .
[9] Deb, Chirag, Zhang, Fan, Yang, Junjing, Lee, Siew Eang, and Shah, Kwok Wei.
“A review on time series forecasting techniques for building energy
consumption”. In: Renewable and Sustainable Energy Reviews 74 (2017),
pp. 902–924.
[10] Demšar, Janez. “Statistical comparisons of classifiers over multiple data sets”.
In: The Journal of Machine Learning Research 7 (2006), pp. 1–30.
[11] Devlin, Jacob, Chang, MingWei, Lee, Kenton, and Toutanova, Kristina.
“BERT: Pretraining of Deep Bidirectional Transformers for Language
Understanding”. In: NAACLHLT. 2019.
[12] Dong, Linhao, Xu, Shuang, and Xu, Bo. “Speechtransformer: a norecurrence
sequencetosequence model for speech recognition”. In: 2018 IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE. 2018, pp. 5884–5888.
[13] Feng, Jie, Chen, Xinlei, Gao, Rundong, Zeng, Ming, and Li, Yong. “Deeptp: An
endtoend neural network for mobile cellular traffic prediction”. In: IEEE
Network 32.6 (2018), pp. 108–115.
[14] Frenger, Pal and Ericson, Marten. “Assessment of alternatives for reducing
energy consumption in multiRAT scenarios”. In: 2014 IEEE 79th Vehicular
Technology Conference (VTC Spring). IEEE. 2014, pp. 1–5.
[15] Frenger, Pål, Jading, Ylva, and Turk, John. “A case study on estimating future
radio network energy consumption and CO 2 emissions”. In: 2013 IEEE 24th
International Symposium on Personal, Indoor and Mobile Radio
Communications (PIMRC Workshops). IEEE. 2013, pp. 1–5.
[16] Gamboa, John Cristian Borges. “Deep learning for timeseries analysis”. In:
arXiv preprint arXiv:1701.01887 (2017).
42
BIBLIOGRAPHY
[17] Gao, Yun, Wei, Xin, Zhou, Liang, and Lv, Haibing. “A deep learning framework
with spatialtemporal attention mechanism for cellular traffic prediction”. In:
2019 IEEE Globecom Workshops (GC Wkshps). IEEE. 2019, pp. 1–6.
[18] Goodfellow, Ian, Bengio, Yoshua, Courville, Aaron, and Bengio, Yoshua. Deep
learning. Vol. 1. 2. MIT press Cambridge, 2016.
[19] Guo, Jia, Peng, Yu, Peng, Xiyuan, Chen, Qiang, Yu, Jiang, and Dai, Yufeng.
“Traffic forecasting for mobile networks with multiplicative seasonal arima
models”. In: 2009 9th International Conference on Electronic Measurement &
Instruments. IEEE. 2009, pp. 3–377.
[20] Hahnloser, Richard HR, Sarpeshkar, Rahul, Mahowald, Misha A,

Douglas, Rodney J, and Seung, H Sebastian. “Digital selection and analogue
amplification coexist in a cortexinspired silicon circuit”. In: Nature 405.6789
(2000), pp. 947–951.
[21] He, Kaiwen, Huang, Yufen, Chen, Xu, Zhou, Zhi, and Yu, Shuai. “Graph
attention spatialtemporal network for deep learning based mobile traffic
prediction”. In: 2019 IEEE Global Communications Conference
(GLOBECOM). IEEE. 2019, pp. 1–6.
[22] Hochreiter, Sepp. “Untersuchungen zu dynamischen neuronalen Netzen”.

Diploma. Technische Universität München, 1991.
[23] Hochreiter, Sepp and Schmidhuber, Jürgen. “Long shortterm memory”. In:
Neural computation 9.8 (1997), pp. 1735–1780.
[24] International Telecommunication Union. Minimum requirements related to

technical performance for IMT2020 radio interface(s). 2017. URL:
.
[25] Jaffry, Shan and Hasan, Syed Faraz. “Cellular Traffic Prediction using
Recurrent Neural Networks”. In: 2020 IEEE 5th International Symposium on
Telecommunication Technologies (ISTT). 2020, pp. 94–98. DOI:
.
[26] Kazemi, Seyed Mehran, Goel, Rishab, Eghbali, Sepehr, Ramanan, Janahan,
Sahota, Jaspreet, Thakur, Sanjay, Wu, Stella, Smyth, Cathal, Poupart, Pascal,
and Brubaker, Marcus. “Time2vec: Learning a vector representation of time”.
In: arXiv preprint arXiv:1907.05321 (2019).
43
BIBLIOGRAPHY
[27] Kingma, Diederik P. and Ba, Jimmy Lei. “Adam: A method for stochastic
gradient descent”. In: ICLR: International Conference on Learning
Representations. 2015, pp. 1–15.
[28] Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. “Imagenet

classification with deep convolutional neural networks”. In: Advances in
neural information processing systems 25 (2012), pp. 1097–1105.
[29] LeCun, Yann, Bengio, Yoshua, and Hinton, Geoffrey. “Deep learning”. In:
Nature 521.7553 (2015), pp. 436–444.
[30] Li, Ming, Wang, Yuewen, Wang, Zhaowen, and Zheng, Huiying. “A deep
learning method based on an attention mechanism for wireless network traffic
prediction”. In: Ad Hoc Networks 107 (2020), p. 102258.
[31] Li, Rongpeng, Zhao, Zhifeng, Chen, Xianfu, Palicot, Jacques, and
Zhang, Honggang. “TACT: A transfer actorcritic learning framework for
energy saving in cellular radio access networks”. In: IEEE transactions on
wireless communications 13.4 (2014), pp. 2000–2011.
[32] Li, Shiyang, Jin, Xiaoyong, Xuan, Yao, Zhou, Xiyou, Chen, Wenhu,
Wang, YuXiang, and Yan, Xifeng. “Enhancing the Locality and Breaking the
Memory Bottleneck of Transformer on Time Series Forecasting”. In: Advances
in Neural Information Processing Systems. Ed. by H. Wallach, H. Larochelle,
A. Beygelzimer, F. d’AlchéBuc, E. Fox, and R. Garnett. Vol. 32. Curran
Associates, Inc., 2019. URL:
.
[33] Madiajagan, M. and Raj, S. Sridhar. “Chapter 1 Parallel Computing, Graphics

Processing Unit (GPU) and New Hardware for Deep Learning in
Computational Intelligence Research”. In: Deep Learning and Parallel
Computing Environment for Bioengineering Systems. Ed. by
Arun Kumar Sangaiah. Academic Press, 2019, pp. 1–15. ISBN:
9780128167182. DOI:
. URL:
.
[34] Oliveira, Mariana, Torgo, Luis, and Santos Costa, Vitor. “Evaluation
procedures for forecasting with spatiotemporal data”. In: Mathematics 9.6
(2021), p. 691.
44
BIBLIOGRAPHY
[35] Parmar, Niki, Vaswani, Ashish, Uszkoreit, Jakob, Kaiser, Lukasz,

Shazeer, Noam, Ku, Alexander, and Tran, Dustin. “Image transformer”. In:
International Conference on Machine Learning. PMLR. 2018, pp. 4055–4064.
[36] Python. Python Time module: Time access and conversions.

. Accessed: 20210523.
[37] Richter, Fred, Fehske, Albrecht J, and Fettweis, Gerhard P. “Energy efficiency
aspects of base station deployment strategies for cellular networks”. In: 2009
IEEE 70th Vehicular Technology Conference Fall. IEEE. 2009, pp. 1–5.
[38] Shu, Yantai, Yu, Minfang, Yang, Oliver, Liu, Jiakun, and Feng, Huifang.
“Wireless traffic modeling and prediction using seasonal ARIMA models”. In:
IEICE transactions on communications 88.10 (2005), pp. 3992–3999.
[39] Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and
Salakhutdinov, Ruslan. “Dropout: a simple way to prevent neural networks
from overfitting”. In: The journal of machine learning research 15.1 (2014),
pp. 1929–1958.
[40] Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott,
Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and
Rabinovich, Andrew. “Going deeper with convolutions”. In: Proceedings of the
IEEE conference on computer vision and pattern recognition. 2015, pp. 1–9.
[41] Tensorflow. Tensorflow Keras.

. Accessed:
20210523.
[42] Trinh, Hoang Duy, Giupponi, Lorenza, and Dini, Paolo. “Mobile traffic
prediction from raw data using LSTM networks”. In: IEEE 29th Annual
International Symposium on Personal, Indoor and Mobile Radio
Communications (PIMRC). IEEE. 2018, pp. 1827–1832.
[43] Vaswani, Ashish, Bengio, Samy, Brevdo, Eugene, Chollet, Francois,

Gomez, Aidan, Gouws, Stephan, Jones, Llion, Kaiser, Łukasz,
Kalchbrenner, Nal, Parmar, Niki, Sepassi, Ryan, Shazeer, Noam, and
Uszkoreit, Jakob. “Tensor2Tensor for Neural Machine Translation”. In:
Proceedings of the 13th Conference of the Association for Machine
Translation in the Americas (Volume 1: Research Track). Boston, MA:
45
BIBLIOGRAPHY
Association for Machine Translation in the Americas, Mar. 2018, pp. 193–199.
URL: .
[44] Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion,
Gomez, Aidan N, Kaiser, Łukasz, and Polosukhin, Illia. “Attention is All you
Need”. In: Advances in Neural Information Processing Systems. Ed. by
I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,
and R. Garnett. Vol. 30. Curran Associates, Inc., 2017. URL:
[45] Wang, Jing, Tang, Jian, Xu, Zhiyuan, Wang, Yanzhi, Xue, Guoliang,
Zhang, Xing, and Yang, Dejun. “Spatiotemporal modeling and prediction in
cellular networks: A big data enabled deep learning approach”. In: IEEE
INFOCOM 2017IEEE Conference on Computer Communications. IEEE.
2017, pp. 1–9.
[46] Wilcoxon, Frank. “Individual Comparisons by Ranking Methods”. In:

Biometrics Bulletin 1.6 (1945), pp. 80–83. ISSN: 00994987. URL:
.
[47] Wu, Neo, Green, Bradley, Ben, Xue, and O’Banion, Shawn. “Deep transformer
models for time series forecasting: The influenza prevalence case”. In: arXiv
preprint arXiv:2001.08317 (2020).
[48] Xu, Mingxing, Dai, Wenrui, Liu, Chunmiao, Gao, Xing, Lin, Weiyao,
Qi, GuoJun, and Xiong, Hongkai. “Spatialtemporal transformer networks for
traffic flow forecasting”. In: arXiv preprint arXiv:2001.02908 (2020).
[49] Ye, Junhong and Zhang, YingJun Angela. “DRAG: Deep reinforcement
learning based base station activation in heterogeneous networks”. In: IEEE
Transactions on Mobile Computing 19.9 (2019), pp. 2076–2087.
[50] Yu, Nuo, Miao, Yuting, Mu, Lan, Du, Hongwei, Huang, Hejiao, and
Jia, Xiaohua. “Minimizing energy cost by dynamic switching on/off base
stations in cellular networks”. In: IEEE Transactions on Wireless
Communications 15.11 (2016), pp. 7457–7469.
46
BIBLIOGRAPHY
[51] Yu, Yanhua, Wang, Jun, Song, Meina, and Song, Junde. “Network traffic
prediction and result analysis based on seasonal ARIMA and correlation
coefficient”. In: 2010 International Conference on Intelligent System Design
and Engineering Application. Vol. 1. IEEE. 2010, pp. 980–983.
[52] Zerveas, George, Jayaraman, Srideepika, Patel, Dhaval,

Bhamidipaty, Anuradha, and Eickhoff, Carsten. “A Transformerbased
Framework for Multivariate Time Series Representation Learning”. In: arXiv
preprint arXiv:2010.02803 (2020).
[53] Zhang, Sheng, Zhao, Shenglin, Yuan, Mingxuan, Zeng, Jia, Yao, Jianguo,
Lyu, Michael R, and King, Irwin. “Traffic prediction based power saving in
cellular networks: A machine learning method”. In: Proceedings of the 25th
ACM SIGSPATIAL international conference on advances in geographic
information systems. 2017, pp. 1–10.
[54] Zhou, Bo, He, Dan, Sun, Zhili, and Ng, Wee Hock. “Network traffic modeling
and prediction with ARIMA/GARCH”. In: Proc. of HETNETs Conference.
2005, pp. 1–10.
47
TRITA -EECS-EX-2021:644
www.kth.se

Full Text 01

Uploaded by

Copyright:

Available Formats

Full Text 01

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Full Text 01

Uploaded by

Copyright:

Available Formats

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING,

SECOND CYCLE, 30 CREDITS

Transformer learning for traffic

KTH ROYAL INSTITUTE OF TECHNOLOGY

The implemented attention­based approach consists of stacked layers of multi­head

Transformer, Attention, LSTM, Mobile traffic prediction.

Den här studien föreslår en transformermetod, helt baserad på attention­mekanismen,

Transformermodellen består av flera attention­lager staplade i kombination med

Transformer, Attention, LSTM, Prediktering av mobil nätverkstrafik.

I would like to express my deepest appreciation to my supervisors Lackis Eleftheriadis

Furthermore, I would like to extend my sincere thanks to Athanasios Karapantelakis

ADAM adaptive moment estimation

4.1.1 Data description . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Proactive resource management relies on robust and precise traffic prediction to

and data streams.

Recurrent LSTM networks have proven successful at capturing temporal dependencies

1.2 Purpose and goal

1.4 Ethics and sustainability

Another aspect of energy efficiency is related to the complexity of the underlying

A mobile communications network consists of multiple interconnected BSs which

resource management could contribute to significant improvements. It has even been

2.1 Time series prediction

2.2 Machine learning for time series prediction

2.2.1 Recurrent neural networks

2.2.2 Vanishing and exploding gradients for RNNs

2.2.3 An RNN without learning issues: LSTM

it = σ(Wi xt + Ui ht−1 + bi ), (2.2)

ot = σ(Wo xt + Uo ht−1 + bo ), (2.3)

ft = σ(Wf xt + Uf ht−1 + bf ), (2.4)

ht = ot " tanh(ct ), (2.5)

function and ct is the cell state

ct = ft " ct−1 + it " tanh(Wc xt + Uc ht−1 + bc ). (2.6)

2.3 Related work

2.3.1 Mobile network traffic prediction

2.3.2 Prediction without recurrence: Transformers

Figure 2.3.1: The model architecture of the Transformer [44].

of dimension dv , the attention output is computed as

The multi­headed attention mechanism consists of h parallel attention mechanisms

MultiHead(Q, K, V ) = Concat(head1 , ..., headh )W O , (2.8)

The encoder­decoder structure of the Transformer makes it suitable for generative

However, fully attention­based approaches, which have gained such acknowledgement

machine translation Transformer presented by Vaswani et al. in [44] is based on a

2.3.3 Representation of time: Time2Vec

In [26], Kazemi et al. presented the alternative representation of time Time2Vec. It

The prediction problem is formed as a multi­variate many­to­many problem. The

3.1 Model architecture of the Transformer model

The implemented Transformer model essentially consists of a number of stacked

3.2 Model architecture of the recurrent LSTM

3.4 Wilcoxon signed­rank for statistical testing

3.5 Implementation details

4.1 Data and pre­processing of data

4.1.1 Data description

4.1.3 Evaluation and splitting of data

setup is described in section 3.5.

4.2 Model selection

4.2.1 Parameter grid search for the LSTM

4.2.2 Parameter grid search for the Transformer

layers units dropout batch size RMSE time (s)

layers h dk FFNN units batch size dropout RMSE time(s)

The scenarios were exposed to two experiments, referred to as Experiment 1 and

The implemented attentionbased approach consists of stacked layers of multihead

Den här studien föreslår en transformermetod, helt baserad på attentionmekanismen,

Transformermodellen består av flera attentionlager staplade i kombination med

The multiheaded attention mechanism consists of h parallel attention mechanisms

The encoderdecoder structure of the Transformer makes it suitable for generative

However, fully attentionbased approaches, which have gained such acknowledgement

The prediction problem is formed as a multivariate manytomany problem. The

3.4 Wilcoxon signedrank for statistical testing

4.1 Data and preprocessing of data

pvalue epochs model RMSE σe ∆e (%) time (s)

By proposing a previously unexplored approach of fully attentionbased learning for

Therefore, we can conclude that the fully attentionbased Transformer architecture

prediction where fully attentionbased models have shown performances at stateof

Furthermore, it would certainly be interesting to investigate how good a fully attention

[33] Madiajagan, M. and Raj, S. Sridhar. “Chapter 1 Parallel Computing, Graphics