Arima Project Report
Arima Project Report
Arima Project Report
In the electrical power grid, the power load is not constant but continuously
changing. This depends on many different factors, among which the habits of the
consumers, the yearly seasons and the hour of the day. The continuous change in
energy consumption requires the power grid to be flexible. If the energy provided
by generators is lower than the demand, this is usually compensated by using
renewable power sources or stored energy until the power generators have adapted
to the new demand. However, if buffers are depleted the output may not meet the
demanded power and could cause power outages. The currently adopted practice in
the indus- try is based on configuring the grid depending on some expected power
draw. This analysis is usually performed at a high level and provide only some
basic load aggre- gate as an output. In this thesis, we aim at investigating
techniques that are able to predict the behaviour of loads with fine-grained
precision. These techniques could be used as predictors to dynamically adapt the
grid at run time. We have investigated the field of time series forecasting and
evaluated and compared different techniques using a real data set of the load of the
Swedish power grid recorded hourly through years. In particular, we have
compared the traditional ARIMA models to a neural network and a long short-term
memory (LSTM) model to see which of these tech- niques had the lowest
forecasting error in our scenario. Our results show that the LSTM model
outperformed the other tested models with an average error of 6,1%.
I want to thank my supervisor Mirko D’Angelo for guiding me through all of this and I
want to thank my fiancé Frida for all the love and support. Without you, this would not
have been possible.
Contents
1 Introduction1
1.1 Problem formulation and objectives.........................................................3
1.2 Scope, Limitation and target group............................................................4
1.3 Outline......................................................................................................4
2 Method5
2.1 Scientific Approach....................................................................................5
2.2 Method description....................................................................................5
2.3 Reliability and Validity..............................................................................5
3 Related work7
3.1 Machine learning techniques.....................................................................7
3.2 ARIMA (Auto-Regressive Integrated Moving Average) models and hy-
brid implementations.................................................................................7
3.3 Other techniques.......................................................................................9
3.4 Discussion...................................................................................................9
5 Forecasting15
5.1 Long term forecasting vs short term forecasting.......................................15
5.2 Forecasting error metrics...........................................................................15
5.3 ARIMA Models..........................................................................................16
5.3.1 Training the model........................................................................16
5.3.2 Model selection and fitting of model.............................................16
5.3.3 Short-term forecasting..................................................................19
5.3.4 Long-term forecasting..................................................................23
5.3.5 Evaluation of the forecasting results.............................................24
5.4 Neural Network and machine learning......................................................25
5.4.1 Training the neural networks.........................................................26
5.4.2 Forecasting with the Neural Network..........................................27
5.4.3 LSTM (Long Short-Term Memory) Model.................................29
5.4.4 Forecasting with LSTM model......................................................30
5.4.5 Evaluation of forecasting results...................................................32
5.5 Discussion...................................................................................................32
5.6 Implementation..........................................................................................33
References35
1 Introduction
The Swedish power grid consists of over 150 000 kilometers of power lines, roughly
160 substations and 16 stations that connect to other countries [1]. To handle the load
demand we have separate power grids, the transmission and the distribution grid. The
transmission grid is the power grid that handles the transport of power between the
power generators and the substations while the distribution grid handles all the power
distribution from the substations to the consumers. All of this is required to power our
houses, cellphones, computers etc. Sweden has a wide variety of power generators, but
the main power comes from nuclear power plants, hydro or renewable power generators
such as wind power, however on the other side other forms of renewable energy sources
such as wind power is not constant and can vary a lot. This means that the power plant
needs to be adaptable to be able to increase the power production if the wind power
generators fail to produce power enough due to weather conditions. The thermal power
plants such as coal or nuclear are able to increase their power production, but it often
takes a while to reach the new requested power level.
In transmission and distribution electrical power grids, the loads on the power lines
are not constant but continuously changing. This depends on various factors, among
which the season and the habits of the consumers. Moreover, power usage differs
greatly depending on the current time of day and other unexpected factors that could
change the amount of power consumption from the grid. If loads increase this is usually
compensated in modern grids by using flexible power sources, which can sustain
different demand levels up to a certain maximum [2]. However, if resource buffers
deplete, the draw could exceed the produced output leading to a power shortage.
One of the most prevalent solutions to this problem is forecasting. If we are able to
accurately predict the power load required, we can adapt preemptively. To be able to
forecast the required power load we need some kind of data to base our prediction on.
Most often, we are using a time series, which is data stored over a long time period.
A time series is a set of observations, each one recorded at a time interval [3]. A
discrete time series is a set of observations recorded in a fixed interval. This might be
daily, weekly, hourly etc. These time series are often one dimensional, just a time/date
and a value however, there are multiple factors to take into account when analyzing the
data. A simple graphical representation of the data set can tell us a lot. Figure1.1shows
the wine sales in Australia. By looking at figure1.1we can see that is has a clear
increasing trend. We can also observe the increase in sales during the spring and the
summer, which is referred to as seasonality. The unusual highs and lows for the data
series are called white noise. White noise is a variable that is neither trend nor
seasonality but still affects the time series.
Time series is commonly used in other areas such as economics to forecast the stock
prices and it is not uncommon for companies to use forecasting techniques to predict the
workload to be able to increase or decrease the number of workers needed.
To correctly operate a transmission grid load forecasting is of utmost importance
[4,5]. The currently adopted practice in the industry is based on configuring the grid
depending on some expected power draw. However, this analysis is usually performed at
a high level and provide only some basic load aggregate as an output [5]. On the other
hand, we aim at investigating and finding techniques that are able to predict the
behaviour of loads with fine-grained precision. These techniques could be used as
predictors to dynamically adapt the grid at run time. The techniques we are investigating
are the tradi- tional ARIMA models and different machine-learning approaches.
1
Figure 1.1: Australian wine sales , Jan. 1980-Oct 1991
The field of time series forecasting has been used for multiple different tasks that
require planning often due to restrictions in adaptability [3]. One of the most popular
algorithms for time series forecasting is ARIMA. ARIMA models were firstly
introduced in the 1950s but were made popular by George E. P. Box and Gwilym
Jenkins in their book "Time Series Analysis: Forecasting and Control" [6]. To get a
clear overview of what ARIMA is, we must first break it down into smaller pieces.
ARIMA is a combination of multiple forecasting principles. The Autoregressive model
(AR) is a representation of a random process. The autoregressive model specifies that
the output depends on the previous inputs and a stochastic term (an imperfect
predictable term) therefore making it a stochastic differential equation [6]. Basically,
autoregressive models take the previous steps into account when predicting and
calculating the next step. The issue with the AR model is that temporary or single
shocks affect the whole output indefinitely. To avoid this issue the AR process often has
a lag value. The lag value is how many of the previous steps that should contribute more
to the output than others. The AR model can be non- stationary as it can be represented
by a unit root variable [6].
The MA (Moving Average) model in contrary to the AR model is always stationary.
The moving average is a linear regression of the current value against the white noise or
random shocks in the series, contrary to the AR model which is the linear regression to
non-shock values [7].
There is no reason why you cannot combine these models and that’s where the
ARMA process comes in. The ARMA model combines the previous two models to
make an even more accurate prediction. The ARMA model compares the results for
both of the models and makes a prediction based on the results [3]. The issue with the
ARMA process is that it assumes that your time series is stationary. This means that the
series will not take trend or seasonality into account. That’s where ARIMA comes in
handy. The I in ARIMA is for integrated, which is the differencing of the previous
observations in the series (subtracting an observation from a previous observation in the
series to make the series stationary). Non-seasonal ARIMA is often described as
ARIMA(p, d, q) where p is the number of lags in the AR model, d is the grade of
differentiation (the number of time the data has previous values subtracted) and the q is
2
the order of the MA model[8].
3
As the ARIMA models are a combination of models, ARIMA(1, 0, 0) is the same as an
AR(1) model as the integrated part and the moving average are not used. Lag is the
delay in time steps between two points in a data set that you are comparing to each
other. The order of MA is how many of the previous shocks that the model will take into
account when predicting. A shock is an external influence or something that changed
the value to an extreme point, both high or low. In the same way any ARIMA(p, d, q)
where p is 0 is equal to an ARMA(p, q) model.
Lastly, we aim to investigate different machine learning techniques as machine
learn- ing has been more and more prevalent in many areas in the last years. The idea of
machine learning is not to instruct the program on how to explicitly solve a given task,
but rather give them a problem and let the computer solve it by using different patterns.
One of the most prevalent solutions in regards to machine learning and time series
forecasting is the usage of neural networks. A neural network is a network of nodes that
are connected and communicate with each other to solve a task. The network has one or
more invisible layers that contain "neurons". These are the information nodes that help
calculate the result or function for the given problem. The result is not a definitive
result, but rather an estimated result and the accuracy of the result depends on the
number of hidden "neurons" and lay- ers in the network[9]. The more neurons and
layers, the more calculations/operations and the more accurate output. Machine learning
is growing as the computational power of today’s technology is steadily increasing,
which in turn means that we can do even more complex and bigger computations at run
time.
4
O1 Investigate the data set
O1.1 Analyze the properties and characteristics
O1.2 Verify what kind of predictions it is possible to do
O2 Investigate previous research and state of the art techniques
O3 Apply the theory and models to our use case (real data set)
O3.1 Either developing or customizing existing tools and algorithms
O4 Validation of adopted algorithm(s)
O4.1 Validate prediction accuracy
Objective 1, 1.1, 1.2, we have to investigate the data set and extrapolate the
properties and the characteristics of the data set. This is crucial as we need to know the
properties of the data set in order to know which forecasting algorithms can be used for
the data set. We need to know if the data set is stationary, non-stationary, follow a trend
or seasonality etc.
Objective 2 requires us to investigate previous research to see what they did and
how. We also gather data to see which types of algorithms are used and how they are
applied to the investigated data set. Depending on our findings, there might limitations
to our data set that makes us favor some forecasting techniques over others. In this step,
we exclusively explore the literature, as we need a lot of data to figure out which of
these methods and algorithms are applicable to our data set.
Objective 3, 3.1. After performing the literature study, we will apply the identified
techniques to our data set.
Objective 4, 4.1 Lastly, we will validate our models by the performance of the
chosen algorithms and methods to measure the prediction accuracy based on widely
used metrics in this area, both for short and long term forecasting.
1.3 Outline
The coming chapter will include the methodology and the scientific approach for this
project. Chapter 3 will contain the related work and information gathered from the lit-
erature study. The 4th chapter will contain the information and analysis of the data set
for the time series such as the characteristics. Chapter 5 will contain a small overview of
the implementation, libraries used, forecasting models, the forecasts and the results of
my forecasts. The 6th chapter will contain the conclusion, discussions and future work
for this project.
5
2 Method
In this chapter, we describe our scientific approach, how we intend to answer the
research questions, and we outline reliability and validity for the study.
7
attached libraries. We cannot really detail how this would affect our validity, but we can
assume that the effect would be negligible
The external validity is the validity of applying the conclusions of our scientific
study outside the adopted context. This should be very high, as we using commonly
used methodology and procedures for this field. If we find a similar data set with similar
properties, the same methods could be used for that data set. However, we did not
explore this in this thesis.
8
3 Related work
We performed a literature review on related work. We investigated two of the most
com- plete databases, the IEEE database and the ACM database. The search string we
used was "time series forecasting" and found the same recurring forecasting methods in
a majority of the articles and reports. We did not specify "time series forecasting for a
smart grid" as we wish to do a more general search and get a broader view of the time
series forecasting field. We will use these data sources as the base for our methodology
and implemen- tation. We have decided to separate the methodology into three
subgroups. Machine learning, ARIMA, and other techniques.
10
findings suggest that ARIMA is heavily dependant on fitting the correct model. To get
the perfect fitting, we need good knowledge regarding the ACF and PACF values to the
series. Their results were promising in with a very low prediction error but the report
stresses the importance of a correct evaluation of the Auto Correlation Function (ACF)
and Partial Auto Correlation Function (PACF) values to fit a good model for the
prediction. The ACF and PACF are methods to measure how the values in the data set
are affected by previous values in the data set. The R package software was also used
for the calculations and plotting.
In regards to our subject, similar tests and research were done in a hospital, where
they did forecasting in their medium voltage power grid [16]. The authors discovered
that their tests were slightly off due to the lack of information in their time series. Their
data set only had 45 days of input, which means that the models only had a small
selection of data to use for the prediction. The paper does, however, prove the efficiency
and validity of the Box-Jenkins method as the better way to fit a Seasonal ARIMA
(SARIMA) model. A report from 2014 details the usage of a combination of ARIMA
and Kalman filtering which is a machine learning technique [17]. This hybrid was used
to perform short-term forecasting for wind speed. The Kalman filter is a recursive
algorithm where incorrect or noisy predictions can estimate the correct value of a
prediction. Since the weather can be random, they found the traditional time series
methodology to be very inaccurate. The ARIMA model had a MAPE of 15.442%
however, the maximum average error of the fluctuating wind speed data was 50.227%.
The Kalman filtering alone had an error margin of 5,80% and a maximum error of
11,26%. The combination of these two methods was surprisingly effective with an error
margin of 2,70% and a maximum error margin of 4,94%. The paper proves the validity
of wind speed forecasting with ARIMA and the Kalman algorithm for wind speed
forecasting, however, it still has some error with the
uncertainty of fluctuating winds.
Jianguang Deng and Panida Jirutitijaroen[18] published a study detailing the
compar- ison in load forecasting in Singapore between the ARIMA models and a
multiplicative decomposition solution. Their series is similar to ours in terms of
stationarity. Their tests proved that the multiplicative decomposition slightly
outperformed the SARIMA in this case due to the multiplicative decomposition model
not being as affected by random shocks as the ARIMA models. Their main testing was
done using Matlab.
A study from 2005[19] suggests a hybrid between machine learning and traditional
ARIMA modeling for financial time series forecasting. They combined ARMA with
a Generalized Regressive Neural Network (GRNN) to further increase the forecasting
accuracy. Their data set was non-stationary as it showed a clear falling trend. The
models tested are the ARIMA, the GRNN and their suggested ARIMA-GRNN hybrid,
which outperformed the other two individual forecasting models in terms of MAPE,
MAE, and Root Mean Squared Error (RMSE).
Heping Liu and Jing Shi[20] did a comparison of the ARMA combined with
GARCH models for forecasting the electricity price in New England. The GARCH
models were introduced due to the volatility of the electricity price. Their results
showed that the ARMA-GARCH-M method slightly outperformed the other 5 ARMA-
GARCH models however, their time series was somewhat limited. The time series only
had 2 months of recorded data, which might be a small sample to get a definitive
answer.
A study from 2014[21] compared the traditional ARIMA model with or without
inter- vention to forecast the cases of campylobacteriosis in New Zealand. An
11
intervention is an external event or influence that drastically changes the data set. Their
results proved that ARIMA even with intervention gave poor forecasts due to the
intervention in their time
12
series. Holt-Winters method was the far better solution in regards to MAPE and MSE
and their results prove that the strength of the Holt-Winters method to predict the
coming steps even with a structural change in the time series. The Holt-Winters method
is an exponential smoothing method which similar to MA models as they put weight on
previ- ously regressed values however, the weight set will decrease exponentially. Holt-
Winters is also known as the triple exponential smoothing.
3.4 Discussion
From our literary review, we find that the most commonly used techniques for time se-
ries forecasting are different versions of the ARIMA models, such as SARIMA,
machine learning techniques such as neural networks and hybrid models. Many of the
hybrid mod- els combine ARIMA with machine learning techniques[10,20]. These will
be the starting models we will consider for the forecasting part of our study. The
common ground for the ARIMA models and hybrid models using an ARIMA model are
the methods to fit the model. Multiple studies use the Box-Jenkins method as a method
to check for stationarity and also as a method to fit the ARIMA parameters [8].
There seems to be no difference in model selection for long and short term
forecasting. The same models can be used for both long and short-term forecasting [3]
as the ARIMA model does not have an inherent issue that prevents long term
forecasting. There are also no issues with the neural networks that prevent long-term
forecasting. The neural network requires no modifications for long-term forecasting.
One of the major concerns we extract from the related work is that the selection of
ARIMA model is heavily depending on whether the data set is stationary or not [6]. A
stationary data set means that the time series does not have a trend and is therefore not
time-dependent. This affects the selection of applicable models to the data set as the
AR, MA and ARMA models are not applicable to a non-stationary data set. This means
that
13
we need to analyze the data set thoroughly before we can decide which type of ARIMA
model is applicable in our study. In the next section, we will further explore this facet.
14
4 Data set analysis
In this section, we will further analyze the data set and its properties. This is important
for selecting the appropriate models and methods for the forecasting part. Section 4.1
will cover the analysis of the data set and section 4.2 will cover a stationarity analysis
by investigating the autocorrelating function of the data set and the unit root test.
Date 00 01 02 03 04 05 06 07 08 09
2010-01-01 18754 18478 18177 18002 17897 18042 18441 18870 19061 19190
2010-01-02 18119 17786 17688 17762 17831 18049 18601 19210 19785 20398
2010-01-03 19388 19059 18920 18928 19020 19278 19722 20214 20574 21003
2010-01-04 18833 18488 18398 18407 18605 19367 20941 22571 23276 23362
2010-01-05 19642 19390 19375 19562 19925 20627 22336 23898 24664 24757
2010-01-06 21285 20884 20689 20636 20759 21010 21707 22455 22826 23291
2010-01-07 20817 20548 20491 20521 20792 21564 23112 24798 25428 25335
2010-01-08 21249 20911 20881 21082 21297 22009 23654 25048 25678 25697
2010-01-09 21731 21372 21218 21211 21291 21513 21999 22661 23174 23832
Table4.1shows a small sample of the data set. The values are the power load in
GW/h and the headers are the hour of recording. There are in total 2192 rows and 24
columns for the time series. From this sample, we can see that we are having lower
value in the early hours, but it is slowly increasing until 05:00 and then increases faster.
This is very likely the effect of industries, shops, and other infrastructure starting up
their services. The load remains at roughly these values until 16:00 where it starts
declining again. It is very likely that the reduction of power usage after 16:00 is also due
to the industry and other power demanding consumers close for the day. The values then
continue to drop and reach their lowest points at about 02:00-03:00, which is reasonable
as most of the Swedish population is sleeping by that hour, and then the cycle repeats
itself throughout the year. By further analyzing the data set, we can see that there is also
a weekly seasonality where the Mondays have a higher power load than the other
workdays. We then have a lower power load from Tuesday to Thursday, but between the
days in this interval, there is little to no difference in the power load. On Friday and
through the weekend we have a decreasing power load and we reach the lowest part on
Sunday which is reasonable as we reach the end of the week and we transition into the
new week. Then it rapidly increases and reaches the highest point on Monday. As we
have these stable cycles, we have a seasonality where the duration of this season is one
week, therefore we have a weekly seasonality as well as the yearly.
In order for us to select appropriate algorithms or methods to use for the data set
forecasting, we must first analyze the data set to examine its properties. From what we can
1
https://www.entsoe.eu/data/power-stats/hourly_load/
15
tell from the overview is that we have a clear seasonality, but we also see a minor
declining trend on the peaks of the curve. The summer seasons seem to be roughly the
same, but the winter seasons seem to decrease in their maximum power usage. This
could be affected by many different external factors, such as milder winters or more
effective/alternative heating sources.
yt = ρyt−1 + ut (1)
For equation1 yt is the variable of interest, t is the time index, ρ is a coefficient, and ut
is the error term. A unit root is present if ρ = 1. The model would be non-stationary in
this case. After some rewriting of the formula, we get the regression model to look like
equation2.
∆yt = (ρ − 1)yt−1 + ut = δyt−1 + ut (2)
where ∆ is the first difference operator. This model can be estimated and tested for
a unit root where δ = 0(whereδ
≡ ρ 1). Since these tests are done over residual data
instead of raw data, these do not follow the standard distribution for calculating the
critical values, but are instead following a specific distribution. These are the critical
values that can be shown in the table4.2.
The Augmented Dickey-Fuller[25] includes a lag variable in order to remove the au-
tocorrelation from the results. There are three different versions of the Dickey-Fuller
test. There is the normal unit root test, unit root with drift and lastly, a version for a unit
root with a drift and a deterministic trend. We will use the normal one, as it is safer to
use, even if we should have a drift or deterministic trend. Wrongly inclusion of the drift
and deter- ministic trend reduces the power of the root test[25]. The formula for the
Dickey-Fuller test for a normal unit root is shown in the formula below.
17
the data set [16]. Figure4.3shows the ACF and PACF values for our data set. We can
18
Sample size 1% 5%
T = 25 -4,38 -3,60
T=50 -4,15 -3,50
T=100 -4,04 -3,45
T=250 -3,99 -3,43
T=500 -3,98 -3,42
T= 501+ -3,96 -3,41
Calculated value -3.085
Table 4.2: The critical values for the Augmented Dickey-Fuller tests for a series with
trend and our results from the root test.
see that we have a linear downward going trend and also that we have seasonality as the
ACF values are cyclical. This proves that the data set does have a trend, which means
that the time series is non-stationary. This trend will have to be accounted for when
fitting the forecasting models. To correctly analyze a root test, we have to compare our
calculated values to the critical values of the Augmented Dickey-Fuller test. The critical
values are nothing we calculate but rather, they are constants that have already been
established. If our calculated values are lower than the 5% value, then we can reject the
null hypothesis with more than 95% certainty, and if the value is below the 1% value,
we can reject the null hypothesis with more than 99% certainty. Looking at the critical
values in table4.2, we can see that our calculated value is not lower than the critical
values. This means that our root test values are not lower than the critical values which
in turn means that we cannot reject the null hypothesis and the time series is non-
stationary. For example, if our value would be -3,50 and our data set has more than 500
inputs, then we would be able to reject the null hypothesis with a 95% certainty as our
calculated value is lower than the critical value for 5%. We would not be able to reject
the null hypothesis with 99% certainty as our value is not lower than the 1% critical
value. If our calculated value is higher than the critical values, we cannot reject the null
hypothesis at all, which means that we cannot prove stationarity.
Figure 4.3: The ACF and PACF values for the series.
19
5 Forecasting
In this section, we will delve deeper into the actual forecasting of the data set and fitting
existing models used in the literature. We will look at the difference between long-term
and short-term forecasting. We will explain the different forecasting metrics adopted to
evaluate our solution. Finally, we will explain how we trained the models for the time-
series forecasting.
N
Σ
1 Xt − Y t | (4)
MAE = |
t=1
N
20
‚
Σ
RSME = . (Xt − Yt )2 (5)
1
,
N
1
N
The Xt is the predicted data and the Yt is the training data of the series.
We also calculate the percentage error of the forecasts as a pure data result can be harder
to evaluate and eventually compare with other results. Referring to equation6the
common used metric for the percentage error is MAPE. It is the average of all the
predicted values divided by the actual values.
1Σ N Xt − Yt
MAPE = | | (6)
N t=1 Yt
21
d is the differencing order and q is the moving average order. The seasonal elements
also have the autoregressive, difference and the moving average order however, the m is
the number of time steps for each season.
22
In order for us to get the trend order, we have to fit the model for an ARIMA model
first. According to George E.P. Box et al [6], one of the most used methods to fit an
ARMA/ARIMA model to the series is to use the Box-Jenkins method. The Box-Jenkins
method requires us to calculate and plot the ACF (autocorrelation function) and PACF
(Partial ACF) for the series. The ACF is a measurement of how related the actual value
is to the previous values including trend and seasonality. The PACF, unlike the ACF,
finds the correlation between the residual values in the series, therefore it is only the
partial function. The ACF and PACF are commonly used to get a good overview of the
time series, but these are required for the Box-Jenkins method as we use these to find
the best parameters for the SARIMA model fitting our data set.
Figure4.3shows the ACF and the PACF values for the time series. According to the
Box-Jenkins method, we make use of the ACF and PACF and their behaviour to select
the parameters for the ARIMA model. The Box-Jenkins can be summarized by reading
the ACF plot and identifying the ACF plots behaviour with this method. The bold text is
the behaviour of the ACF plot and the following text is the suggested model/action
according to the Box-Jenkins method.
• ACF has an exponential decay to zero : Autoregressive model (use the partial
autocorrelation plot to identify the order p). The p is the last notable value in the
PACF plot before they all become similar.
• ACF has one or more spikes, the rest is essentially zero : Moving average
model (order q identified by where autocorrelation plot becomes zero).
• ACF has an exponential decay starting after a few lags : Mixed autoregressive
and moving average model.
• ACF has high values at fixed intervals : Include seasonal autoregressive terms
Referring to figure4.3, we can see two things that are mentioned in the Box-Jenkins
method. We have both cycles with even intervals, which means that we have seasonality,
and we have a very slow decline. This means that we have a trend and this further
proves the time series is non-stationary. The amount of dots in the cycle is 7, which
means that we have a weekly seasonality, which is what we discovered earlier. To be
able to find the parameters for the seasonal order, we have to difference the series. If the
trend is linear, we can remove the trend with a first-level differentiation and only get the
seasonal values. After differencing, we plot the new ACF values, and we do the Box-
Jenkins method again in order to fit our seasonal values for the SARIMA model.
Differencing is performed by subtracting the previous observation from the current
observation, which is shown in equation7.
23
Taking the difference between consecutive steps is called 1-lag differencing and data
sets with a seasonal component the lag may be expected to be the period of the season.
Some temporal structure might remain after differencing, such as if the series contains a
non-linear trend. The differencing may then be done as many times as needed in order
to force stationarity and the number of differencing is called difference order.
Figure 5.4: The ACF and PACF values for the differenced series.
Figure5.4shows the differenced values for the data set. We difference the series to
force stationarity for the data set. If we get a conclusive AFC and PACF for the
differenced series, we can then use the Box-Jenkins method to find the values for the
seasonal param- eters for the SARIMA model. The differenced plot is not conclusive
either. We cannot really detail the values from this plot either. We tried with multiple
levels of differencing, but no model gave a good result. There was no plot where we
could fit the model with the Box-Jenkins method, as they did not follow the patterns
required, so we have decided to try a different approach. Another way to fit the
parameters for the SARIMA model is to do a grid search. A grid search is a trial and
error method to try out the different possible models and compare their error results.
Instead of using the Box-Jenkins method, we went with a grid search to get the most
fitting parameters for the SARIMA forecasting models. The plan is to define a set of
model combinations in order to test which forecast model that has the lowest error in
regards to our data set. To evaluate which model is the best fitted, we evaluate based on
multiple one-step forecasts and comparing with the actual value and calculate the
RMSE. Figure5.3shows the result of the grid search. We start by splitting the data into
the test and training data. After the split, we take model parameters from our selection
and fit our data set with these model parameters. We then make multiple one-step
forecasts and compare the predicted values with the actual values. Once we have done
this for a number of steps, we calculate the RMSE for the predictions and use this as a
metric for evaluating the model. As there are many combinations to try, we speed up
this process by having multiple threads doing these evaluations and lastly, we present
the top 3 models with the lowest error. Since the top three models presented have the
same autoregressive component as well as seasonal component, it is safe to say that the
model seems like a good fit. The trend is the only part that changes between the models.
The models that the grid search found out to be the best model is the SARIMA(2, 1, 1)(2, 0, 2,
7)
24
Model Trend RMSE
(2, 1, 1)(2, 0, 2, 7) Constant and linear 834.3001089409712
(2, 1, 1)(2, 0, 2, 7) Linear 836.9687538893095
(2, 1, 1)(2, 0, 2, 7) None 837.1173747930893
with no trend, linear trend and also constant with a linear trend. The error difference be-
tween all three models is so low, that we will have to test them all. We will hereby
mention them by model 1 (constant and linear trend), 2 (linear) and 3 (no trend). The
last seasonal parameter being 7 is not surprising, as we could see in the data set that
there was weekly seasonality.
Table 5.4: Results from the short-term forecasting with the SARIMA models
25
Figure 5.5: graphs for the 5 day forecasts
26
Figure 5.6: graphs for the 10 day forecasts
27
Figure 5.7: graphs for the 1 month forecasts
28
Figure 5.8: The two-month forecast with model 1
We do run into problems when attempting long term forecasting with these models.
Figure5.8shows the two-month forecast with the first model. We can see that it does not
follow the seasonality, but rather keeps on increasing and all of the three models follow
the same pattern. This is due to the last seasonal parameter for the SARIMA model
defines the number of steps in the seasonality. Since this seasonality is yearly, the last
parameter is 365 however, since we used a normal laptop for the experimentation, the
laptop that we had available was not able to store the amount of data required for these
calculations as it was not possible with 8 gb of ram. For longer forecasts, the common
practice is to use monthly or quarterly data.
Table 5.5: Results from the mid to long-term forecasting with the SARIMA models
29
One interesting part regarding the results in table5.5is comparing the results for
the short-term forecasts in table5.4regarding the 1 Month forecast, the short-term have
higher accuracy than the long-term model. This is a clear indication that for our data
set, the mid-term forecasting seems to be better, as the short-term forecasts and long
term forecasts have lower accuracy than mid-term forecasts. However, from 2 months
and forward, the long term model is way more accurate in regards to all three metrics
for forecasting accuracy. Since the long term model can calculate the whole seasonality
period, it follows the curve rather than having a steadily increasing trend.
30
an overall forecast error of 3,4% for their short-term forecasting. Gordon Reikard [14]
showed an error of 0,4% in his forecasts with the SARIMA model. The data was stored
hourly and the focus of the report was a comparison of different forecasting models for
short-term forecasting.
The results in our models have performed have a higher error than expected as our
overall error for our short-term forecasts 7,1% for the first model, which had the best
average error in regards to our short-term forecasts. Our long term forecasts seem more
promising as the average error for those forecasts is 6,3%. This is due to the model only
having to account for one seasonality, instead of both weekly and yearly seasonality.
31
Figure 5.10: Example of the neural network model and its layers
32
Figure 5.11: 10 step forecast with the Neural Network
the network will analyze the data set and also a weight method which will tell the
network how far off the target value it was. The weight will then be measured and
the network will then use this data to adapt in order to get a more accurate result.
• Reinforced learning. This strategy includes having methods and algorithms to
reinforce good results and punish bad results which force the network to learn
over time.
In regards to our data set, we can use the supervised learning strategy since our data
set is already labeled. This means that we do not have to create a weight or other
artificial measurements to teach the network what to do, instead, we can just feed the
network the entire training data.
33
Figure 5.12: 2 Months Neural Network forecast
34
5.4.3 LSTM (Long Short-Term Memory) Model
As the second method for testing machine learning solutions, we chose to set up a
LSTM (Long short-term memory) model, which is a recurrent neural network [28]. Like
many of the neural network ideas, they have been around since the ’80s but haven’t
really been popular until a few years ago. This model is a good candidate for forecasting
and will be a good choice to test out the capabilities of the recurrent neural networks.
The benefit of a RNN is that they have an internal memory which means that they can
remember properties of the input they just received, which makes it easier predicting
what’s coming. The difference between a recurrent neural network and a normal "feed-
forward" network is that the RNN instead of sending the data to the next layer can keep
the data and process it more times. Since it also has a memory, it keeps the output and
sends it through the network again for even more evaluation and analysis. The recurrent
neural network also has the ability for backward propagation, which means that the data
can travel backward in layers, analyze which derivatives that affected the output and
measuring their weight. In short, it means that the RNN tweaks the weights of the
neurons while training. Figure 5.15shows a basic example of a recurrent neural network.
The LSTM is an extension of the traditional neural network. It uses the same
method- ology, except that it has its own special layer. One major difference between a
traditional neural network and LSTM is that the LSTM has an extended memory and is
well suited for processes that have long lags between or where the learning is slow.
LSTM can be seen as an additional layer to the old recurrent neural network and allows
for the network to further store their memory, and the LSTM has the access to read,
write and delete the information stored. The LSTM won’t store everything instead,
there is a "gatekeeper"
35
that measures if the data is worth storing or not. This is done by asserting which level
of importance this assigns to the information. The assigning of importance is done by
adding weights to the information, which the algorithm then learns. This means that the
computer learns which information is important over time.
36
Figure 5.16: 2 Months LSTM model forecast
37
Model Metric 1 Day 5 Days 10 Days 1 Month 2 Months 6 Months 1 Year
NN MAE 1233.550 1193.600 1150.640 1080.518 1083.199 1033.472 1233.102
NN RMSE 1233.550 1424.965 1308.409 1439.743 1511.533 1432.712 1547.269
NN MAPE 7.505% 6.433% 5.919% 5.338% 5.401% 5.873% 7.991%
LSTM MAE 1309.884 1139.250 1040.869 1068.268 1067.785 1014.515 1092.685
LSTM RMSE 1309.884 1356.199 1237.330 1362.212 1422.452 1342.837 1405.800
LSTM MAPE 7.970% 6.149% 5.345% 5.206% 5.250% 5.825% 6.978%
5.5 Discussion
In the previous sections, we analyzed our data set for properties needed to decide which
models were eligible for forecasting. We implemented both the traditional SARIMA
method as well as two of the state of the art machine learning techniques, neural
network and a LSTM. After the analysis, we divided the training and testing data
according to best practice [6] and used the training data for fitting the SARIMA model,
training the neural network and the LSTM model. These models were fitted using a grid
search which we implemented after the Box-Jenkins method proved to be inapplicable
on our data set. Even after multiple levels of differencing, our ACF plots did not match
a behaviour that fits the Box-Jenkins method. After the training and fitting of models,
we set certain time intervals for our models to forecast. We decided to have 1,5,10 and
30 days as our short- term forecasts and have 2 months, 6 months and one year as our
long-term forecasts. We predicted these intervals and measured their accuracy by
comparing the predicted value to the real value.
In both our SARIMA models as well as the Machine learning models, we found our
results to be not as good as anticipated. The reports mentioned previously in this report
found greater success in their forecasting accuracy, often having 0,5-3% error, contra
our LSTM implementation with 6,1%, neural network with 6,3% and our SARIMA
models with 7,1% for the short-term forecasts and 6,3% for the long-term forecasts.
38
5.6 Implementation
For the experimenting and testing suite, we programmed these in python due to python
having good documentation and there are multiple good libraries in python for math and
statistical data handling. The libraries we used for the implementation are:
• Numpy. This is one of the most common math libraries. It is also really good for
array handling and collections.
• Statsmodels. This is a great library with many forecasting models and other tools
for statistical data handling.
• Pandas. This library has a lot of functionality for both series, reading external csv
files and data handling in terms of series and other files. It is one of the more
commonly used data analysis libraries.
• Matplotlib. It handles the plotting of data and has been used for all the plotted
figures in the implementation.
• Keras. Keras is the base library used for the machine and deep learning methods.
To make sure what kind of data set or series we are dealing with, we have decided to
program a few tools in Python to plot the series, check the ACF and PACF as well as do
the root test for the series. This is to test if the series is stationary and also to find a
fitting ARMA/ARIMA model for this series. For the SARIMA model forecasting, we
used the Statsmodels library as they had an already implemented SARIMA method, and
we fit the model to our data set. Numpy and Pandas handled the series and converted it
to a fitting format for the SARIMA method. We also used the Matplotlib to plot the
forecasted values and the actual values to get a graphical overview of the prediction.
For our machine learning implementations, we found a python library called Keras
which is a high-level neural network and a deep learning library. The library was the
base for the neural network and by using this library we set up a neural network and a
LSTM model, which is a recurrent neural network implementation. Both of these
models are created using the Keras "sequential" model with 1 input layer of 100
neurons, then 4 hidden layers of neurons, then lastly it contains a 1 node output layer.
We used an optimized called "SGD" as the implementation required an optimizer to
compare data, and the measurement for evaluating the data is the RMSE as the
optimizer could only use one error metric for optimization. We tried different error
metrics such as the MAE and MAPE, but we found this one to be the best one in regards
to our data set. The layers are so-called "dense" layers.
The implementation can be found at:
https://github.com/elaktomte/Degree-Project-Final.
39
6 Conclusion and future work
We analyzed our data set and experimented with the traditional SARIMA model and
compared it to a neural network and a long short-term memory model for time series
forecasting. From the results gathered, in our case and for this data set, the LSTM model
had the best prediction accuracy of the tested methods and models. Our results were not
as good as anticipated, but from the results we gathered, we could see that the models
we implemented worked best for mid-term forecasting (10-30 steps) on this data set. For
future work, we should attempt to refine our models and explore other techniques for
the fitting of models. We tried two techniques for the fitting of our models, but there are
other methods worth exploring such as the Box-Ljung[6] method or the Akaike
information criterion (AIC)[6].
Due to the limited time for this degree project, we did not have time to test more
than these 3 methods. There could have been greater accuracy using different methods
or extensions to the SARIMA model or finding a better method for fitting the model.
The Holt-Winter method [21], also known as triple exponential smoothing, was not
included in this report but is worth testing for future work. Also due to time constraints,
we did not experiment as much with the machine learning models as we wanted. It
could be possible to obtain better accuracy using different layers or models. Another
future direction is com- bining the SARIMA and the machine learning models, as many
of the reports combine these two methodologies to gain better accuracy than the
individual models[23,10,19]. Therefore there is a clear argument for combining and
creating hybrid models and will be left for future work.
For future work, we shall consider using models or extensions that enable fitting
ARIMA models with multiple seasonalities. With regard to this, there is a method called
Fourier trends[29] that can be added to ARIMA models that handles multiple seasonali-
ties, which should be tested on this data set.
40
References
[1]“Swedish power statistic,”https://www.svk.se/en/national-grid/the-control-room/,
accessed: 2019-04-08.
[3]P. J. Brockwell and R. A. Davis, Introduction to Time Series and Forecasting, 3rd ed.
Springer, 2016.
[4]J. W. Taylor, “Triple seasonal methods for short-term electricity demand forecast-
ing.” European journal of operational research, vol. 204, pp. 139–152, 2010.
[6]G. C. R. George E. P. Box, Gwilym M. Jenkins and G. M. Ljung, Time Series Anal-
ysis : Forecasting and Control, 6th ed. John Wiley Sons, Incorporated, 2016.
[8]D. Asteriou and S. G. Hall, ARIMA Models and the Box–Jenkins Methodology,
2nd ed. Palgrave MacMillan, 2011.
[10]M. Khashei and M. Bijari, “A novel hybridization of artificial neural networks and
arima models for time series forecasting,” Expert systems with applications, vol. 39,
pp. 4344–4357, 2012.
[11]B. Z. . Y. Wei, “Carbon price forecasting with a novel hybrid arima and least
squares support vector machines methodology,” Omega, vol. 41, pp. 517–524,
2013.
[12]M. B. Mehdi Khashei and S. R. Hejazi, “Combining seasonal arima models with
computational intelligence techniques for time series forecasting,” Soft Computing,
vol. 16, pp. 1091–1105, 2012.
[13]S. F. G. F.Marandi, “Time series forecasting and analysis of municipal solid waste
generation in Tehran city,” in 2016 12th International Conference on Industrial
En- gineering (ICIE). IEEE, 2016.
[15]D. Sena and N. K. Nagwani, “Application of time series based prediction model to
forecast per capita disposable income,” in IEEE International Advance Computing
Conference (IACC). IEEE, 2015.
41
[16]H. Matsila and P. Bokoro, “Load Forecasting Using Statistical Time Series Model in
a Medium Voltage Distribution Network,” in IECON 2018 - 44th Annual
Conference of the IEEE Industrial Electronics Society. IEEE, 2018.
[17]Z. H. Yunxiang Tian, Qunying Liu and Y. Liao, “Wind speed forecasting based on
Time series - Adaptive Kalman filtering algorithm,” in 2014 IEEE Far East Forum
on Nondestructive Evaluation/Testing. IEEE, 2014.
[18]J. Deng and P. Jirutitijaroen, “Short-term load forecasting using time series anal-
ysis: A case study for Singapore,” in 2010 IEEE Conference on Cybernetics and
Intelligent Systems. IEEE, 2010.
[19]J.-J. L. Wei-Min Li, Jian-Wei Liu and X.-R. Wang, “The financial time series fore-
casting based on proposed ARMA-GRNN model,” in 2005 International Confer-
ence on Machine Learning and Cybernetics. IEEE, 2005.
[20]H. Liu and J. Shi, “Applying arma–garch approaches to forecasting short-term elec-
tricity prices,” Energy Economics, vol. 37, pp. 152–166, 2013.
[21]A. Al-Sakkaf and G. Jones, “Comparison of time series models for predicting
campylobacteriosis risk in new zealand,” Zoonoses and public health, vol. 61, pp. 167–
174, 2014.
[23]G. P. Zhang, “Time series forecasting using a hybrid arima and neural network
model,” Neurocomputing, vol. 50, p. 159 – 175, 2003.
[24]T. R. Zibo Dong, Dazhi Yang and W. M.Walsh, “Short-term solar irradiance fore-
casting using exponential smoothing state space model,” Energy, vol. 55, pp. 1104–
1113, 2013.
[25]W. A. Dickey, D. A.; Fuller, “Distribution of the estimators for autoregressive time
series with a unit root,” Journal of the American Statistical Association, vol. 74, p.
427–431, 1979.
[27]D. M. Pelt and J. A. Sethian, “A mixed-scale dense convolutional neural network for
image analysis,” Proceedings of the National Academy of Sciences, vol. 115, no. 2,
pp. 254–259, 2018.
[28]Q.-Y. S. C.-K. Huang, Guang-Bin; Zhu, “Extreme learning machine: theory and
applications,” Neurocomputing, vol. 70, p. 489–501, 2006.
42