ssrn-4170455
ssrn-4170455
ssrn-4170455
Series Prediction
aTaha Buğra ÇELİK, bÖzgür İCAN, cElif BULUT
aResearchAssistant, Faculty of Economics and Administrative Sciences, Department of Business
Administration, Ondokuz Mayıs University, Samsun, Turkey. E-Mail: tahabugra.celik@omu.edu.tr
bAssistantProfessor, Faculty of Economics and Administrative Sciences, Department of International
Trade and Logistics Ondokuz Mayıs University, Samsun, Turkey. E-Mail: ozgur.ican@omu.edu.tr
cAssistant
Professor, Faculty of Economics and Administrative Sciences, Department of Business
Administration, Ondokuz Mayıs University, Samsun, Turkey. E-Mail: elif@omu.edu.tr
Corresponding author: Taha Buğra ÇELİK. Phone number: (+90) 543 245 62 91.
Conflicts of Interest
Data Availability
The datasets used to support the findings of this study are available from the corresponding author
upon request.
Abstract
Prediction with higher accuracy is vital for stock market prediction. Recently, considerable amount of
machine learning techniques are proposed which successfully predict stock market price direction. No
matter how successful the proposed prediction model, it can be argued that there occur two major
drawbacks for further increasing the prediction accuracy. The first one is that, because machine
learning methods bear black box nature, the source of inference for the predictions cannot be
explained. Furthermore, due to the complex characteristics of the predicted time series, no matter how
sophisticated techniques are employed, it would be very difficult to achieve a marginal increase in
accuracy that would meaningfully offset the additional computational burden it brings in. For these
two reasons, instead of chasing incremental accuracy increases, we propose utilizing an eXplainable
Artificial Intelligence (XAI) approach which can be employed for assessing the reliability of the
predictions hence allowing decision maker to abstain from poor decisions which are responsible for
decrease in overall prediction performance. If there would be a measure of how sure the prediction
model is on any prediction, the predictions with a relatively higher reliability could be used to make a
decision while lower quality decisions could be avoided. In this study, a novel two-stage stacking
ensemble model for stock market direction prediction based on machine learning (ML), empirical
mode decomposition (EMD) and XAI is proposed. Our experiments have shown that, proposed
prediction model supported with local interpretable model-agnostic explanations (LIME) achieved the
highest accuracy of 0.9913 with trusted predictions on the KOSPI dataset.
Keywords: Stock market prediction, machine learning, deep learning, empirical mode decomposition,
explainable machine learning, local interpretable model-agnostic explanations.
1. Introduction
Financial markets, in particular stock markets, allow investors and traders (practitioners who aim to
earn returns from short-term price movements) to gain capital gains if they can make the right
decisions. However, the price movements in the stock markets are highly nonlinear, and it is difficult
to make the right decisions consistently. As a result of the rapid developments in machine learning and
deep learning in recent years, great progress has been made in the field of stock market prediction.
Since many new methods and techniques are applied simultaneously in the studies of the stock market
prediction literature, it is very difficult to classify these studies in terms of applied methods and
Another approach focusing on features is dimensionality reduction such as variational auto encoders
and principal components to improve the computational efficiency of prediction models via reducing
the complexity of feature set [9,8]. In addition, the use of technical analysis indicators as inputs in the
prediction model is quite common in the literature. Yang et al. [40] combine technical analysis with
group penalized logistic regressions to predict up and down trends of stock prices. Patel et al. [28]
represent ten technical indicators as trend deterministic data. Nabipour et al. [23] uses ten technical
indicators as continuous data and then convert these indicators to binary data before using and
comparing results. Lee et al. [18] makes predictions with LSTM fed by technical analysis indicators.
Besides dimensionality reduction, decomposition of the complex time series into more manageable
sub-components simplifies the work of the prediction model. One of the most commonly used
decomposition approaches is empirical mode decomposition (EMD) [13]. EMD has been applied by
Xu and Tan [38] to decompose stock price and sub components predicted by a temporal attention
LSTM. Zhou et al. [45] introduce EMD and factorization machine based neural network to predict the
stock market trend. Jin et al. [17] proposed sentiment analysis combined with LSTM and EMD. A
more advanced decomposition method developed as an alternative to EMD is variational mode
decomposition (VMD) [7]. Utilization of VMD along with deep learning and machine learning models
for stock market prediction has been proved to provide successful results [24,2,42]. There are also
other decomposition methods such as singular spectrum analysis, empirical wavelet transform,
ensemble EMD employed along with prediction models [37,19,41].
In the studies mentioned above, in order to increase the prediction performance of a single prediction
model, additional methods and techniques are hybridized. Utilizing multiple prediction models
emerges as an alternative approach. Ensemble learning is a meta approach that combines multiple
prediction models in order to produce a better composite predictive model. Ensemble methods are
divided into three main categories called bagging, stacking, and boosting. Bagging or bootstrap
aggregation include a diverse group of prediction models which are trained with different training
subsets generated using random sampling. The predictions made by the ensemble members are then
given to a combination scheme (such as voting, averaging or any set of rules) to produce a final
prediction value [6]. On the other hand, stacking approach combines outputs of a group of prediction
models as inputs to another prediction model in order to achieve higher prediction accuracy [6].
Although multiple layers of models can be utilized, two-level hierarchy is more common. In boosting
ensemble models, training data is iteratively changed to focus on the misclassified instances in the
The number of proposed methods and techniques to increase the success of stock market prediction
models is plentiful and new ones are constantly being proposed. The findings of the mentioned studies
reveal that the prediction success of the proposed models is significantly high. Regardless of the
method used, it becomes more and more difficult, or even impossible, to exceed the prediction
accuracy rates claimed in these studies [43]. The high pattern recognition capacities of machine
learning methods can be used to discover patterns in time series. Reporting of close prediction
successes with different combinations of new or existing techniques indicates that it is increasingly
difficult to make further progress. In this case, instead of dealing with new techniques that increase the
computational burden and complexity of interpretation, the option of how to increase the reliability of
the prediction success of existing techniques comes to the fore as an alternative. For example, an
approach that allows a prediction model to avoid the 20% predictions in which the prediction model
fails, rather than striving to increase its accuracy from 80% to 85%, will indirectly increase overall
prediction accuracy to significantly higher levels by making only reliable or trustworthy predictions.
The only drawback of this approach is that it prevents predictions from being made for each period, as
unreliable predictions will be avoided. This is the price to be paid to be able to make predictions with a
very high accuracy rate. In the contemporary literature, we see that such efforts are examined under
the name of explainable artificial intelligence (XAI). Recently, two different approaches have come to
the fore within the scope of XAI. One of them is local interpretable model-agnostic explanations
(LIME) [31], which allows any model to be interpreted by describing each prediction on an instance-
by-instance basis. Another method is SHapley Additive Explanations (SHAP) [21], which assigns the
importance level of features to each prediction.
In this study, initially, a two-stage stacking ensemble prediction model is developed in order to predict
the daily stock market closing price direction. Instead of feature engineering, we prefer decomposition
approach by employing EMD technique. The reason for preferring data decomposition is clearly
explained in future sections by experimental results. It is put forward that EMD clearly facilities the
operation of the prediction model. These decomposed series, also known as intrinsic mode functions
(IMF) are simply sub-components of original time series. In the first stage, each IMF is predicted with
two distinct ANN models. The first one is used for predicting each IMF’s real value. In other words, it
predicts its quantitative values (regression prediction), so it is called ANN regression (ANNR) for
short. On the other hand, the latter one is used for classifying direction (upward and downward) and
since this ANN model is designed as a classifier model, it is named as ANNC after classifier. Needless
to say that, for all IMFs, these two ANN models have been trained separately and model objects have
been saved for predictions. The reason behind employing these two distinct ANN models (regression
prediction and classification) is pre-experimental results asserting the superiority of ANNC for
predicting certain IMFs and ANNR for the rest. For this reason, the predictions of these two models
have been combined in order to exploit their relative strengths in the first stage and their combined
predictions have been fed to a third prediction model in the second stage. The third model in the
second stage has been selected based on the comparative performances of different algorithms namely
random forest (RF) and extreme gradient boosting (XGBoost). In addition, third prediction model is
utilized as classifier (upward and downward direction prediction) to predict the direction of the
original time series. To sum up, overall architecture is one of the possible stacking ensemble model
configurations among many possible ones hence our proposed ensemble model is named as EMD-
ANN-RF.
In the second stage of proposed prediction procedure, which constitutes the most important part of the
study, the LIME algorithm has been integrated to the RF model. By giving the outputs of ANNR and
ANNC as inputs to the RF model, the upward and downward movement direction of six major stock
market indices is predicted. The main motivation of this study emerges at this stage. During the
explorative researches, it has been discovered that the RF model makes more successful predictions
for certain values of the inputs. For instance, the outputs of the ANNC model take values in the range
of [0,1]. In fact, it has been observed that this prediction is quite successful if the ANNC output has a
In the literature on XAI, independent of the prediction model being used, there is an approach which
tries to explain each prediction (instance by instance) called LIME. LIME explains the predictions of
any classier in an interpretable and faithful manner. It offers explanations by learning an interpretable
model locally around the prediction [31]. Thus, the probabilities of each class prediction made by the
RF model can be calculated with LIME. In the usual process, when any input instance is given to the
RF model, it makes one of the 0 or 1 predictions. However, after the LIME algorithm is implemented,
a probability is assigned to each of the 0 and 1 class labels. For example, suppose that for any input
set, the probability of 0.10 for the class 0 and 0.90 for the class 1 is obtained. These probabilities
calculated for the classes are obtained for each prediction in the test set. Here, the class probabilities
are utilized as the reliability level for the predictions. If one of the class probabilities is high enough
for the decision maker, then he/she trust that prediction and make decision based on the outcome.
According to the previous example, if 0.80 is high enough to trust any prediction for the decision
maker, then he/she make decision in favor of 1 class label. Here, 0.80 is the reliability level for the
decision maker. On the other hand, If the decision maker would set the reliability level as 0.91 let’s
say, then he/she would hesitate to make any decision since 0.90 < 0.91, in other words, reliability
condition is not satisfied. If the class probability level is set to 0.50, we obtain the predictions simply
made by the RF algorithm alone without LIME. However, as the reliability level is increased, some
predictions will be avoided as the reliability condition will not be met for some of them, but a higher
accuracy will be expected for trusted predictions. In other words, predictions cannot be made for all of
the predictions of RF model in the test set, since those with low reliability levels will be avoided. In
order to test this idea, the final increase in the accuracy rate for all reliability levels ranging from 50%
to 100% and the decrease in the number of predictions (accuracy and number of trusted predictions
trade-off) have been revealed as a result of the experiments. To the best of our knowledge, the
implementation of LIME algorithm to a prediction model as we propose here has not been previously
proposed in the relevant literature. It is useful for the reader to summarize the next parts of the work.
Sections 2.1 to 2.5 describe the base methods and techniques used in the forecasting model. In Section
2.5, the details of the model we have proposed are shared. While the experimental results and
evaluations are included in Section 3, conclusion and future directions are given in Section 4.
Here 𝜖 is a threshold value and usually close to 0.2. The IMF admits well-behaved Hilbert
transform. This decomposition method is adaptive, which means that it is not based on a
predetermined well-defined mathematical basis but the data itself dictates the decomposition,
therefore, highly efficient. Since the decomposition is based on the local characteristic time scale of
the data, it is applicable to non-linear and non-stationary processes. With the Hilbert transform, the
IMFs yield instantaneous frequencies as functions of time that give sharp identifications of imbedded
structures. The final presentation of the results is an energy-frequency-time distribution, designated as
the Hilbert Spectrum. Classical non-linear system models are used to illustrate the roles played by the
non-linear and non-stationary effects in the energy-frequency-time distribution. Examples including
Duffy equation, Rossler Equation, and non-linear wind wave data will be discussed to show the new
Hilbert view of non-linear and non-stationary systems.
The beginning of artificial neural networks (ANN) goes back to the computational model developed
for neural networks called threshold logic by [22]. However, the algorithm that forms the basis of the
multi-layer neural networks algorithm and enables the training of artificial neural networks in its
current sense is the backpropagation algorithm. ANN is a supervised machine learning algorithm that
learns the mapping between an input and an output set. ANN has three layers and each layer consists
of nodes. First layer is the input layer and each node refers to an input. The second layer called as
hidden layer which may include more than one layer. The last layer called output layer and it produces
the output of the model for each input instance. Based on the design, nodes are connected each other
but not necessarily all nodes are connected. Then the model is trained with a sample of data, to capture
the relationship between inputs and outputs. In Figure 1, a representative ANN model is depicted.
1 1 1 1
⋮
Output
layer
2 2 2 2
⋮
⋮ ⋮ ⋮ ⋮
𝑚 𝑙 𝑘 𝑗
⋮
Both regressing prediction (prediction of continuous numeric values) and classification (prediction of
class label) can be made with ANN. In this study, both these features of artificial neural network are
used. Regression version of ANN named as ANNR and classifier version as ANNC. Both ANNR and
ANNC models have two hidden layers with 100 nodes and using ReLU activation function. Two
Ribeiro et al. [31] proposed the LIME algorithm to provide explanations for individual
predictions, allowing some degree of reliability for the predictions of any classifier or regressor. LIME
provides interpretations for predictions locally. On the other hand, for a given prediction, SHAP
calculate the marginal contribution of a feature to the model by the Shapley value of a feature. The
main approach of LIME differs from SHAP. LIME decides whether a model is locally faithful
regardless of the model and verifies how a model represents the features around a prediction. This
attribution of LIME algorithm is known as local fidelity [32]. The explanation produced by LIME is
obtained by the following, [31]:
𝑒𝑥𝑝𝑙𝑎𝑛𝑎𝑡𝑖𝑜𝑛(𝑥) = argmin ℒ(𝑓,𝑔,Π𝑥(𝑧)) + Ω(𝑔) (4)
𝑔∈𝐺
'
𝑥 represents instance and interpretable representation of an instance is a binary vector 𝑥 ∈ {0,1}𝑑 . Let
𝐺 be a set of potentially interpretable models and 𝑔 ∈ 𝐺, where 𝑔 represents a machine learning model.
'
The domain of 𝑔 is {0,1}𝑑 . The complexity of an interpretation of a model is Ω(𝑔). For classification,
𝑓(𝑥) is the probability measure of 𝑥 belong to a class. Π𝑥(𝑧) is proximity measurement between an
instance 𝑧 to 𝑥, so as to define locality around 𝑥.
Two major prediction approaches exist in stock market price prediction literature. The first
approach is to predict actual price levels of time series (also known as regression prediction) and
compare the results with observed real values with respect to known evaluation metrics such as root
mean squared error (RMSE), mean squared error (MSE), mean absolute error (MAE) and mean
absolute percentage error (MAPE). Recently, directional prediction accuracy (also known as hit-rate)
for evaluating prediction performance is also becoming more common since making accurate
movement direction prediction is vital for successful stock market predictions. Accuracy measured as,
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (5)
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
where 𝑇𝑃 true positive predictions, 𝑇𝑁 true negative predictions, 𝐹𝑃 false positive predictions and 𝐹𝑁
false negative predictions. Other performance evaluation metrics that commonly employed in the
literature and in this paper are precision, recall, F1-score and area under the receiver operating
characteristic curve (ROC-AUC). These evaluation metrics are defined as follows;
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (6)
𝑇𝑃 + 𝐹𝑃
𝑇𝑃 (7)
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃 + 𝐹𝑁
(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛)(𝑅𝑒𝑐𝑎𝑙𝑙) (8)
𝐹1 ‒ 𝑠𝑐𝑜𝑟𝑒 = 2
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
1 𝑇𝑃 𝑇𝑃 (9)
𝑅𝑂𝐶 𝐴𝑈𝐶 = ∫ 𝑇𝑃 + 𝐹𝑁𝑑𝑇𝑃 + 𝐹𝑃
0
The second approach in predicting stock market data is based on time series directional
prediction, also known as time series classification. In order to conduct stock market direction
classification, target data is commonly labeled as 1 and 0 standing for upward and downward
movements as respectively. Then the prediction procedure turns into a classification problem and a
Prediction results show that, accuracy of classification predictions for the first IMF, ℎ0(𝑡), is
significantly better then regression predictions. Since ℎ0(𝑡) is the hardest part to predict for regression
prediction, it is thought that improvement in prediction accuracy for ℎ0(𝑡) may contribute to the
overall success of the prediction model of the original price series. Moreover, prediction results of
other IMFs are slightly better than classification predictions. Therefore, predicting all IMFs except ℎ0
(𝑡) with regression based prediction model and predicting ℎ0(𝑡) with classification based prediction
model is considered to be appropriate for increasing overall prediction accuracy. However, at this
stage a problem arises. Prediction procedure of EMD-ANN model is traditionally made by summation
of each predicted continuous valued IMFs so that the aggregated predictions of IMFs represent the
prediction of original price series. On the other hand, classification prediction results of ℎ0(𝑡) are
sigmoid function, 𝜎(𝑥), outputs (continuous values which is ranging between 0 and 1). ANN
classification model labels predictions as 1 if 𝜎(𝑥) > 0.5 and 0 for 𝜎(𝑥) ≤ 0.5. For this regard,
summation of classification predictions of ℎ0(𝑡) with the regression predictions of the rest of the IMFs
is inappropriate and meaningless. In order to overcome this obstacle, a novel prediction experiment
design proposed.
{
0,
𝐷𝑡(ℎ0(𝑡)) = 1,
𝑖𝑓
𝑖𝑓
ℎ0(𝑡) ‒ ℎ0(𝑡 ‒ 1) ≤ 0
ℎ0(𝑡) ‒ ℎ0(𝑡 ‒ 1) > 0
(10)
as output. Since ANNC is a binary classifier, predictions, 𝐶𝑦,𝑡, are class labels that is 𝐶𝑦,𝑡 ∈ {0,1}
implied by classifier design of the artificial neural network. Activation function of ANNC is a sigmoid
function also called as squashing function in machine learning terminology and denoted by 𝜎𝑡(𝑥),
which is a special form of the logistic function and ranges between 0 and 1. ANNC classifies inputs
based on the predicted values of 𝜎𝑡(𝑥) such that,
𝐶𝑦,𝑡 = { 0,
1,
𝑖𝑓
𝑖𝑓
𝜎𝑡(𝑥) ≤ 0.5
𝜎𝑡(𝑥) > 0.5 (11)
Explorative experiments for the final prediction step have shown that, predictions that have sigmoid
function output close to one or zero can more successfully predict upward and downward movements
of original time series respectively. Based on this justification, 𝜎𝑡(𝑥) is kept as the output value of
ANNC, in other words, no class label conversion is made. On the other hand, all IMFs are also
predicted by the ANNR model and summed same as the conventional way to predict original price
series for each of the 𝑡 time period. Let 𝑦𝑡 be the prediction at 𝑡 time step, and difference series which
is denoted as 𝑑𝑡(𝑦), where
𝑑𝑡(𝑦) = 𝑦𝑡 ‒ 𝑦𝑡 ‒ 1 (12)
is obtained for each day. Here 𝑑𝑡(𝑦) indicates the prediction direction magnitude of ANNR and
indicates the severity of the increase or decrease in the successive predictions. The motivation behind
obtaining 𝑑𝑡(𝑦) is the same with 𝜎𝑡(𝑥). Similarly, as 𝑑𝑡(𝑦) diverges from zero, upward and downward
movement direction prediction accuracies are expected to increase. Stage 1 of the proposed model
ends here and the outputs of this stage, 𝑑𝑡(𝑦) and 𝜎𝑡(𝑥), initiate stage 2. Summarizing the procedure
and operations performed in stage 1 in terms of prediction models, datasets and input-output pairings
would be helpful. ANNR and ANNC models are trained in training set and using these two models,
predictions are made in validation set to obtain 𝑑𝑡(𝑦) and 𝜎𝑡(𝑥). At this point, stage 1 is complete and
two models are obtained (ANNR and ANNC) that produce the inputs that will feed the prediction
model of stage 2. Stage 2 starts in validation set where another machine learning classification model
(such as random forest (RF) and extreme gradient boosting (XGBoost)) is trained with 𝑑𝑡(𝑦) and 𝜎𝑡
(𝑥) inputs and output is the movement direction of original price series, 𝐷𝑡(𝑦), at time step 𝑡, where,
𝐷𝑡(𝑦𝑡) = 1,{
0, 𝑖𝑓
𝑖𝑓
𝑦𝑡 ‒ 𝑦𝑡 ‒ 1 ≤ 0
𝑦𝑡 ‒ 𝑦𝑡 ‒ 1 > 0 (12)
Thus, to predict the movement direction of original price series in the test set, all models are trained.
In final prediction step of stage 2, in order to test aforementioned hypotheses, 𝑑𝑡(𝑦) and 𝜎𝑡(𝑥) are
obtained by the predictions of ANNR and ANNC respectively in the test set. Finally, 𝑑𝑡(𝑦) and 𝜎𝑡(𝑥)
are given to random forest or XGBoost classifier to predict the movement direction of original price
series for each 𝑡 time step and 𝐶𝑦,𝑡 = {0,1} (predicted movement directions of original price series for
each time step 𝑡) is obtained. Then the results are compared with single EMD-ANNR prediction model
in terms of accuracy, precision, recall, F1-score and ROC-AUC. The entire prediction procedure is
depicted in Figure 2. In a nutshell, in the final version of the prediction model, the entire dataset is
decomposed with EMD, then using ANNR and ANNC (ANNRC in short), the first prediction
procedure of the ensemble mode is completed. Then random forest or XGBoost is employed for the
second prediction procedure of the ensemble model and it is called EMD-ANNRC-RF (since random
forest model is preferred for the final experiments).
Stage 1 Stage 2
RF
ANNR1 ANNR2 … ANNRK ANNC
Predict 𝐷𝑡(𝑦) : direction class
Predict Predict Predict Predict of scaled original series
values of values of … values of direction
IMF1 IMF2 IMFK class of
IMF0 𝐶𝑦,𝑡
Integrate LIME
Σ 𝐶ℎ0(𝑡),𝑡 algorithm to RF
Prediction with higher accuracy is vital for stock market prediction. Therefore, using a
prediction model with a high accuracy rate will allow more successful results to be obtained.
However, in cases where the accuracy of the prediction model cannot be increased further, another
approach can be suggested to increase the success of the predictions. Testing the reliability of the
results suggested by the prediction model appears as an option. For instance, if there would be a
measure of how sure the prediction model is about each prediction, the predictions with relatively
higher probability of occurrence could be used to make decision by providing reliability for the
decision maker. On the other hand, it could be hesitant to make decisions for unreliable predictions
(i.e. lower probability of occurrence). Thus, it would not be wrong to expect an ultimate increase in the
success of the predictions, since only the predictions with a high level of reliability will be used in
decision-making. In the machine learning literature, this approach is generally referred to as model
explainability. Model explicability increases trust in a machine learning model because it allows it to
be interpreted. There are two different ways to interpret a model; global and local. Global
interpretation explains the whole model while local interpretation explains only predictions [12].
Global interpretation explain the complete behavior of the model while local interpretation helps in
understanding how the model makes decisions for a single instance and explain the individual
predictions. LIME and SHAP are common algorithms for local interpretation. In this study, the LIME
algorithm is used for this purpose. Since it is desired to explain each prediction made by the prediction
model and consult to the measurements related to it in decision making, local interpretability approach
is employed for model explicability in this study.
A two-stage ensemble prediction model is designed to increase the accuracy of the base model
(EMD-ANN), and the outputs of the first stage, 𝜎𝑡(𝑥) and 𝑑𝑡(𝑦), are given as input to another
prediction model in the second stage. There are two purposes in designing this prediction procedure.
One of them is to increase the prediction success of the base model, as can be seen in the results in
section 3.2, and the other is to add another aspect to the prediction procedure using the LIME
algorithm. Therefore, according to the second aspect of this study, another novel prediction procedure
is proposed using the LIME algorithm. Essentially, the importance of using 𝜎𝑡(𝑥) as the predictive
output of the ANNC model instead of the class labels directly to predict 𝐷𝑡(𝑦𝑡), and similarly, using 𝑑𝑡
(𝑦) instead of directly using 𝑦𝑡 as the output of ANNR emerges at this stage. As mentioned in the
previous section, albeit partially, a parallelism is observed between the values of 𝜎𝑡(𝑥) and the values
of 𝐷𝑡(𝑦𝑡) in the preliminary experiments. As the values of 𝜎𝑡(𝑥) approach 1 and 0, the accuracy of the
predictions for 1 and 0 values of 𝐷𝑡(𝑦𝑡) increases, respectively. It is observed that the increase in hit
rates is similar for 𝑑𝑡(𝑦). As the absolute amount of increase/decrease in the values of 𝑑𝑡(𝑦) increases,
the accuracy rates of the predictions made for the 1 and 0 values of 𝐷𝑡(𝑦𝑡) increase, respectively. The
LIME algorithm is utilized to make these observed rules more systematic and useful. Thus, the
reliability of the results of the prediction model can be measured for each value of 𝜎𝑡(𝑥) and 𝑑𝑡(𝑦),
and only the predictions that meet the reliability condition are used to predict according to a certain
predetermined reliability level.
10
Random
Sample 2
Random
Sample 3
Random
Sample 4
Random
Sample 5
Random
Sample 6
Random
Sample 7
By integrating the LIME algorithm to the prediction model, each prediction of the RF model is
explained in the test set. The LIME technique enables these explanations based on the behavior of RF
model in the validation set. As a result, for each prediction of RF model in the test set, prediction
probabilities or explanation prediction probabilities (EPP) are calculated for each class label.
Moreover, when any input instance is given to RF, since the prediction model is a binary classifier, it
produces one of the values of 0 and 1 as a prediction output. However, when the LIME algorithm is
integrated, it will allow a probability calculation of the probability of occurrence for each class label.
For example, let's assume that any input instance is given to RF model in test set. Then LIME
calculates prediction probabilities such as in Figure 3. In Figure 3, seven different random prediction
samples are given. According to random sample 1, for class 0 (downward prediction) LIME calculates
prediction probability as 0.97 and for class 1 (upward prediction) 0.03 and corresponding input values
are given in the right hand side of the figure. This indicates that, a downward movement will occur
with 0.97 probability or upward movement will occur with 0.03 probability. For random sample 2,
prediction probabilities of 0 and 1 classes are 0.39 and 0.61 respectively. In this case, if the decision
maker determines 0.70 as the reliability level beforehand, then he/she would trust the prediction in
random sample 1 since the downward prediction probability (0.97) is greater than or equal to the
reliability level. On the other hand, in random sample 2, since the prediction probability of both of the
classes is less than 0.70, the decision maker would hesitate to make a prediction and refrain from
making a decision.
where 𝑠(𝑦𝑡): scaled data set, 𝑚𝑖𝑛( 𝑦𝑡): minimum value of 𝑦𝑡 and, 𝑚𝑎𝑥( 𝑦𝑡): maximum value of 𝑦𝑡.
Data normalization is common for machine learning and deep learning tasks since prediction models
can work more efficiently with scaled data.
Table 2. Date ranges and sizes of total data, training set, validation set and test set of experiment data
Index Dates Total data Training set Validation set Test set
SP500 2012-01-03 ~ 2021-12-31 2517 1261 625 626
NI225 2012-01-04 ~ 2021-12-30 2445 1224 608 608
XU100 2012-01-02 ~ 2021-12-31 2512 1258 624 625
KOSPI 2012-01-02 ~ 2021-12-30 2460 1232 611 612
DAX 2012-01-02 ~ 2021-12-30 2530 1267 629 629
FTSE100 2012-01-04 ~ 2021-12-31 2529 1267 628 629
Experiments are carried out on six different stock market indices which are SP500, NI225,
XU100, KOSPI, DAX and FTSE100. In order to train and predict stock market indices’ upward and
downward movement direction on the next day, the entire data set is divided into 50%, 25% and 25%
subsets as training, validation and test sets respectively. Since four days lagged values of each IMF are
used as inputs for prediction in Stage 1, four days of data are missing. In addition, since the differ
series 𝑑𝑡(𝑦) is obtained from the predictions of ANNR, there is a one more day data loss occurs, then
totally five days loss occurs in the sum of the validation, test and training sets. Date ranges of data
sets, total data, training set, validation set and test set sizes are given in Table 2.
12
Since the length of the holiday days of different countries' markets differ from each other, negligible
differences occur between the total lengths of the data sets and the start-end dates. All of the
experimental data sets were downloaded from the tradingview website (https://tr.tradingview.com/).
13
As a final analysis, the findings obtained by applying the LIME algorithm to the EMD-ANN-RF
model are discussed below and the resulting model is entitled as two-stage EMD-ANN-RF-LIME
from now on. Integrating the LIME algorithm to the prediction model makes it possible for the
decision maker to rely on predictions which only satisfy the reliability condition. Therefore, at this
stage, the value of the reliability level should be determined. Then, according to the determined
reliability level, predictions are made during the test period and it should be calculated for which days
the predictions will be made and to what extent these predictions are accurate. Also, as the reliability
level increases, it is necessary to test whether there is a corresponding increase in the accuracy rate. In
Figure 5, reliability level (horizontal axes) and accuracy rate (vertical axis) relation is plotted for the
test sets of six different market indices. Reliability level varies between 0.5 to 1 as 0.5 reliability
simply means working with EMD-ANN-RF model, in other words not using LIME technique at all.
By increasing reliability level gradually towards 1, decision maker simply imposes the desire of
getting more trustworthy predictions from the model. As previously mentioned, because of the
tradeoff between number of the number of trusted predictions and reliability, one simply cannot expect
to obtain trustworthy predictions for the whole prediction horizon. However, it is natural to think that
increasing reliability level might also increase accuracy rate but with less predictions. For all of the
datasets, increasing reliability level also increases the hit rates in general with small deviations.
14
The number of days for which trusted predictions can be made and the corresponding accuracy rates
are shown in Figure 6. It should be noted that, normatively, only the predicted days are included in the
calculation of hit rates. Therefore, if any of the EPP value calculated for each class label is not above
the determined reliability level, the prediction will not be made, and one of the TP, TN, FP and FN
values will not occur for that day.
Figure 6. Number of trusted predictions (horizontal axis) and accuracy rates (vertical axis) trade-off in
test set.
The number of days for which a trusted prediction can be made and the corresponding hit rates are
shown in Figure 6. It should be noted that, naturally only the trusted predictions are included in the
calculation of accuracy rates. Therefore, if any of the EPP value calculated for each class label is not
above the determined reliability level, the model would abstain from making a prediction and none of
15
Figure 7. Percentage of trusted predictions with respect to reliability level and accuracy rate.
For six stock market indices, the proportion of trusted predictions in test sets (𝑡𝑟𝑢𝑠𝑡𝑒𝑑 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 %
) decreases for almost every reliability level while accuracy rate increases as can be seen in Figure 7.
On the other hand, as the reliability level exceeds approximately 0.90, the linear structure of the
relationship starts to deteriorate and 𝑡𝑟𝑢𝑠𝑡𝑒𝑑 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 % decreases drastically while an
equivalent increase in the accuracy rate on average is not observed.
The experimental results of EMD-ANN-RF-LIME model are summarized in Table 4. For the
cases where the reliability level is greater than or equal to each of the values %50, %85, %95 and %1
00, the hit rates and the corresponding number of trusted predictions are given in Table 4. Notice that
the number of trading days for Reliability ≥ 0.50 is the total test period for all indices and this is due
to the fact that it is equivalent for not determining any level of reliability in other words it can be
interpreted as predicting the entire test set same as EMD-ANN-RF model. The highest accuracy of the
EMD-ANN-RF-LIME is obtained on KOSPI data set with 0.9913 accuracy where 155 predictions are
made for Reliability = 1. On the other hand, lowest accuracy obtained on FTSE100 index with
0.9079 for 76 predictions where Reliability = 1.
16
On another note, although prediction accuracies of proposed EMD-ANN-RF model for six different
datasets are close to each other, as the reliability level increases the accuracy rates of six datasets begin
to diverge (see Figure 5 and Figure 6). Therefore, it can be said that although the LIME algorithm
fulfills its task as expected, EMD-ANN-RF-LIME model may need improvement in terms of
robustness for higher levels of reliability.
Although as a decomposition approach, EMD is able to provide very bright results in our
experiments, there are more up-to-date data decomposition methods such as ensemble empirical mode
decomposition (EEMD) and variational mode decomposition (VMD) in the contemporary literature.
Also parallel experiments carried out by employing VMD and EEMD techniques. According to our
findings there was not any significant difference observed between EMD and mentioned techniques in
terms of hit rates. However, Niu et al. [24] has shown that VMD is more successful compared to
EMD in stock market direction prediction. In this case, we assert that VMD does not make a
difference in the ensemble model proposed here. It can be taught that the reason behind this
indifference might be due to the specific configuration the proposed ensemble model. Therefore, we
think that there is still room for improving obtained results by employing VMD with different settings
in future studies.
There is another important aspect to consider regarding data decomposition. In the literature
including this paper, the entire dataset (training, validation and test) is decomposed beforehand.
Therefore, decomposing the validation and test set along with training set is necessary before
employing any machine learning technique. However, in practice, only the data up to the day to be
predicted can be decompose before making the prediction as it is observed. Therefore, the next day
values of the subcomponents obtained after the observed original price data are decomposed are
predicted. To be more precise, if we say 0 to the starting point of the original time series, the values of
the subcomponents at time 𝑡 are predicted after the original time series consisting of the values of days
(0,1,…,𝑡 ‒ 2,𝑡 ‒ 1) is decomposed. Then, the time series consisting of the values of the days
(1,…,𝑡 ‒ 1,𝑡) is decomposed again, and the values of the subcomponents at the time 𝑡 + 1 are
predicted, and the prediction procedure proceeds in this way. This prediction procedure is referred to
as simply sliding window. In short, these operations are performed one after the other for each
window, and predictions are performed. It is suggested that financial time series prediction models
with data decomposition should be developed with such an experimental design, especially to guide
practitioners.
Although ANN is sufficiently well in predicting time series, LSTM has been shown to be more
successful [36]. Based on our preliminary experiment results, no significant difference is observed
between ANN and LSTM in terms of accuracy. In future studies we suggest that it might be useful to
better explore LSTM algorithm with fine-tuned hyper parameters. The same inference can be derived
also XGBoost model. Studies such as [43] have shown that XGBoost model can produce more
successful results than RF. The reason for adoption of RF in our ensemble model’s second stage is the
fact that RF was observed to be slightly better than XGBoost in terms of accuracy rates. Moreover, RF
has lesser hyper parameters relative to XGBoost hence it is more straightforward to use. On another
note, the possible reason for XGBoost’s inferior performance can be explained with the default
17
In the literature, it has not been found that the LIME algorithm has been used as suggested here in
an integrated manner with machine learning techniques to make direct daily direction estimation.
Therefore, integrating LIME in such a context is, in our opinion, an original contribution to the
literature. Thanks to LIME algorithm’s ability to avoid “unreliable” predictions of the model, and
allowing relatively fewer but more reliable predictions, our proposed EMD-ANNRC-RF-LIME
framework has proved to be distinctively successful.
The inevitable outcome of such a framework would be its use as a beneficial decision making tool
for investors in capital markets where buy and sell decisions usually occur synchronously. For
instance, in a daily trading operation, the more number of time series to be predicted at a particular
time period increases, the less total number of days in which the predictions are abstained will
decrease in a multi-asset setting. Because when more than one time series are predicted in the same
period, some predictions will be avoided for some assets while the others would be fulfilled.
Therefore, instead of predicting a single stock or stock market indices, it is possible to predict more
than one stock at the same time and rely on for those who provide a reliability condition among them.
Thus, in each prediction period, directional predictions are made for a certain number of stocks that
meet the reliability condition, while predictions that do not meet the reliability condition will be
avoided and the predicted stocks will change periodically. For further studies, it is recommended to
design an experiment in which the direction of a large number of assets is predicted in the same period
as stated.
References
[1] Ampomah, E. K., Qin, Z., & Nyame, G. (2020). Evaluation of tree-based ensemble machine
learning models in predicting stock price direction of movement. Information, 11(6), 332.
[2] Bisoi, R., Dash, P. K., & Parida, A. K. (2019). Hybrid variational mode decomposition and
evolutionary robust kernel extreme learning machine for stock price and movement prediction on daily
basis. Applied Soft Computing, 74, 652-678.
[3] Börjesson, L., & Singull, M. (2020). Forecasting financial time series through causal and dilated
convolutional neural networks. Entropy, 22(10), 1094.
[4] Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. In Proceedings of
the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp.
785–794). New York, NY, USA: ACM. https://doi.org/10.1145/2939672.2939785
[5] Chollet, F., & others. (2015). Keras. GitHub. Retrieved from https://github.com/fchollet/keras
[6] Dash, R., Samal, S., Dash, R., & Rautray, R. (2019). An integrated TOPSIS crow search based
classifier ensemble: In application to stock index price movement prediction. Applied Soft Computing,
85, 105784.
[7] Dragomiretskiy, K., & Zosso, D. (2013). Variational mode decomposition. IEEE transactions on
signal processing, 62(3), 531-544.
[8] Ghorbani, M., & Chong, E. K. (2020). Stock price prediction using principal components. Plos
one, 15(3), e0230124.
[9] Gunduz, H. (2021). An efficient stock market prediction model using hybrid feature reduction
method based on variational autoencoders and recursive feature elimination. Financial Innovation,
7(1), 1-24.
[10] Hao, Y., & Gao, Q. (2020). Predicting the trend of stock market index using the hybrid neural
network based on multiple time scale feature learning. Applied Sciences, 10(11), 3961.
[11] Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature
585, 357–362 (2020). DOI: 10.1038/s41586-020-2649-2. (Publisher link).
[12] Heuillet, A., Couthouis, F., & Díaz-Rodríguez, N. (2021). Explainability in deep reinforcement
learning. Knowledge-Based Systems, 214, 106685.
[13] Huang, N. E., Shen, Z., Long, S. R., Wu, M. C., Shih, H. H., Zheng, Q., ... & Liu, H. H. (1998).
The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time
18
19
20