Stock Prediction
Stock Prediction
Stock Prediction
https://doi.org/10.1007/s12559-023-10125-8
Received: 15 September 2022 / Accepted: 9 February 2023 / Published online: 9 March 2023
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023
Abstract
Stock trending prediction is a challenging task due to its dynamic and nonlinear characteristics. With the development of
social platform and artificial intelligence (AI), incorporating timely news and social media information into stock trend-
ing models becomes possible. However, most of the existing works focus on classification or regression problems when
predicting stock market trending without fully considering the effects of different influence factors in different phases. To
address this gap, this research solves stock trending prediction problem utilizing both technical indicators and sentiments of
the social media text as influence factors in different situations. A 3-phase hybrid model is proposed where daily sentiment
values and technical indicators are considered when predicting the trends of the stocks. The proposed method leverages both
traditional learning and deep learning methods as the core predictors in different phases. Accuracy and F1-score are used to
evaluate the performance of the proposed method. Incorporating the technical indicators and social media sentiments, the
performance of the proposed method with different learning-based methods as core predictors is analyzed and compared in
different situations. Specifically, multi-layer perceptron (MLP), naïve bayes (NB), decision tree (DT), logistic regression
(LR), random forest (RF), extreme gradient boosting (XGBoost), long short-term memory (LSTM), and convolutional neural
networks (CNN) are leveraged as the core learning predictor module, with different combinations of the degree of involve-
ment of technical and sentiment information. The result demonstrates the effectiveness of the proposed method with an
accuracy of 73.41% and F1-score of 84.19%. The result also shows that various learning-based methods perform differently
for the prediction of different stocks. This research not only demonstrates the merits of the proposed method, it also shows
that integrating social opinions with technical indicators is a right direction for enhancing the performance of learning-based
stock market trending analysis methods.
Keywords Stock market trending · Social media sentiment analysis · Machine learning · Deep learning · Technical
indicators
13
Vol:.(1234567890)
Cognitive Computation (2023) 15:1092–1102 1093
stock market predictions. More and more different machine dictions. Different learning-based models are leveraged
learning and deep learning methods such as support vector and compared, with different combinations of usage of
machine (SVM), artificial neural network (ANN), long short- different technical and sentiment information.
term memory networks (LSTM), and their fusion models have 3. Abundant experiments are conducted on one stock index
been applied to stock market predictions [6–10]. Dow Jones Industrial Average (DJIA) and the five very
Inspired by behavioral finance, researchers began to add famous stocks including Google (GOOG), Amazon
information that can reflect investors’ behavior toward the (AMZN), Apple (AAPL), eBay (EBAY), and Citigroup
stock forecasting model. Bollen et al. [1] used an emotion (C) using eight learning-based algorithms. The proposed
tracking tool to analyze the content of tweets and used the model outperforms the baseline models for predicting
generated emotion time series to predict the change rate of stock’s trend, which proves the effectiveness of utilizing
the Dow Jones Industrial Index. After that, many research- technical indicators and social sentiment. The results
ers began to use the tools that can reflect or influence the also illustrate the differences of various learning-based
market to study the stock market based on the emotional algorithms for the prediction of real-world stock market
and psychological information of participants. Further- trend.
more, with the rise of social networks, huge amount of
data is being generated every day, and there is a gaining The rest of the paper is organized as follows: the “Related
in popularity of using these data to enhance the prediction Work” section discusses related work on stock prediction; the
performance [11–20]. “The Proposed Methodology” section presents the proposed
Unlike most existing research focusing on simple classifi- stock trending prediction methodology; the “Experiment
cation or regression tasks for the stock trending problem, this Results and Discussion” section describes parameter setting
paper not only solves the stock trending problem as a multi- in the experiments and discusses the results of the experi-
label classification task to predict the trending of the stock ments; finally, the “Conclusion, Limitations, and Future
price, but also utilizes both technical indicators and sentiment Works” section offers concluding remarks and illustrates
values from social platform as influence factors. Related future work.
work of this direction mostly focuses on using only technical
indicators [21–23] or sentiment values [24], while this paper
leverages both of them for the stock trending prediction task. Related Work
Furthermore, compared with the existing work [25] which
also leveraged both technical indicators and sentiment values, There are many internal and external factors influencing the
our paper explores the effectiveness of various learning-based stock price in the stock market, and the fluctuation of stock
algorithms when applied to different stocks, which illustrates price volatility is not only affected by macro monetary policy,
the differences of various learning-based algorithms for real- but also affected by macro-economic environment and emer-
world stock market trending. gencies. According to the different mechanisms of stock price
In this research work, we propose a hybrid learning-based prediction, the related work is reviewed under two different
model to predict the stock’s trend. The hybrid model is an aspects: learning-based method for stock forecasting and stock
integration of learning-based algorithms such as ANN with trending analysis with social media sentiment analysis.
social media technical indicators and sentiment analysis. The
results show that the performance can be improved when rel- Learning‑Based Method for Stock Forecasting
evant technical indicators and social sentiment are considered.
The contributions of this paper are summarized as Compared with the traditional algorithm, a machine learning
follows: algorithm has the capability of processing large amount of
data and multi-dimensional data. Due to the better prediction
1. Unlike most existing research focusing on simple clas- performance, more and more researchers applied machine
sification or regression tasks for the stock trending learning algorithms to stock market trending analysis and
problem, this paper not only solves the stock trending prediction.
problem as a multi-label classification task to predict the The stock forecasting problems can be solved as regres-
trending of the stock price, but also utilizes both techni- sion problems to predict the values of the special stocks [15].
cal indicators and sentiment values from social platform In recent years, with the development of deep learning
as influence factors. It is more fine-grained and more technology, many stock forecasting models based on deep
suitable for the real stock market analysis. learning have been proposed, and promising results have
2. This paper proposes a hybrid learning-based model been obtained [26–30]. However, there are more and more
which utilizes a three-stage method to determine the research works which treated the stock forecasting problems
final trend prediction based on two intermediate pre- as trending classification tasks as described below.
13
1094 Cognitive Computation (2023) 15:1092–1102
As learning-based methods, SVM, ANN, and Naïve Bayes Extreme gradient boosting (XGboost) was proposed by
(NB) are widely applied in the field of financial forecast- Chen and Guestrin [41]. It has been proven that XGboost
ing [31–33]. SVM is known to have capacity control of has the characteristics of low computational complexity,
decision function, use of kernel functions, and sparsity of fast running speed, and high accuracy. For the analysis of
solutions [34]. It has been applied to stock market analysis time series data, although the gradient boosting decision
and has been verified to be effective when it is being com- tree (GBDT) can effectively improve the stock prediction
pared with other algorithms, such as the random walk model results, the relatively slow detection rate limits the method.
(RW), linear discriminant analysis (LDA), quadratic discri- In order to find a fast and high accuracy prediction method,
minant analysis (QDA), and Elman backpropagation neural an XGboost model has been used for stock prediction, which
networks (EBNN) (Huang et al. [31]). It has been used for can improve the prediction accuracy as well as the predic-
stock market daily price prediction [35, 36] and Producer tion speed.
Price Index (PPI) prediction [35]. Although the feasibility With the success of learning-based methods for stock
has been proven, the research also pointed out the limitations trending analysis, some researchers further enhanced the
for solving such a problem as a regression task [35]. performance of the methods by considering the influence
A neural network is known to have the capability for of social event (e.g., social media news sentiment analysis)
pattern recognition [37]. Naeini et al. [32] compared feed
forward multi-layer perception (MLP) and Elman recurrent Stock Trending Analysis with Social Media
network by leveraging linear regression. Their experiment Sentiment Analysis
showed that linear regression was comparatively better in
terms of predicting the direction of changes on the next Social media sentiment analysis is a popular research area in
day, whereas MLP displayed a lowest error in predicting the natural language understanding (NLU) domain that iden-
the amount of value changed. This implied that neural net- tifies and categorizes opinions that are expressed in news,
works adapted well to the dynamic nature of the stock mar- articles, tweets, or text [42–44]. In the field of stock market
ket by providing the lowest error rate. From the perspective prediction, it is often used as an indicator of the public senti-
of the relationship between the stock technical indicators ment towards events and scenarios. A very popular method
and the stock market, Gken et al. [38] used harmony search is to use the sentiment value as an external factor and feed
algorithm and genetic algorithm to select the most relevant it as an input that will affect the final prediction [11–14,
technical indicators and applied them to ANN for stock 14–18, 25, 45].
price prediction. The experimental results showed that the Bharathi and Geetha [11] aimed to present the impact of
mean absolute percentage error of the ANN model based on really simple syndication (RSS) feeds on stock market val-
harmony search and genetic algorithm is 3.38% and 3.36%, ues. The approach of this article is to utilize the sentiment
respectively, which is better than the model using only ANN analysis result as an external factor that is used together
algorithm. with the Sensex-moving average results to produce a final-
As for an NB-based prediction method, it is a type of result prediction of the trend. Ichinose and Shimada [12]
supervised learning method that learns from historic records proposed a system that utilized bag of keywords from expert
or expert’s knowledge and utilizes probabilistic approaches articles (BoK-E) to predict the trend of the next day. In the
to find an optimal solution [39]. Huang and Chang [33] uti- experiment conducted, it was reported that the average
lized a set of independent data which was collected ran- accuracy obtained using BoK-E was 61.8%, which is a 9.5%
domly from Taiwan Stock Exchange Corporation (TSEC), increase in accuracy compared to using standard bag of word
and 9 attributes were used to build the NB predictor. Their approach. Zhang et al. [13] utilized the correlation of events
result showed successful prediction, with a probability of from web news and public sentiments from social media and
13.46% of making a loss. This implies the possibility of stock movement to determine the next day trend. The pro-
using the NB-based predictor for stock market prediction in posed coupled stock correlation (CMT) method (accuracy of
obtaining good results. 62.50%) performs better compared to models without stock
Besides the traditional machine learning methods men- correlation information (accuracy of 60.25%)).
tioned above, there are ensemble learning (EL) methods that In addition, Si et al. [14] proposed the use of a seman-
have been used to forecast future trends of stock price move- tic stock network (SSN) to model the relationship between
ments [40, 41]. Random forest (RF) can overcome overfit- stocks. It proved that the utilization of SSN has a higher
ting problems by training multiple decision trees on different capability than correlation stock network (CSN) to predict
subspaces of the features at the cost of slightly increased the stock market. Xing et al. [46] designed a framework
bias. Previous experiments indicated that RF resulted in a which captured the bi-directional interaction between move-
high accuracy rate for all periods, and the longer the trading ments of asset price and market sentiment for stock return
period, the higher the accuracy rate [40]. fluctuation prediction. Picasso et al. [25] combined both
13
Cognitive Computation (2023) 15:1092–1102 1095
technical and fundamental analysis using machine learn- using different enhancement technologies. Our research
ing techniques for the stock market prediction problem. A work demonstrates the merits of the proposed method and
high-frequency trading simulation with over 80% of annual- points out the correct direction for future work in this area.
ized return was conducted to exploit the prediction results.
Merello et al. [47] presented a transfer learning method to
estimate the amount of price change and the most perform- The Proposed Methodology
ing assets, in which price fluctuations of different magnitude
are treated differently through the application of different This research aims to leverage the stock market time series
weights on samples. data to investigate the performance of different learning-
There is also a gaining in popularity of using Twitter based methods by incorporating technical indicators and
data for sentiment analysis [17]. In addition, Li et al. [17] social media sentiment analysis.
also suggested that the proposed approach of using Twitter The proposed methodology consists of 3 phases, each
data for stock market prediction achieved a better perfor- with multiple steps. These are described in the following
mance when using the tweets sentiment values to predict the subsections.
stock price of 3 days later. Gupta and Chen [48] analyzed
the StockTwits tweet contents and extracted financial senti-
Problem Formulation
ment using a set of text featurization and machine learning
algorithms. The correlation between the aggregated daily
In this research, the stock market forecasting problem
sentiment and daily stock price movement was then studied,
is treated as a three-class classification task. It is to pre-
and the effectiveness of the proposed work on stock price
dict the stock market trend: Up, Hold, and Down (i.e., 1,
prediction was demonstrated through experiments on five
0, −1). Specifically, given the stock prices sequences
companies (Apple, Amazon, General Electric, Microsoft,
XT (T = 1, 2, 3, ⋯ , n) , generated technical indicator fea-
and Target). In addition, Google Trends data was used to
tures TIT (T = 1, 2, 3, ⋯ , n) , and sentiment features
provide the search volume for keywords searched so that
ST (T = 1, 2, 3, ⋯ , n), the task is to predict stock price trend
the model could determine the impact of events that might
of the next day YT + 1. It can be formulated as Eq. (1):
affect the stock market. Hu et al. [18] considered the use of
Google Trends data in improving the performance of stock YT+1 = F(XT , XT−1 , XT−2 , ⋯ , XT−K , TIT , TIT−1 ,
market prediction. According to the experimental results, (1)
TIT−2 , ⋯ , TIT−K , ST , ST−1 , ST−2 , ⋯ , ST−K )
Google Trends was capable of enhancing the accuracy in
predicting the trend of the stock market. where F() represents the mapping function from input to
Besides public sentiment, Khan et al. [49] also explored output and K represents the size of the sliding window.
the effect of political situation on the stock prediction accu-
racy, and the experimental results showed that the senti- Stock Data Pre‑processing
ment feature improved the prediction accuracy of machine
learning algorithms by 0–3% while political situation feature The stock index and various stocks from S&P 500 were
improved the prediction accuracy of algorithms by about identified and retrieved from Yahoo Finance (https://sg.
20%. finance.yahoo.com). The period for data extraction was
Unlike the work mentioned above, this paper aims to between 1st Jan 2014 and 31th Dec 2018. Entries in the
solve the stock trending problem as a triple classification data include:
task to predict the trending of the stock price, e.g., Buy or
Rise (1), Sell or Drop (−1) and Hold (0). In addition, related – Date: index of each record
work of this direction mostly focuses on using only techni- – Open: price of stock at opening of trading (in USD)
cal indicators [21–23] or sentiment values [24], while this – High: highest price of stock during trading day (in USD)
paper leverages both of them for the stock trending predic- – Low: lowest price of stock during trading day (in USD)
tion task. Our paper explores the effectiveness of various – Close: price of stock at closing of trading (in USD)
learning-based algorithms when applied to different stocks, – Volume: amount of stocks traded (in USD)
which illustrates the differences of various learning-based – Adjusted Close: price of stock at closing adjusted with
algorithms for real-world stock market trending. Not only dividends (in USD)
the stock index, Dow Jones Industrial Average (DJIA) is
analyzed, but also the individual stocks, such as the very Learning-based methods can be leveraged to analyze all
famous stocks, Google (GOOG), Amazon (AMZN), Apple the time series datasets, such as Open, High, Low, Close, and
(AAPL), eBay (EBAY), and Citigroup (C) are analyzed by Adjusted Close for the stock market data. In this paper, we
13
1096 Cognitive Computation (2023) 15:1092–1102
illustrate the results of analyzing Adjusted Close time series dimensions. The technical indicators can be calculated
for the purpose of comparing different prediction methods. using the Python library TA with the stock’s Open, High,
All available stock market data were downloaded for anal- Low, Close, and Volume values as inputs to the TA library.
ysis, which were daily data. The trending is grouped under – Step 3: The dataset is fed into the model as inputs and the
Buy or Rise (1) when the percentage change is above +1% learning-based algorithms are used to perform intermedi-
and Sell or Drop (−1) when the percentage change is below ate prediction of the trend of the next day.
−1%, else it would be grouped under Hold (0). The various
learning-based methods were performed on different stocks Phase 2: In phase 2, the model generates a 2nd intermediate
and the results were compared. prediction. This intermediate prediction is the news head-
Five stocks from the S&P 500 index were selected for line. The steps in this phase are as follows:
performing the experiment. They were, namely GOOG,
AMZN, AAPL, EBAY, EBAY, and C. – Step 1: In the first step, it retrieves news items that are
related to the stocks from online media sources such as
Stock Trending Analysis by Incorporating Technical the New York Times. The duplicate rows and redundant
Indicators and Social Media Sentiment Analysis information within the news are removed.
– Step 2: In this pre-processing stage, duplicated news is
The proposed stock trending analysis by incorporating tech- first removed and redundant punctuations, special char-
nical indicators and social media sentiment is presented and acters, and short words (less than 2 characters long) are
explained in detail in this section. The proposed method- then removed. Next, the New York Times News under-
ology consists of 3 phases, each with multiple steps. The goes tokenization, stemming, and lastly, joining the
details of the methodology are illustrated in Fig. 1. stemmed tokens back to form a stemmed sentence.
Phase 1: In phase 1, the 1st intermediate prediction is – Step 3: The pre-processed dataset then undergoes sen-
obtained using learning-based algorithms with technical timent analysis to determine the daily sentiment value
indicators. The steps in this phase are as follows: (polarity scores). Such scores (compound score) are then
calculated using SenticNet [50], a cognitive-inspired
– Step 1: The model first retrieves the data either manually framework for sentiment analysis. To derive the daily
or automatically by using a crawler that is coded using sentiment value for the News, the compound score (nor-
Python. malized, weighted composite score) of each news items
– Step 2: The dataset then undergoes pre-processing to within the same day is summed up and divided by the
ensure the dataset is ready to be fed into the learning- total number of news items generated on the same spe-
based algorithms. In addition, technical indicators would cific day.
be considered, and they can be added as part of the input
13
Cognitive Computation (2023) 15:1092–1102 1097
Phase 3: In phase 3, the two intermediate predictions (trend the dataset was selected as training set to build the model for
intermediate predictions and daily sentiment values) are the learning-based methods, and the remaining 20% was used
combined to determine the final trend prediction of the next as testing set to verify the performance of the learning-based
day. The steps in this phase are as follows: methods.
For hyper-parameter tuning of the models, we used grid
– Step 1: Once the two intermediate prediction values have search and expert experience to select the parameters. To
been obtained, a sliding window of 3 days is applied to maintain fairness in the comparison of different learning-
the two intermediate prediction datasets (trend inter- based methods, we made sure that each model used the best
mediate prediction and daily sentiment value). The two optimized parameters. For example, the neural network model
datasets are then joined together to form a final dataset is a 3-layer MLP model with hidden layer sizes of 30. For
with their dates included. In addition, the daily sentiment DT, the CART algorithm was used for feature selection. For
value of each day in the sliding window will be further RF and XGboost, 100 submodels and 1000 submodels were
pre-processed so that the impact of the daily sentiment used, respectively. For LR, L2 regularization was selected as
value will decrease as the days go by. The weighted daily the penalty term. For LSTM, the number of hidden layer nodes
sentiment value of Dayt−x on Dayt can be calculated with RELU activation function was 50. For CNN, 32 filters
using the following equation: were used, and the kernel size was 5.
w−x For technical analysis, we used the technical analysis
Weighted Valuet−x =
w
∗ Valuet−x (2) library in Python to generate a total of 58 features through
an original stock time series dataset, and then the recur-
where w represents the window size. sive feature elimination method (RFE) was used for fea-
– Step 2: After the final datasets have been generated, it is ture selection. Finally, five most important features were
now ready to be fed into the learning-based algorithm for selected.
prediction. This final trend will then be the final predic- Different window sizes, n, were used for the trending pre-
tion result of the proposed hybrid learning-based model. diction, and it means that we use the value of the previous n
days to predict the value of the (n + 1)th day. After experimen-
tal exploration, we chose 3 as the window size.
Experiment Results and Discussion Under data preprocessing, empty or infinite values were
replaced with the value 0. In addition, the independent vari-
Stock Market Data Used able (X) was normalized from the actual value to its percentage
change to obtain a smaller range of values to reduce variability,
One stock index and five stocks were identified to be used, as formulated by using the following equation:
namely Dow Jones Industrial Average (DJIA), Google x − xmin
(GOOG), Amazon (AMZN), Apple (AAPL), eBay (EBAY), xnorm = (3)
xmax − xmin
and Citigroup (C). Two types of datasets are required. The first
is the historical values of stocks, and the second is the relevant
where xnorm is the normalized data, X is the original data,
New York Times News headlines.
and xmin and xmax are corresponding minimal and maximal
For the stocks historical values dataset, the daily data were
of each data dimension.
downloaded from Yahoo! Finance. The dataset contains 7 col-
The dependent variable (Y) would be the trending label
umns, Date, Open, High, Low, Close, Volume, and Adjusted
based on n days of prediction. Finally, the performances
Close. The interval taken was from 1st Jan 2014 to 31st Dec
were evaluated based on accuracy rate and F1-score for the
2018 5 years in total.
triple classification task.
The New York Times News dataset was obtained using
the New York Times Archive API. The API also allows the
news to be filtered based on the stock’s name and the dataset Performance Comparison for Stock Trending Prediction
retrieved was of 5 years, from 1st Jan 2014 to 31st Dec 2018.
In this section, results obtained by the proposed approach
Experiment Parameter and Evaluation Measure are briefly discussed and are evaluated against the accu-
racy and F1-score evaluation metrics for the stock tick-
A total of eight learning-based models are used for the com- ers: DJIA, GOOG, AMZN, AAPL, EBAY, and C. For the
parison in this research, including MLP, decision tree, NB, comparison experiments, “with technical analysis (TA)”
RF, logistic regression, XGBoost, LSTM, and CNN. 80% of denotes only adding generated technical indicators to the
13
1098 Cognitive Computation (2023) 15:1092–1102
learning-based models, similar to the methods used in some Table 1 Accuracy of different learning-based methods for individual
research [21, 22] mentioned in the “Related Work” section. stock
“with sentiment analysis (SA)” denotes only adding senti- Stock Models Baseline with TA with SA with TA & SA
ment values — i.e., New York Times news polarity — to the
DJIA MLP 73.09 64.26 72.69 64.67
learning-based models. “with TA&SA” denotes the proposed
DT 61.04 60.64 72.69 64.08
method which utilizes both technical indicators and senti- NB 66.05 57.43 72.69 72.29
ment values for the stock prediction task. The results can be RF 69.08 70.28 72.69 72.69
seen in Tables 1 and 2. LR 72.69 71.89 72.69 72.69
Table 1 shows the accuracy obtained from analyzing the XGBoost 67.07 65.86 72.69 73.09
stock index and individual stocks by using eight different LSTM 72.69 73.09 72.69 72.69
learning-based methods. It can be observed that the baseline CNN 72.69 73.09 72.69 73.41
model with TA&SA achieves the best results in 15 out of GOOG MLP 47.01 44.62 51.60 38.89
48 cases for all stocks while only 3 and 5 best results are DT 43.82 37.85 50.80 38.62
achieved by baseline and baseline with TA, respectively. NB 34.97 43.82 52.40 47.20
Among the one stock index and five stocks, the baseline RF 49.40 45.02 50.80 46.80
model with TA&SA achieves the best results three times, LR 49.40 47.41 50.80 50.80
for the stocks “DJIA,” EBAY, and C, followed by the base- XGBoost 39.44 43.82 51.20 46.80
LSTM 49.40 49.40 50.00 49.60
line model with SA. In addition, the baseline model with
CNN 49.40 50.60 50.40 50.00
TA&SA manages to achieve the highest accuracy of 61.82%
AMZN MLP 48.40 42.80 50.20 39.55
for the stock EBAY in all cases of the five individual stocks.
DT 44.40 38.80 50.20 49.00
The results show that the proposed approach baseline with NB 40.84 33.20 50.20 39.49
TA&SA outperforms the other strategies in most cases. RF 46.80 47.60 50.20 48.19
As for different learning-based methods, NB achieves LR 47.20 48.40 50.20 49.00
the best result for the stock GOOG and “EBAY” while XGBoost 46.40 47.20 50.20 48.19
CNN achieves the best result for the stock index “DJIA” LSTM 46.80 46.80 50.20 48.59
and the stock AMZN. LR achieves the best result for AAPL CNN 46.80 46.80 51.20 48.59
while NB and XGboost both achieve the best result for the AAPL MLP 53.01 43.78 54.84 40.78
stock C. DT 44.98 34.14 53.63 51.06
Table 2 shows the F1-score obtained for analyzing the NB 45.11 38.15 54.84 54.44
stock index and individual stocks by using eight different RF 49.40 53.41 55.24 55.24
learning-based methods. Consistent with the results above LR 55.02 56.63 54.03 54.84
from the accuracy measure in Table 1, it can be observed XGBoost 42.97 50.60 54.44 55.65
LSTM 55.02 55.02 54.84 54.84
that the baseline model with TA&SA achieves the highest
CNN 55.02 55.02 54.84 55.24
F-score three times, followed by the baseline model with
EBEY MLP 58.13 54.88 60.00 46.13
SA. It is worth mentioning that compared to the baseline
DT 43.09 42.28 60.00 46.13
model, the baseline model with SA can obtain competitive
NB 46.80 49.19 60.00 61.82
results for the stock index “DJIA” and the stock “C.” In RF 56.10 58.54 60.00 59.59
addition, the baseline model with SA manages to achieve LR 60.16 59.76 60.00 60.41
the highest F1-score of 84.19% for DIJA by five different XGBoost 48.78 50.81 60.00 60.41
machine learning algorithms, which shows the effectiveness LSTM 60.16 60.16 60.00 60.82
of sentiment analysis. CNN 60.16 60.38 60.00 60.82
As for different learning-based methods, LR achieves the C MLP 53.63 44.76 55.06 53.94
best result for four individual stocks including GOOG and DT 37.90 39.52 55.06 53.94
AAPL, EBAY, and C, which shows that LR can be a good NB 44.95 40.73 55.06 55.47
choice for individual stocks. RF 49.60 54.03 55.06 54.66
In addition, to verify the robustness and stability of the LR 54.84 52.42 55.06 55.47
proposed method, we run the models N times (N=5 in this XGBoost 42.34 47.18 55.06 54.25
LSTM 54.84 54.84 55.06 53.85
paper) and report the mean value of the performance as well
CNN 54.84 53.63 55.06 54.25
as the variance (e.g., standard deviation) obtained from using
different learning-based models. Table 3 shows the accura- Accuracy of different learning-based methods for individual stock.
cies and F1-scores as well as standard deviations obtained Highest values for each stock in bold
13
Cognitive Computation (2023) 15:1092–1102 1099
Table 2 F1-score of different learning-based methods for individual from analyzing the stock index, DJIA. It can be seen that
stock the standard deviations are relatively small, which indicates
Stock Models Baseline with TA with SA with TA & SA that the proposed method is stable. The results also indicate
that the performance of the proposed method implemented
DJIA MLP 67.37 64.35 84.19 64.67
using different learning-based models is slightly different
DT 60.08 61.52 84.19 64.08
NB 66.05 60.63 60.63 73.67
but is stable as shown in Table 3. This discovery is consist-
RF 63.87 66.95 84.19 63.80
ent with the results obtained from analyzing different stocks
LR 84.19 66.21 84.19 63.11 in the previous section.
XGBoost 63.90 63.60 84.32 64.08
LSTM 61.20 62.97 61.20 62.06
CNN 61.20 66.06 61.20 62.89 Result Analysis and Discussions
GOOG MLP 38.54 42.41 41.33 38.89
DT 43.20 37.87 40.09 38.62 From the results obtained, it is discovered that the perfor-
NB 34.97 42.82 42.82 41.36 mance varies from stock to stock.
RF 43.51 37.73 40.13 39.06 Firstly, by comparing the baseline model and the baseline
LR 38.13 39.09 39.77 38.68 model with TA, it is observed that the accuracy and F1-score
XGBoost 36.90 40.67 40.78 38.80 of prediction both drop in most cases when utilizing techni-
LSTM 32.67 32.67 39.28 39.59
cal indicators. However, there are also some cases where the
CNN 32.67 39.66 39.18 44.27
baseline model with TA improves the accuracy and man-
AMZN MLP 39.31 41.85 48.94 39.55
ages to generate the best accuracy compared to the other 3
DT 43.32 37.78 48.94 39.55
NB 40.84 30.62 30.62 39.49
approaches. The result implies that utilization of technical
RF 41.57 40.96 48.94 39.55
indicators has the potential in increasing the accuracy of pre-
LR 39.37 35.87 48.94 39.55 diction. However, such technical indicators must be carefully
XGBoost 45.18 41.34 48.94 52.38 selected through an optimized feature selection algorithm
LSTM 29.84 29.84 48.94 39.78 to prevent it from causing the opposite effect of reducing
CNN 29.84 29.92 48.94 39.78 the accuracy.
AAPL MLP 42.50 43.00 44.91 40.78 Secondly, looking at the results obtained using the
DT 43.29 35.67 40.79 51.06 baseline model and the baseline model with SA, it can be
NB 45.11 41.06 41.06 47.74 observed that the utilization of daily sentiment values from
RF 41.21 47.10 45.60 51.06 New York Times News as an external factor (phase 2 of the
LR 41.37 43.18 54.57 55.98 proposed model) to the predicted trend is largely capable
XGBoost 38.24 44.38 44.68 53.06 of increasing the accuracy of stock prediction. However,
LSTM 39.06 39.06 38.84 39.54
there are cases where slight reductions of accuracy when
CNN 39.06 39.06 39.72 39.74
utilizing SA are experienced. This can be caused by reasons
EBEY MLP 50.07 49.21 45.00 46.13
such as failing to capture negation in News, and an insuf-
DT 44.14 43.70 45.00 46.13
ficient number of news items considered in the sentiment
NB 46.80 49.88 49.88 48.84
RF 50.30 50.46 75.00 45.39
analysis phase.
LR 75.13 59.46 75.00 46.99 Thirdly, comparing the results obtained by machine learn-
XGBoost 46.17 46.76 75.00 47.05 ing models and deep learning models, it can be found that a
LSTM 45.20 45.20 45.00 47.07 deep learning algorithm has no obvious advantages for stock
CNN 45.20 48.67 45.00 47.16 trending prediction, which explains that for some relatively
C MLP 41.07 38.12 63.15 53.94 simple tasks, traditional machine learning models can also
DT 38.00 39.80 63.15 53.94 achieve competitive performances.
NB 44.95 50.72 50.72 43.23 Lastly, from the observation of the one stock index and
RF 40.54 40.58 63.15 40.56 five stocks, the proposed baseline model with TA&SA out-
LR 70.83 57.59 63.15 41.50 performs the other three approaches in most cases. Thus, this
XGBoost 37.70 38.94 64.34 40.39 implies that the utilization of technical indicators together
LSTM 38.84 38.84 39.10 38.85
with daily sentiment values of New York Times News might
CNN 38.84 41.96 39.10 38.75
have the effect of further increasing the accuracy of stock
Accuracy of different learning-based methods for individual stock. prediction.
Highest values for each stock in bold
13
1100 Cognitive Computation (2023) 15:1092–1102
Table 3 Accuracy and F1-score Models Baseline with TA with SA with TA & SA
of different learning-based
methods for DJIA Accuracy MLP 73.09 (± 0.40) 64.26 (± 0.21) 72.69 (± 0.34) 64.67 (± 0.53)
DT 61.04 (± 0.21) 60.64 (± 0.23) 72.69 (± 0.34) 64.08 (± 0.32)
NB 66.05 (± 0.00) 57.43 (± 0.00) 72.69 (± 0.00) 72.29 (± 0.00)
RF 69.08 (± 0.41) 70.28 (± 0.43) 72.69 (± 0.34) 72.69 (± 0.54)
LR 72.69 (± 0.00) 71.89 (± 0.00) 72.69 (± 0.00) 72.69 (± 0.00)
XGBoost 67.07 (± 0.00) 65.86 (± 0.00) 72.69 (± 0.00) 73.09 (± 0.00)
LSTM 72.69 (± 0.40) 73.09 (± 0.54) 72.69 (± 0.68) 72.69 (± 0.75)
CNN 72.69 (± 0.35) 73.09 (± 0.42) 72.69 (± 0.67) 73.41 (± 0.54)
F1-score MLP 67.37 (± 0.52) 64.35 (± 0.63) 84.19 (± 0.67) 64.67 (± 0.54)
DT 60.08 (± 0.23) 61.52 (± 0.34) 84.19 (± 0.67) 64.08 (± 0.33)
NB 66.05 (± 0.00) 60.63 (± 0.00) 60.63 (± 0.00) 73.67 (± 0.00)
RF 63.87 (± 0.76) 66.95 (± 0.84) 84.19 (± 0.77) 63.80 (± 0.68)
LR 84.19 (± 0.00) 66.21 (± 0.00) 84.19 (± 0.00) 63.11 (± 0.00)
XGBoost 63.90 (± 0.00) 63.60 (± 0.00) 84.32 (± 0.00) 64.08 (± 0.00)
LSTM 61.20 (± 0.65) 62.97 (± 0.53) 61.20 (± 0.64) 62.06 (± 0.72)
CNN 61.20 (± 0.47) 66.06 (± 0.56) 61.20 (± 0.67) 62.89 (± 0.55)
Conclusion, Limitations, and Future Works technical indicators and sentiment information for stock trend-
ing analysis. Further parameter optimization and consideration
In conclusion, different from EMH and RWT, where both of other influence factors are our ongoing work.
theories emphasize the non-viability of stock market predic-
tion, this research has demonstrated that it is possible to pre- Author Contribution Zhaoxia Wang: conceptualization, methodol-
dict the trending of stock market by using the right methods. ogy, supervision, data curation, software design, visualization, writing
The proposed method is a 3-phase hybrid prediction — original draft, writing — review and editing. Zhenda Hu: investi-
gation, formal analysis, software testing, visualization, validation,
model where daily sentiment values and technical indica- writing — review and editing. Fang LI: investigation, data curation,
tors are considered when predicting the trends of the stocks, software development, writing — original draft. Seng-Beng HO: con-
GOOG, AMZN AAPL, EBAY, and C. The 3 phases in the ceptualization, methodology, supervision, writing — original draft,
approach are phase 1: intermediate prediction using learn- writing — review and editing. Erik Cambria: data curation, investiga-
tion, writing — review and editing.
ing-based algorithms to generate the first intermediate trend
prediction, phase 2: sentiment analysis where daily senti- Data Availability All data generated or analyzed during this study are
ment values of New York Times News are calculated, and included in this published article.
phase 3: final prediction where the final trend is predicted
by considering sentiment analysis as an external factor to the Declarations
first intermediate trend prediction. Conflict of Interest The authors declare no competing interests.
The performance of the model is evaluated using accuracy
and F1-score and the results show that the proposed approach
managed to achieve the highest accuracy of 73.41% and the References
highest F1-score of 84.19% for DJIA. In addition, the effect
of utilizing sentiment analysis and technical indicators was 1. Bollen J, Mao H, Zeng X. Twitter mood predicts the stock mar-
discussed in detail. Also, utilizing technical indicators together ket. J Comput Sci. 2011;2(1):1–8.
with sentiment analysis can be seen to further enhance the 2. Patel J, Shah S, Thakkar P, Kotecha K. Predicting stock and
stock price index movement using trend deterministic data
prediction performance. preparation and machine learning techniques. Expert Syst Appl.
It is observed that no learning-based method is capable of 2015;42(1):259–68.
consistently achieving the best performance. This has been 3. Ma Y, Mao R, Lin Q, Wu P, Cambria E. Multi-source aggre-
demonstrated across eight different learning-based models gated classification for stock price movement prediction. Inf
Fusion. 2023;91:515–28.
in this research. This suggests that the applicability of each 4. Maini SS, Govinda K. Stock market prediction using data min-
learning-based method differs from stock to stock. In addi- ing techniques. In: 2017 International Conference on Intelligent
tion, this research demonstrates the merit of incorporating both Sustainable Systems (ICISS) IEEE. 2017:654–61.
13
Cognitive Computation (2023) 15:1092–1102 1101
5. Varfis A, Versino C. Univariate economic time series forecast- 27. Nelson DM, Pereira AC, de Oliveira RA, Stock market’s
ing by connectionist methods. In: 1990 International Conference price movement prediction with LSTM neural networks. In,.
on Neural Networks (ICNN) IEEE. 1990:342–5. International joint conference on neural networks (IJCNN).
6. Rather AM, Agarwal A, Sastry V. Recurrent neural network and IEEE. 2017:1419–26.
a hybrid model for prediction of stock returns. Expert Syst Appl. 28. Stoean C, Paja W, Stoean R, Sandita A. Deep architectures for
2015;42(6):3234–41. long-term stock price prediction with a heuristic-based strategy
7. Hafezi R, Shahrabi J, Hadavandi E. A bat-neural network multi- for trading simulations. PloS one. 2019;14(10).
agent system (BNNMAS) for stock price prediction: case study 29. Kim T, Kim HY. Forecasting stock prices with a feature fusion
of DAX stock price. Appl Soft Comput. 2015;29:196–210. LSTM-CNN model using different representations of the same
8. Xiong L, Lu Y. Hybrid ARIMA-BPNN model for time series pre- data. PloS one. 2019;14(2).
diction of the Chinese stock market. In: 2017 3rd International Con- 30. Sezer OB, Ozbayoglu AM. Financial trading model with stock
ference on Information Management (ICIM). IEEE; 2017. p. 93-7. bar chart image time series with deep convolutional neural net-
9. Lee SW, Um JY. Stock fluctuation prediction method and server. works. Intell Autom Soft Comput. 2020;26(2):323–34.
Google Patents; 2019. US Patent 10,185,996. 31. Huang W, Nakamori Y, Wang SY. Forecasting stock market
10. Kim KJ. Financial time series forecasting using support vector movement direction with support vector machine. Comput Oper
machines. Neurocomputing. 2003;55(1–2):307–19. Res. 2005;32(10):2513–22.
11. Bharathi S, Geetha A. Sentiment analysis for effective stock 32. Naeini MP, Taremian H, Hashemi HB, Stock market value
market prediction. Int J Intell Eng Syst. 2017;10(3):146–54. prediction using neural networks. International conference on
12. Ichinose K, Shimada K. Stock market prediction using keywords computer information systems and industrial management appli-
from expert articles. In: International Conference on Soft Com- cations (CISIM). IEEE. 2010:132–6.
puting and Data Mining. Springer. 2018:409–17. 33. Huang TT, Chang CH. Intelligent stock selecting via Bayesian
13. Zhang X, Zhang Y, Wang S, Yao Y, Fang B, Philip SY. Improv- naive classifiers on the hybrid use of scientific and humane attrib-
ing stock market prediction via heterogeneous information utes. In: 2008 Eighth International Conference on Intelligent Sys-
fusion. Knowl Based Syst. 2018;143:236–47. tems Design and Applications. vol.1. IEEE. 2008:617–21.
14. Si J, Mukherjee A, Liu B, Pan SJ, Li Q, Li H. Exploiting social 34. Wang Z, Jiao R, Jiang H. Emotion recognition using WT-SVM in
relations and sentiment for stock prediction. In: Proceedings human-computer interaction. J New Media. 2020;2(3):121.
of the 2014 Conference on Empirical Methods in Natural Lan- 35. Henrique BM, Sobreiro VA, Kimura H. Stock price prediction
guage Processing (EMNLP). 2014:1139–45. using support vector regression on daily and up to the minute
15. Wang Z, Ho S-B, Lin Z. Stock market prediction analysis by prices. J Finance Data Sci. 2018;4(3):183–201.
incorporating social and news opinion and sentiment. In: 2018 36. Marković I, Stojanović M, Stanković J, Stanković M. Stock market
IEEE International Conference on Data Mining Workshops trend prediction using AHP and weighted kernel LS-SVM. Soft
(ICDMW) IEEE. 2018:1375–80. Comput. 2017;21(18):5387–98.
16. Nguyen TH, Shirai K, Velcin J. Sentiment analysis on social 37. Anitescu C, Atroshchenko E, Alajlan N, Rabczuk T. Artificial
media for stock movement prediction. Expert Syst Appl. 2015; neural network methods for the solution of second order boundary
42(24):9603–11. value problems. Comput Mater Continua. 2019;59(1):345–59.
17. Li B, Chan KC, Ou C, Ruifeng S. Discovering public sentiment 38. Göçken M, Özçalıcı M, Boru A, Dosdoğru AT. Integrating
in social media for predicting stock movement of publicly listed metaheuristics and artificial neural networks for improved stock
companies. Inform Syst. 2017;69:81–92. price prediction. Expert Syst Appl. 2016;44:320–31.
18. Hu H, Tang L, Zhang S, Wang H. Predicting the direction of 39. Zhu K, Zhang N, Ying S, Wang X. Within-project and cross-project
stock markets using optimized neural networks with Google software defect prediction based on improved transfer naive Bayes
Trends. Neurocomputing. 2018;285:188–95. algorithm. Comput Mater Continua. 2020;63(2):891–910.
19. Hu Z. Crude oil price prediction using CEEMDAN and LSTM- 40. Khaidem L, Saha S, Dey SR. Predicting the direction of stock
attention with news sentiment index. Oil & Gas Science and market prices using random forest. arXiv preprint http://a rxiv.o rg/
Technology-Revue d’IFP Energies nouvelles. 2021;76:28. abs/1605.00003. 2016.
20. Malandri L, Xing FZ, Orsenigo C, Vercellis C, Cambria E. Public 41. Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In:
mood-driven asset allocation: the importance of financial sentiment Proceedings of the 22nd ACM Sigkdd International Conference
in portfolio management. Cognit Comput. 2018;10(6):1167–76. on Knowledge Discovery and Data Mining 2016:785–94.
21. Parray IR, Khurana SS, Kumar M, Altalbe AA. Time series 42. Wang Z, Chong CS, Lan L, Yang Y, Ho S-B, Tong JC, Fine-grained
data analysis of stock price movement using machine learning sentiment analysis of social media with emotion sensing. In Future
techniques. Soft Comput. 2020;24(21):16509–17. Technologies Conference (FTC). IEEE. 2016:1361–4.
22. Dey PP, Nahar N, Hossain B. Forecasting stock market trend 43. Wang Z, Ho S-B, Cambria E. A review of emotion sensing:
using machine learning algorithms with technical indicators. Int categorization models and algorithms. Multimed Tools Appl.
J Inform Technol Comput Sci. 2020;12(3):32–8. 2020;79:35553–82.
23. Agrawal M, Shukla PK, Nair R, Nayyar A, Masud M. Stock 44. Xing FZ, Cambria E, Welsch RE. Natural language based financial
prediction based on technical indicators using deep learning forecasting: a survey. Artif Intell Rev. 2018;50(1):49–73.
model. Comput Mater Continua. 2022;70(1):287–304. 45. Hu Z, Wang Z, Ho S-B, Tan A-H. Stock market trend forecasting
24. Li Y, Pan Y. A novel ensemble deep learning model for stock based on multiple textual features: a deep learning method. In:
prediction based on stock prices and news. Int J Data Sci Anal. 2021 IEEE 33rd International Conference on Tools with Artificial
2022;13(2):139–49. Intelligence (ICTAI). IEEE 2021:1002–7.
25. Picasso A, Merello S, Ma Y, Oneto L, Cambria E. Technical 46. Xing FZ, Cambria E, Zhang Y. Sentiment-aware volatility fore-
analysis and sentiment embeddings for market trend prediction. casting. Knowledge-Based Syst. 2019;176:68–76.
Expert Syst Appl. 2019;135:60–70. 47. Merello S, Ratto AP, Oneto L, Cambria E. Ensemble application
26. Fischer T, Krauss C. Deep learning with long short-term mem- of transfer learning and sample weighting for stock market predic-
ory networks for financial market predictions. Eur J Oper Res. tion. In: 2019 International Joint Conference on Neural Networks
2018;270(2):654–69. (IJCNN). IEEE. 2019:1–8.
13
1102 Cognitive Computation (2023) 15:1092–1102
48. Gupta R, Chen M. Sentiment analysis for stock price prediction. Publisher’s Note Springer Nature remains neutral with regard to
In: 2020 IEEE Conference on Multimedia Information Processing jurisdictional claims in published maps and institutional affiliations.
and Retrieval (MIPR). IEEE. 2020:213–8.
49. Khan W, Malik U, Ghazanfar MA, Azam MA, Alyoubi KH, Springer Nature or its licensor (e.g. a society or other partner) holds
Alfakeeh AS. Predicting stock market trends using machine exclusive rights to this article under a publishing agreement with the
learning algorithms via public sentiment and political situation author(s) or other rightsholder(s); author self-archiving of the accepted
analysis. Soft Comput 2019:1–25. manuscript version of this article is solely governed by the terms of
50. Cambria E, Liu Q, Decherchi S, Xing F, Kwok K. SenticNet 7: a such publishing agreement and applicable law.
commonsense-based neurosymbolic AI framework for explainable
sentiment analysis. In: LREC 2022: 3829–39.
Zhaoxia Wang1 · Zhenda Hu2 · Fang Li3 · Seng‑Beng Ho4 · Erik Cambria3
2
Zhenda Hu School of Information Management and Engineering,
huzhenda2020@gmail.com Shanghai University of Finance and Economics, 777
Guoding Road, Shanghai 200433, China
Fang Li
3
asfli@ntu.edu.sg School of Computer Science and Engineering, Nanyang
Technological University, 50 Nanyang Avenue,
Seng‑Beng Ho
Singapore 639798, Singapore
hosb@ihpc.a-star.edu.sg
4
Social and Cognitive Computing Department, Institute
Erik Cambria
of High Performance Computing (IHPC), Agency
cambria@ntu.edu.sg
for Science, Technology and Research (A*STAR), 1
1 Fusionopolis Way, Singapore 138632, Singapore
School of Computing and Information Systems,
Singapore Management University, 80 Stamford Road,
Singapore 178902, Singapore
13