0% found this document useful (0 votes)
39 views39 pages

MSC Thimmaiah K 2024 PDF

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 39

PERFORM THE STOCK PREDICTION USING THE

SENTIMENT ANALYSIS AND TIME SERIES FORECASTING


APPROACHES TO DETERMINE THE OPTIMAL ONE

Kavya Thimmaiah

Dissertation submitted in partial fulfilment of the requirements for the degree of

Master of Science in Financial Analytics

at Dublin Business School

Supervised by

Mr. Heikki Laiho

January 2024
DECLARATION

I hereby declare that the dissertation that I have submitted “Perform the
Stock Prediction Using the Sentiment Analysis and Time Series Forecasting
Approaches to Determine the Optimal One” to Dublin Business School for the
award of master’s in Financial Analytics under the guidance of supervision of Mr.
Heikki Laiho is solely the result of my own work; collaboration contributions have
been acknowledged and are explicitly referenced in the text. This work has not
been submitted to any university or college for the award of Degree.

Student Name: Kavya Thimmaiah

Student Number: 20000956

Date: 08th January 2024

1
ACKNOWLEDGEMENT

I would like to express my sincere appreciation to Dublin Business School and


gratitude to my research supervisor Mr. Heikki Laiho for his consistent support,
advice, and enthusiastic encouragement during my study. You have always been a
great mentor to me in my search for knowledge providing genuine support,
inspiration, and suggestions.

Secondly, I would like to thank my parents and the members of my family for
their support throughout my career pursuit.

2
ABSTRACT

Stock returns are affected by a variety of factors, among which the social media remarks
of public figures are one of the more important aspects on the stock market trend. On top
of that, latest news about the product of the stock also matters. In this paper, we determine
the sentiment type of public figures' social media remarks from the perspective of textual
sentiment, and compare them with the stock chart of the day to analyse the connection
between the two. Specifically, we first construct a dataset of public figures' social remarks
and classify the sentiment types, and then we use the network model BERT for training to
be able to judge the sentiment type of a new remark when it is inputted, which serves as a
basis for stock prediction. The experiment shows that the public figure's speech and the
news will have a strong impact on the stock trading on the same day, but the impact is small
for a long time, at the same time, the more influential the public figure is, the more obvious
the impact on the stock. The development and wealth of countries depend heavily on the
stock market. Data mining and artificial intelligence methods are required to analyse stock
market data. The financial success of particular businesses is one of the important factors
that has a significant impact on stock price volatility. However, news reports also have a
significant impact on how the stock market moves. In this research, we use sentiment
classification to use non-measurable data, such as financial news articles, to forecast a
company's future stock trend. We seek to cast light on the effect of news reports on the
stock market by analysing the connection between news and stock movement. Our study
seeks to advance knowledge of the function of news sentiment in forecasting stock market
trends. The dataset used in this study consists of news headlines from the financial news
website, Financial Times, and the prediction task is to classify the direction of the stock price
changes as either positive or negative. The purpose of this study is to evaluate the
effectiveness of sentiment analysis for stock prediction and to compare the performance of
different algorithms.

3
Table of Contents
DECLARATION .............................................................................................................................1

ABSTRACT .....................................................................................................................................3

1. THESIS INTRODUCTION .....................................................................................................6

1.1 Stakeholder: ............................................................................................................................8

1.2 Research Hypothesis: .............................................................................................................8

1.3 Aim & Objective: ...................................................................................................................9

1.4 Significance of the Approach: ................................................................................................9

1.5 Limitations of the Solution:..................................................................................................11

1.6 Components of the Solution: ................................................................................................11

2. RELATED WORK ................................................................................................................12

2.1 Traditional Machine Learning Methods: ..............................................................................12

2.2 Deep Learning Methods .......................................................................................................14

3. APPROACH AND METHODOLOGY .................................................................................20

3.1 Implemented Components....................................................................................................21

3.2 Methodology ........................................................................................................................23

3.3 Data Analysis .......................................................................................................................25

4. IMPLEMENTATION ............................................................................................................28

4.1 Dataset: .................................................................................................................................28

4.1 Time Series Analysis:...........................................................................................................28

4.3 BERT-Based Text Sentiment Classification ........................................................................29

4.4 Model Training .....................................................................................................................31

4.3 Sentiment Analysis Approach: .............................................................................................31

4.3 Time Series Forecasting Approach: .....................................................................................32

4.4 Application Interface & Obtained Results: ..........................................................................33

5. CONCLUSION ......................................................................................................................35
4
6. FUTURE WORK ...................................................................................................................36

Table of Figures
Figure 1: LSTM Model ..................................................................................................................15
Figure 2: BERT Algorithm ............................................................................................................22
Figure 3: Prediction Using the Time-Series Approach ..................................................................33
Figure 4: Prediction using the Sentiment Analysis Approach .......................................................34

5
1. THESIS INTRODUCTION
Stocks are credit instruments on the financial market that can be used for
transfer and trading, offering the opportunity to make a profit on the one hand,
and the risk of loss on the other. Yield and liquidity are the two most important
factors of stocks, which reflect the laws of economic functioning and are also one
of the most important indicators of the securities market. Standard financial
theory, based on the efficient market assumption, assumes that markets are fully
transparent and fairly competitive. However, this assumption is far from reality.
Many studies have shown that investors' information acquisition and cognitive
biases in interpreting information affect their decision-making behavior in the stock
market, which in turn affects stock returns and liquidity.

Stock returns and liquidity are affected by various factors. Most natural
investors lack experience in trading, and their judgment of the market is highly
subjective, and they are easily influenced by the public media and people around
them, with a very obvious "herd effect". Among them, public figures express their
views through various social media such as Twitter, Snowball and Collingwood,
which can easily cause emotional fluctuations among investors and affect their
judgment of the stock market trend. Therefore, the social media remarks of public
figures have a higher degree of attention, and it is of great significance to study the
impact of their remarks on the stock market and predict the trend of the stock
market.

Considering that the comments scattered on the Internet media are a large
amount of unstructured real-time text information, it is natural to think of using
computers to help process them. Text mining techniques can uncover meaningful

6
and unknown knowledge in large amounts of textual information, and can be used
to analyze the emotions and sentiments of public figures' social media comments.
The use of computers to characterize and process textual information in the Web
has advanced the field of finance. The rise of the internet and social media
platforms have brought about an explosion of user-generated content on the web,
such as blogs, social media posts, forums, and product reviews.[1] As a result,
businesses and organizations face a daunting task of analyzing this unstructured
data to understand their customers' needs, preferences, and sentiments towards
their products or services. Sentiment analysis, also known as opinion mining, is a
technique used to extract subjective information from text data [2].

Sentiment analysis is a crucial field of study within natural language


processing and machine learning because of its numerous practical uses, such as
analyzing customer feedback, social media monitoring, and product
recommendations.[3] It involves extracting subjective information from textual
data and categorizing it as positive, negative, or neutral through the application of
natural language processing techniques. In recent years, machine learning
algorithms have emerged as a popular approach for sentiment analysis. These
algorithms can learn from labeled training data to predict the sentiment of text
data with high accuracy.[4] In this project, we will explore the effectiveness of
different machine learning algorithms, including Count Vectorizer, TF-IDF
Vectorizer (Term Frequency-Inverse Document Frequency), and Naive Bayes, in
sentiment analysis of product reviews. The goal is to compare the performance of
these algorithms and identify the best one for the task at hand.

We will use a dataset of product reviews and news about the Apple company,
as the source of our data. The dataset contains twitter details and news content
7
gathered from different sources. We will preprocess the data to remove noise,
transform the text into numerical features using Count Vectorizer and Tfidf
Vectorizer, and apply algorithm models as a classifier to predict the sentiment of
the reviews. Using measures like accuracy, precision, recall, and F1-score, we will
assess the effectiveness of the machine learning algorithms. Furthermore, we will
visualize the results using confusion matrices to gain insights into the performance
of the classifiers. Finally, we will compare our results with existing studies in the
field and draw conclusions on the effectiveness of these algorithms for sentiment
analysis.

1.1 Stakeholder:

The stakeholder of the research application are the investors of the stock
market who would like to predict the stock market price trends in a most optimal
manner. As of now, the application is targeting to perform the prediction only for
the apple stock. Anyways, the dataset could be changed with different stock related
data and perform the prediction as we do for the apple stock. Any customization
required here has to be flagged by the stakeholders to develop the solution a more
generic one.

1.2 Research Hypothesis:

The approach that uses the sentiment analysis model performs more accurate
stock price prediction when compared to the model that considered only the
historical data.

8
1.3 Aim & Objective:

Predicting the price trend of a specific stock means forecasting whether its
price will rise or fall in the future: day, week, month, etc. In other words,
considering t as the current day, the objective is to predict whether there will be
an increase or decrease in its price on day t + n. This scenario can be seen as a
binary classification problem, where the goal is to predict one of two classes, based
on the input data. Then use the binary classified data to predict the stock price
trend.
In this work, we propose an approach for predicting stock price movements
using traditional classifiers and compare it with sentiment analysis approach and
treating each stock as a single dataset. More specifically, we tune hyper-
parameters and train a specialized model for each stock. Our methodology follows
a technical analysis approach that considers the percentage change of prices
(closing, opening, high, low) and volume relative to their previous day values as
features. Additionally, we incorporate the percentage change of these indicators
related to their 5-day, 10-day, and 15-day moving averages. We claim that
specialized models for each stock combined with feature extraction and data
preprocessing play crucial roles in the classifier performance.

1.4 Significance of the Approach:

Sentiment analysis has emerged as a popular technique for predicting stock


prices, given the increasing availability of text data which include news, social
media posts and articles. The premise is that sentiment expressed in these texts
can reveal information about the market sentiment and expectations, which can
be used to make predictions about the direction or magnitude of the stock price

9
changes. Stock market prediction is an important area in finance that is always in
demand. Stock prices are influenced by a wide range of factors such as economic
indicators, company performance and news. One of the key factors that can
influence stock prices is investor sentiment. Sentiment analysis is a very efficient
technique which can be used to analyze the emotions and opinions of investors
towards a particular stock, which can help predict future stock prices. In this report,
we will explore the use of sentiment analysis in stock prediction.

Sentiment analysis with the LSTM & BERT algorithm can be a valuable tool
for stock prediction. By analyzing investor sentiment, we can gain insights into
market trends and make informed investment decisions. The LSTM is an efficient
and powerful algorithm in machine learning that can combine sentiment data with
other factors to predict future stock prices with high accuracy. As the field of NLP
and machine learning continue to advance, we can expect sentiment analysis with
the LSTM algorithm to become an even more powerful tool for stock prediction in
the future. Stock prediction is a challenging task that is of great interest to investors
and financial analysts. One of the most significant factors that can influence stock
prices is investor sentiment. Sentiment analysis is a very efficient technique which
can be used to analyze the emotions and opinions of investors towards a particular
stock, which can help predict future stock prices. Sentiment analysis has emerged
as a popular technique for predicting stock prices, given the increasing availability
of data in form of text such as such as news, social media posts and articles. The
premise is that sentiment expressed in these texts can reveal information about
the market sentiment and expectations, which can be used to make predictions
about the direction or magnitude of the stock price changes.

10
1.5 Limitations of the Solution:

The developed application has many limitations and I am highlighting the


major stuffs here. First, the model training and prediction are done just with an
algorithm without making any major comparisons with the other deep learning or
machine learning algorithms. As an impact, when the solution is moved to next
phase for further enhancements, a more complex work is required to include the
algorithm comparison phase as well. On top of that, we couldn't ensure all the stock
data would be of the same format, so we may need to change the data
preprocessing techniques according to the dataset that we intend to process for a
different stock price.

1.6 Components of the Solution:

o Component to pre-process the dataset which is in the unstructured text


format.
o A BERT model which is to perform the sentiment analysis on the text dataset
like news, tweets to find if the trend is positive or negative.
o A code block that consumes the BERT model and prepare a final dataset, that
will have the average value which is calculated based on the predictions.
o A LSTM Model to visualize the predicted result of the apple stock price in the
plots, so that we could compare the actual value with the predicted one from
the developed model.

11
2. RELATED WORK

2.1 Traditional Machine Learning Methods:

The text sentiment classification problem can use any text classification
algorithm such as K Nearest Neighbor algorithm, Naïve Bayes, Fisher's Discriminant
Criterion, Support Vector Machines and so on. Literature [5] proposed an improved
classification method of k Nearest Neighbor, comparing three classification
algorithms and the improved algorithm based on experimental data. Experiments
show that the improved method performs the best among the KNN classification
methods with ACCURACY of 11.5% and PRECISION of 20.3%.

Naive-Bayes is a supervised learning algorithm: assigning c* to a given


document, a Naïve Bayes classifier is first derived by observation, where P(d) does
not work in selecting c*. To estimate the term, Naive Bayes decomposes it by
assuming that fi is conditionally independent for a given d class. The training
method consists of relative frequency estimation of P(c) and P(fi|c) using plus-one
smoothing. Despite its simplicity and the fact that the assumption of conditional
independence clearly does not hold in real-world situations, text categorization
based on plain Bayes still performs surprisingly well.

Fisher's linear discriminant is one of the effective methods for


dimensionality reduction in statistical pattern recognition. Its main idea can be
briefly described as follows: assuming that there are two kinds of sample points in
the d-dimensional data space, it is desired to find a line in the original space so that
the projection points on the line of sample points can be separated as much as
possible by a point on the line. In other words, the larger the squared difference
12
between the means of the two projected sample points, and also the smaller the
intraclass scattering, the better the desired line. Literature [6] conducted two
experiments based on Fisher's discriminant ratio by combining different feature
selection methods with two candidate feature sets. Under 2739 subjective
documents and 1006 car-related subjective documents from COAE2008, the
experimental results show that the accuracy of Fisher's discriminant ratio based on
word frequency estimation is 86.61% and 82.80% under the two corpora,
respectively.

Support Vector Machines (SVMs) have been shown to be very effective in


traditional text categorization, often outperforming Naïve-Bayes. in the two-class
case, the process of training is to find a hyperplane represented by vectors that not
only separates the document vectors in one class from the document vectors in the
other, but also that the separation or edges will be as large as possible.

Integration of various machine learning methods can make the classification


task more effective, for example, the integration method proposed in literature [7]
is based on static classifier selection. Based on the static classifier selection scheme,
the proposed integration method combines Logistic Regression, Naïve-Bayes,
Fisher Linear Discriminant, and Support Vector Machines as the base learners,
whose precision and performance of the recall value determines the weight
adjustment. The experimental analysis of classification tasks (including sentiment
analysis, software defect prediction, credit risk modeling, spam filtering, and
semantic mapping) in this paper shows that the integrated classification scheme
proposed in this study exhibits more significant prediction results than the
traditional single learning approach. The laptop dataset showed the best
classification accuracy (98.86%) among all examined datasets.
13
2.2 Deep Learning Methods:

With the improvement of computer arithmetic and the successful


application of deep learning methods in image categorization, people have also
begun to apply deep learning to the problem of semantic categorization of text.
The effects of word embedding and long short-term memory (LSTM) on sentiment
categorization in social media have been investigated in the literature [8]. The word
embedding model converts the words in a post into vectors and the sequence of
words in a sentence is fed into the LSTM.A deep learning based approach to
categorize user opinions expressed in comments (called RNSA) is proposed in
literature [9]. RNSA overcomes the obstacles of word order and information.
Literature [10] proposes a two-stage data balancing scheme for text sentiment
classification, which not only makes the data boundaries clear, but also balances
the class distribution of the training dataset. Literature [11] proposed a sentiment
feature-enhanced deep neural network (SDNN), which solves the problem by
integrating sentiment linguistic knowledge into a deep neural network through an
affective attention mechanism. A new emotion attention mechanism is introduced
to help select key emotion words related context words by utilizing an emotion
lexicon in the attention mechanism. Literature [12] proposes a computationally
efficient deep learning model for binary sentiment categorization aimed at
determining the sentiment polarity of opinions, attitudes, and emotions expressed
by people in written texts. The performance metrics of the method proposed in
that paper are competitive with recently published models with relatively complex
architectures. Furthermore, it can be inferred that the proposed architecture based
on a single-layer BiLSTM is computationally efficient and can be recommended for
real-time applications in the field of sentiment analysis. Literature [13] uses a

14
transport capsule network model which is capable of transferring knowledge
obtained at the document level to the aspect level for classification based on the
sentiment detected in the text. The routing approach is extended to group
semantic capsules for use in a transfer learning framework.

Figure 1: LSTM Model

Literature [14] proposes a capsule-based hybrid network model with the


advantages of short training time and simple network structure for better
performance. Literature [15] proposes a BERT model incorporating an attention
mechanism, the model proposed in the study uses an attention mechanism to

15
analyze the weights of short texts and accurately weights the sums to finally obtain
the category output of the short texts. For the characteristics of microblog
comments, such as few words and short sentences. This study achieved good
results by improving the lexical model. Experimental results show that the
algorithm proposed in this study outperforms other algorithms in terms of aspect-
level classification accuracy and recall in the aspect-level sentiment categorization
task. From the above related work, it can be seen that deep learning is considerably
due to traditional machine learning methods in terms of accuracy in text
categorization, so we choose the deep learning method as the classification model
in this study.

Many studies and research have been conducted which concentrates on


the use of sentiment analysis using Random Forest algorithm for stock prediction.
A study by Chen et al. [16] used the Random Forest algorithm with sentiment data
to predict the stock prices of Chinese companies. They found that the Random
Forest algorithm with sentiment data could improve the accuracy of stock price
predictions by up to 2.6% compared to using only financial data. For example, a
study was conducted at Bollen et al. [17] found out that in Twitter every sentiment
can help to predict stock market movements which will have an accuracy of up to
87.6%.

Another study by Naik et al. [18] also concentrates on use of sentiment


analysis to help predict the stock prices of Chinese companies. They found that
sentiment analysis could improve the accuracy of stock price predictions by up to
6%. Another study by Angelika they used the Random Forest algorithm with
sentiment data to predict the stock prices of American companies. They found that
the Random Forest algorithm with sentiment data could improve the accuracy of
16
stock price predictions by up to 3.3% compared to using only financial data. A
number of studies have explored how efficiently sentiment analysis can help in
prediction of stocks, using a variety of methods and data sources. In another one
study, Xiao Ding. [19] used a machine learning algorithm to predict the direction of
the stock price changes based on news articles, while they use of data by Twitter
to use in prediction of the DJIA index. Both studies found that sentiment analysis is
one of the efficient tools for stock price prediction, with varying degrees of
accuracy depending on the data sources and methods used. Other studies have
focused on specific industries or events, such as the financial crisis of 2008
(Xiaodong Li 2014) or the announcement of mergers and acquisitions (Shahi, T.B.,
2019). These studies found that sentiment analysis can capture the market
reactions to these events and can provide valuable insights into the future
performance of the affected companies or sectors. In terms of methods, many
machine learning algorithms such as Random Forest, Multinomial Naive Bayes, and
many more have been used for stock prediction with sentiment analysis
(Mittermayer, 2004). Some studies have also combined sentiment analysis with
other techniques such as technical analysis or network analysis to improve the
prediction accuracy (Wang et al., 2017). However, there are also challenges and
limitations to using sentiment analysis for stock prediction. One issue is the quality
and reliability of the text data, which can be influenced by factors such as bias,
noise, and ambiguity. Another issue is the changing nature of the sentiment over
time, as market conditions and expectations can shift quickly. Additionally, there
may be other factors that influence the stock prices that are not captured by the
sentiment analysis.

17
The importance of sentiment research in predicting stock prices has grown.
The creation of several algorithms for predicting stock prices based on sentiment
analysis of news stories and social media posts has been facilitated by recent
developments in natural language processing. In this study, we test the
effectiveness of three algorithms for sentiment analysis on the daily newspaper
headlines of a particular company: Count Vectorizer, Tfidf Vectorizer, and Naive
Bayes. Previous research has looked into the use of sentiment analysis to forecast
stock values. Li and Li (2011) developed a regression model that beat a benchmark
model using sentiment analysis on data from stock message boards. Zhang et al.
(2011) found that sentiment analysis increased the precision of prediction models
when applied to financial news stories. Machine learning algorithms like Naive
Bayes, Support Vector Machines (SVM), Random Forest, and XGBoost have been
extensively used in recent years to predict stock prices using sentiment analysis.
The most successful machine learning algorithms, according to Akita and Kitagawa
(2016)'s tests on financial news articles, were Naive Bayes, SVM, and Random
Forest.[20]

Common natural language processing methods for feature extraction from text
include count Vectorizer and Tfidf Vectorizer. The Count Vectorizer algorithm turns
text documents into a feature matrix of token counts, whereas the Tfidf Vectorizer
algorithm transforms text documents into a feature matrix of term frequency-
inverse document frequency (TF-IDF) values. Overall, sentiment analysis has shown
promise for predicting stock prices, and earlier studies have shown the
effectiveness of machine learning algorithms like Count Vectorizer, Tfidf Vectorizer,
and Naive Bayes. By applying these algorithms to daily news headlines and offering
18
insights into their efficacy in sentiment analysis for stock price prediction, this
initiative seeks to advance existing research.[21]

In summary, sentiment analysis has shown promise as a technique for


predicting stock prices, with a growing body of research exploring the methods and
applications of this approach. However, more work is needed to address the
challenges and limitations and to refine the methods which will help in improving
the accuracy and dependability of the predictions.

19
3. APPROACH AND METHODOLOGY
Natural language processing (NLP) techniques are used in this project's
methodology and strategy to evaluate sentiment in texts from a variety of sources,
including news articles, social media posts, and financial reports. The research
contrasts various sentiment analysis algorithms, such as lexicon-based methods
and machine learning models. The effect of various sentiment characteristics,
including polarity and subjectivity, on the precision of stock price prediction is also
examined by the writers. The methodology of the research includes data collection
and preprocessing, model training and testing, and evaluation of performance
measures like recall, accuracy, and F1 score. The results of this research can have
significant ramifications for financial decision-making and offer information on how
well various sentiment analysis techniques can forecast stock prices.

To perform sentiment analysis, we need to collect data from various sources,


including social media platforms, news websites, and financial blogs. We can then
use natural language processing (NLP) techniques to analyze the text and
determine the sentiment of the content. NLP techniques can help identify positive,
negative, and neutral sentiments from the text. Once we have collected the
sentiment data, we can use the Random Forest algorithm to predict future stock
prices. We can use the sentiment data and many other factors which help in
influencing stock prices, which include market trends and performance of
company, as input features for the LSTM algorithm. We can then train the
algorithm on historical data to predict future stock prices.

20
3.1 Implemented Components:

• Vectorizer

The vectorization procedure in natural language processing involves transforming


text data into a numerical format that machine learning algorithms can
comprehend. A vectorizer is a tool that does this conversion, usually by creating a
matrix of word counts or frequency statistics.

• Countvectorizer

CountVectorizer is a tool for transforming text data into a table of word frequency
counts. The table has a column for each unique word in the vocabulary and a row
for each document. The cells in the table represent the number of times a particular
word appears in a given document. It is a method of representing text data in
numerical form where each word is represented by a number of occurrences in the
text. This approach builds a vocabulary from training data, and each document is
represented as a vector with values equal to the number of terms in the vocabulary.
The resulting vectors are then used as input to the Random Forest algorithm for
making predictions.

• TF-IDF Vectorizer

Tf-idf Vectorizer is another type of vectorizer that converts text data into a matrix
of term frequencies and inverse document frequencies (TF-IDF). This method is
similar to CountVectorizer, but it takes into account the importance of each word
in the document and in the corpus as a whole. Each word is represented by its term
21
frequency-inverse document frequency in this way of numerically expressing text
data. According to this technique, a word's significance in a document is assessed
based on both its rarity within the document and its frequency within the corpus
of documents. The resulting vectors are then used as input to the Random Forest
algorithm for making predictions.

• BERT

BERT algorithm that is often used in natural language processing tasks such as text
classification. It is based on the Bayes theorem, which determines the likelihood
that a document belongs to a particular class based on the likelihood that its words
occur in that class. BERT algorithm is used in text classification problems and it
performs well when the features are independent of each other. It is compatible
with numerous feature extraction methods like Count Vector and TF-IDF Vector.

Figure 2: BERT Algorithm

In summary, there are two feature extraction methods, Count Vector and TF-IDF
Vector, that can be utilized with the LSTM algorithm. This algorithm is powerful and

22
can be applied to both classification and regression tasks. Additionally, the Naive
Bayes algorithm is a simple probabilistic classifier that can work with various
feature extraction methods, including Count Vector and TF-IDF Vector. This
algorithm applies Bayes' theorem while making strong independence assumptions.

3.2 Methodology:

Data Collection: The data collection phase of this project is the first stage. In this
instance, the information was gathered from a daily newspaper's headlines about
a specific business. In this step, historical stock prices and relevant news articles or
tweets are collected. The news articles or tweets are preprocessed by cleaning the
text, removing stop words, and stemming or lemmatizing the words. The historical
stock prices are adjusted for splits and dividends, and any missing values are
imputed.

Data Cleaning: The collected data is then cleaned to remove any unwanted
characters, punctuation marks, and stop words.[8] The data is then converted to
lowercase, and any other pre-processing techniques are applied, such as stemming
or lemmatization.

Vectorization: After preprocessing, vectorization techniques are used to transform


the data into numerical shape. Two vectorization methods—Count Vectorizer and
TfidfVectorizer—have been applied in this undertaking. CountVectorizer tallies the

23
occurrences of each word in the document, as opposed to TfidfVectorizer, which
calculates the term frequency-inverse document frequency of each word.

Model Evaluation: Following training, the models are assessed using measures like
accuracy, precision, recall, and F1-score. The finest performing model is then
chosen for additional examination. In this step, machine learning models which
include LSTM are trained on input features to predict the magnitude or depth of
stock price changes. Models are evaluated using appropriate metrics such as
accuracy, precision, recall, F1-score and AUC-ROC. Models could also be back
tested against historical data to assess their performance in a simulated trading
environment.

Results and Interpretation: In this step, the results of the experiments are
presented and interpreted. The performance of the machine learning models is
compared to a baseline such as a random walk or a buy-and-hold strategy. The
contribution of each input feature to the prediction performance is also analyzed
using techniques such as feature importance or partial dependence plots. The
limitations and potential future directions of the approach are also discussed.

Hyper parameter Tuning: In this step, the hyper parameters of the machine
learning models are tuned using various techniques which include grid search or
random search. The hyper parameters help in controlling the complexity of the
models which might have a huge impact on working and performance. The tuning
process is typically performed using a separate validation set to prevent overfitting

24
Sentiment Analysis: The selected model is then used to predict the sentiment of
each headline in the dataset. The sentiment of a headline is determined by the
probability of the headline belonging to a positive, negative, or neutral class. In this
step, sentiment scores are calculated for each news article or tweet using
techniques such as VADER or TextBlob. The sentiment scores can be positive,
negative, or neutral, or they can be on a continuous scale. The sentiment scores
are then combined with other features, such as the stock price changes, trading
volumes, and technical indicators, to create a set of input features for the machine
learning models.

Stock Price Prediction: Finally, the sentiment of the headlines is used to predict the
stock price of the company. This is done by analyzing the correlation between the
sentiment of the headlines and the stock price movement of the company.

3.3 Data Analysis:

Data Analysis and Exploration is an important step in any data science project
as it allows for a thorough understanding of the data and identification of any
potential issues or trends. In this project, the data analysis and exploration will
focus on the stock prices and the sentiment of news articles.

To comprehend the general trend and any fluctuations, the stock prices will
first be visualised using a variety of plots, including line plots, bar plots, and
histograms. The computation of summary statistics like mean, median, and
standard deviation will also help us better understand the distribution of stock
values.

25
The sentiment of the news stories will then be examined using a variety of
methods, including sentiment analysis, text mining, and word clouds. This will make
it possible to comprehend the general tone of the news stories and any possible
relationships to the stock prices.

Additionally, the correlation between the stock prices and the sentiment of
the news articles will be analyzed. This will be done by calculating the correlation
coefficient between the two variables and visualizing the results using scatter plots.

In this project, the data will be analyzed using different algorithms such as
Random Forest Count Vector, LSTM TF-IDF Vector and BERT, thus it is important to
understand how each algorithm is performing on the data, by comparing the
accuracy, precision, recall and F1-score of each algorithm.

Groups for training and assessment will then be created from the
preprocessed data. The training data will then be subjected to the feature
extraction methods in order to produce the feature vectors that will serve as the
input for the algorithms.

The scikit-learn Python library will then be used to build the algorithm. The
training data will be used to programmed the algorithm, and the test data will be
used to make forecasts. Performance will be assessed using calculations based on
the algorithm's accuracy, precision, memory, and F1-score. Similarly, the LSTM
algorithm will also be implemented and evaluated using the same metrics.

The Count Vector and TF-IDF Vector feature extraction methods will be used
in this assignment to build the Random Forest algorithm. This will make it possible
to compare how the algorithm performed using the two distinct techniques and
determine which technique provides greater accuracy. Finally, the results of the
26
Random Forest algorithm with Count Vector and TF-IDF Vector will be compared
with the results of LSTM algorithm to see which algorithm gives a better accuracy.

In conclusion, this project's data analysis and exploration efforts will be


directed towards figuring out the general pattern and swings in stock prices, the
tone of the news stories, and the relationship between the two variables.
Additionally, the effectiveness of various methods will be assessed.

27
4. IMPLEMENTATION
At first, a stock would be selected for which I would be developing the models
using sentiment analysis-based prediction and historical data-based predictive
analytics approaches. The selection of stock would be based on the News and
Tweets dataset available online concerning the stock.

4.1 Dataset:

1. For applying the sentiment analysis approach, the external factors like the
news, and people's opinions from social media sites like Twitter/X. On top of that,
a historical stock price dataset is required for model training & validation.

2. For performing the historical-based analysis, just the stock historical price
dataset is sufficient.

3. The stock price dataset can be downloaded from the Yahoo Finance portal.
News and Tweets could be downloaded from the sources like Kaggle.

4.1 Time Series Analysis:

A significant class of longitudinal study designs are suitable for time series
analysis as a statistical tool. These types of designs frequently use lone individuals
or research units that are tracked regularly at scheduled times over a sizable
number of evaluations. A longitudinal design might be compared to a time series
analysis as an example. Periodic analysis can be used to analyze the results of either
a controlled or accidental intervention, as well as to comprehend the fundamental
dynamical process and the trends of shifts throughout time. The ability to examine
longitudinal information collected on specific persons or units has advanced
28
significantly thanks to contemporary statistical analysis of time series along with
associated research techniques. Early time series systems mainly depended on
visualization to describe and comprehend results, particularly when applied to
psychology. The capacity to apply a sophisticated mathematical approach to this
sort of data has transformed the field of solitary subject investigation, even though
graphical tools are still helpful and continue to give significant additional insight
into the comprehension of a time series operation.

4.3 BERT-Based Text Sentiment Classification

The research content of this paper is mainly to use text mining technology to
capture the social media remarks of public figures in related industries, to construct
and quantify the word vector space after tweet crawling, and to extract the specific
stock price information synchronized with its time to form a visualized comparison
chart; in the next step, to construct a network model between multiple historical
sensitive word vectors and stock price data curves, and to make a judgment on the
existing sensitive words and the corresponding stock price fluctuations to get the
best model results, in order to accurately extract the features, it is decided to use
the text pre-training model BERT proposed by Google; the prediction results, i.e.,
the proportional impact of the fluctuations of sensitive and quasi-sensitive words
on the stock price, are given on the optimal model, accordingly for the
recommendation or warning information of the stock trading.

BERT is a pre-trained model open-sourced by Google in 2018, which uses a


fully-connected structure masked language model (MLM) that generates deep bi-
directional semantic representations. The results are astounding when using this

29
structure, and the paper was published with refreshed SOTA for several natural
language processing tasks.

The network architecture of BERT is realized based on the Transformer


structure. BERT takes the encoder part of the Transformer. The bi-directional
coding structure of BERT is shown in Fig. 1. It consists of multiple layers of
Transformer Encoder block structure stacked together. Because of its bidirectional
encoding approach, the network can break out of the order of sentences,
understand the contextual information, and perform in-depth bi-directional
linguistic characterization of the text.

4.5 Dataset Construction

In order to ensure the validity and accuracy of the data, we select websites
with large traffic and recognition to engage in data acquisition, such as twitter, Sina
microblogging, China Economy Network and other well-known domestic and
international websites, and the selected time interval is 2010.1.1~2021.1.1, and the
captured public figures are: Elon Musk, Zuckerberg, Yu Chengdong, Wang Jianlin,
Wang Shi, Pan Shiyi, Buffett, and Shen Nanpeng. Using K-Means algorithm to
cluster analyze the quantified phrases, the text is divided into clusters into 3
categories, which are science and innovation, real estate and finance, accordingly,
we use Python crawler to grab the charts of the major stocks in these three sectors
during this time interval. The result of clustering can then be used to classify the
text for different features, then the non-critical information in the text is removed,
the important information is extracted, and the sentiment of the text is classified
as positive (1), negative (-1), and neutral (0).

30
4.4 Model Training

The idea of Masked Language Model (MLM) is that 15% of the word fragment
elements (tokens) are replaced (masked) in order to learn the representation of
the word. The word is predicted through context. For a word that is selected to
participate in mask, it is not always replaced every time (otherwise it would
produce words that the model has never seen before). Due to the above
mechanism, only 15% of the words are predicted at a time, so the model converges
slowly. This session is designed to serve tasks such as question and answer,
inference, and utterance topic relations. To generate the training data, from the
corpus, in 50% of the cases, two consecutive sentences are taken and labeled as
'IsNext'; in the other 50% of the cases, two arbitrary words are randomly taken and
labeled as 'NotNext '.

Finally, many downstream tasks can be handled by simply inserting task-


specific inputs, outputs into BERT and utilizing Transformer's powerful attention
mechanism. In order to apply the BERT model in solving sentiment analysis tasks,
the model needs to be fine-tuned.

4.3 Sentiment Analysis Approach:

1. At first, we need to go through the steps to clean textual data and prepare the
target variable. This is to ensure, only the relevant details are extracted that are
necessary for my model training. Below steps are part of it:
a. Tokenization.
b. Stop word removal.
c. Normalization.
d. Punctuation removal.
e. Lemmatization.
f. Apply any other pre-processing steps if required.

31
2. Split the pre-processed dataset based on the sentiment score which is already
extracted from the dataset, to understand whether the given text is on a positive
or negative note.

3. Prepare a text classification (sentiment analysis) model with the prepared


dataset and apply a linear model on the TF-IDF approach using the LSTM-RNN
algorithm.

4. Finally, create an LSTM model, which performs the stock price prediction, that
accepts the sentiment analysis results from the model developed in the previous
step and correlates the historical stock price data for the prediction of the stock
price in the future.

5. This approach depends on the hypothesis "The value of the stock varies based
on the public opinion and any major news about the product". For example, if
Apple launches a new model and it's attracting users a lot, then we have a high
chance, that Apple stock would go high.

4.3 Time Series Forecasting Approach:

1. Time series forecasting is the task of predicting future values based on


historical data. Here again, we take the LSTM model and split the dataset into the
model training and validation tests.

2. The model doesn't use external data like news/tweets to find the
correlation between the pattern. This approach is purely based on the historical
data.

32
4.4 Application Interface & Obtained Results:

Figure 3: Prediction Using the Time-Series Approach

33
Figure 4: Prediction using the Sentiment Analysis Approach

Results And Analysis:

The pre-training of BERT is costly in terms of time cost and hardware cost,
and we choose open-source pre-training models and codes to fine-tune on our
sample data. Before the training process of the network, each day is taken as a unit,
and the outputs of all samples of the day are averaged as the sentiment index of
the day. The output of the network ranges from -1 to 1. The sentiment indices of
all days are counted and converted into a "Sentiment Index - Time" graph. The
maximum length of the text was 64. Finally, the model was trained using the mean
square error as the loss function and Adam as the optimizer, with each input batch
of size 16; the learning rate was 2×10-4, and 300 epochs were iterated, with the
structure of the network saved every 10 epochs.

34
5. CONCLUSION
This paper proposed an approach for predicting stock price movements using
technical indicators based on the percentage change of prices, volume, and related
moving averages. It con-siders each stock as a distinct dataset and trains a
specialized classifier for each one. We compare the proposed procedure mainly
with state-of-the-art deep learning techniques.

The results demonstrate that a specialized model per stock, employing


simple and small feature sets, can outperform generalized state-of-the-art deep
learning models in both accuracy and MCC. When training a model, it is essential
to consider the specific characteristics of each stock: combining all shares and
training a single classifier may ignore certain specificities. We argue that the model
can effectively capture these particularities using specialized training. The results
suggest that a comprehensive understanding of the data, feature extraction, and
the use of specific models for each stock significantly contribute to improve the
model performance. Even simpler models can achieve good performance.

In this paper, we numerically represent the social media statements of public


figures, use the BERT network model to classify the sentiment of these texts, and
analyze them in comparison with the stock price trend of the day. The results show
that public figures' social media statements have a significant impact on stock
market fluctuations, and positive and negative statements have a close relationship
with stock price fluctuations. Specifically, the public figures' comments will have a
greater impact on the stocks of the corresponding industries in a short period of
time, and the more influential figures' comments will have a greater impact on the
stocks, but in a long period of time, the cumulative return will be stabilized.
Therefore, individual investors can make short-term investments based on the
35
information or advice of public figures, and can obtain benefits to a certain extent.
However, for long-term investments, it is far from enough to rely solely on the
remarks of public figures, and it is necessary to make comprehensive
considerations based on the policy and regulations, the industry cycle, as well as
the company's operation and other factors.

6. FUTURE WORK
As future work, we propose to explore a hybrid approach that leverages an
ensemble of classifiers. Such approach could combine specialized models for
individual stocks with generic models trained on all stock data. In the future phase
of the implementation, the model would be developed with different algorithms
and compare the performance of the model with the one that has been developed
now. This is required as it is essential to ensure that a more optimal algorithm is
used in the stock prediction. On top of that, as of now, the sentiment analysis is
done based on the model which has been trained with limited data, in the future
phase, the sentiment analysis model could be developed as a hybrid one, which
makes use of other pre-trained algorithms as well and take the collective result.
The sentiment analysis layer is important for the stock prediction as the data is
plotted based on the result that has been obtained from the sentiment analysis
model.

36
7. REFERENCES
o Shuchi He, Zhongyue Chen and Xiaoping Chen, "A Position-Sensitive Regression Network for Multi-
Oriented Scene Text Detection", 2021 IEEE 4th International Conference on Computer and
Communication Engineering Technology (CCET), 2021.

o D. Arora, A. Singh, V. Sharma, H. S. Bhaduria and R. B. Patel, "HgsDb: Haplogroups Database to


understand migration and molecular risk assessment", Bioinformation, vol. 11, no. 6, pp. 272, Jun.
2015.

o Zhenxuan Zhang, Yuanyuan Li, Sang Won Yoon and Daehan Won, "Chapter 6 Reflow Thermal Recipe
Segment Optimization Model Based on Artificial Neural Network Approach", Springer Science and
Business Media LLC, 2023.

o R. Singh and S. Avikal, "COVID-19: A decision-making approach for prioritization of preventive


activities", Int. J. Healthc. Manag., vol. 13, no. 3, pp. 257-262, 2020.

o G Wang and S Y Shin, "An improved text classification method for sentiment classification[J]", Journal
of information and communication convergence engineering, vol. 17, no. 1, pp. 41-48, 2019.

o S Wang, D Li, X Song et al., "A feature selection method based on improved fisher’s discriminant ratio
for text sentiment classification[J]", Expert Systems with Applications, vol. 38, no. 7, pp. 8696-8702,
2011.

o A Onan, S Korukoğlu and H Bulut, "A multiobjective weighted voting ensemble classifier based on
differential evolution algorithm for text sentiment classification[J]", Expert Systems with Applications,
vol. 62, pp. 1-16, 2016.

o A Abdi, S M Shamsuddin, S Hasan et al., "Deep learning-based sentiment classification of evaluative


text based on Multi-feature fusion[J]", Information Processing & Management, vol. 56, no. 4, pp. 1245-
1259, 2019.

o Y Li, J Wang, S Wang et al., "Local dense mixed region cutting+ global rebalancing: a method for
imbalanced text sentiment classification[J]", International journal of machine learning and
cybernetics, vol. 10, no. 7, pp. 1805-1820, 2019.

o W Li, P Liu, Q Zhang et al., "An improved approach for text sentiment classification based on a deep
neural network via a sentiment attention mechanism[J]", Future Internet, vol. 11, no. 4, pp. 96, 2019.

37
o Z Hameed and B Garcia-Zapirain, "Sentiment classification using a single-layered BiLSTM model[J]",
IEEE Access, vol. 8, pp. 73992-74001, 2020.

o A Sungheetha and R Sharma, "TransCapsule model for sentiment classification[J]", Journal of


Artificial Intelligence, vol. 2, no. 03, pp. 163-169, 2020.

o Y Du, X Zhao, M He et al., "A novel capsule-based hybrid neural network for sentiment
classification[J]", IEEE Access, vol. 7, pp. 39321-39328, 2019.

o W Li, F Qi, M Tang et al., "Bidirectional LSTM with self-attention mechanism and multi-channel
features for sentiment classification[J]", Neurocomputing, vol. 387, pp. 63-77, 2020.

o S. Chen and H. He, "Stock prediction using convolutional neural network", IOP Conference Series:
Materials Science and Engineering, vol. 435, no. 1, pp. 1-9, 2018.

o J. Bollen, H. Mao and X. Zeng, "Twitter mood predicts the stock market", Journal of Computational
Science, vol. 2, no. 1, pp. 1-8, 2011.

o N. Naik and B. R. Mohan, "Novel Stock Crisis Prediction Technique—A Study on Indian Stock Market",
IEEE Access, vol. 9, pp. 86230-86242, 2021.

o X. Ding et al., "Using Structured Events to Predict Stock Price Movement: An Empirical Investigation",
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),
pp. 14015-1425, 2014.

o Xiaodong Li, Pangjing Wu and Wenpeng Wang, "Incorporating stock prices and news sentiments for
stock market prediction: A case of Hong Kong", Information Processing & Management, vol. 57, no.
5, 2020, [online] Available: https://doi.org/10.1016/j.ipm.2020.102212.

o Sentiment Analysis of News Headlines For Stock Trend Prediction Gupta O, pp. 13, 2020.

o Image References: https://production-media.paperswithcode.com/methods/new_BERT_Overall.jpg

o Image References: https://miro.medium.com/v2/resize:fit:1400/1*7cMfenu76BZCzdKWCfBABA.png

38

You might also like