Thesis Darius Dragnea

Download as pdf or txt
Download as pdf or txt
You are on page 1of 64

Moral judgement with

r/AITA: how could NLP


methods be used to assess
real-life scenarios?
Darius Dragnea

A dissertation submitted in partial fulfilment


of the requirements for the degree of
Bachelor of Science
of the
University of Aberdeen.

Department of Computing Science

2023
Declaration

No portion of the work contained in this document has been submitted in support of an application
for a degree or qualification of this or any other university or other institution of learning. All
verbatim extracts have been distinguished by quotation marks, and all sources of information have
been specifically acknowledged.

Signed:

Date: 2023
Abstract

People often seek advice for their problems on the internet forums because they believe other
strangers would judge their situations more objectively than relatives or friends. Thanks to the
most recent developments in the Natural Language Processing (NLP) field, AIs can be now used
for moral judgment purposes and they definitely provide a lower bias level than humans. This
paper focuses on exploring different NLP methods used in text classification tasks, while attempt-
ing to provide a novel solution for developing such a tool, based on posts collected from Reddit’s
r/AmITheAsshole (AITA) community. The first research question focuses on exploring and sum-
marising different patterns that exist in the newly-created data-set containing posts scraped from
the famous subreddit. Topic modelling and sentiment analysis have been performed, showing that
the most majority of the posts are labeled as positive, while also providing the main topics dis-
cussed on the forum, which are relationships, family and housing. The second research question
revolves around delving into various machine learning methods and assess their performance for a
text classification task. Four different subsets have been created, due to the issue provided by the
imbalanced data-set, and results show that the best models (Bi-LSTM and DistilBERT) are able
to achieve around 62% and 61% accuracy, respectively, with the later even achieving more than
40% recall per each of the four classes: Not The Asshole (NTA), You The Asshole (YTA), Ev-
eryone Sucks Here (ESH) and No Assholes Here (NAH). The third research question provides
a solution for a web platform that allows a user to write their stories and receive a verdict from
the best performing model trained previously, along with an explanation for the given verdict and
a recommendation in the situation, both generated with the aid of ChatGPT3.5. The platform has
been further tested through a human experiment involving 20 participants which have been asked
to fill in a questionnaire based on their experience. The results are somewhat satisfactory, showing
that the model has managed to predict the verdicts decently in 50% of the time for the stories
provided by the participants, and in some cases the verdicts along with the explanation have even
managed to change people’s opinion on the scenario.
Acknowledgements

First of all, I would like to thank to my family members for giving me the opportunity and the
funds to study abroad at such a prestigious university like University of Aberdeen. The whole
experience here has happened because they have had the vision and the commitment to determine
me to study abroad. I have acquired important skills in computing science throughout these four
years, and I would like to also thank all the professors that have guided us this whole time and
that made their best to help us as much as possible even during the pandemic. The professor that
has influenced me the most has been my supervisor, Dr. Bruno Yun, that has introduced me to the
world of Machine Learning and then accepted me to work on this project proposed by him. He
has represented a role model for me and I wish him all the best with his future career as a lecturer
and researcher.
Contents

1 Introduction 7
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Paper overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Background 13
2.1 Natural Language Processing (NLP) methods . . . . . . . . . . . . . . . . . . . 13
2.1.1 NLP background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.2 Sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.3 Topic modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Traditional Machine Learning (ML) methods . . . . . . . . . . . . . . . . . . . 17
2.2.1 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Deep Learning methods and Neural Networks (NNs) . . . . . . . . . . . . . . . 18
2.3.1 An overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Multi-layer Percepetron (MLP) . . . . . . . . . . . . . . . . . . . . . . 19
2.3.3 Long short-term memory networks (LSTMs) . . . . . . . . . . . . . . . 20
2.4 Feature engineering techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Bag of words (BoW) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.2 Term frequency–inverse document frequency (Tf-idf) . . . . . . . . . . . 22
2.4.3 Word embeddings & Word2Vec . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Transformers & Large Language Models (LLMs) . . . . . . . . . . . . . . . . . 24
2.5.1 Bidirectional Encoder Representations from Transformers (BERT) . . . . 24
2.5.2 DistilBERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.3 GPT-3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 Methodology 27
3.1 Data-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 RQ1 - sentiment analysis and topic modelling on AITA data-set . . . . . . . . . 28
3.2.1 Data-cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 Sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.3 Topic modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 RQ2 - ML methods employed for text classification . . . . . . . . . . . . . . . . 32
C ONTENTS 6

3.3.1 Subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.2 BoW & Tf-idf with traditional ML . . . . . . . . . . . . . . . . . . . . . 33
3.3.3 Word2Vec with Neural Networks . . . . . . . . . . . . . . . . . . . . . 35
3.3.4 DistilBERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4 RQ3 - integrate best model and other features on web platform . . . . . . . . . . 41

4 Results & discussion 43


4.1 RQ1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 RQ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 aita_2500 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.2 aita_NTA_1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.3 aita_3_balanced . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.4 aita_4_balanced . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.5 Overall remarks about the four experiments . . . . . . . . . . . . . . . . 50
4.2.6 Further analysis of best model . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 RQ3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.1 Platform & experiment details . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.2 Participants’ agreement with the results returned by the AI . . . . . . . . 52
4.3.3 Qualitative analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5 Conclusions 55
5.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 Future work & limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.1 Expand the data-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.2 Use explainable tools to understand the models better . . . . . . . . . . . 56
5.2.3 Extend the 4-class classification . . . . . . . . . . . . . . . . . . . . . . 56
5.3 Self-evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

A Maintenance manual 57

B User manual 59
Chapter 1

Introduction

1.1 Motivation
In recent years, the modern tech world has witnessed significant progress in the field of artificial
intelligence, especially what is known as Natural Language Processing (NLP) (Kalyanathaya et al.,
2019). Its applications have become prevalent in many industries and have definitely improved
human-to-machine interactions, as well as customers’ experience. For instance, AI conversational
systems such as chatbots and virtual assistants are widely used nowadays by certain organisations
that deal with plenty of customers, in view of boosting their business and services. This process is
called automatisation and relies directly on how well-automated systems can replace, replicate or
even overcome humans’ potential.

But what if we were to expand these systems further by making them more involved in peo-
ple’s lives? What if automated systems became capable of aiding us with our day-to-day personal
problems? Recent advances in NLP have provided us with an answer to these questions: text
analytics (Dale, 2017). This topic covers a broad range of technologies such as text classifica-
tion, entity recognition, or sentiment analysis, that manipulate text written by humans and convert
it into a format understandable to AIs. One of the tasks of great interest these days that can be
performed by using these tools is moral judgement, which consists in evaluating whether certain
human behaviours are morally acceptable or not. Many researchers have struggled to come up
with satisfactory results in this respect, some even considering these AI systems more suitable
than humans in assessing real-life scenarios due to their lower level of bias (Mujumdar et al.,
2020). In contrast, others that have already tried developing such systems consider that teaching
moral sense to any machine poses a seemingly impossible challenge (Jiang et al., 2021).
When attempting to design and build this kind of ultra-powerful AI system, large chunks of
data are required, in view of making the machine as accurate as possible. Fortunately, social me-
dia platforms such as Reddit, Facebook, and Twitter have increased in popularity among people,
with many users revealing some of their personal stories. This trend has its roots in the biggest
privilege a user benefits from when surfing online - anonymity. Therefore, loads of threads, posts,
and comments have accumulated throughout the years and have indirectly provided ideal data-sets
that satisfy the data requirements mentioned above. They have eventually become an extremely
attractive database for data scientists, data analysts, and researchers (Chen et al., 2022). Perhaps
one of the most popular social media platforms among Internet users, Reddit offers an unlimited
1.2. R ESEARCH QUESTIONS 8

range of data-sets due to its communities and subreddits that multiply daily, covering various as-
pects of real-life (Proferes et al., 2021). A well-known Reddit community that has been active
since 2013 is r/AmItheAsshole (AITA) 1 , where people are invited to share their moral dilemmas
with the other existing members that are allowed to give their opinions on the situations, based
on which a final verdict is generated after all votes are counted. All of the posts contained in this
forum could be transformed into input data that could be processed by a machine, by using the text
analytics methods listed above. A machine could then be trained by using the verdicts associated
with each post to classify other scenarios (posts) based on this training data.
Overall, the main motivation for this project comes from the potential AI systems represent
in the field of moral judgement, which could be harvested in humans’ favour. Despite being ethi-
cally and socially aware to an extent, we still tend to judge subjectively when it comes to assessing
real-life scenarios that we bump into every day, from any perspective. Endless debates could arise
anytime among people from any group, concerning what it is ethical to do and what it is not in a
certain situation which could cover any possible topic. No matter the location, time, or context, we
are always required to wrestle with moral issues of our own or in our proximity, and we often seek
external help when things degenerate. For example, a simple marriage misunderstanding could
lead to serious future issues, which are sometimes extremely complicated and almost impossible
to solve conveniently for both sides. At this point, relatives or friends might interfere in the matter,
and their advice could negatively influencing its outcome in some cases. In some cases, the two
partners resort to contacting psychological help or even appealing to court judgement, situations
which ensure impartiality and fairness. Nevertheless, accessing these services could mean signif-
icant financial costs, in which case other solutions have to be considered. This is where social
media platforms come in handy, where, thanks to communities such as Reddit’s r/AmITheAsshole,
where any existent moral dilemma is discussed and debated each day, objectively and anony-
mously. Therefore, each user of this platform can get thousands of replies to their issue, along
with the final verdict. Instead of having to read that many opinions given by others, why can we
not use the most recent innovations in the AI industry described above and create such a smart
platform for people that seek much lower-biased advice? This represents this paper’s own "moral
dilemma", from which three main research questions have been extracted (see next section).

1.2 Research questions


RQ1: How could NLP methods be used to extract and summarise patterns given the
AITA data-sets?
As stated online 2 , statistics show that more and more people opt for using the r/AITA subreddit
which became Reddit’s number one most-viewed community in 2022. Its popularity has increased
significantly, due to the freedom of speech it provides - anyone can genuinely write about any of
their problems with nothing holding them back from doing so. As a result, this community features
various posts and comments depicting moral judgement in real-life scenarios, building the ideal
environment in which preliminary data analysis techniques can be applied. These represent an
essential milestone that must be reached before proceeding with further research.
1 https://www.reddit.com/r/AmItheAsshole/
2 https://www.engadget.com/reddit-recap-stats-2022-130015151.html
1.2. R ESEARCH QUESTIONS 9

The main goal of the first research question is to integrate some well-known NLP methods
used in text analysis into the data mining process. AITA posts will be collected and merged into a
big data-set, along with their respective verdicts. After modifying the existing posts accordingly, a
thorough investigation will be performed in view of extracting and summarising different patterns.
Among the selected NLP tools, sentiment analysis, and topic modelling are to be mentioned, due
to their potential significance later on in the classification tasks. These two techniques combined
could generate answers to queries such as how people feel when writing stories about certain
topics. The results obtained at the end of this milestone will be compared to the ones existing in
the literature and extended further.

RQ2: Which classification algorithm performs better when it comes to predicting an


AITA verdict?
The second research question revolves around the data-sets generated beforehand, whose quality
will be essential for the next steps of this analysis. Depending on the content of each post and the
verdict given by Reddit users, different Machine Learning models will be trained and then tested,
to perform predictions as accurately as possible. The main goal here is to extend the approaches
which have been chosen by other authors in their papers, as well as investigate other alternatives
that have not been explored in detail yet.
Before delving into building the classifiers themselves, additional changes might need to be
applied to the main data-set. One of the most important issues that need to be dealt with is the
distribution of the four different classes across the data. In case any classes from the existing four
are predominant among the others, data will require balancing before passing it to the models.
Moreover, machine learning techniques have to be chosen accordingly, since not all of them
are guaranteed to return satisfying results. After comparing several different ML approaches,
ranging from the traditional procedures to the newly-developed state-of-the-art transformers, they
will be assessed and compared altogether through different metrics. Given that a random classifier
would achieve 25% accuracy when classifying with four classes, the models developed in this
paper are expected to provide much better results overall, as well as specific to each label.
The whole process described above, along with all the mentioned subsequent steps, has to
be treated rigorously. A high-quality ML model should be able to recognise and understand the
differences between the classes and their nuances, and at the same time, any ambiguities or incon-
sistencies observed in the final results must be taken into account. Otherwise, if the predictions are
incorrectly analysed and presented, the outcome of the experiment could be negatively affected.

RQ3: How could a web platform be created to help people with moral judgement?
Having a plethora of tools available to assess the performances of ML models should be more than
enough to use existent test sets during the process. After all, the classifiers will be trained, vali-
dated, and tested on posts from the same data-set, which are split accordingly as well. However,
this paper also intends to transpose the AI workspace built during the previous two stages to the
real world, thus attempting to create an interactive tool meant to be used by the general public.
Therefore, the third research question proposed as part of this project specifically investigates
how a successfully-developed AI system could be integrated into a web platform. The main ideas
surrounding this goal are to expand on the class returned by the system and make the AI also
1.3. R ELATED WORK 10

provide a clear reason for its "decision", followed by a recommendation for the user, i.e., what
they could do in that situation, no matter to what extent they are guilty.
These two sub-tasks require extremely powerful tools to use for their completion, such as
Large Language Models (LLMs) like Chat-GPT, BLOOM, or OPT. Additionally, they strongly
rely on how well the actual models perform, and that is why they had been initially set as optional
goals for this research.
Apart from the components described above, the users will also be allowed to access certain
statistical graphs, figures, or tables which could be useful when it comes to understanding better
how the system works, based on the collected data. For instance, visualising the topics that are
most prevalent in the r/AITA community for each class, the sentiments associated with each topic,
or the distribution of the topics over the decade. Those with ML expertise will be allowed to
explore different results returned by the models through some explainable tools. In this way, the
verdict, explanation, or recommendation given by the system will not seem ambiguous.

1.3 Related work


There are several papers which tackle the issues discussed above, some of them even using or
exploring similar approaches and techniques.
Overall, the main inspiration for this paper comes from Jiang et al. (2021), whose Delphi
experiment has proven to be extremely innovative in the field of AI ethical judgement based on
deep neural networks. While being able to achieve up to 92.8% correct judgement when tested
on an immense 1.7M crowd-sourced data-set, Delphi has managed to surpass the performances
of many state-of-the-art models such as Chat GPT-33 , thus emphasising the need for teaching AI
systems with moral textbooks. Moreover, experiments on five different ethics tasks have been
conducted in parallel with other two powerful frameworks and have generated favourable results
for Delphi, again. One of these tasks is represented by commonsense morality (Frederick, 2009),
which denotes a system of moral rules that people use in everyday life to make judgments about
the character and actions of other people. Interestingly enough, posts from Reddit’s AmITheAss-
hole (AITA) forum feature among the scenarios provided as training data for this specific section,
commonsense morality being measured in this case with binary classification accuracy, which has
reached 81%. This important detail has fortified the motivation needed to pursue this research,
which was still requiring other scientific bases. Some limitations that figure in this study are the
lack of cultural awareness and language understanding that Delphi has, as well as inconsistent
predictions specific to any transparent moral reasoning model. Therefore, as the authors claim,
machine moral reasoning and machine language understanding should be investigated concur-
rently, carrying out mutual benefits to each other.
Another approach that had to be taken into account has been adopted by Mujumdar et al. in
their paper published in 2020. Their main goal was to explore and potentially deal with the bias
levels generated by Deep Learning models when used for tasks such as sentiment analysis or text
classification. Their target data-set has also been the r/AITA subreddit since it aligns perfectly with
their initial intention to investigate whether NLP could be applied in judging ethical dilemmas.
Their focus has been on harvesting the potential that the pre-trained transformer model BERT has
3 https://openai.com/api/
1.3. R ELATED WORK 11

shown in terms of moral reasoning tasks. Even though they have achieved decent results overall,
including a general accuracy of 73.55% (thanks to a successful parameter-optimisation strategy),
their model has not been capable of predicting that many minority class instances, as it can be seen
in the confusion matrix they have provided. As they mentioned in their paper, the class imbalance
has represented one of the issues they have not tried to fix, but they have suggested possible
solutions for solving this matter. Additionally, they have not performed a complete analysis on
what concerns topic modelling, since they have only provided the most relevant terms for each
document, using the Term frequency-inverse document frequency (Tf-idf) technique. A principal
goal of this paper is to extend the data analysis provided by Mujumdar et al. by using different
topic modelling methods, as well as exploring in more depth ways to adjust the apparent class
imbalance, in view of developing a more objective classifier.

The same state-of-the-art pre-trained models have been analysed in the paper written by Al-
hassan et al. while being evaluated over posts collected from the AITA community. The authors
claim that the complex nature of these publicly-available stories provided by this subreddit has
determined them to investigate in what ways NLP technologies could be used to interpret these
posts. Their approach was slightly different from Mujumdar et al. (2020) through the different
subsets they have used for training and testing, to balance the data. Not surprisingly, better re-
sults have been obtained on the two balanced sets, the absolute best being on the set with longer
posts, which had 88% general accuracy. However, they acknowledged that their minority class
performance could be significantly improved if other data balancing techniques such as SMOTE
are applied.

Another study that has brought contributions to this research field is the one of Wang (2017),
whose purpose was to investigate whether deep learning methods could be used in predicting
moralistic judgements. According to the author, neural network approaches have been rare when
it comes to tackling this kind of problems, a fact that is strengthened by the results provided
later in the paper. The architectures adopted here have been word embeddings, created with the
help of numerical vectors generated from the tokenised posts. These embeddings have then been
combined with support vector machines (SVM) and simple Deep Neural Networks. In order to
deal with the imbalanced data, as previously reported by Mujumdar et al. (2020) and Alhassan
et al. (2022), they have opted for an under-sampling technique, which has slightly improved the
two models, but not by that much. Finally, they acknowledge the limitations Reddit’s API provides
when trying to scrape AITA posts, as well as the need for ampler amounts of data to improve the
performance over minority classes.

With regards to topic modelling, the paper that has conducted a detailed analysis of AITA
posts is authored by Nguyen et al. (2022). It features around 100, 000 moral dilemmas collected
through the API. Their final findings depict a total number of 47 high-quality topics covering most
of the data space, which have been detected by using topic modelling techniques such as Latent
Dirichlet Allocation (LDA) and probabilistic clustering. These methods will be explained later
in this paper, with an extra emphasis on posts containing different kinds of relationships, since
Nguyen et al. have specifically highlighted that.
1.4. PAPER OVERVIEW 12

Furthermore, according to Sivek, which has performed sentiment analysis and topic mod-
elling on an AITA data-set, there are several existent correlations between different topics or sen-
timents expressed through the analysed posts. The author has also mentioned in their paper several
statistical tests such as Pearson or Spearman Correlation tools that could be used to identify these
patterns.
Reasonability in social media represents another focus that several papers such as Ha-
worth et al. (2021) have tried to analyse in depth through the AITA subreddit. Their ap-
proach was based on separating the collected posts into two classes - NTA(NotTheAsshole) and
YTA(YouTheAsshole)-, after which a re-sampling process has been used to balance them, due to
the dominance of the former. Their final models have been tested on various features and have
resulted in a general accuracy of 77% achieved by their best classifier. They have also made use
of the comments written for each post, and have concluded that Reddit subscribers that are active
on AITA maintain a cognitive bias toward YTA posts and that many AITA users seek validation
regardless of its necessity. This behaviour is described in detail by Cunn in his online blog, where
the author concludes that most articles whose headline is a question can be answered with the
word ‘no’, i.e., in most of those posts the original poster (OP) will not be labelled as YTA.
Moral judgement on Reddit has been also analysed by Botzer et al. (2021), where the main
focus was on determining whether users’ unlabelled comments assign a positive or negative moral
valence to the OP. Their best-performing model was by far judge-Bert with around 89% accuracy.
Their main results have also shown that users prefer positively-labelled posts to negatively-labelled
posts. At the same time, work done by Efstathiadis et al. on a similar data-set has generated an
accuracy of 86% (through the same BERT) when classifying comments. This paper, however,
has also attempted to classify posts, obtaining an accuracy of 62% for both test sets used in the
process, as well as higher precision and f1-scores for the re-balanced test set. This confirms the
need to solve the issues concerning the imbalanced data-set, which causes the BERT model to be
biased towards "sweethearts".

1.4 Paper overview


This chapter highlighted the research questions that will be discussed and analysed further in this
paper. It also reviewed the other relevant papers collected from the literature. Chapter 2 provides
the background knowledge necessary to understand the experiments presented in the paper, i.e., the
main Natural Language Processing and Machine Learning techniques used throughout the study.
Chapter 3 makes use of the notions explained in Chapter 2 and details the procedure adopted for
each research question. After each research question is tackled, the results obtained using the tools
mentioned in the methodology are shown in Chapter 4 where they are also compared with the ones
existing in the literature. Finally, Chapter 5 contains the most important final remarks made at the
end of the study, along with some steps required for potentially extending this study further in the
future, as well as a self-evaluation section made by the author.
Chapter 2

Background

2.1 Natural Language Processing (NLP) methods


2.1.1 NLP background
Our world contains a high range of natural language text displayed in a plethora of languages and,
as a consequence, it has become harder and harder for humans to fully comprehend the meaning
of it, both knowledge-wise and time-wise. A computer, however, has the capacity of learning at a
much faster rate and also processing bigger chunks of data, all of this in an lower amount of time.
Moreover, these AI systems have started to be extremely useful in our day-to-day lives in various
cases, ranging from a simple Google, Siri or Alexa assistant that could help you surf the web easier
or order groceries for you, to powerful systems that contribute significantly to the growth of big
companies.
The aforementioned applications, along with many more, are now possible due to the recent
innovations in the subfield Natural Language Processing (NLP), described by Chowdhary as a
collection of computational techniques able to perform automatic analysis and representation of
human languages. Since its appearance in 1950 (Nadkarni et al., 2011), it has increased in popular-
ity and prevalence over the years, being regarded as the intersection between artificial intelligence
and linguistics. Nowadays, NLP researchers are required to constantly broaden their knowledge
in this field since most of NLP methods borrow from other fields - apart from linguistics -, such
as mathematics, machine learning or deep learning. Two methods that have been prevalent in the
NLP world are sentiment analysis and topic modelling. They have been explained separately in
the following subsections
2.1.2 Sentiment analysis
Sentiment analysis (Agarwal et al., 2015), often defined also as opinion mining, is an NLP ap-
proach that analyses people’s opinions and feeling towards different services or products. While
being used across various fields, ranging from marketing, customer support to social media mon-
itoring, this technique relies directly on automated text processing and has rapidly increased in
usage in many industries.
A common strategy adopted in this sentiment analysis is classifying people’s comments, re-
views or stories as positive, negative or neutral (Habimana et al., 2020). To perform this task, two
main approaches have been provided (Anees et al., 2020) : lexicon based approach, which consists
of applying tokenisation methods on the given chunks of textual data, and machine learning (ML)
approachto classify the sentiments. With regards to the former, it determines the polarity of the
2.1. NATURAL L ANGUAGE P ROCESSING (NLP) METHODS 14

textual input with the aid of a big collection of words stored in a dictionary (dictionary based), also
called lexicon. In this respect, Python offers a free open-source package called Natural Language
Toolkit (NLTK) 1 , which provides the corpora and lexical resources necessary for lexicon-based
sentiment analysis.
One of the most popular lexicon-based tools existent in this library is Valence Aware Dic-
tionary and sEntiment Reasoner (VADER) (Hutto and Gilbert, 2014). It has been developed
specifically to be used in social media contexts, featuring more sensitivity towards sentiment ex-
pressions than traditional sentiment lexicons such as Linguistic Inquiry and Word Count (LIWC),
whose benefits are still preserved in VADER. When applied to text, VADER can be used to calcu-
late the final sentiment scores through the following simple algorithm:

Algorithm 1 VADER text classification


Require: text: the processed input, lexicon: VADER’s lexicon
Ensure: polarity_scores: dictionary containing polarity scores
1: Apply tokenisation on text
2: Sum all the valence scores of each word from the text existent in the lexicon
3: Calculate the compound_score of the whole input by normalising the sum to be a number
between −1 and 1
4: Determine the class text belongs to depending on compound_score:
5: if compound_score > 0.05 then
6: Classify the text as positive.
7: else if compound_score < −0.05 then
8: Classify the text as negative.
9: else
10: Classify the text as neutral.
11: end if
12: The final polarity_scores dictionary will contain the compound_score, along with three other
variables pos, neg and neu, which depict the probability text classifies as positive, negative or
neutral

2.1.3 Topic modelling


Another important existent NLP task is text summarisation, ubiquitous nowadays due to the con-
stant need to summarise extremely huge text corpora which would take incredibly long to perform
manually. Therefore, immeasurable collections of documents can now be explained and reduced to
a readable space with the aid of what is known as topic modelling (Chauhan and Shah, 2022)Prob-
abilistic Topic Modelling. This contributes enormously to the field of information retrieval which
features high demands nowadays due to the sharp increase in sources of information existent on
the internet.
Briefly, topic modelling describes an unsupervised technique2 that processes a set of docu-
ments by extracting words and grouping them into different clusters based on different identified
word patterns, such as word frequency or distance between words. In essence, a topic model will
"return" three different "outputs": the topics that appear in the given documents (pieces of text),
the frequency of words in each topic and the proportion of topics in each document. Therefore,
1 https://www.nltk.org/
2 It does not require a predefined list of tags or training data that has been previously classified by humans. https:

//monkeylearn.com/blog/introduction-to-topic-modeling/
2.1. NATURAL L ANGUAGE P ROCESSING (NLP) METHODS 15

it can be concluded that the three main actors that take part in this process are the following:
documents, topics and words.
Latent Dirichlet Allocation (LDA) represents a topic modelling technique that directly deals
with topics that are "invisible" (since the English word "latent" essentially refers to something that
exists, but is not yet developed). The main principle that underpins LDA is the correlation between
the three elements mentioned above which is the following: documents are considered a mixture
of topics, which are regarded themselves as a mixture of words. In statistical terms, the topics
are generated from the probability distribution of the documents, whereas the words are computed
from the probability distribution of the topics. The probability distribution used by LDA comes
from a family of continuous multivariate probability distributions called Dirichlet distribution3 ,
which are parameterised by a vector of positive reals α (thus, the family is known as Dirα).
Before delving into the algorithm itself, several notations have to be explained more in detail.
First, two hyper-parameters α and β are defined as the parameters of the topic prior probability
distributions 4 of the topics over documents and of the words over topics, respectively. The two
Dirichlet distributions Dirα and Dirβ are then computed, each of them modelling the relationship
between documents and topics, and the relationship between the topics and the words, respectively.
Having decided on this, the LDA generation algorithm is summarised through the following piece
of pseudo-code:

Algorithm 2 LDA generation of topics


Require: D: set of all documents, T : pre-defined number of topics, α and β - the two hyper-
parameters
Ensure: : βT : topic word distribution for all topics {1, . . . , T }, αD : document topic distribution
for all documents from D, TD : the set of found topics in all documents from D
1: for all topic indices t ∈ {1, . . . , T } do
2: Create probability distribution over words for topic βt ∼ Dirβ
3: end for
4: for all documents d ∈ D do
5: Sample proportion of documents for each topic αd ∼ Dirα
6: Select a document length from the distribution of length of documents Ld ∼ DistD
7: for all word indices w ∈ 1, Ld do
8: Sample topic and assign to each word Td,w ∼ Multαd (multinomial distribution 5 )
9: Sample words from the distribution over words for topic Wd,w ∼ MultβTd,w
10: end for
11: end for

While LDA is a powerful and innovative tool for performing topic modelling, appropriate
metrics are required to evaluate the quality of the discovered topics. Topic coherence(Newman
et al., 2010) can be evaluated by rating a set of words for how coherent and how interpretable they
are from a human perspective. By using coherence, the topic quality can therefore be estimated
through a single qualitative value that goes by the name of coherence score, which follows the
principle that words with similar meanings tend to occur in similar contexts (Syed and Spruit,
2017). This evaluation method comes extremely in handy when deciding on the initial number of
desired topics T . If the value chosen for T is not big enough, the resulting topics might be too
3 https://en.wikipedia.org/wiki/Dirichlet_distribution
4 https://en.wikipedia.org/wiki/Prior_probability
2.1. NATURAL L ANGUAGE P ROCESSING (NLP) METHODS 16

general, with few existent distinctions, whereas a bigger value would lead to overly specific topics
which are too high to interpret.
The main strategy that will be adopted later in this paper relies on the coherence measure
called CV which has reportedly achieved the highest correlation with human ranking data, accord-
ing to Röder et al. (2015). This coherence measure is computed through four main steps: segmen-
tation of the documents in word pairs, calculation of probability estimation of each word and
word pair, computation of a confirmatsion measure which defines how strongly each adjacent
segments are to each other, and aggregation of all confirmation measures calculated by taking the
mean of the former. The four parts are explained more mathematically below:

1. Segmentation: Let N denote the top-N words specific to each discovered topic t ∈
{1, . . . , T }, and let WNt,d denote the set of all t’s top-N words, assuming that WNt,d repre-
sents the segmentation of an arbitrary document d. A set St is constructed such that all its
subsets contains word pairs of the form (Wi ,W j ), where Wi ,W j ∈ WNt,d . Hence, each subset
S ∈ St describes a mapping between some of the top-N words from WNt,d and others.

2. Probability estimation: For each word wi and word pair (wi , w j ), the probability P(wi
and joint probability P(wi , w j ), respectively, are calculated through dividing the number
of documents in which the words occur (|Dwi | and |Dwi ∩ Dw j |) by the total number of
documents of virtual documents |D′ |. The set of virtual documents D′ is computed by using
a Boolean sliding window method. Through this method, for a document d and a window
of size s, D′d would represent a set of virtual documents d1′ , d2′ , ..., dN−s+1
′ , where a virtual
document di′ = {wi , wi+1 , ..., ws−i+1 }.

3. Confirmation measure: For each subset (Wi ,W j ) ∈ St , let θi j be defined as the confirmation
measure of how strongly Wi supports W j based on the similarity between the words in Wi
and W j and all the words from WNd,t . To calculate θi j , Wi and W j have to be transcribed as
the context vectors ⃗vi and ⃗v j , such that all words in Wi and W j are paired to words in WNd,t
through what is called normalized pointwise mutual information(NPMI)6 , which calculates
the aggreement between individual two individual words wi and w j as follows:

 γ
P(w ,w )+ε
log P(wii )P(w
j
j)
NPMI(wi , w j )γ =  
− log P(wi , w j )

, where ε is a constant value (close to zero) that is added to avoid the calculation of logarithm
of zero, and γ is a value that places more weight on high NPMI values. Hence, a vector ⃗vi
will contain the sums of all NPMI values calculated between the a word from Wi and all
words from WNt,d :

⃗vi = { ∑ NPMI(w, w′ )γ }w′ ∈W t,d


N
w∈Wi

The confirmation measure will be calculated by computing the cosine vector similarity of
6 https://en.wikipedia.org/wiki/Pointwise_mutual_information
2.2. T RADITIONAL M ACHINE L EARNING (ML) METHODS 17

all context vectors ⃗vi and ⃗v j :

∑Nl=1⃗vw ·⃗vw
θ (⃗vi ,⃗v j ) =
∥⃗vi ∥2 · ∥⃗v j ∥2

4. Aggregation of all confirmation measures of all subsets S ∈ St is performed by computing


their mean, and this will represent the final coherence score.

2.2 Traditional Machine Learning (ML) methods


2.2.1 Logistic regression
Logistic regression (Geron, 2017) represents a well-known regression algorithm which can be
used at the same time for classification purposes. The whole process of mapping some input to a
class is done by estimating the probability of whether that instance belongs to a certain class or
not, given the calculated logistic. The logistic is a function which returns a value between 0 and
1, which is applied to the vectorised form of the calculated predictions:

1
σ (t) =
1 + e−t

where t represents the vector containing the predictions. It can be noticed that this procedure can
be used for binary classification tasks since it returns only two possible outcomes, but it can be
extended to support multiple classes, by using what it is known as Multinomial Logistic Regression
(MLR). This method has been successfully used for different NLP tasks such as sentiment analysis
performed on Twitter posts (Ramadhan et al., 2017) or text classification on different data-sets
(Kamath et al., 2018) where it has reportedly performed better than other traditional machine
learning algorithms such as LinearSVC, Random Forest or Multinomial Bayes. Scikit-Learn’s
version of the multinomial algorithm7 aims at generating the probabilities that a target variable
yi of an observation i belongs to each existent class k from set K given feature variable X as it
follows:

eXi ·Wk +W0,k


P(yi = k|Xi ) = |K|−1
∑ j=0 eXiWi +W0, j
,where W is the matrix of weights whose rows describe vectors Wk corresponding to each class k.
2.2.2 Naive Bayes
Naive Bayes is a supervised learning algorithm that has the Bayes theorem at its basis, which
calculates the posterior probability P(A|B) defined as below:

P(B|A) · P(A)
P(A|B) =
P(B)

where P(A|B) is the posterior probability of the target class A given predictor class B, P(A) is
the target prior probability, P(B) is the predictor prior probability and P(B|A) is the posterior
probability of predictor class B given target A. This algorithm is usually applied as part of a
Bayesian network which connects different nodes representing random variables through edges
7 https://scikit-learn.org/stable/modules/linear_model.html
2.3. D EEP L EARNING METHODS AND N EURAL N ETWORKS (NN S ) 18

labelled with the posterior probabilities of the random variables. The main task of the network is
to calculate the occurrences of the respective posterior probabilities and maps a given target (or
document in the case of text classification) to the class reporting the highest probability.
One of the popular variants of the Naive Bayes algorithm is the Multinomial Naive Bayes
(NB), used for multinomial distributed data, which has been proved to perform better than other
Naive-Bayes models such as Bernoulli when it comes to text categorisation (classification) (Kib-
riya et al., 2005). Scikit-Learn employs a version of this algorithm8 which is widely used for
information retrieval purposes. To predict the probability that the feature variable Xi of observa-
tion i belongs to class y, the formula for relative frequency counting has to be employed:

Ny,i + 1
P(Xi |y) =
Ny + ·n
where Ny,i represents all occurrences of observation i in a sample (document) of class y, Ny is
defined as the total sum of occurrences of all observations labelled with class y, 1 is an estimator
used to prevent zero probabilities and n denotes the number of all features from the data-space. It
can be observed that this algorithm could be easily adapted to text classification tasks, where each
observation could represent a document and each probability that a document belongs to a class
could be recomputed by using the count of each word from the document, as detailed in Kibriya
et al. (2005).

2.3 Deep Learning methods and Neural Networks (NNs)


2.3.1 An overview
The term neural networks(Mehlig, 2021) originates from the endless networks formed by neu-
rons that a human brain contains, with all neurons being connected to process any piece of data.
Therefore, algorithms underpinning neural-networks, describe similar architectures and dynamics
and are based on similar principles which are processed by a computer instead of a real brain.
These neural networks are often defined as Artificial Neural Networks(ANNs) that have to learn
by changing the connections between their artificial neurons. A usual training process of a NN
relies on a training set containing enormous lists of input data and their corresponding labels, or
some target values, through which the NN can be trained to classify similar data by adjusting the
connection strengths between its neurons. A basic representation of how an ANN’s structure can
be seen in Figure 2.1.

Figure 2.1: A simple ANN architecture

8 https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.

MultinomialNB.html
2.3. D EEP L EARNING METHODS AND N EURAL N ETWORKS (NN S ) 19

Briefly, an ANN contains an Input layer which contains all input neurons required to train
the model, a Hidden layer which processes and computes the variables further, and an Output
layer which represents the final result computed using the neurons from the previous steps. The
Hidden layer is by far the most interesting part of an ANN, since it can comprise multiple hidden
layers, so that the ANN becomes deeper in complexity, where the term Deep Learning comes
from. Deep learning architectures of more complex ANNs, that have been used in this research,
will be explained next through definite examples.
2.3.2 Multi-layer Percepetron (MLP)
Multi-layer Percepetron (MLP)(Kamath et al., 2018) is a type of ANN whose architecture is
based on 2 or more fully-connected layers, followed by the final output layer, so it could represent
the ANN drawn in Figure 2.1.
Consider that the leftmost layer from Figure 2.1 represents a set of multiple neurons x1 , x2 , x3
depicting the MLP’s input features and a set of corresponding weights w1 , w2 , w3 , where each
weight represents a vector of length 4 whose elements are sent to the hidden layer of neurons
a1 , a2 , a3 , a4 . Each of these neurons will transform in the following way:

1. Compute weighted linear summation: for each ai , the summation will be

si = x1 w1,i + x2 w2,i + x3 w3,i

2. Add the bias assigned to the hidden layer to each summation:

si = si + bi

3. Apply the activation function on the result: this function usually represents:

• The hyperbolic tangent (sigmoid):

1
tanh(si ) =
1 + e−si

• The Rectified Linear unit (ReLu):

relu(si ) = max(si , 0)

The obtained value will be assigned to each ai and then sent to the output layer, thus finishing the
training of the model. For classification purposes, in case there are more than 2 existent classes,
an additional softmax layer is required, which will create a vector containing all probabilities that
a sample belongs to each class, defined by the following function:

eai
so f tmax(ai ) =
∑4j=1 eai

Scikit-Learn has provided a solid implementation of an MLP classifier9 which will be employed
9 https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.

MLPClassifier.html
2.3. D EEP L EARNING METHODS AND N EURAL N ETWORKS (NN S ) 20

later in this paper.

2.3.3 Long short-term memory networks (LSTMs)


Long short-term memory networks (LSTMs) belong to the family of Recurrent Neural Net-
works (RNN)(Salehinejad et al., 2018), which have been designed on the principle that the neurons
communicate with each other through feedback signals which are sent back and forth recurrently
through the network’s recurrent layers. RNNs belong to the class of supervised ML models, so
they require training data of input-target pairs.
A simple RNN has a basic structure featuring three main layers, but it differs from a simple
ANN through its hidden layer which is called recurrent. An example of this kind of RNN can be
seen in Figure 2.2 .

Figure 2.2: A simple RNN architecture for different time units

Input features do not represent only one vector of units xt = (x1 , x2 , ..., xn ), but sequences of
vectors, depending at which time t they are processed. They are fully connected to the neurons
from the hidden layer h1 , h2 , ..., hm through different weight matrices {...,Wi ,Wi+1 , ...} for each
time unit {...,t,t + 1, ...}. The hidden has its recurrent connections between neurons from different
time units, which send the weight matrices {...,Wh−1 ,Wh , ...} for one time unit to another. Each
vector hidden unit ht = (h1 , h2 , ..., hm ) corresponding to time t is computed through the activation
function fh used by the hidden layer as it follows:

ht = fh (Wi · xt +Wh−1 · ht−1 + bh−1 )

,where bh−1 represents the bias with which unit ht is initialised. The hidden neurons from each
time unit are connected themselves to the output neurons, respectively, through weight matrices
{...,Wo ,Wo+1 , ...}, and the output units yt = (y1 , y2 , ..., y p ) are computed by using the activation
function fo used by the output layer:

yt = fo (Wo · ht + bo )

,where bo represents the bias with initialises unit yt . Therefore, it can be observed that at each
timestamp, the hidden units return a prediction for the output layer which is calculated based on
the input unit parsed in the beginning.
In general, simple RNNs can store the information in their "memory" by using the recurrent
connections that operate in a loop, but they become extremely difficult to train when it comes to
storing and learning information for a long time because of the vanishing gradients corresponding
2.3. D EEP L EARNING METHODS AND N EURAL N ETWORKS (NN S ) 21

to their loss function and their exponential decay. To solve these memory compression issues,
gates have been introduced as part of the activation functions, thus defining a new category of
RNNs which is the LSTMs(Chung et al., 2014). The basic structure of an LSTM unit can be seen
in Figure 2.3.

Figure 2.3: LSTM unit

An LSTM unit uses a set of gates consisting of an input gate It , a forget gate Ft and an output
gate Ot , which all serve at processing the values which are kept in memory slots labeled as ct . The
activation function used to initialise the gates is the sigmoid function (labeled as sig), whereas the
hyperbolic tangent (tanh) represents the activation function for hidden unit ht . Each ct represents
a memory cell of the LSTM unit and gets updated with a new cell Ct that is computed through
an additional memory gate from previous time units. The easiest way to understand this whole
process is to start with the output values of ht and ct and go backwards:

• ht = Ot θ (ct )

– Ot = σ (WOt ht−1 +UOt xt + bOt )


– ct = Ft ct−1 + It Ct

* Ft = σ (WFt ht−1 +UFt xt + bFt )


* Ct = θ (WCt ht−1 +UCt xt + bCt )
* It = σ (WIt ht−1 +UIt xt + bIt )

,where WIt , WFt , WCt , WOt represent the weight matrices associated with each existent gate for input
xt , UIt , UFt , UCt , UOt represent the weight matrices associated with each existent gate for hidden
unit ht , and bIt , bFt , bCt , bOt their respective bias vectors.
One of the most popular variants of LSTMs used for text classification is the Bidirectional
LSTM (Bi-LSTM)(Cui et al., 2019), whose architecture employs two different LSTM hidden
layers, that operate both backwards and frontwards, at the same time being connected to the
input and output layers. Both hidden layers contain their respective hidden units ht and ht′ , with
the difference that ht s are processed backwards, with t decreasing to {t − 1,t − 2, ...}, whereas the
2.4. F EATURE ENGINEERING TECHNIQUES 22

ht′ s are processed frontwards, for next time units t + 1,t + 2, .... The equations detailed above for
a simple LSTM unit apply in this context by following the same pattern. A Bi-LSTM architecture
can be visualised in Figure 2.4:

Figure 2.4: Bi-LSTM architecture for different time units

2.4 Feature engineering techniques


2.4.1 Bag of words (BoW)
Bag of Words (BoW) is an NLP text classification technique used for feature extraction purposes
when it comes to representing the given documents in the form of numerical vectors. Considered
one of the most efficient ways to represent textual data (V M and Kumar R, 2019), its simple
core principle is to extract all the words/terms from all documents d existent in a set D and each
word’s frequency per each document in a matrix M of shape |D| × |W |, where W represents the
set of all terms found in each d. Depending on the task for which it is used, a BoW model could
be generally defined as Bag of n-grams, where an n-gram represents a set of tokens of size n.
Each token represents a column in M, corresponding to each tokenised (cleaned) document which
describes a row in M.

2.4.2 Term frequency–inverse document frequency (Tf-idf)


Term frequency–inverse document frequency (Tf-idf) is a popular NLP method used in infor-
mation retrieval and machine learning, whose main idea is to determine how relevant a term or a
word is in a certain document concerning the topic of that document. In a way, Tf-idf is a more
advanced form of Bag of Words which deals only with term frequency. According to Zhang et al.
the words a document comprises can be separated into two categories: the words with eliteness
and the words without eliteness, which is a measure used for computing the importance of a term
in a collection of documents. The classic formula for the Tf-idf weight w(t, d) attributed to a term
t from a document d (belonging to the collection of documents D) is the following:

|D|
w(t, d) = t f (t, d) · log
d f (t, D)
where t f (t, d) and d f (t, D) represents the term frequency of t in document d and the document
frequency of t in the whole collection D, respectively. The log term existent in the formula can
2.4. F EATURE ENGINEERING TECHNIQUES 23

also be defined by id f (t, D), which represent the inverse document frequency of a term d in the
collection D. To handle the case when a term does not occur in the whole document collection,
which would lead to a division by 0 error in id f (t, D), some libraries such as Scikit-Learn10 use a
different formula for the inverse-document frequency, where they add 1 to the numerator and de-
nominator of the id f (t, D) as if an extra document was seen containing every term in the collection
exactly once:
1 + |D|
id f (t, D) = log +1
1 + d f (t, D)
2.4.3 Word embeddings & Word2Vec
Word embeddings represent fixed-length, dense and distributed representations for words(Almeida
and Xexéo, 2019) which have been proven extremely useful in most of NLP tasks that deal with
processing of textual data. They are frequently classified as prediction-based language models,
which are strongly linked to neural language models as they predict the next word using the con-
text and count-based models which have a matrix structure and take into account corpus-based
statistics such as words frequencies and counts.
Word2Vec(Mikolov et al., 2013a) is a model that has its roots in the family of prediction-
based language models, and that has become widely used in the NLP world due to its impressive
performance and quality when it comes to computing vector representations of words belonging
to large data-sets. The quality that defines these vectors is measured in syntactic and semantic
word similarity, based on which Word2Vec could provide state-of-the-art performance on certain
data-sets, surpassing some well-performing techniques originated in neural networks, according to
Mikolov et al.. Among the most successful models employed by Word2Vec, a worth-mentioning
family of models is log-linear models, which has been proved to provide low computational com-
plexities for learning distributed representations of words, as it is claimed through its name (log-
linear).
The two log-linear model architectures(Adewumi et al., 2022) are Continuous BoW and Con-
tinuous Skip-gram, and both of them are supported as open-source projects by Gensim 11 . The
main difference between them is how they attempt to predict: the skip-gram algorithm takes a
centre word as an input and predicts the words before and after it for a given range known as win-
dow, while BoW does the opposite: from a sequence of words projected within the same window,
it tries to classify the target word in the middle. It has been shown in literature Continuous Skip-
gram provides a much faster training time than other neural networks techniques since it does not
require dense matrix multiplications(Mikolov et al., 2013b). It achieves better accuracy for both
semantic and syntactic tasks than its homologue when tested on corpora of 783M words, featuring
a higher vector size (600), according to Table 5 from Mikolov et al. (2013a). Additionally, İrsoy
et al. have proved that the Gensim implementation of BoW lacks in quality, and, consequently, the
Word2Vec experiments conducted throughout this study have preferred continuous Skip-gram.
A continuous skip-gram model uses a classic neural network architecture, consisting of an
input, hidden and output layer, and its training objective is to discover word representations (vec-
tors) that would help in detecting the other surrounding words in a document. Mathematically,
10 https://scikit-learn.org/stable/modules/generated/sklearn.feature_

extraction.text.TfidfTransformer.html
11 https://radimrehurek.com/gensim/models/word2vec.html
2.5. T RANSFORMERS & L ARGE L ANGUAGE M ODELS (LLM S ) 24

let W define the set of some training words wt ∈ W of length T (Mikolov et al., 2013b). The idea
behind the skip-gram algorithm is to maximise the average log probability of detecting the context
words in a window of size c around a target word:

1 T
∑ ∑ log P(wt+i |wt )
T t=1 t≤|c|,t̸=0

, where each probability is calculated by using the softmax function.

2.5 Transformers & Large Language Models (LLMs)


2.5.1 Bidirectional Encoder Representations from Transformers (BERT)
Bidirectional Encoder Representations from Transformers (BERT) is a transformer model in-
troduced by Devlin et al. and designed to train deep bidirectional representations from unlabeled
text by jointly conditioning on both left and right context in all existent layers. Due to its remark-
able potential and outstanding performance, BERT has rapidly become the state-of-the-art model
used for many NLP tasks such as machine translation, question answering, language modeling
or language inference. The biggest advantage that the researchers see in BERT is its relatively
straightforward operational process, which requires only one additional output layer compared to
other large language models, so it does not slow down the process significantly.
The architecture of a BERT model represents a multi-layer bidirectional transformer en-
coder(Vaswani et al., 2017) that stands out through its so-called attention mechanisms, which have
been shown to completely surpass the quality provided by recurrent or convolutional networks. A
usual transformer model follows the classic architecture based on the encoder-decoder which
works in the following manner: the encoder maps an input sequence of symbols x = (x1 , x2 , ..., xn )
to a sequence of representations z = (z1 , z2 , ..., zn ) of dimension n, which is processed by the de-
coder and transformed into a sequence of output symbols y = (y1 , y2 , ..., ym ) of dimension m.
Additionally, it features several extra layers such as the stacked self-attention and fully-connected
layers for both components. Vaswani et al. has provided a suggestive diagram in their paper
which illustrates perfectly the whole process. The diagram can be seen in Figure 2.5: There are
6 identical layers composing the encoder and each one of them contains 2 sublayers: one multi-
head attention mechanism and one fully-connected network, which are both normalised. The
encoding returned in the end will contain 512 tokens, 512 being chosen to facilitate all residual
connections between the two sublayers. Likewise, the decoder 6 identical layers featuring the
same multi-head attention and fully-connected sublayers, with the difference that it also provides
an additional masked multi-head attention sublayer, to ensure that predictions made for a certain
position depend only on the outputs returned from lower positions. An attention layer is created
from an attention function which maps two vectors containing a query and some key-value pairs,
respectively, to an output vector, computed through the weighted sum of all values contained in
the vector. A multi-head attention layer use linear projections instead and processes the query and
the key-value pairs in parallel through the same attention function, after which it concatenates and
projects again all the output values. A BERT model 12 can perform all these encoding-decoding
steps bidirectionally and has been pre-trained with two different objectives:
12 Provided at https://huggingface.co/bert-base-uncased
2.5. T RANSFORMERS & L ARGE L ANGUAGE M ODELS (LLM S ) 25

Figure 2.5: The architecture of Transformer encoder and decoder (taken from Vaswani et al.
(2017))

• Masked Language Modelling (MLM): given a sentence, the mode will mask 15% of the
words from the sentence, then run the entire masked sentence through the model in attempt-
ing to make predictions

• Next sentence prediction (NSP): during pre-training, the model concatenates two masked
sentences and attempts to predict whether the sentences were following each other or not

2.5.2 DistilBERT
DistilBERT(Sanh et al., 2020) represents a lighter version of BERT which has been pre-trained as
a smaller general purposes transformer model. By using what is known as knowledge distillation13 ,
Sanh et al. have achieved to shrink BERT’s size by 40% and keep the most essential language
understanding properties, while also performing 60% faster than its "parent" model. DistilBERT
can be found online on Hugging Face14 as well.
13 https://en.wikipedia.org/wiki/Knowledge_distillation
14 https://huggingface.co/distilbert-base-uncased
2.5. T RANSFORMERS & L ARGE L ANGUAGE M ODELS (LLM S ) 26

2.5.3 GPT-3.5
Chat GPT (Bang et al., 2023) represents the state-of-the-art multilingual tool that has revolu-
tionised the AI world in recent years, due to its remarcable performance in a plethora of tasks,
such as machine translation, sentiment analysis or question answering, all comprising the field of
Natural Language Generation (NLG). GPT3.5-turbo represents one of the existent pre-trained
GPT models which are available to use online and also to integrate on personal websites for a
certain cost. The 3.5-turbo model, however, has proved to be extremely affordable since it only
charges $0.002 per API request15 . Therefore, it has been preferred to be used for some tasks in
this study.

15 https://openai.com/blog/introducing-chatgpt-and-whisper-apis
Chapter 3

Methodology

3.1 Data-set
Reddit1 is one of the highest-grossing social media platforms nowadays, featuring over 100k ac-
tive communities that count over 57 million daily active users writing over 13 billion posts and
comments. One of its most popular communities (also known as subreddit) is AmITheAsshole
(AITA), which provides plenty of text data that could be used further for NLP experiments.
Each user is allowed to write as many posts as they want, with the condition to follow some
general guidelines. Each post must contain a title starting with the word AITA (or WIBTA - Would
I Be The Asshole?), as well as a story describing a moral judgement situation to which other users
could react and vote with one of the following labels: NTA (Not The Asshole), YTA (You The
Asshole), ESH (Everyone Sucks Here) and NAH (No Assholes Here). The flowchart displayed
in Figure 3.1illustrates the general process of how a verdict is generated for a post:

Figure 3.1: Flowchart representing a user story on the r/AITA forum

1 https://www.reddit.com/
3.2. RQ1 - SENTIMENT ANALYSIS AND TOPIC MODELLING ON AITA DATA - SET 28

In order to create a proper AITA data-set, the posts have to be extracted with the aid of several
online scraping tools:

• Pushshift’s API2 for pulling the ids of the posts, along with each score associated with
them (which represents the difference between the number of upvotes and the number of
downvotes of each post

• Reddit’s official API called Praw3 for extracting the post content and other relevant infor-
mation

O’Brien has already provided a completely scraped data-set with posts collected from 2013
up until 2020, part of which has been used in this paper, as well. This data-set has been further
updated with new content from the late 2022 until the beginning of 2023 by using the same scrap-
ing methods described above. The final data-set contains the most important features presented in
Table 3.1 and can be found here4 .

Feature Description
id The unique identifier of each post
The time written in encoded characters5 when
timestamp
the post has been published
Short description of the story that starts with
title
AITA/WIBTA
body The content of the story
The final verdict of the story, which can be
verdict
one of NTA, YTA, ESH, or NAH
The number of upvotes minus the number of
score
downvotes for the post

Table 3.1: Description of each feature in the created AITA dataset

To make sure that the quality of the data-set is preserved, only the posts with a score higher
than 3 have been considered (an explanation for preserving quality control can be found here
6 ). In total, 108, 557 posts have been retrieved, ranging from 24th of February 2014 (timestamp
1393278651) until 1st of January 2023 (timestamp 1672545102), featuring several time gaps be-
cause Reddit API does not work anymore for the period between March 2020 and December 2021.

3.2 RQ1 - sentiment analysis and topic modelling on AITA data-set


The first research question discussed in this paper revolves around applying the NLP methods
listed in the previous section on the newly created data set of posts collected from r/AITA : sen-
timent analysis and topic modelling. However, Reddit’s API does not guarantee a completely
clean data retrieval, having many posts whose textual structure might be flawed. Therefore, before
delving into the natural language processing methodology, several data-cleaning steps need to be
followed and applied.
2 https://pushshift.io/api-parameters/
3 https://praw.readthedocs.io/en/latest/
4 The aita_dataset.csv file located at https://tinyurl.com/34fse4vv
6 https://openai.com/research/better-language-models
3.2. RQ1 - SENTIMENT ANALYSIS AND TOPIC MODELLING ON AITA DATA - SET 29

3.2.1 Data-cleaning
Python’s Regex7 library represents the ideal tool that can be used for text cleaning since it makes
use of classic regex expressions through which unnecessary characters can be easily removed. The
four main regex operations that have been applied to the data-set are the following (in order)

1. Remove links and hyperlinks

2. Remove digits

3. Remove extra-spaces

4. Remove punctuation

Operation 1. is required first because of potential overlap with other operations such as 4.,
which could lead to modifying the link so that Regex would not detect it anymore. The need to
delete any links or hyperlinks comes from the fact that some posts might not contain the full story,
but only a link to the actual story instead. Since it would be almost impossible to get rid of all
posts that are incomplete, the whole analysis has included the title of each post, as well, which
was concatenated to each post’s body. Apart from the four operations explained above, all existent
posts will be converted to the lowercase format, since there is no difference between a capitalised
and a lowercase word in terms of context.
Other necessary operations used for pre-processing the data are lemmatisation and stop word
removal. Lemmatisation 8 is a technique used in computational linguistics for determining the
lemma (dictionary form) of a word based on its intended meaning. It is widely used in plenty of
NLP tasks such as information retrieval or sentiment analysis because it has the potential to reduce
the dimension of the subspace used for analysis, as well as to improve the quality of the results
given by applying the respective NLP methods. It has been proved that lemmatisation could even
improve general accuracy when integrated into sentiment analysis Symeonidis et al. (2018) and it
features several approaches that can be adopted through Python’s libraries. Spacy’s Lemmatizer
9 pipeline has been preferred for this study due to the numerous functionalities that it provides.
Here are several rules that illustrate how Spacy’s Lemmatizer works in practice:

• Verbs: Any forms such as was, were, being or been will get converted to be

• Pronouns: Any possessive pronouns such as my, mine, you’re or yours will be either re-
placed with the tag ‘-PRON-’ or will not be modified, since determining the lemma in this
context is more challenging

• Nouns: Any plural forms that modify the singular noun such as houses, dictionaries or mice
will be converted to their singular forms house, dictionary or mouse

The other popular pre-processing function, stop word removal, consists of removing existent
words that do not have any meaning and that would significantly slow down the whole data analy-
sis process, especially when it comes to sentiment analysis and topic modelling, where the context
7 https://docs.python.org/3/library/re.html
8 https://en.wikipedia.org/wiki/Lemmatisation
9 https://spacy.io/api/lemmatizer
3.2. RQ1 - SENTIMENT ANALYSIS AND TOPIC MODELLING ON AITA DATA - SET 30

of the words matter the most(Elbagir and Yang, 2019). NLTK provides an entire corpus of stop-
words (172 in total) which has been used to reclean the data one more time. For instance, consid-
ering the following sentence: I have gone to the library today to finish my
dissertation, after removing the stopwords (words that belong to the downloaded NLTK
corpus), the new sentence will be I gone library today finish dissertation. It
can be observed that the second sentence suggests the same meaning even though it has been
significantly truncated.
3.2.2 Sentiment analysis
In order to perform sentiment analysis on the Reddit posts, the VADER algorithm 1 has to be
applied on each story. The lexicon required beforehand can be downloaded and imported from
NLTK’s library, which also provides an extremely useful sentiment intensity analyzer 10 . This tool
generates the dictionary containing the compound metric through which the sentiment of each post
is inferred accordingly.
3.2.3 Topic modelling
On what concerns topic modelling applied through LDA, Python’s open source library Gensim
11 represents the most convenient tool to use. It features an extremely fast and easy-to-use library
that is able to process massive chunks of data and large corpora, while also providing pre-trained
models, reason for which many companies prefer it, as well. Topic modelling-based experiments
have been conducted in the past by using the same state-of-the art tool( Jelodar et al. (2020), Gonda
(2019), Sivek (2021)).
Several preparation steps have been required before commencing the topic discovery process.
At this point, all the posts that have been collected have been cleaned such that they could be
processed more efficiently by NLP tools, but they represent pieces of plain-text labeled as strings.
The following three steps have been applied to the big set of posts (the documents processed in
algorithm 2):

1. Convert all posts to lists of tokens: All posts have been tokenised by using Gensim’s
simple_preprocess tokeniser, which discards tokens of length less than 2 or bigger than 15.
The resulting output will represent a list of lists of tokens, defined as texts.

2. Create a dictionary from the lists of tokens: All lists of tokens existent in texts will
be mapped to a dictionary that will store each token (word) that exists in each list, in an
alphabetical order. The resulting output will represent a dictionary object of type gen-
sim.corpora.Dictionary called id2word, which maps an id to a word.

3. Create a bag-of-words list from each list of tokens and dictionary: Given texts and
id2word, the newly-created list called corpus will store pairs of tuples (token_id, to-
ken_count, where token_id represents the rank of the token in id2word, and token_count
defines the number of occurences of the token in texts.

An example applied on a single cleaned post is the following:


10 https://www.nltk.org/howto/sentiment.html
11 https://radimrehurek.com/gensim/index.html
3.2. RQ1 - SENTIMENT ANALYSIS AND TOPIC MODELLING ON AITA DATA - SET 31

• Cleaned post:’i write explanation come asshole pretty immature rude


little possible come little’

• Tokens:[’write’,’explanation’,’come’,’asshole’, ’pretty’, ’immature’,


’rude’, ’asshole’, ’write’, ’little’, ’possible’, ’come’, ’little’]

• Dictionary:{0:’asshole’, 1:’come’, 2:’explanation’, 3:’immature’,


4:’little’, 5:’pretty’, 6:’possible’, 7:’write’}

• Corpus:[(0, 2), (1, 2), (2, 1), (3, 1), (4, 1), (5, 1), (6, 2),
(7, 2)]

With texts, id2word and corpus computed, the next thing that has to be clarified is the
number of topics that have to be agreed on beforehand. As explained in the Background section,
one way to determine the most optimal number of topics is by computing different coherence
scores for different numbers of topics. Comparing the different coherence scores could be done by
plotting them altogether alongside the number of topics. The expected result is that the coherence
score will increase as the number of topics increases, but this would lead to a bigger specificity
rate among the topics. One idea that is frequently adopted is to apply the elbow method 12 on the
resulted graph of coherence values and the number of topics, and pick the number after which the
graph’s curve stops increasing at the same rate. However, it might not always be the case that the
increase in topic coherence will be proportional to the increase in number of topics. In this respect,
the number of topics has to be selected from a desired range manually chosen beforehand, such
that the coherence value is the highest for that range.
Gensim’s LdaModel 13 initialises the LDA model and applies the LDA generation algorithm
2 to it following the appropriate steps, but for this experiment LdaMulticore14 has been preferred
due to the fact that the later adopts a multiprocessing, parallelised technique which makes the
model training much faster, while still applying the same algorithm. Additionally, another tech-
nique recommended by some studies or experiments (Yao et al. (2009), Gonda (2019), Chauhan
and Shah (2022)) is MAchine Learning for LanguagE Toolkit(MALLET), a topic modelling
package implemented in Java, but also provided through Gensim. The main difference between
LdaMallet 15 and LdaMulticore is that the former uses a Gibbs sampling method (explained more
in detail in Chauhan and Shah (2022)), which provides a more qualitative topic inference than
LDA which uses a sampling method based on the Variational Bayes algorithm. However, LdaMul-
ticore is more suitable in terms of memory performance, requiring O(1) space, whereas training
an LdaMallet model requires O(|id2word|) space, where id2word contains all words existent in
the main corpus.
After experimenting with both of these approaches, the optimal number of topics can there-
fore be selected and used to create a new LDA model, which will return the output mentioned
in Algorithm 2. This output will serve further to the extraction and summarisation of different
patterns related to the inferred topics, as well as their distribution among posts.
12 https://www.baeldung.com/cs/topic-modeling-coherence-score
13 https://radimrehurek.com/gensim/models/ldamodel.html
14 https://radimrehurek.com/gensim/models/ldamulticore.html
15 https://radimrehurek.com/gensim_3.8.3/models/wrappers/ldamallet.html
3.3. RQ2 - ML METHODS EMPLOYED FOR TEXT CLASSIFICATION 32

3.3 RQ2 - ML methods employed for text classification


3.3.1 Subsets
The second research question underpinning this study explores different ML approaches used
for text classification, which will be applied to the newly-created AITA data-set described at the
beginning of this section. Data cleaning steps have already been performed and, therefore, the
data is almost ready for classification, but several pre-processing steps are still required.
First of all, it can be observed in Figure 3.2 that majority of the posts (after performing
cleaning) contain a number of characters in the range [0, 2500], with only 1001 of them having
more than 2500 characters. Therefore, they have been left out of this study, the decision being
motivated by the approach adopted by Alhassan et al. which consists of a similar data separation
into smaller subsets on the number of words, because of computational limitations that some ML
methods feature (e.g BERT models are only able to process 512 tokens at a time, Word2Vec
models provide memory issues on when big sizes are chosen).

Figure 3.2: Distribution plot of cleaned posts in terms of number of characters, per each class)

Due to the existent class imbalance which is explained in Cunn (2019) and also noticed in this
research (Figure 3.3), the new data-set (which will be referred to as aita_2500) will be shrunk
further such that dominant class represented by the not the asshole will contain less representative
posts, thus making the distribution more balanced.

Figure 3.3: Distribution of verdicts along posts for aita_2500

Some papers existent in the literature have already discussed this issue to a certain extent and
have either reported bad results mainly affected by the class imbalance (Wang, 2017) or suggested
and provided ways that could solve this matter (O’Brien, 2020; Alhassan et al., 2022). At this
3.3. RQ2 - ML METHODS EMPLOYED FOR TEXT CLASSIFICATION 33

Class percentages
Subset Description
NTA YTA ESH NAH
aita_NTA_1000 NTA < 1000 characters 47% 29% 8% 16%
aita_3_balanced NTA < 600 characters 32% 29% 13% 26%
YTA < 1000 characters
aita_4_balanced NTA < 450 characters 27% 26% 22% 25%
YTA < 600 characters
ESH < 800 characters

Table 3.2: Subsets created after balancing classes accordingly

point, aita_2500 has the following class distribution among verdicts: not the asshole (NTA)
- 62%, asshole (YTA) - 21%, everyone sucks (ESH) - 5% and no assholes here (NAH) - 12%.
Three other subsets have been created from aita_2500 and they are displayed in Table 3.2. The
technique here was to gradually reduce the number of the dominant classes by selecting posts from
a certain length while keeping the minority class (ESH) unchanged.
The main methods required for applying text classification on the 4 subsets have been divided
into three main sections, depicting three different approaches.
3.3.2 BoW & Tf-idf with traditional ML
First approach illustrates the relevance of traditional ML methods (Logistic regression and Multi-
nomial Naive Bayes (MNB)) in the field of text classification with the aid of two NLP techniques
used for word representations: Bag-of-Words (BoW) and Term frequency-inverse document
frequency (Tf-idf). It has been shown in literature(Kibriya et al., 2005) how important the con-
nection between MNB and Tf-idf, in the sense that the former is computationally very efficient
and easy to implement as well, while its performance could be improved even more if Tf-idf
scores are used for recording word frequencies. Additionally, O’Brien has proved how logistic
regression could achieve a good performance when performing binary classification (by merging
NAH posts into NTA and ESH posts into YTA), so it could manage to obtain similar results for
multi-class classification. Their logistic regression model has been used with 1-grams for storing
the terms frequency and, therefore, it is believed that using Tf-idf for a similar task could improve
the predictions even more.
The main methodology employed for this transforming is illustrated in Figure 3.4, where the
main components have been used through Sklearn’s packages.
In order to understand how CountVectorizer and TfidfTransformer work together
and process the given corpus, consider a smaller corpus comprising 3 posts that had already been
cleaned (including lemmatisation and stop-word removal):

• ’write come asshole pretty immature rude ’

• ’come home write little possible pretty asshole small’

• ’little home party rude immature pretty cool come’

When passed into a CountVectorizer object, it will tokenise the sentences into words and
create the term-frequency matrix as shown in Figure 3.5. Afterwards, a TfidfTransformer
3.3. RQ2 - ML METHODS EMPLOYED FOR TEXT CLASSIFICATION 34

Figure 3.4: Pipeline representing how tf-idf is used with traditional ML

Figure 3.5: Term frequency matrix for 3 cleaned posts

will process the term frequency matrix and transform it into a tf-idf representation by applying the
inverse-document frequency formula seen in the Background section. The final tf-idf representa-
tion is depicted in Figure 3.6: After applying this procedure on all existent posts (which obviously

Figure 3.6: Tf-idf representation for 3 cleaned posts

contain way more tokens than the ones from the example) from all 4 main subsets, the tf-idf matrix
containing the training data has been fit along with its corresponding verdicts into a multinomial
logistic classifier and a multinomial naive bayes classifier.
Another interesting experiment which has been considered is using a sampling method on
the main data-set in order to balance the existent classes equally. In this respect, O’Brien has
used Synthetic Minority Over-sampling Technique (SMOTE)(Chawla et al., 2002), an ap-
proach in which the minorities class are over-sampled by creating “synthetic” data rather than
by over-sampling with replacement, as seen in previous attempts to deal with imbalanced data.
This method is usually applied on the tf-idf matrix, as seen in Figure 3.4.
An illustrative example in this sense could be shown for 100 random entries sampled from
the aita_2500 with the same distribution. After converting the extracted posts to word repre-
sentations by using tf-idf, the latter could be re-sampled along with their corresponding verdicts
by using Sklearn’s16 imblearn libary which provides an over-sampling variant of SMOTE. The
16 https://imbalanced-learn.org/stable/references/generated/imblearn.over_

sampling.SMOTE.html
3.3. RQ2 - ML METHODS EMPLOYED FOR TEXT CLASSIFICATION 35

effect of SMOTE, as well as the way in which it generates syntethic data has been captured in the
two plots provided in Figure 3.7:

(a) Data before SMOTE (b) Data after SMOTE

Figure 3.7: SMOTE applied on a sample of 100 numerical vectors representing posts from
aita_2500, where the verdicts are encoded as it follows: NTA:0, YTA: 1, ESH: 2 and NAH: 3

It can be observed that the majority class (representing NTA) has not been modified at all,
while the other three have gained more population, which aligns along the same range of values.
Eventually, the newly synthetic data will balance the existent data-set, so in this case it will bring
all 4 classes at 25% ratio.
While SMOTE is a very popular method for handling imbalanced data, featuring impressive
results in some papers(Umer et al., 2021) when used with traditional machine learning methods
and tf-idf to perform text categorisation tasks, it has its own downsides. Since it is using a nearest
neighbour algorithm to generate new samples, they might accidentally become too similar to the
existent ones and this might cause overfitting problems when training the models later on. In
addition, SMOTE significantly increases the data-space, which could cause memory issues when
training bigger word representation structures such as Word2Vec or BERT encodings. As a result,
this method has not been considered for the following two other sections comprising the second
research question of this paper.
3.3.3 Word2Vec with Neural Networks
The second approach utilises vector representations known by the name Word2Vec, an extremely
popular technique used in the NLP world to transform any documents into word embeddings,
based on the similarity between different words contained in each document. Gensim’s library has
been preferred for this task since it provides an implementation of the continuous Skip-gram ar-
chitecture of a Word2Vec model, which has been shown to outperform its homologue, Continuous
BoW, as described in the Background section. In spite of all existent pre-trained models trained by
Gensim17 , it has been chosen as part of this study to create the word vectors from scratch, based
on the posts given as training data.
This approach has been used before by Wang in their paper where a similar experiment has
been performed, but for a binary classification task. They have also shown how the words gener-
ated through a pre-trained model might not cover all words existent in posts collected from r/AITA.
17 https://github.com/RaRe-Technologies/gensim-data
3.3. RQ2 - ML METHODS EMPLOYED FOR TEXT CLASSIFICATION 36

However, they have not accounted for how they have chosen the optimal vector size for the word
embeddings to be 30, fact that has determined this study to investigate more in this aspect. In
addition, the SVM classifier used for the classification part in Wang (2017) has been omitted for
this paper, due to severe computational issues SVM could cause when used with Word2Vec, since
its execution time is very slow if Sklearn’s version18 is used.
On what concerns the vector size preferred chosen for this task, 2 dimensions have been
considered initially and compared: 30 (taken from (Wang, 2017)) and 300. The later represents
the default size of all pre-trained vectors. After a short experiment on a sample from the big data-
set which included an MLP classifier with a similar architecture as the one in Wang (2017), 300
has been determined as the optimal vector size. The MLP classifier trained with this size and the
same hidden layer size has managed to surpass the performance of the model with 30 vector and
hidden layer size, in terms of accuracy. Therefore, the Word2Vec model is ready to be trained,
with vector size 300, window size of 5 and a skip-gram architecture.
The main methodology used for this Word2Vec approach is illustrated in Figure 3.8, where
the Bidirectional LSTM model has been initialised from Keras, while the Multi-layer Percepetron
classifier has been taken from Sklearn.

Figure 3.8: Pipeline representing how word2vec is used with deep learning methods

In order to understand the whole Word2Vec process, a simpler model has been created,
considering the following three cleaned sentences:

• ’write come asshole pretty immature rude’

• ’come home write little possible pretty’

• ’little home party rude immature pretty’

First of all, Word2Vec will create a dictionary including all words existent in the corpus
and their corresponding index {’pretty’:0, ’little’:1, ’home’:2, ’rude’:3,
18 https://scikit-learn.org/stable/modules/svm.html
3.3. RQ2 - ML METHODS EMPLOYED FOR TEXT CLASSIFICATION 37

immature’:4, ’come’:5, ’write’:6, ’party’:7, ’possible’:8, ’asshole’:9}.


The index is important to understand, because for each given word, its input representation will
be passed into the model as a one-hot vector. For instance, for the word write, its corresponding
vector will have the following structure: [0, 0, 0, 0, 0, 0, 1, 0, 0, 0].
Suppose that this Word2Vec model has been initialised with a window of size 2, with a
skip-gram architecture. When first sentence is processed, the following steps will be performed:

1. ’write come asshole pretty immature rude’


Model will process the pairs found around come, at most 2 positions left and right.
Processed pairs: (come, write), (come, asshole), (come, pretty)

2. ’write come asshole pretty immature rude ’


Model will process the pairs found around asshole, at most 2 positions left and right.
Processed pairs: (asshole, write), (asshole, come), (asshole, pretty),
(asshole, immature)

3. ’write come asshole pretty immature rude ’


Model will process the pairs found around pretty, at most 2 positions left and right.
Processed pairs: (pretty, come), (pretty, asshole), (pretty, immature),
(pretty, rude)

4. ’write come asshole pretty immature rude ’


Model will process the pairs found around immature, at most 2 positions left and right.
Processed pairs: (immature, asshole), (immature, pretty), (immature,
rude)

It can be noticed that this procedure starts with the second and ends with the second-to-last
word, since the window size is 2, because the later is applied for both left and right sides of the
word. The corresponding probabilities defined in the Background section will be computed for
these pairs, thus determining the similarity between different words.
When this algorithm is applied to a much larger corpus such as the one used for this research,
the Skip-gram model will have a structure similar to the one displayed in Figure 3.9.
Number V represents the size of the vocabulary that is constructed initially and that contains
all the words and their corresponding positions. The Inputlayer represented by a one-hot vector
for a word wm will output a V × E Hidden layer which will be learned with the necessary weights,
where E denotes the size of the word embedding decided initially when training the Word2Vec
model. This layer is then fully connected to the Output layer also called Softmax. which calculates
each probability pi that, if wm were randomly chosen, its position will be similar to the position
of i of wi . In other words, it generates all probabilities that depict how similar two words are with
each other.
The Multi-layer Percepetron (MLP) method used in Wang (2017) has been adopted in this
paper, with a different vector size, as well as a different hidden layer size for the MLP classifier
(300). Figure 3.10 shows how the entire process will be conducted. The Word2Vec component
initially contains the numerical vectors corresponding to each post. Since each post has different
length, so will each vector and therefore, all the vectors need to be averaged to have length 300
3.3. RQ2 - ML METHODS EMPLOYED FOR TEXT CLASSIFICATION 38

Figure 3.9: Word2Vec main structure for Skip-gram architecture

before being sent to the Input layer. The latter is fully-connected to three Hidden layers which
are also fully-connected to themselves. The activation function for the Hidden layers tanh, while
the activation function applied to the Output layer is softmax. Likewise, the architecture of the

Figure 3.10: MLP model architecture

Bidirectional LSTM (Bi-LSTM) used for the other approach can be seen in Figure 3.11. This
method has not been used before with the same data-set, but has been adopted for different data-
sets created from posts collected from other Reddit. communities(Ting, 2015).
3.3. RQ2 - ML METHODS EMPLOYED FOR TEXT CLASSIFICATION 39

Figure 3.11: BiLSTM Keras model architecture

All the existent layers Input, Embedding, Bi-LSTM, Dense and Output have been initialised
by using Keras. The Sequences component represents a pre-processing methods through which
all tokenised posts are transformed into numerical sequences through Keras’s tokeniser19 . These
sequences will all have length 300, just like the size of Word2Vec, so each word from the tokenised
post is mapped to an encoding. The layer connected to Input will be an Embedding20 , initialised
with the weights computed from Word2Vec. The embedding matrix is calculated based on the
weight matrix which has V (size of the vocabulary) word vectors and on the values sent from each
input sequence, where each value corresponding to a word. If a sequence a word w with encoding
e from a sequence xi exists in the vocabulary, its corresponding vector from the weight matrix will
be stored together with e in the a matrix. This process is repeated for each word from the input for
300 times, and will hence build the Embedding matrix of size 300 × 300.
The embedding matrix will be connected itself to a Bidirectional LSTM layer which contains
a Bi-LSTM unit. All 300 vectors will be processed back and forth and twice, so 600 values will
be connected further to a Dense (fully-connected) layer of 64 neurons. This extra hidden layer
(activated with relu will help in calculating the values of the Output). By applying softmax again,
the final probabilities are finally obtained.
3.3.4 DistilBERT
As explained in the Background section, Bidirectional Encoder Representations from Trans-
formers (BERT) represents one of the state-of-the-art transformer models used in NLP and im-
plicitly in text classification tasks. Alhassan et al. (2022) is the paper that made use of several
BERT-based models that they have retrieved from Tensorflow hub21 and finetuned on a very pow-
erful graphics processor, NVIDIA Tesla P100 GPU. This resource has allowed them to make use of
large language models of up to 160GB corpus size, with hundreds of millions parameters. Since
Google Collab22 has been used for this study, the resources were more limited (16GB RAM),
and that is why the smaller version of BERT, DistilBert, has been preferred for this experiment.
19 https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/

Tokenizer
20 https://keras.io/api/layers/core_layers/embedding/
21 https://www.tensorflow.org/hub
22 https://colab.research.google.com/
3.3. RQ2 - ML METHODS EMPLOYED FOR TEXT CLASSIFICATION 40

The machine learning method used for this task is called transfer learning, and relies on using an
pre-trained model and integrating it as a starting point for a completely new task. In this case, a
pre-trained TfDistilBertModel23 will be integrated in our Keras model, in order to process
the posts. The later have to be tokenised accordingly into a DistilBERT encoding, by using the
DistilBert tokeniser.
One thing to mention before delving into the tokenisation process is that BERT models are
able to distinguish between the context of different words very easily, so that lemmatisation and
stop-words removal will not be necessary here - the posts will just be cleaned by using the other
aforementioned tools such as removing punctuation or any other unnecessary symbols. After
cleaning the posts in this way, two different lists will be created: one containing the token ids
assigned to each word through a huge vocabulary dictionary (BERT uses WordPiece24 ), and the
other containing what is known as attention mask, which keeps track of the occurence of each
token, and record when the tokens end. Both lists will have size 512 since this is the maximum
number of tokens any BERT model can process.
In order to understand this process better, consider the following sentence s: "I am
overreacting to what it was said to me!". The following steps will be applied
to compute the token ids and the attention mask corresponding to s:

1. Tokenisation - clean the sentence and tokenise into words


Tokenised sentence: "i am overreacting to what it was said to me"

2. Sentence-to-sequence - convert the sentence into a BERT-specific sequence by applying the


following rules:

• Add the term [CLS] at the beginning of the sentence


• Add the term [SEP] at the end of the sentence
• Split unknown words into sub-tokens of known words by using the separation term ##
• Add as many [PAD] terms as needed after [SEP], until end of list is reached (512)

Generated sequence: "[CLS] i am over ##rea ##cting to what it was


said to me [SEP] [PAD] [PAD] [PAD] ... [PAD]"

3. Sequence-to-ids - convert the sequence into ids by replacing each token with the id to which
it is mapped in the vocabulary (e.g [CLS] is mapped to 101, [SEP] to 102, [PAD] to 0)
Token ids: [101, 1045, 2572, 2058, 16416, 11873, 2000, 2054, 2009,
2001, 2056, 2000, 2033, 102, 0, 0, 0, ..., 0]

4. Sequence-to-mask - compute the attention mask corresponding to the sequence by assign-


ing each word to value 1 and each padding to value 0
Mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,...,
0]

These two lists [t1 ,t2 , ...,tn ] and [m1 , m2 , ..., mn ] will represent the Input tensors of our Keras model,
whose architecture can be seen in Figure 3.12. These tensors will be processed by the pre-trained
23 https://huggingface.co/docs/transformers/model_doc/distilbert
24 https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt
3.4. RQ3 - INTEGRATE BEST MODEL AND OTHER FEATURES ON WEB PLATFORM 41

Figure 3.12: DistilBERT Keras model architecture

model TfDistilBertModel from HuggingFace, which has 6 transformer layers and output a
feature matrix of size 512 × 768 for each input sentence. After applying a Pooling layer which
will compress each pair of feature matrix and input sentence, so as to obtain only a vector per each
input sentence containing all 768 encoded sequences. These will be transformed into 768 neurons
and inserted into the Dense layer which has 64 neurons. Just like in the Bi-LSTM case, relu is
used as an activation function for the thus layer. Finally, the values are calculated from the weights
of this layer and saved into the Output layer, from which the softmax layer used for previous tasks
averages the probability for each class.

3.4 RQ3 - integrate best model and other features on web platform
The third research question revolves around selecting the best performing model found during the
ML process detailed in the previous subsection and integrating it on a web platform, where users
can interact with it. The platform will have the following features:

• two text boxes where the user can input a moral judgement story along with a convenient
title

• the results returned by the AI, which comprise the following components:

1. the VERDICT returned by the model with regards to the given scenario, one of Not
the asshole (NTA), You the asshole (YTA), Everyone sucks here (ESH) and No
assholes here
2. an EXPLANATION returned by ChatGPT3.5, which attempts to justify the given ver-
dict
3. a RECOMMENDATION returned by ChatGPT3.5 meant to guid the user accordingly,
given the scenario they wrote

Similar web platforms have been created before, such as AskDelphi25 and a pretty similar AreY-
ouTheAsshole?26 . The former has been trained on multiple data-sets and can perform various
25 https://delphi.allenai.org/
26 https://areyoutheasshole.com/
3.4. RQ3 - INTEGRATE BEST MODEL AND OTHER FEATURES ON WEB PLATFORM 42

tasks, one of which is moral reasoning, but it is more focused on returning a result such as it is
right, it is acceptable or it is wrong, so it could not deal with more complex stories, where other
parties are involved. At the same time, the other one also called AYTA is trained on several data-
sets collected from r/AITA, which include redditors’ comments as well, but it lacks in complexity
as it is able to predict only two classes.
The platform created as part of this research will include all 4 respective classes, so it will
extend the approach followed by AYTA, but instead of using the comments from Reddit (due to
memory issues and time constraints), it will provide the explanation and recommendation to user,
computed by ChatGPT3.5. To make sure the platform would be usable, a human experiment
involving 20 participants has been conducted. The experiment consists of giving a scenario to a
participant and asking them to judge it with one of the 4 verdicts. After they answer, they would
be able to input the scenario on the platform and see what the AI’s opinion is. Furthermore, the
participant will also be asked to rewrite the scenario such that the verdict (according to THEM)
would be different, and then test it again with the AI.
The experiment has been conducted through a questionnaire which can be found at this link27 .
In total, 5 scenarios have been used, and for each scenario the participants have provided 4 more
new stories covering all 4 verdicts. The participants have also been asked to provide their feedback
with regards to the verdicts, explanations and recommendations generated by the AI. By using the
explanation generated by ChatGPT3.5, it would be known whether the model outputs a completely
wrong verdict if the explanation does not manage to justify the verdict accordingly. Furthermore,
even though the recommendation feature is not related to the verdict, it would be interesting to
explore whether it could actually give good advice in those situations. Therefore, not only is
it desirable to test the usability of the AI model but also how well a state-of-the-art model like
ChatGPT3.5 would perform for this task.

27 https://forms.office.com/e/KCHpmEPrpw
Chapter 4

Results & discussion

4.1 RQ1
The distribution of all posts and the sentiment associated with them can be seen in Figure 4.1.
It can be observed that 72199 posts have been labeled with a positive label compared to only
35394 which tend to be more negative. In addition, a small sample of 1264 posts convey a neutral
sentiment, representing that less than 2% of total posts. In order to get a better idea of how

Figure 4.1: Sentiment distribution of posts

sentiments are distributed over the post, they have been plotted across each verdict as it can be
seen in Figure 4.2. Here, even though the number positive posts clearly surpass the number of
negative posts per each class, it can be noticed that, for the everyone sucks class, the difference is
not that huge.

Figure 4.2: Sentiment distribution per each verdict

After computing the coherence scores for both LdaMulticore and LdaMallet approaches and
4.1. RQ1 44

for five different values for the number of topics, two plots have been provided in Figure 4.3.

Figure 4.3: Coherence values for LdaMallet and LdaMulticore

It can be observed that adopting the MALLET approach has proved to increase the topic
coherence significantly, with almost 10%. The complexity of the former also comes from the
execution time (LdaMallet has taken around 3 hours to completely finish while LdaMulticore
terminated after 57 minutes). By looking at LdaMallet’s graph, a huge leap occurs after exceeding
the limit of 20 topics, which shows that choosing 20 for the optimal number of topics would work.
After reinitialising the LdaMallet model with T = 20, where T is the number of topics used
in Algorithm 2, its final topic coherence will be 0.43. Each discovered topic has its own cluster of
words, each word being linked to a calculated weight. For example, Figure 4.4 describes one of
the 20 discovered topics through the highest 20 weighted words.

Figure 4.4: First 20 words describing topic 0

From a human point of view, these words clearly suggest a major topic name, which is Edu-
cation. The same intuitive technique has been applied for all 20 detected topics, which have been
also merged, since some of the predominant words from certain topics overlap. In order to pre-
serve some sort of accuracy when it comes to renaming and merging the topics, the whole process
has been performed with ChatGPT, which has come up with the topic-words mapping shown in
Table 4.1. These operations have provided a better understanding of what kind of topics AITA
redditors prefer writing about: there are 13 main topics, some of which (such as Communication,
4.1. RQ1 45

Family, TimeManagement or Relationships cover more than one cluster of words computed by
LdaMallet, thus emphasising their prevalence and importance.

Topic Keywords
eat, food, make, aita, dinner, order, bring,
Food
cook, table, lunch
party, wedding, birthday, christmas, gift, in-
Events
vite, family, year, wear, event
play, game, watch, aita, show, time, movie,
Entertainment
make, video, people
kid, wife, husband, child, daughter, son, year,
baby, aita, law
Family
mom, family, sister, parent, dad, brother,
mother, year, father, house
school, college, class, high, year, student, aita,
Education
make, study, teacher
call, phone, send, post, message, text, re-
spond, picture, reply, find
Communication
start, call, talk, stop, hear, yell, apologize, cry,
joke, angry
work, job, week, coworker, company, day,
Work
time, boss, manager, business
pay, money, buy, give, month, back, rent, bill,
Finance
save, card
aita, bad, sick, smoke, hair, care, day, hospi-
Health
tal, doctor, make
room, dog, house, move, roommate, live,
Housing
clean, apartment, cat, door
day, time, home, night, hour, leave, work,
week, sleep, stay
TimeManagement
move, time, plan, live, week, trip, stay, month,
year, place
life, thing, issue, feel, time, partner, support,
problem, lot, mental
feel, make, boyfriend, thing, upset, time, bad,
talk, bf, aita
guy, pretty, thing, stuff, drink, aita, bit, ass-
Relationship
hole, back, shit
friend, girl, good, people, group, talk, hang,
close, guy, person
girlfriend, relationship, date, break, year,
month, talk, start, meet, ago
car, drive, walk, back, sit, aita, minute, front,
Transportation
wait, people

Table 4.1: Rename and merge topics with ChatGPT

The distribution of newly-merged and renamed topics over posts can be seen in Figure 4.5.
The number of posts represents in how many posts a certain topic has been chosen as dominant
over the others existent topics. Relationships tend to be discussed the most among redditors, with
4.1. RQ1 46

the mention that all kinds of relationships are included in this cluster, according to Table 4.1.
Family and Housing are the next 2 topics on the list, while Education and Health find themselves
at the other side of the list. Another interesting pattern to notice is that Housing surpasses topics
like Communication or TimeManagement, even though the later are the result of a merge between
clusters. In other example from the literature representing Gonda’s analysis performed on a similar
data-set of posts, Housing tops the list, followed by Family. However, the approach adopted there
has not included a model trained by using the MALLET implementation.

Figure 4.5: Topic distribution over posts after renaming and merging

Another analysis has been performed on the distribution of topics across posts in terms of
the verdict each post has been labelled with. Since Figure 4.2 confirms the uneven distribution of
posts on what concerns the verdict, in Figure 4.6, the percentage distribution has been illustrated
instead, from which more interesting patterns can be extracted.

Figure 4.6: Topic percentage distribution over posts in terms of verdict

For instance, for topics such as Entertainment, Education or Food, the percentage of posts
labeled as asshole drops below 60%, while posts labeled as asshole are slightly higher. Moreover,
4.2. RQ2 47

posts whose dominant topic is Communication are more likely to be labeled with everyone sucks
than no assholes here, suggesting the problems that occur in day-to-day communication with the
others. Another argument that confirms this belief is that the Communication topic also appears to
be reporting an almost equal percentage of negative and positive posts, as seen in Figure 4.7.

Figure 4.7: Topic percentage distribution over posts in terms of sentiment

The same pattern applies to the Transportation topic, whereas posts related to Health are
more likely to be negative, which has been expected due to some of the top 10 words such as
sick, bad or hospital (Table 4.1). In contrast, the Events topic is linked to positive posts with a
probability exceeding 80%, since the cluster of words that defines this topic contain words with a
positive connotation: party, birthday or wedding.

4.2 RQ2
The experiments described in the Methodology section have been conducted on Google Collab.
Each approach has been applied for each subset, and the quality of the predictions has been as-
sessed through both accuracy1 and recall per class2 . The reason for using the latter as well comes
from how misleading the accuracy metric could be in some cases. In this case, since it is desired
to build a model that is able to predict well for all classes, their corresponding recalls have been
compared.

1 https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_

score.html
2 https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_

score.html
4.2. RQ2 48

4.2.1 aita_2500
aita_2500 represents the most imbalanced subset, with the NTA class dominating the others.
Consequently, as seen in Table 4.2, all the models employed for this task have been extremely
biased towards the majority class, with Tf-idf and logistic regression model managing to provide
the best performance as it can be seen in the Recall per class section. Nonetheless, the SMOTE
method applied with logistic regression and multinomial naive bayes proves the benefits of balanc-
ing the data mentioned in the literature studies(Alhassan et al., 2022; O’Brien, 2020; Wang, 2017).
The SMOTE approach has generated a more balanced recall per each class for both approaches,
featuring 40% overall accuracy for the logistic model and 37% for the MNB. While having an
accuracy better than the one of a random model in a 4-class classification task (25%), there is also
a better chance for the two models to predict each class more accurately that a random model.

Recall per class


Approach Representation ML classifier Accuracy
NTA YTA ESH NAH
Logistic regression - NO SMOTE 62% 94% 14% 1% 5%
Logistic regression - SMOTE 40% 42% 33% 30% 42%
1st approach Tf-idf
MNB - NO SMOTE 62% 100% 0% 0% 0%
MNB - SMOTE 37% 40% 24% 41% 43%
MLP 62% 98% 6% 0% 3%
2nd approach Word2Vec
Bi-LSTM - 5 epochs 62% 100% 0% 0% 0%
3rd approach DistilBERT encoder DistilBERT - 5 epochs 62% 97% 4% 0% 7%

Table 4.2: Results of the experiments conducted on aita_2500

4.2.2 aita_NTA_1000
After reducing the majority class by almost 50%, even though the overall accuracy has decreased,
the recall per class has slightly improved for the other classes, as Table 4.3 depicts. Tf-idf with
logistic regression has managed to provide better results than MNB, featuring a decent recall
for the YTA class, but still lacking a bit in performance when predicting the other two minority
classes. In addition, this method has surpassed the performance of Word2Vec with MLP in terms
of recall per each of the three minority classes. However, the Bi-LSTM model has achieved the
highest accuracy overall (58%), along with the best recall for YTA, but DistilBERT has performed
similarly, with both having been trained on five epochs.

Recall per class


Approach Representation ML classifier Accuracy
NTA YTA ESH NAH
Logistic regression 51% 80% 37% 5% 15%
1st approach Tf-idf
MNB 47% 99% 2% 0% 0%
MLP 51% 83% 34% 2% 13%
2nd approach Word2Vec
Bi-LSTM - 5 epochs 58% 94% 41% 2% 7%
3rd approach DistilBERT encoder DistilBERT - 5 epochs 57% 97% 37% 1% 7%

Table 4.3: Results of the experiments conducted on aita_NTA_1000

4.2.3 aita_3_balanced
Moving on to the third subset created by balancing the first three classes (NTA, YTA and ESH),
it can be observed in Table 4.4 that each recall per class has improved significantly for all existent
4.2. RQ2 49

models. Logistic regression has obtained the same results as Bi-LSTM and DistilBERT with
regards to the recall for all classes but NTA. The recall for the latter has only been 63%, being
surpassed by the two Word2Vec and transformer models. MLP is still quite comparable with
logistic regression in this case, both of them achieving 49% overall accuracy and similar recall
values. However, Bi-LSTM and DistilBERT top the table with 58% accuracy, with much better
recall for NTA than the others. Overall, it can be noticed that the best performing models still lack
in predicting some of the minority classes (i.e. Bi-LSTM with only 25% recall and DistilBERT
with 30% recall for ESH).

Recall per class


Approach Representation ML classifier Accuracy
NTA YTA ESH NAH
Logistic regression 49% 63% 38% 32% 54%
1st approach Tf-idf
MNB 42% 54% 36% 0% 52%
MLP 49% 68% 36% 32% 49%
2nd approach Word2Vec
Bi-LSTM - 5 epochs 58% 97% 40% 25% 45%
3rd approach DistilBERT encoder DistilBERT - 5 epochs 58% 91% 37% 30% 54%

Table 4.4: Results of the experiments conducted on aita_3_balanced

4.2.4 aita_4_balanced
This fourth subset and at the same time the most balanced one has generated the most satisfactory
results overall. A general trend observed between the results recorded in Table 4.5 compared to
Table 4.4 is that in the case of four balanced classes, not only did the general accuracy improved
for most models, so did the recalls, with three out of four almost reaching 50%. Bi-LSTM has
achieved 62% accuracy, which equalises the one obtained for the big subset, with the difference
that the recalls have increased in percentage. Bi-LSTM still struggled to predict instances labeled
with NTA for which the recall has been 30%. Its homologue DistilBERT, despite achieving a 61%
accuracy, so slightly lower, has managed to surpass the former when predicting NTA instances,
with a recall of 42%. The trade-off has affected the NTA class, for which it has obtained 81% re-
call, lower than the 93% obtained by Bi-LSTM. With regards to the others, logistic regression with
tf-idf has obtained 50% accuracy with decent recall values since all exceeded the 25% threshold,
unlike MNB and MLP. Another observation here is that the models have managed to predict much
better posts labeled as NTA and ESH, both verdicts implying that there is at least one asshole in
the respective scenarios.

Recall per class


Approach Representation ML classifier Accuracy
NTA YTA ESH NAH
Logistic regression 50% 58% 28% 49% 46%
1st approach Tf-idf
MNB 41% 54% 6% 72% 36%
MLP 49% 63% 23% 68% 41%
2nd approach Word2Vec
Bi-LSTM - 5 epochs 62% 93% 30% 73% 49%
3rd approach DistilBERT encoder DistilBERT - 5 epochs 61% 81% 42% 73% 44%

Table 4.5: Results of the experiments conducted on aita_4_balanced


4.3. RQ3 50

4.2.5 Overall remarks about the four experiments


The outcomes of all four experiments detailed above show the issues that this 4-class classification
task has faced, especially on what concerns the imbalanced data. Consequently, the initial big
data-set has required further pre-processing such that the newly obtained data-sets would be more
balanced. The results have been proved much more satisfying, but the number of posts has been
reduced significantly. Therefore, the main question that arises is whether the trained models are
too overfitted or not. This will be further discussed in the next section which details the results
obtained after applying the methods used to tackle the third research question.
Compared to the results recorded in previous papers, the models developed here have suc-
cessfully extended the approaches used in Wang (2017), namely Word2Vec and MLP. However,
a different method based on Word2Vec that outperformed MLP has proved to be Bi-LSTM, so in
this aspect the results provided here are definitely more satisfactory. Furthermore, a new method
involving tf-idf and multinomial logistic regression has been also proved to at least match the per-
formance of the MLP classifier. Alhassan et al.’s approach with multiple transformer models has
shown that good results could be achieved for binary classification on a similar data-set, featuring
up to 88% accuracy for a Longformer model. However, they have not provided the code used for
their approach, and they have also used much bigger transformer models than DistilBERT.
4.2.6 Further analysis of best model
The DistilBERT model trained on aita_4_balanced has been considered the best performing
model overall, since it has managed to achieve at least 40% recall for all classes and an overall
accuracy of 61%. The ideal model should have achieved at leat 50% recall for each class, but this
one is not far from that result. To understand how the model actually performed, it is important to
visualise the confusion matrix3 of the predictions, which is displayed in Figure 4.8.
Here, it can be observed that the model has classified posts labeled as YTA with NTA, but
almost half of them have been good predictions, anyways. Likewise, many posts labeled as NAH
have been classified as NTA or YTA, but the model has managed to predict a lot of them correctly.
Digging even further in the analysis and at the same time linking the predictions with the topics
discovered previously, Figure 4.9 shows the number of posts classified correctly and incorrectly
as YTA by the model per each topic. An interesting detail noticed here is that posts related to
Relationships feature more true than false predictions, in contrast to all the other ones. This could
potentially lead to creating a separate model that attempts to predict only posts about relationships,
given that it is the most prevalent topic of them all while also achieving very good accuracy overall.

4.3 RQ3
4.3.1 Platform & experiment details
The best performing model found in the previous section has been integrated into the web platform
whose usability has been tested through the human experiment. In this way, it would be clear
whether the performance showed on the AITA test set would match the one used on other people.
The design of the platform can be visualised in Figure 4.10. Each user has to provide a story of
maximum 1200 characters along with an adequate title that starts with the acronym AITA. After
3 https://scikit-learn.org/stable/modules/generated/sklearn.metrics.

confusion_matrix.html
4.3. RQ3 51

Figure 4.8: Confusion matrix showing the predictions returned by DistilBERT when tested on
aita_4_balanced. Each verdict NTA, YTA, ESH and NAH has been encoded with 0, 1, 2
and 3, respectively.

Figure 4.9: Number of posts classified correctly and incorrectly as YTA by the model per each
topic

pressing the Submit button, after some seconds, the verdict, explanation and recommendation will
be generated on the right side. In case they get confused by the process, they can always press
the Help button which provides them with some instructions. Additionally, for testing purposes,
an extra Random scenario button has been implemented, so that a user could use a pre-made
scenario to see how the platform works, without having to write their own. Other basic tools for
clearing the two textareas have been also added to the platform: Clear title, Clear story and Clear
both.
The verdict is generated after the program processes and transforms the content of the title
concatenated with the content of the story into a BERT encoding that is parsed into the DistilBERT
model. The explanation and recommendation are generated by ChatGPT3.5 through two requests
that are sent to ChatGPT’s API along with two specific queries. The first query asks ChatGPT
to explain the returned verdict for the given scenario, whereas the second query asks ChatGPT to
give a recommendation in the situation described in the story.
4.3. RQ3 52

Figure 4.10: Design of the platform

After all 20 participants have finished completing the questionnaire form, the answers4 have
been collected and analysed accordingly by Microsoft Forms, which has created figures that sum-
marise the answers to the questions involving multiple choices. On what concerns the answer that
included plenty of text, a qualitative analysis has been performed and the most important remarks
have been extracted.

4.3.2 Participants’ agreement with the results returned by the AI


All five scenarios can be found attached to the Appendix ?? section. For the first part of the
experiment, each scenario has been analysed separately for all three results returned by the AI.
The satisfaction of the participants with regards to the verdicts, explanations and recommendations
returned by the AI for each scenario can be visualised in Figure 4.11.

(a) Verdicts (b) Explanations (c) Recommendations

Figure 4.11: Results from the first part of experiment

The verdicts returned by the AI have matched the participant’s satisfaction only for Scenario
4, and half of them have been satisfied with the verdict generated by Scenario 2, while the other
4 That can be found in the Questionnaire_results.xlsx file located at https://tinyurl.com/

34fse4vv
4.3. RQ3 53

three depict disagreement with the verdicts. However, when it comes to explanation, results are
better for these three scenarios, suggesting that the verdicts returned by the AI are not completely
wrong, but maybe slightly (i.e. NTA instead of NAH). In fact, for Scenario 2 and Scenario 3 the
participants have mostly agreed with the explanation, even though not at all with the verdicts. The
recommendation feature has proved to work really well, which shows that ChatGPT3.5 is quite
good when it comes to giving advice in certain situations.
The second part of the experiment has been the more challenging task for both the partici-
pants and the AI - while the former were supposed to come up with modified versions of the given
scenario that would change the verdict, the latter has been properly tested whether it is able to
adapt to spots certain details. The results from this task have been put altogether with no distin-
guishment between the scenarios and can be seen in Figure 4.12. Overall, the AI has managed to

Figure 4.12: Results from the second part of experiment

fulfill the expectation of the participants in 50% of the cases, whereas the explanation has man-
aged to justify the verdicts in 55% of all cases. This suggests an overall improvement from the
first experiment where most of the feedback has been negative towards the verdicts, as well as a
decent potential that the DistilBERT model shows when it comes to differentiating between sim-
ilar scenarios. The recommendation feature has worked remarcably once again, with only few
complaints with regards to that.
4.3.3 Qualitative analysis
The participants have been asked to provide feedback after each task, and this section summarises
some of the most interesting suggestions/opinions that have been collected.
Among the main strengths of the web platform the following have been considered worth-
mentioning:

• three of the participants have actually changed their opinion on the verdict of the story
after reading the explanation
4.3. RQ3 54

• the experiment has proved to be really interesting for the participants and they see them-
selves using such a tool in the future

• most participants have been impressed with the explanation and recommendation features
and how on point these are

• some participants think that the verdict, explanation and recommendation returned by the
AI are human-like

• the explanations have been reasonable in some cases where the verdict was wrong

In contrast, the main weaknesses of the AI have been listed below:

• some recommendations were just explaining the scenario and not giving actual advice

• the classification could be improved overall, since in some cases the model did not manage
to spot the changes in some scenarios

• the model has not found enough "fault" in some scenarios even though from a human per-
spective, the people involved were immorally wrong

• the explanation and recommendation could have been more detailed and provided more
information
Chapter 5

Conclusions

5.1 Achievements
Overall, this paper has focused on providing a possible solution for implementing an AI that could
help people with moral judgement, based on online posts collected from one of Reddit’s biggest
communities, r/AmITheAsshole (r/AITA). However, among the sub-goals that were intended to be
achieved lie the detailed analysis provided on the given data-set which has provided interesting
results related to the topics discussed in the subreddit, and the comparison between different ML
models when it comes to performing a text classification task.
This paper has managed to extend some of the previous attempts to implement such a model
(Alhassan et al., 2022; Wang, 2017) since it has provided a solution for 4-class classification,
which has not been tried before. The issues concerning the imbalanced data-set due to an uneven
distribution of the verdicts have been eventually solved by creating different subsets which are
more balanced. Adopting this approach has led to improving the recall of each class and implicitly
the overall performance, with two models reaching 61% and 62% accuracy.
Moreover, the third research question discussed in this study provides a novel solution for an
AI system that could help people with moral dilemmas, that is different from other AIs such as
AskDelphi or AreYouTheAsshole (AYTA) because of the explanation and recommendation features
given by ChatGPT3.5. The reason to opt for the latter to perform this task comes from the willing-
ness to include the most popular AI tool in the study, since people use it really often. This decision
has proved to be successful as well in spite of many recent concerns that ChatGPT might provide
inconsistent advice sometimes(Krügel et al., 2023). The participants that took part in the human
experiment involving the newly-created tool have showed some sort of satisfaction with regards to
the whole idea, suggesting that they would be willing to use such a system in the future.

5.2 Future work & limitations


Some potential future work related to this paper will be discussed in this section, along with several
limitations and issues that have been encountered at different points in the study.

5.2.1 Expand the data-set


When the Reddit data has been scraped, only the posts with their verdicts have been considered,
thus leaving out all existent comments to each post. This approach has been adopted by the people
that created the AYTA tool, but has been neglected here due to the huge amount of data-space that
would have been used. The comments could provide an extra insight for the training data and
5.3. S ELF - EVALUATION 56

maybe also improve the performance of the AI, as well as making it able to return explanations
based on the same comments. However, this process might require plenty of time, since it would
be challenging to filter only the comments considered to be morally right, but r/AITA might label
the top comments for each post, which might make it easier. A suggestion could, therefore, be
to make the model able to generate its own explanation and then compare it to the explana-
tion generated through ChatGPT. Not only would this test the performance of a model trained
specifically for moral judgement tasks to a state-of-the-art AI trained on various data, but also to
explore whether ChatGPT could actually be usable for this sorts of tasks.
5.2.2 Use explainable tools to understand the models better
Unfortunately, time and memory constraints have not allowed this study to provide extra expla-
nation on what concerns the trained models. For instance, Keras provides a feature called At-
tention layer1 that could be attached to any Keras model and used to explain the predictions
returned by the AI. This could lead to a further investigation into improving the overall perfor-
mance, as well as getting a better idea of how the AI processes the text. Additionally, Python’s
SHAP library2 provides great visualisation tools that could be used to see how the model
evaluates the input and returns a probability as an output. This method has been tried, but
some errors with regards to the library have made it unable to work, so this could potentially be
an improvement for the future, too.
5.2.3 Extend the 4-class classification
The current model is only able to perform 4-class classification based on the four classes (verdicts)
on which it has been trained. However, the AI that has been implemented on the platform processes
any type of input, which means that it would return a result no matter what. However, there
are definitely cases which either require more information in order to establish for sure the final
verdict, or they might just have no meaning. Both of these situations along with other ones could
be explored further, and a solution could be to filter posts from the r/AITA that are labeled
as INFO, which means that there is not enough information provided. However, it is not
guaranteed that there are enough posts classified with this verdict.

5.3 Self-evaluation
Personally, I have found this project both exciting and challenging for plenty of reasons. Dealing
with huge data-sets and with text classification tasks requires plenty of research time and a proper
choice of resources as well. I think that I have struggled a bit with the latter, since I had not
expected initially that I would need a powerful GPU to train my models if I wanted better time
performance. Memory management represented the main issue I have encountered throughout the
project, but acquiring Collab Pro has helped me reduce the execution time considerably. Overall,
I have greatly enjoyed exploring the world of NLP and text classification since it has aided me in
greatly improving the skills acquired in the ML-based courses I had attended before at university.

1 https://keras.io/api/layers/attention_layers/attention/
2 https://shap.readthedocs.io/en/latest/
Appendix A

Maintenance manual

In order to install and use the system, you have to follow the steps mentioned below:

1. Python 3.9.6 or above has to be installed, you can do it from their website https://www.
python.org/downloads/

2. The main system is written in Python’s framework Flask, which can be installed from the
official website https://flask.palletsprojects.com/en/2.3.x/

3. Unzip the zip file aita_system.zip and extract all files. The main files of the project
are the following will look like this:
/
main.py
requirements.txt
templates
index.html
static
layout_utils.js
scenarios.json
models
distilbert_model.h5
notebooks
datasets
aita_dataset.csv
aita_preprocessed.csv
RQ1.ipynb
RQ2.ipynb

4. requirements.txt contains all dependencies that need to be installed in order to make the
program work. Open a command line and run the following command in order to install the
dependencies: pip install -r requirements.txt

5. main.py represents the main file of the program that will run the entire software. This can be
done by using the following command: flask -app main run.

6. The notebooks folder contains the two Jupyter notebooks where the data analysis has been
performed, along with a datasets folder. It is recommended that you open the notebooks in
the .ipynb format, since the cells contain output which is relevant to the experiment (i.e fig-
ures, charts, model training on epochs etc..). Additionally, in order to avoid extra dependencies
A. M AINTENANCE MANUAL 58

conflicts, it is highly recommended to open the notebooks in Google Collab, where the experi-
ments have been performed themselves. Each cell can be run separately, but keep in mind that
certain cells will take a long time to run, due to the nature of this task, which has involved long
hours of training certain ML models. A pre-trained model is provided in the models folder, and
that one is also used for the AI tool integrated in main.py. The datasets subfolder contains
the necessary information scraped from Reddit’s r/AITA, as well as new information added dur-
ing the data analysis process (i.e topics per document, cleaned posts etc..). Both RQ1.ipynb
and RQ2.ipynb have their own subsections, but one of them called Utils contains the main
functions needed to be run before running the further experiments. All the cells contained in this
section can be run at the same time, thanks to Jupyter Notebook’s format.
Appendix B

User manual

In order to run and interact with the system, you have to follow the steps mentioned below:

1. Open a command line and go to the directory where you have saved the system and run the
following command that will compile the Flask app and open a local session of the project:
flask -app main run.

2. Go to the link provided in the command line, which should output something like this:
Running on http://127.0.0.1:5000. The web platform should have the following
interface:

3. Write your scenario on the left side by giving it a title and an appropriate story. The limit
of the story is 1200 characters, so you will not be allowed to write more. When you are
ready, press the Submit button and wait a few seconds until the results (verdict, explanation
and recommendation) are generated on the right side. Alternatively, in case you do not have
any inspiration for a scenario, press the Random scenario button, and the platform will
automatically provide a custom scenario for you, instead.
B. U SER MANUAL 60

4. The new state of the platform will look like this:

5. Whenever you encounter any difficulties, press the HELP grey button, which will display a
modal containing specific instructions with regards to the platform:
Bibliography

Adewumi, T., Liwicki, F., and Liwicki, M. (2022). Word2vec: Optimal hyperparameters and their
impact on natural language processing downstream tasks. Open Computer Science, 12(1):134–
141.
Agarwal, B., Mittal, N., Bansal, P., and Garg, S. (2015). Sentiment analysis using common-sense
and context information. Intell. Neuroscience, 2015.
Alhassan, A., Zhang, J., and Schlegel, V. (2022). ’am i the bad one’? predicting the moral judge-
ment of the crowd using pre-trained language models. pages 267–276. European Language
Resources Association.
Almeida, F. and Xexéo, G. (2019). Word embeddings: A survey.
Anees, A. F., Shaikh, A., Shaikh, A., and Shaikh, S. (EasyChair, 2020). Survey paper on sentiment
analysis: Techniques and challenges.
Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie, B., Lovenia, H., Ji, Z., Yu, T., Chung,
W., Do, Q. V., Xu, Y., and Fung, P. (2023). A multitask, multilingual, multimodal evaluation of
chatgpt on reasoning, hallucination, and interactivity.
Botzer, N., Gu, S., and Weninger, T. (2021). Analysis of moral judgement on reddit. CoRR,
abs/2101.07664.
Chauhan, U. and Shah, A. (2022). Topic modeling using latent dirichlet allocation: A survey.
ACM Comput. Surv., 54:145:1–145:35.
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: Synthetic
minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321–357.
Chen, J. Q., Qi, K., Zhang, A., Shalaginov, M. Y., and Zeng, T. H. (2022). Covid-19 impact on
mental health analysis based on reddit comments. pages 2253–2258. IEEE.
Chowdhary, K. R. (2020). Natural language processing. pages 603–649.
Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent
neural networks on sequence modeling.
Cui, Z., Ke, R., Pu, Z., and Wang, Y. (2019). Deep bidirectional and unidirectional lstm recurrent
neural network for network-wide traffic speed prediction.
Cunn, N. (2019). Am i the asshole? http://www.nathancunn.com/
2019-04-04-am-i-the-asshole/.
Dale, R. (2017). The commercial nlp landscape in 2017. Natural Language Engineering, 23:641–
647.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirec-
tional transformers for language understanding.
Efstathiadis, I. S., Paulino-Passos, G., and Toni, F. (2022). Explainable patterns for distinction and
BIBLIOGRAPHY 62

prediction of moral judgement on reddit. CoRR, abs/2201.11155.


Elbagir, S. and Yang, J. (2019). Analysis using natural language toolkit and vader sentiment.
Frederick, R. (2009). What is commonsense morality? Think, 8(23):7–20.
Geron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow : concepts,
tools, and techniques to build intelligent systems. O’Reilly Media, Sebastopol, CA.
Gonda, T. (2019). What does reddit argue about? - topic prevalence: housing, family,
relationship<br/>- use LDA, for discovering the topics (options to use MALLEN)<br/>- analy-
ses each comment (around 5mil) and submission/topic (around 170k).
Habimana, O., Li, Y., Li, R., Gu, X., and Yu, G. (2020). Sentiment analysis using deep learning
approaches: an overview. Science China Information Sciences, 63.
Haworth, E., Grover, T., Langston, J., Patel, A., West, J., and Williams, A. C. (2021). Classifying
reasonability in retellings of personal events shared on social media: A preliminary case study
with /r/amitheasshole. pages 1075–1079. AAAI Press.
Hutto, C. and Gilbert, E. (2014). Vader: A parsimonious rule-based model for sentiment analysis
of social media text. Proceedings of the International AAAI Conference on Web and Social
Media, 8(1):216–225.
Jelodar, H., Wang, Y., Orji, R., and Huang, S. (2020). Deep sentiment classification and topic
discovery on novel coronavirus or covid-19 online discussions: Nlp using lstm recurrent neural
network approach. IEEE J. Biomed. Health Informatics, 24:2733–2742.
Jiang, L., Hwang, J. D., Bhagavatula, C., Bras, R. L., Forbes, M., Borchardt, J., Liang, J., Et-
zioni, O., Sap, M., and Choi, Y. (2021). Delphi: Towards machine ethics and norms. CoRR,
abs/2110.07574.
Kalyanathaya, K. P., Akila, D., and Rajesh, P. (2019). Advances in natural language processing–a
survey of current research trends, development tools and industry applications. International
Journal of Recent Technology and Engineering, 7(5C):199–202.
Kamath, C. N., Bukhari, S. S., and Dengel, A. (2018). Comparative study between traditional ma-
chine learning and deep learning approaches for text classification. In Proceedings of the ACM
Symposium on Document Engineering 2018, DocEng ’18, New York, NY, USA. Association
for Computing Machinery.
Kibriya, A. M., Frank, E., Pfahringer, B., and Holmes, G. (2005). Multinomial naive bayes for
text categorization revisited. In Webb, G. I. and Yu, X., editors, AI 2004: Advances in Artificial
Intelligence, pages 488–499, Berlin, Heidelberg. Springer Berlin Heidelberg.
Krügel, S., Ostermaier, A., and Uhl, M. (2023). The moral authority of chatgpt.
Mehlig, B. (2021). Machine Learning with Neural Networks. Cambridge University Press.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word represen-
tations in vector space.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013b). Distributed representations
of words and phrases and their compositionality.
Mujumdar, R., Sarat, P., Kaundinya, P., and Dambekodi, S. (2020). Can machines detect if you’re
a jerk? Deep Learning, Fall 2020, Georgia Tech.
Nadkarni, P. M., Ohno-Machado, L., and Chapman, W. W. (2011). Natural language processing:
an introduction. Journal of the American Medical Informatics Association, 18(5):544–551.
BIBLIOGRAPHY 63

Newman, D., Lau, J. H., Grieser, K., and Baldwin, T. (2010). Automatic evaluation of topic co-
herence. In Human language technologies: The 2010 annual conference of the North American
chapter of the association for computational linguistics, pages 100–108.
Nguyen, T. D., Lyall, G., Tran, A., Shin, M., Carroll, N. G., Klein, C., and Xie, L. (2022). Mapping
topics in 100,000 real-life moral dilemmas. Proceedings of the International AAAI Conference
on Web and Social Media, 16(1):699–710.
O’Brien, E. (2020). Aita for making this? a public dataset of reddit posts about moral dilemmas.
Proferes, N., Jones, N., Gilbert, S., Fiesler, C., and Zimmer, M. (2021). Studying reddit: A
systematic overview of disciplines, approaches, methods, and ethics. Social Media+ Society,
7(2):20563051211019004.
Ramadhan, W., Astri Novianty, S., and Casi Setianingsih, S. (2017). Sentiment analysis using
multinomial logistic regression. In 2017 International Conference on Control, Electronics,
Renewable Energy and Communications (ICCREC), pages 46–49.
Röder, M., Both, A., and Hinneburg, A. (2015). Exploring the space of topic coherence measures.
In Proceedings of the eighth ACM international conference on Web search and data mining,
pages 399–408.
Salehinejad, H., Sankar, S., Barfett, J., Colak, E., and Valaee, S. (2018). Recent advances in
recurrent neural networks.
Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2020). Distilbert, a distilled version of bert:
smaller, faster, cheaper and lighter.
Sivek, S. C. (2021). Am i the. . . data geek who analyzed red-
dit aita posts? yes. https://towardsdatascience.com/
am-i-the-data-geek-who-analyzed-reddit-aita-posts-yes-4954a8d37055.
Syed, S. and Spruit, M. (2017). Full-text or abstract? examining topic coherence scores using la-
tent dirichlet allocation. In 2017 IEEE International Conference on Data Science and Advanced
Analytics (DSAA), pages 165–174.
Symeonidis, S., Effrosynidis, D., and Arampatzis, A. (2018). A comparative evaluation of pre-
processing techniques and their interactions for twitter sentiment analysis. Expert Systems with
Applications, 110:298–310.
Ting, J. (2015). A look into the world of reddit with neural networks.
Umer, M., Sadiq, S., Missen, M. M. S., Hameed, Z., Aslam, Z., Siddique, M. A., and NAPPI,
M. (2021). Scientific papers citation analysis using textual features and smote resampling tech-
niques. Pattern Recognition Letters, 150:250–257.
V M, N. and Kumar R, D. A. (2019). Implementation on text classification using bag of words
model.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and
Polosukhin, I. (2017). Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wal-
lach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information
Processing Systems, volume 30. Curran Associates, Inc.
Wang, I. (2017). “am i the asshole?”: A deep learning approach for evaluating moral scenarios.
CS230: Deep Learning, Spring 2020, Stanford University.
Yao, L., Mimno, D., and Mccallum, A. (2009). Efficient methods for topic model inference on
BIBLIOGRAPHY 64

streaming document collections. pages 937–946.


Zhang, W., Yoshida, T., and Tang, X. (2011). A comparative study of tf*idf, lsi and multi-words
for text classification. Expert Systems with Applications, 38(3):2758–2765.
İrsoy, O., Benton, A., and Stratos, K. (2021). Corrected cbow performs as well as skip-gram.

You might also like