Thesis Darius Dragnea
Thesis Darius Dragnea
Thesis Darius Dragnea
2023
Declaration
No portion of the work contained in this document has been submitted in support of an application
for a degree or qualification of this or any other university or other institution of learning. All
verbatim extracts have been distinguished by quotation marks, and all sources of information have
been specifically acknowledged.
Signed:
Date: 2023
Abstract
People often seek advice for their problems on the internet forums because they believe other
strangers would judge their situations more objectively than relatives or friends. Thanks to the
most recent developments in the Natural Language Processing (NLP) field, AIs can be now used
for moral judgment purposes and they definitely provide a lower bias level than humans. This
paper focuses on exploring different NLP methods used in text classification tasks, while attempt-
ing to provide a novel solution for developing such a tool, based on posts collected from Reddit’s
r/AmITheAsshole (AITA) community. The first research question focuses on exploring and sum-
marising different patterns that exist in the newly-created data-set containing posts scraped from
the famous subreddit. Topic modelling and sentiment analysis have been performed, showing that
the most majority of the posts are labeled as positive, while also providing the main topics dis-
cussed on the forum, which are relationships, family and housing. The second research question
revolves around delving into various machine learning methods and assess their performance for a
text classification task. Four different subsets have been created, due to the issue provided by the
imbalanced data-set, and results show that the best models (Bi-LSTM and DistilBERT) are able
to achieve around 62% and 61% accuracy, respectively, with the later even achieving more than
40% recall per each of the four classes: Not The Asshole (NTA), You The Asshole (YTA), Ev-
eryone Sucks Here (ESH) and No Assholes Here (NAH). The third research question provides
a solution for a web platform that allows a user to write their stories and receive a verdict from
the best performing model trained previously, along with an explanation for the given verdict and
a recommendation in the situation, both generated with the aid of ChatGPT3.5. The platform has
been further tested through a human experiment involving 20 participants which have been asked
to fill in a questionnaire based on their experience. The results are somewhat satisfactory, showing
that the model has managed to predict the verdicts decently in 50% of the time for the stories
provided by the participants, and in some cases the verdicts along with the explanation have even
managed to change people’s opinion on the scenario.
Acknowledgements
First of all, I would like to thank to my family members for giving me the opportunity and the
funds to study abroad at such a prestigious university like University of Aberdeen. The whole
experience here has happened because they have had the vision and the commitment to determine
me to study abroad. I have acquired important skills in computing science throughout these four
years, and I would like to also thank all the professors that have guided us this whole time and
that made their best to help us as much as possible even during the pandemic. The professor that
has influenced me the most has been my supervisor, Dr. Bruno Yun, that has introduced me to the
world of Machine Learning and then accepted me to work on this project proposed by him. He
has represented a role model for me and I wish him all the best with his future career as a lecturer
and researcher.
Contents
1 Introduction 7
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Paper overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Background 13
2.1 Natural Language Processing (NLP) methods . . . . . . . . . . . . . . . . . . . 13
2.1.1 NLP background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.2 Sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.3 Topic modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Traditional Machine Learning (ML) methods . . . . . . . . . . . . . . . . . . . 17
2.2.1 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Deep Learning methods and Neural Networks (NNs) . . . . . . . . . . . . . . . 18
2.3.1 An overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Multi-layer Percepetron (MLP) . . . . . . . . . . . . . . . . . . . . . . 19
2.3.3 Long short-term memory networks (LSTMs) . . . . . . . . . . . . . . . 20
2.4 Feature engineering techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.1 Bag of words (BoW) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.2 Term frequency–inverse document frequency (Tf-idf) . . . . . . . . . . . 22
2.4.3 Word embeddings & Word2Vec . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Transformers & Large Language Models (LLMs) . . . . . . . . . . . . . . . . . 24
2.5.1 Bidirectional Encoder Representations from Transformers (BERT) . . . . 24
2.5.2 DistilBERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.3 GPT-3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3 Methodology 27
3.1 Data-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 RQ1 - sentiment analysis and topic modelling on AITA data-set . . . . . . . . . 28
3.2.1 Data-cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.2 Sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.3 Topic modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 RQ2 - ML methods employed for text classification . . . . . . . . . . . . . . . . 32
C ONTENTS 6
3.3.1 Subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.2 BoW & Tf-idf with traditional ML . . . . . . . . . . . . . . . . . . . . . 33
3.3.3 Word2Vec with Neural Networks . . . . . . . . . . . . . . . . . . . . . 35
3.3.4 DistilBERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4 RQ3 - integrate best model and other features on web platform . . . . . . . . . . 41
5 Conclusions 55
5.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 Future work & limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.1 Expand the data-set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.2 Use explainable tools to understand the models better . . . . . . . . . . . 56
5.2.3 Extend the 4-class classification . . . . . . . . . . . . . . . . . . . . . . 56
5.3 Self-evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
A Maintenance manual 57
B User manual 59
Chapter 1
Introduction
1.1 Motivation
In recent years, the modern tech world has witnessed significant progress in the field of artificial
intelligence, especially what is known as Natural Language Processing (NLP) (Kalyanathaya et al.,
2019). Its applications have become prevalent in many industries and have definitely improved
human-to-machine interactions, as well as customers’ experience. For instance, AI conversational
systems such as chatbots and virtual assistants are widely used nowadays by certain organisations
that deal with plenty of customers, in view of boosting their business and services. This process is
called automatisation and relies directly on how well-automated systems can replace, replicate or
even overcome humans’ potential.
But what if we were to expand these systems further by making them more involved in peo-
ple’s lives? What if automated systems became capable of aiding us with our day-to-day personal
problems? Recent advances in NLP have provided us with an answer to these questions: text
analytics (Dale, 2017). This topic covers a broad range of technologies such as text classifica-
tion, entity recognition, or sentiment analysis, that manipulate text written by humans and convert
it into a format understandable to AIs. One of the tasks of great interest these days that can be
performed by using these tools is moral judgement, which consists in evaluating whether certain
human behaviours are morally acceptable or not. Many researchers have struggled to come up
with satisfactory results in this respect, some even considering these AI systems more suitable
than humans in assessing real-life scenarios due to their lower level of bias (Mujumdar et al.,
2020). In contrast, others that have already tried developing such systems consider that teaching
moral sense to any machine poses a seemingly impossible challenge (Jiang et al., 2021).
When attempting to design and build this kind of ultra-powerful AI system, large chunks of
data are required, in view of making the machine as accurate as possible. Fortunately, social me-
dia platforms such as Reddit, Facebook, and Twitter have increased in popularity among people,
with many users revealing some of their personal stories. This trend has its roots in the biggest
privilege a user benefits from when surfing online - anonymity. Therefore, loads of threads, posts,
and comments have accumulated throughout the years and have indirectly provided ideal data-sets
that satisfy the data requirements mentioned above. They have eventually become an extremely
attractive database for data scientists, data analysts, and researchers (Chen et al., 2022). Perhaps
one of the most popular social media platforms among Internet users, Reddit offers an unlimited
1.2. R ESEARCH QUESTIONS 8
range of data-sets due to its communities and subreddits that multiply daily, covering various as-
pects of real-life (Proferes et al., 2021). A well-known Reddit community that has been active
since 2013 is r/AmItheAsshole (AITA) 1 , where people are invited to share their moral dilemmas
with the other existing members that are allowed to give their opinions on the situations, based
on which a final verdict is generated after all votes are counted. All of the posts contained in this
forum could be transformed into input data that could be processed by a machine, by using the text
analytics methods listed above. A machine could then be trained by using the verdicts associated
with each post to classify other scenarios (posts) based on this training data.
Overall, the main motivation for this project comes from the potential AI systems represent
in the field of moral judgement, which could be harvested in humans’ favour. Despite being ethi-
cally and socially aware to an extent, we still tend to judge subjectively when it comes to assessing
real-life scenarios that we bump into every day, from any perspective. Endless debates could arise
anytime among people from any group, concerning what it is ethical to do and what it is not in a
certain situation which could cover any possible topic. No matter the location, time, or context, we
are always required to wrestle with moral issues of our own or in our proximity, and we often seek
external help when things degenerate. For example, a simple marriage misunderstanding could
lead to serious future issues, which are sometimes extremely complicated and almost impossible
to solve conveniently for both sides. At this point, relatives or friends might interfere in the matter,
and their advice could negatively influencing its outcome in some cases. In some cases, the two
partners resort to contacting psychological help or even appealing to court judgement, situations
which ensure impartiality and fairness. Nevertheless, accessing these services could mean signif-
icant financial costs, in which case other solutions have to be considered. This is where social
media platforms come in handy, where, thanks to communities such as Reddit’s r/AmITheAsshole,
where any existent moral dilemma is discussed and debated each day, objectively and anony-
mously. Therefore, each user of this platform can get thousands of replies to their issue, along
with the final verdict. Instead of having to read that many opinions given by others, why can we
not use the most recent innovations in the AI industry described above and create such a smart
platform for people that seek much lower-biased advice? This represents this paper’s own "moral
dilemma", from which three main research questions have been extracted (see next section).
The main goal of the first research question is to integrate some well-known NLP methods
used in text analysis into the data mining process. AITA posts will be collected and merged into a
big data-set, along with their respective verdicts. After modifying the existing posts accordingly, a
thorough investigation will be performed in view of extracting and summarising different patterns.
Among the selected NLP tools, sentiment analysis, and topic modelling are to be mentioned, due
to their potential significance later on in the classification tasks. These two techniques combined
could generate answers to queries such as how people feel when writing stories about certain
topics. The results obtained at the end of this milestone will be compared to the ones existing in
the literature and extended further.
RQ3: How could a web platform be created to help people with moral judgement?
Having a plethora of tools available to assess the performances of ML models should be more than
enough to use existent test sets during the process. After all, the classifiers will be trained, vali-
dated, and tested on posts from the same data-set, which are split accordingly as well. However,
this paper also intends to transpose the AI workspace built during the previous two stages to the
real world, thus attempting to create an interactive tool meant to be used by the general public.
Therefore, the third research question proposed as part of this project specifically investigates
how a successfully-developed AI system could be integrated into a web platform. The main ideas
surrounding this goal are to expand on the class returned by the system and make the AI also
1.3. R ELATED WORK 10
provide a clear reason for its "decision", followed by a recommendation for the user, i.e., what
they could do in that situation, no matter to what extent they are guilty.
These two sub-tasks require extremely powerful tools to use for their completion, such as
Large Language Models (LLMs) like Chat-GPT, BLOOM, or OPT. Additionally, they strongly
rely on how well the actual models perform, and that is why they had been initially set as optional
goals for this research.
Apart from the components described above, the users will also be allowed to access certain
statistical graphs, figures, or tables which could be useful when it comes to understanding better
how the system works, based on the collected data. For instance, visualising the topics that are
most prevalent in the r/AITA community for each class, the sentiments associated with each topic,
or the distribution of the topics over the decade. Those with ML expertise will be allowed to
explore different results returned by the models through some explainable tools. In this way, the
verdict, explanation, or recommendation given by the system will not seem ambiguous.
shown in terms of moral reasoning tasks. Even though they have achieved decent results overall,
including a general accuracy of 73.55% (thanks to a successful parameter-optimisation strategy),
their model has not been capable of predicting that many minority class instances, as it can be seen
in the confusion matrix they have provided. As they mentioned in their paper, the class imbalance
has represented one of the issues they have not tried to fix, but they have suggested possible
solutions for solving this matter. Additionally, they have not performed a complete analysis on
what concerns topic modelling, since they have only provided the most relevant terms for each
document, using the Term frequency-inverse document frequency (Tf-idf) technique. A principal
goal of this paper is to extend the data analysis provided by Mujumdar et al. by using different
topic modelling methods, as well as exploring in more depth ways to adjust the apparent class
imbalance, in view of developing a more objective classifier.
The same state-of-the-art pre-trained models have been analysed in the paper written by Al-
hassan et al. while being evaluated over posts collected from the AITA community. The authors
claim that the complex nature of these publicly-available stories provided by this subreddit has
determined them to investigate in what ways NLP technologies could be used to interpret these
posts. Their approach was slightly different from Mujumdar et al. (2020) through the different
subsets they have used for training and testing, to balance the data. Not surprisingly, better re-
sults have been obtained on the two balanced sets, the absolute best being on the set with longer
posts, which had 88% general accuracy. However, they acknowledged that their minority class
performance could be significantly improved if other data balancing techniques such as SMOTE
are applied.
Another study that has brought contributions to this research field is the one of Wang (2017),
whose purpose was to investigate whether deep learning methods could be used in predicting
moralistic judgements. According to the author, neural network approaches have been rare when
it comes to tackling this kind of problems, a fact that is strengthened by the results provided
later in the paper. The architectures adopted here have been word embeddings, created with the
help of numerical vectors generated from the tokenised posts. These embeddings have then been
combined with support vector machines (SVM) and simple Deep Neural Networks. In order to
deal with the imbalanced data, as previously reported by Mujumdar et al. (2020) and Alhassan
et al. (2022), they have opted for an under-sampling technique, which has slightly improved the
two models, but not by that much. Finally, they acknowledge the limitations Reddit’s API provides
when trying to scrape AITA posts, as well as the need for ampler amounts of data to improve the
performance over minority classes.
With regards to topic modelling, the paper that has conducted a detailed analysis of AITA
posts is authored by Nguyen et al. (2022). It features around 100, 000 moral dilemmas collected
through the API. Their final findings depict a total number of 47 high-quality topics covering most
of the data space, which have been detected by using topic modelling techniques such as Latent
Dirichlet Allocation (LDA) and probabilistic clustering. These methods will be explained later
in this paper, with an extra emphasis on posts containing different kinds of relationships, since
Nguyen et al. have specifically highlighted that.
1.4. PAPER OVERVIEW 12
Furthermore, according to Sivek, which has performed sentiment analysis and topic mod-
elling on an AITA data-set, there are several existent correlations between different topics or sen-
timents expressed through the analysed posts. The author has also mentioned in their paper several
statistical tests such as Pearson or Spearman Correlation tools that could be used to identify these
patterns.
Reasonability in social media represents another focus that several papers such as Ha-
worth et al. (2021) have tried to analyse in depth through the AITA subreddit. Their ap-
proach was based on separating the collected posts into two classes - NTA(NotTheAsshole) and
YTA(YouTheAsshole)-, after which a re-sampling process has been used to balance them, due to
the dominance of the former. Their final models have been tested on various features and have
resulted in a general accuracy of 77% achieved by their best classifier. They have also made use
of the comments written for each post, and have concluded that Reddit subscribers that are active
on AITA maintain a cognitive bias toward YTA posts and that many AITA users seek validation
regardless of its necessity. This behaviour is described in detail by Cunn in his online blog, where
the author concludes that most articles whose headline is a question can be answered with the
word ‘no’, i.e., in most of those posts the original poster (OP) will not be labelled as YTA.
Moral judgement on Reddit has been also analysed by Botzer et al. (2021), where the main
focus was on determining whether users’ unlabelled comments assign a positive or negative moral
valence to the OP. Their best-performing model was by far judge-Bert with around 89% accuracy.
Their main results have also shown that users prefer positively-labelled posts to negatively-labelled
posts. At the same time, work done by Efstathiadis et al. on a similar data-set has generated an
accuracy of 86% (through the same BERT) when classifying comments. This paper, however,
has also attempted to classify posts, obtaining an accuracy of 62% for both test sets used in the
process, as well as higher precision and f1-scores for the re-balanced test set. This confirms the
need to solve the issues concerning the imbalanced data-set, which causes the BERT model to be
biased towards "sweethearts".
Background
textual input with the aid of a big collection of words stored in a dictionary (dictionary based), also
called lexicon. In this respect, Python offers a free open-source package called Natural Language
Toolkit (NLTK) 1 , which provides the corpora and lexical resources necessary for lexicon-based
sentiment analysis.
One of the most popular lexicon-based tools existent in this library is Valence Aware Dic-
tionary and sEntiment Reasoner (VADER) (Hutto and Gilbert, 2014). It has been developed
specifically to be used in social media contexts, featuring more sensitivity towards sentiment ex-
pressions than traditional sentiment lexicons such as Linguistic Inquiry and Word Count (LIWC),
whose benefits are still preserved in VADER. When applied to text, VADER can be used to calcu-
late the final sentiment scores through the following simple algorithm:
//monkeylearn.com/blog/introduction-to-topic-modeling/
2.1. NATURAL L ANGUAGE P ROCESSING (NLP) METHODS 15
it can be concluded that the three main actors that take part in this process are the following:
documents, topics and words.
Latent Dirichlet Allocation (LDA) represents a topic modelling technique that directly deals
with topics that are "invisible" (since the English word "latent" essentially refers to something that
exists, but is not yet developed). The main principle that underpins LDA is the correlation between
the three elements mentioned above which is the following: documents are considered a mixture
of topics, which are regarded themselves as a mixture of words. In statistical terms, the topics
are generated from the probability distribution of the documents, whereas the words are computed
from the probability distribution of the topics. The probability distribution used by LDA comes
from a family of continuous multivariate probability distributions called Dirichlet distribution3 ,
which are parameterised by a vector of positive reals α (thus, the family is known as Dirα).
Before delving into the algorithm itself, several notations have to be explained more in detail.
First, two hyper-parameters α and β are defined as the parameters of the topic prior probability
distributions 4 of the topics over documents and of the words over topics, respectively. The two
Dirichlet distributions Dirα and Dirβ are then computed, each of them modelling the relationship
between documents and topics, and the relationship between the topics and the words, respectively.
Having decided on this, the LDA generation algorithm is summarised through the following piece
of pseudo-code:
While LDA is a powerful and innovative tool for performing topic modelling, appropriate
metrics are required to evaluate the quality of the discovered topics. Topic coherence(Newman
et al., 2010) can be evaluated by rating a set of words for how coherent and how interpretable they
are from a human perspective. By using coherence, the topic quality can therefore be estimated
through a single qualitative value that goes by the name of coherence score, which follows the
principle that words with similar meanings tend to occur in similar contexts (Syed and Spruit,
2017). This evaluation method comes extremely in handy when deciding on the initial number of
desired topics T . If the value chosen for T is not big enough, the resulting topics might be too
3 https://en.wikipedia.org/wiki/Dirichlet_distribution
4 https://en.wikipedia.org/wiki/Prior_probability
2.1. NATURAL L ANGUAGE P ROCESSING (NLP) METHODS 16
general, with few existent distinctions, whereas a bigger value would lead to overly specific topics
which are too high to interpret.
The main strategy that will be adopted later in this paper relies on the coherence measure
called CV which has reportedly achieved the highest correlation with human ranking data, accord-
ing to Röder et al. (2015). This coherence measure is computed through four main steps: segmen-
tation of the documents in word pairs, calculation of probability estimation of each word and
word pair, computation of a confirmatsion measure which defines how strongly each adjacent
segments are to each other, and aggregation of all confirmation measures calculated by taking the
mean of the former. The four parts are explained more mathematically below:
1. Segmentation: Let N denote the top-N words specific to each discovered topic t ∈
{1, . . . , T }, and let WNt,d denote the set of all t’s top-N words, assuming that WNt,d repre-
sents the segmentation of an arbitrary document d. A set St is constructed such that all its
subsets contains word pairs of the form (Wi ,W j ), where Wi ,W j ∈ WNt,d . Hence, each subset
S ∈ St describes a mapping between some of the top-N words from WNt,d and others.
2. Probability estimation: For each word wi and word pair (wi , w j ), the probability P(wi
and joint probability P(wi , w j ), respectively, are calculated through dividing the number
of documents in which the words occur (|Dwi | and |Dwi ∩ Dw j |) by the total number of
documents of virtual documents |D′ |. The set of virtual documents D′ is computed by using
a Boolean sliding window method. Through this method, for a document d and a window
of size s, D′d would represent a set of virtual documents d1′ , d2′ , ..., dN−s+1
′ , where a virtual
document di′ = {wi , wi+1 , ..., ws−i+1 }.
3. Confirmation measure: For each subset (Wi ,W j ) ∈ St , let θi j be defined as the confirmation
measure of how strongly Wi supports W j based on the similarity between the words in Wi
and W j and all the words from WNd,t . To calculate θi j , Wi and W j have to be transcribed as
the context vectors ⃗vi and ⃗v j , such that all words in Wi and W j are paired to words in WNd,t
through what is called normalized pointwise mutual information(NPMI)6 , which calculates
the aggreement between individual two individual words wi and w j as follows:
γ
P(w ,w )+ε
log P(wii )P(w
j
j)
NPMI(wi , w j )γ =
− log P(wi , w j )
, where ε is a constant value (close to zero) that is added to avoid the calculation of logarithm
of zero, and γ is a value that places more weight on high NPMI values. Hence, a vector ⃗vi
will contain the sums of all NPMI values calculated between the a word from Wi and all
words from WNt,d :
The confirmation measure will be calculated by computing the cosine vector similarity of
6 https://en.wikipedia.org/wiki/Pointwise_mutual_information
2.2. T RADITIONAL M ACHINE L EARNING (ML) METHODS 17
∑Nl=1⃗vw ·⃗vw
θ (⃗vi ,⃗v j ) =
∥⃗vi ∥2 · ∥⃗v j ∥2
1
σ (t) =
1 + e−t
where t represents the vector containing the predictions. It can be noticed that this procedure can
be used for binary classification tasks since it returns only two possible outcomes, but it can be
extended to support multiple classes, by using what it is known as Multinomial Logistic Regression
(MLR). This method has been successfully used for different NLP tasks such as sentiment analysis
performed on Twitter posts (Ramadhan et al., 2017) or text classification on different data-sets
(Kamath et al., 2018) where it has reportedly performed better than other traditional machine
learning algorithms such as LinearSVC, Random Forest or Multinomial Bayes. Scikit-Learn’s
version of the multinomial algorithm7 aims at generating the probabilities that a target variable
yi of an observation i belongs to each existent class k from set K given feature variable X as it
follows:
P(B|A) · P(A)
P(A|B) =
P(B)
where P(A|B) is the posterior probability of the target class A given predictor class B, P(A) is
the target prior probability, P(B) is the predictor prior probability and P(B|A) is the posterior
probability of predictor class B given target A. This algorithm is usually applied as part of a
Bayesian network which connects different nodes representing random variables through edges
7 https://scikit-learn.org/stable/modules/linear_model.html
2.3. D EEP L EARNING METHODS AND N EURAL N ETWORKS (NN S ) 18
labelled with the posterior probabilities of the random variables. The main task of the network is
to calculate the occurrences of the respective posterior probabilities and maps a given target (or
document in the case of text classification) to the class reporting the highest probability.
One of the popular variants of the Naive Bayes algorithm is the Multinomial Naive Bayes
(NB), used for multinomial distributed data, which has been proved to perform better than other
Naive-Bayes models such as Bernoulli when it comes to text categorisation (classification) (Kib-
riya et al., 2005). Scikit-Learn employs a version of this algorithm8 which is widely used for
information retrieval purposes. To predict the probability that the feature variable Xi of observa-
tion i belongs to class y, the formula for relative frequency counting has to be employed:
Ny,i + 1
P(Xi |y) =
Ny + ·n
where Ny,i represents all occurrences of observation i in a sample (document) of class y, Ny is
defined as the total sum of occurrences of all observations labelled with class y, 1 is an estimator
used to prevent zero probabilities and n denotes the number of all features from the data-space. It
can be observed that this algorithm could be easily adapted to text classification tasks, where each
observation could represent a document and each probability that a document belongs to a class
could be recomputed by using the count of each word from the document, as detailed in Kibriya
et al. (2005).
8 https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.
MultinomialNB.html
2.3. D EEP L EARNING METHODS AND N EURAL N ETWORKS (NN S ) 19
Briefly, an ANN contains an Input layer which contains all input neurons required to train
the model, a Hidden layer which processes and computes the variables further, and an Output
layer which represents the final result computed using the neurons from the previous steps. The
Hidden layer is by far the most interesting part of an ANN, since it can comprise multiple hidden
layers, so that the ANN becomes deeper in complexity, where the term Deep Learning comes
from. Deep learning architectures of more complex ANNs, that have been used in this research,
will be explained next through definite examples.
2.3.2 Multi-layer Percepetron (MLP)
Multi-layer Percepetron (MLP)(Kamath et al., 2018) is a type of ANN whose architecture is
based on 2 or more fully-connected layers, followed by the final output layer, so it could represent
the ANN drawn in Figure 2.1.
Consider that the leftmost layer from Figure 2.1 represents a set of multiple neurons x1 , x2 , x3
depicting the MLP’s input features and a set of corresponding weights w1 , w2 , w3 , where each
weight represents a vector of length 4 whose elements are sent to the hidden layer of neurons
a1 , a2 , a3 , a4 . Each of these neurons will transform in the following way:
si = si + bi
3. Apply the activation function on the result: this function usually represents:
1
tanh(si ) =
1 + e−si
relu(si ) = max(si , 0)
The obtained value will be assigned to each ai and then sent to the output layer, thus finishing the
training of the model. For classification purposes, in case there are more than 2 existent classes,
an additional softmax layer is required, which will create a vector containing all probabilities that
a sample belongs to each class, defined by the following function:
eai
so f tmax(ai ) =
∑4j=1 eai
Scikit-Learn has provided a solid implementation of an MLP classifier9 which will be employed
9 https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.
MLPClassifier.html
2.3. D EEP L EARNING METHODS AND N EURAL N ETWORKS (NN S ) 20
Input features do not represent only one vector of units xt = (x1 , x2 , ..., xn ), but sequences of
vectors, depending at which time t they are processed. They are fully connected to the neurons
from the hidden layer h1 , h2 , ..., hm through different weight matrices {...,Wi ,Wi+1 , ...} for each
time unit {...,t,t + 1, ...}. The hidden has its recurrent connections between neurons from different
time units, which send the weight matrices {...,Wh−1 ,Wh , ...} for one time unit to another. Each
vector hidden unit ht = (h1 , h2 , ..., hm ) corresponding to time t is computed through the activation
function fh used by the hidden layer as it follows:
,where bh−1 represents the bias with which unit ht is initialised. The hidden neurons from each
time unit are connected themselves to the output neurons, respectively, through weight matrices
{...,Wo ,Wo+1 , ...}, and the output units yt = (y1 , y2 , ..., y p ) are computed by using the activation
function fo used by the output layer:
yt = fo (Wo · ht + bo )
,where bo represents the bias with initialises unit yt . Therefore, it can be observed that at each
timestamp, the hidden units return a prediction for the output layer which is calculated based on
the input unit parsed in the beginning.
In general, simple RNNs can store the information in their "memory" by using the recurrent
connections that operate in a loop, but they become extremely difficult to train when it comes to
storing and learning information for a long time because of the vanishing gradients corresponding
2.3. D EEP L EARNING METHODS AND N EURAL N ETWORKS (NN S ) 21
to their loss function and their exponential decay. To solve these memory compression issues,
gates have been introduced as part of the activation functions, thus defining a new category of
RNNs which is the LSTMs(Chung et al., 2014). The basic structure of an LSTM unit can be seen
in Figure 2.3.
An LSTM unit uses a set of gates consisting of an input gate It , a forget gate Ft and an output
gate Ot , which all serve at processing the values which are kept in memory slots labeled as ct . The
activation function used to initialise the gates is the sigmoid function (labeled as sig), whereas the
hyperbolic tangent (tanh) represents the activation function for hidden unit ht . Each ct represents
a memory cell of the LSTM unit and gets updated with a new cell Ct that is computed through
an additional memory gate from previous time units. The easiest way to understand this whole
process is to start with the output values of ht and ct and go backwards:
• ht = Ot θ (ct )
,where WIt , WFt , WCt , WOt represent the weight matrices associated with each existent gate for input
xt , UIt , UFt , UCt , UOt represent the weight matrices associated with each existent gate for hidden
unit ht , and bIt , bFt , bCt , bOt their respective bias vectors.
One of the most popular variants of LSTMs used for text classification is the Bidirectional
LSTM (Bi-LSTM)(Cui et al., 2019), whose architecture employs two different LSTM hidden
layers, that operate both backwards and frontwards, at the same time being connected to the
input and output layers. Both hidden layers contain their respective hidden units ht and ht′ , with
the difference that ht s are processed backwards, with t decreasing to {t − 1,t − 2, ...}, whereas the
2.4. F EATURE ENGINEERING TECHNIQUES 22
ht′ s are processed frontwards, for next time units t + 1,t + 2, .... The equations detailed above for
a simple LSTM unit apply in this context by following the same pattern. A Bi-LSTM architecture
can be visualised in Figure 2.4:
|D|
w(t, d) = t f (t, d) · log
d f (t, D)
where t f (t, d) and d f (t, D) represents the term frequency of t in document d and the document
frequency of t in the whole collection D, respectively. The log term existent in the formula can
2.4. F EATURE ENGINEERING TECHNIQUES 23
also be defined by id f (t, D), which represent the inverse document frequency of a term d in the
collection D. To handle the case when a term does not occur in the whole document collection,
which would lead to a division by 0 error in id f (t, D), some libraries such as Scikit-Learn10 use a
different formula for the inverse-document frequency, where they add 1 to the numerator and de-
nominator of the id f (t, D) as if an extra document was seen containing every term in the collection
exactly once:
1 + |D|
id f (t, D) = log +1
1 + d f (t, D)
2.4.3 Word embeddings & Word2Vec
Word embeddings represent fixed-length, dense and distributed representations for words(Almeida
and Xexéo, 2019) which have been proven extremely useful in most of NLP tasks that deal with
processing of textual data. They are frequently classified as prediction-based language models,
which are strongly linked to neural language models as they predict the next word using the con-
text and count-based models which have a matrix structure and take into account corpus-based
statistics such as words frequencies and counts.
Word2Vec(Mikolov et al., 2013a) is a model that has its roots in the family of prediction-
based language models, and that has become widely used in the NLP world due to its impressive
performance and quality when it comes to computing vector representations of words belonging
to large data-sets. The quality that defines these vectors is measured in syntactic and semantic
word similarity, based on which Word2Vec could provide state-of-the-art performance on certain
data-sets, surpassing some well-performing techniques originated in neural networks, according to
Mikolov et al.. Among the most successful models employed by Word2Vec, a worth-mentioning
family of models is log-linear models, which has been proved to provide low computational com-
plexities for learning distributed representations of words, as it is claimed through its name (log-
linear).
The two log-linear model architectures(Adewumi et al., 2022) are Continuous BoW and Con-
tinuous Skip-gram, and both of them are supported as open-source projects by Gensim 11 . The
main difference between them is how they attempt to predict: the skip-gram algorithm takes a
centre word as an input and predicts the words before and after it for a given range known as win-
dow, while BoW does the opposite: from a sequence of words projected within the same window,
it tries to classify the target word in the middle. It has been shown in literature Continuous Skip-
gram provides a much faster training time than other neural networks techniques since it does not
require dense matrix multiplications(Mikolov et al., 2013b). It achieves better accuracy for both
semantic and syntactic tasks than its homologue when tested on corpora of 783M words, featuring
a higher vector size (600), according to Table 5 from Mikolov et al. (2013a). Additionally, İrsoy
et al. have proved that the Gensim implementation of BoW lacks in quality, and, consequently, the
Word2Vec experiments conducted throughout this study have preferred continuous Skip-gram.
A continuous skip-gram model uses a classic neural network architecture, consisting of an
input, hidden and output layer, and its training objective is to discover word representations (vec-
tors) that would help in detecting the other surrounding words in a document. Mathematically,
10 https://scikit-learn.org/stable/modules/generated/sklearn.feature_
extraction.text.TfidfTransformer.html
11 https://radimrehurek.com/gensim/models/word2vec.html
2.5. T RANSFORMERS & L ARGE L ANGUAGE M ODELS (LLM S ) 24
let W define the set of some training words wt ∈ W of length T (Mikolov et al., 2013b). The idea
behind the skip-gram algorithm is to maximise the average log probability of detecting the context
words in a window of size c around a target word:
1 T
∑ ∑ log P(wt+i |wt )
T t=1 t≤|c|,t̸=0
Figure 2.5: The architecture of Transformer encoder and decoder (taken from Vaswani et al.
(2017))
• Masked Language Modelling (MLM): given a sentence, the mode will mask 15% of the
words from the sentence, then run the entire masked sentence through the model in attempt-
ing to make predictions
• Next sentence prediction (NSP): during pre-training, the model concatenates two masked
sentences and attempts to predict whether the sentences were following each other or not
2.5.2 DistilBERT
DistilBERT(Sanh et al., 2020) represents a lighter version of BERT which has been pre-trained as
a smaller general purposes transformer model. By using what is known as knowledge distillation13 ,
Sanh et al. have achieved to shrink BERT’s size by 40% and keep the most essential language
understanding properties, while also performing 60% faster than its "parent" model. DistilBERT
can be found online on Hugging Face14 as well.
13 https://en.wikipedia.org/wiki/Knowledge_distillation
14 https://huggingface.co/distilbert-base-uncased
2.5. T RANSFORMERS & L ARGE L ANGUAGE M ODELS (LLM S ) 26
2.5.3 GPT-3.5
Chat GPT (Bang et al., 2023) represents the state-of-the-art multilingual tool that has revolu-
tionised the AI world in recent years, due to its remarcable performance in a plethora of tasks,
such as machine translation, sentiment analysis or question answering, all comprising the field of
Natural Language Generation (NLG). GPT3.5-turbo represents one of the existent pre-trained
GPT models which are available to use online and also to integrate on personal websites for a
certain cost. The 3.5-turbo model, however, has proved to be extremely affordable since it only
charges $0.002 per API request15 . Therefore, it has been preferred to be used for some tasks in
this study.
15 https://openai.com/blog/introducing-chatgpt-and-whisper-apis
Chapter 3
Methodology
3.1 Data-set
Reddit1 is one of the highest-grossing social media platforms nowadays, featuring over 100k ac-
tive communities that count over 57 million daily active users writing over 13 billion posts and
comments. One of its most popular communities (also known as subreddit) is AmITheAsshole
(AITA), which provides plenty of text data that could be used further for NLP experiments.
Each user is allowed to write as many posts as they want, with the condition to follow some
general guidelines. Each post must contain a title starting with the word AITA (or WIBTA - Would
I Be The Asshole?), as well as a story describing a moral judgement situation to which other users
could react and vote with one of the following labels: NTA (Not The Asshole), YTA (You The
Asshole), ESH (Everyone Sucks Here) and NAH (No Assholes Here). The flowchart displayed
in Figure 3.1illustrates the general process of how a verdict is generated for a post:
1 https://www.reddit.com/
3.2. RQ1 - SENTIMENT ANALYSIS AND TOPIC MODELLING ON AITA DATA - SET 28
In order to create a proper AITA data-set, the posts have to be extracted with the aid of several
online scraping tools:
• Pushshift’s API2 for pulling the ids of the posts, along with each score associated with
them (which represents the difference between the number of upvotes and the number of
downvotes of each post
• Reddit’s official API called Praw3 for extracting the post content and other relevant infor-
mation
O’Brien has already provided a completely scraped data-set with posts collected from 2013
up until 2020, part of which has been used in this paper, as well. This data-set has been further
updated with new content from the late 2022 until the beginning of 2023 by using the same scrap-
ing methods described above. The final data-set contains the most important features presented in
Table 3.1 and can be found here4 .
Feature Description
id The unique identifier of each post
The time written in encoded characters5 when
timestamp
the post has been published
Short description of the story that starts with
title
AITA/WIBTA
body The content of the story
The final verdict of the story, which can be
verdict
one of NTA, YTA, ESH, or NAH
The number of upvotes minus the number of
score
downvotes for the post
To make sure that the quality of the data-set is preserved, only the posts with a score higher
than 3 have been considered (an explanation for preserving quality control can be found here
6 ). In total, 108, 557 posts have been retrieved, ranging from 24th of February 2014 (timestamp
1393278651) until 1st of January 2023 (timestamp 1672545102), featuring several time gaps be-
cause Reddit API does not work anymore for the period between March 2020 and December 2021.
3.2.1 Data-cleaning
Python’s Regex7 library represents the ideal tool that can be used for text cleaning since it makes
use of classic regex expressions through which unnecessary characters can be easily removed. The
four main regex operations that have been applied to the data-set are the following (in order)
2. Remove digits
3. Remove extra-spaces
4. Remove punctuation
Operation 1. is required first because of potential overlap with other operations such as 4.,
which could lead to modifying the link so that Regex would not detect it anymore. The need to
delete any links or hyperlinks comes from the fact that some posts might not contain the full story,
but only a link to the actual story instead. Since it would be almost impossible to get rid of all
posts that are incomplete, the whole analysis has included the title of each post, as well, which
was concatenated to each post’s body. Apart from the four operations explained above, all existent
posts will be converted to the lowercase format, since there is no difference between a capitalised
and a lowercase word in terms of context.
Other necessary operations used for pre-processing the data are lemmatisation and stop word
removal. Lemmatisation 8 is a technique used in computational linguistics for determining the
lemma (dictionary form) of a word based on its intended meaning. It is widely used in plenty of
NLP tasks such as information retrieval or sentiment analysis because it has the potential to reduce
the dimension of the subspace used for analysis, as well as to improve the quality of the results
given by applying the respective NLP methods. It has been proved that lemmatisation could even
improve general accuracy when integrated into sentiment analysis Symeonidis et al. (2018) and it
features several approaches that can be adopted through Python’s libraries. Spacy’s Lemmatizer
9 pipeline has been preferred for this study due to the numerous functionalities that it provides.
Here are several rules that illustrate how Spacy’s Lemmatizer works in practice:
• Verbs: Any forms such as was, were, being or been will get converted to be
• Pronouns: Any possessive pronouns such as my, mine, you’re or yours will be either re-
placed with the tag ‘-PRON-’ or will not be modified, since determining the lemma in this
context is more challenging
• Nouns: Any plural forms that modify the singular noun such as houses, dictionaries or mice
will be converted to their singular forms house, dictionary or mouse
The other popular pre-processing function, stop word removal, consists of removing existent
words that do not have any meaning and that would significantly slow down the whole data analy-
sis process, especially when it comes to sentiment analysis and topic modelling, where the context
7 https://docs.python.org/3/library/re.html
8 https://en.wikipedia.org/wiki/Lemmatisation
9 https://spacy.io/api/lemmatizer
3.2. RQ1 - SENTIMENT ANALYSIS AND TOPIC MODELLING ON AITA DATA - SET 30
of the words matter the most(Elbagir and Yang, 2019). NLTK provides an entire corpus of stop-
words (172 in total) which has been used to reclean the data one more time. For instance, consid-
ering the following sentence: I have gone to the library today to finish my
dissertation, after removing the stopwords (words that belong to the downloaded NLTK
corpus), the new sentence will be I gone library today finish dissertation. It
can be observed that the second sentence suggests the same meaning even though it has been
significantly truncated.
3.2.2 Sentiment analysis
In order to perform sentiment analysis on the Reddit posts, the VADER algorithm 1 has to be
applied on each story. The lexicon required beforehand can be downloaded and imported from
NLTK’s library, which also provides an extremely useful sentiment intensity analyzer 10 . This tool
generates the dictionary containing the compound metric through which the sentiment of each post
is inferred accordingly.
3.2.3 Topic modelling
On what concerns topic modelling applied through LDA, Python’s open source library Gensim
11 represents the most convenient tool to use. It features an extremely fast and easy-to-use library
that is able to process massive chunks of data and large corpora, while also providing pre-trained
models, reason for which many companies prefer it, as well. Topic modelling-based experiments
have been conducted in the past by using the same state-of-the art tool( Jelodar et al. (2020), Gonda
(2019), Sivek (2021)).
Several preparation steps have been required before commencing the topic discovery process.
At this point, all the posts that have been collected have been cleaned such that they could be
processed more efficiently by NLP tools, but they represent pieces of plain-text labeled as strings.
The following three steps have been applied to the big set of posts (the documents processed in
algorithm 2):
1. Convert all posts to lists of tokens: All posts have been tokenised by using Gensim’s
simple_preprocess tokeniser, which discards tokens of length less than 2 or bigger than 15.
The resulting output will represent a list of lists of tokens, defined as texts.
2. Create a dictionary from the lists of tokens: All lists of tokens existent in texts will
be mapped to a dictionary that will store each token (word) that exists in each list, in an
alphabetical order. The resulting output will represent a dictionary object of type gen-
sim.corpora.Dictionary called id2word, which maps an id to a word.
3. Create a bag-of-words list from each list of tokens and dictionary: Given texts and
id2word, the newly-created list called corpus will store pairs of tuples (token_id, to-
ken_count, where token_id represents the rank of the token in id2word, and token_count
defines the number of occurences of the token in texts.
• Corpus:[(0, 2), (1, 2), (2, 1), (3, 1), (4, 1), (5, 1), (6, 2),
(7, 2)]
With texts, id2word and corpus computed, the next thing that has to be clarified is the
number of topics that have to be agreed on beforehand. As explained in the Background section,
one way to determine the most optimal number of topics is by computing different coherence
scores for different numbers of topics. Comparing the different coherence scores could be done by
plotting them altogether alongside the number of topics. The expected result is that the coherence
score will increase as the number of topics increases, but this would lead to a bigger specificity
rate among the topics. One idea that is frequently adopted is to apply the elbow method 12 on the
resulted graph of coherence values and the number of topics, and pick the number after which the
graph’s curve stops increasing at the same rate. However, it might not always be the case that the
increase in topic coherence will be proportional to the increase in number of topics. In this respect,
the number of topics has to be selected from a desired range manually chosen beforehand, such
that the coherence value is the highest for that range.
Gensim’s LdaModel 13 initialises the LDA model and applies the LDA generation algorithm
2 to it following the appropriate steps, but for this experiment LdaMulticore14 has been preferred
due to the fact that the later adopts a multiprocessing, parallelised technique which makes the
model training much faster, while still applying the same algorithm. Additionally, another tech-
nique recommended by some studies or experiments (Yao et al. (2009), Gonda (2019), Chauhan
and Shah (2022)) is MAchine Learning for LanguagE Toolkit(MALLET), a topic modelling
package implemented in Java, but also provided through Gensim. The main difference between
LdaMallet 15 and LdaMulticore is that the former uses a Gibbs sampling method (explained more
in detail in Chauhan and Shah (2022)), which provides a more qualitative topic inference than
LDA which uses a sampling method based on the Variational Bayes algorithm. However, LdaMul-
ticore is more suitable in terms of memory performance, requiring O(1) space, whereas training
an LdaMallet model requires O(|id2word|) space, where id2word contains all words existent in
the main corpus.
After experimenting with both of these approaches, the optimal number of topics can there-
fore be selected and used to create a new LDA model, which will return the output mentioned
in Algorithm 2. This output will serve further to the extraction and summarisation of different
patterns related to the inferred topics, as well as their distribution among posts.
12 https://www.baeldung.com/cs/topic-modeling-coherence-score
13 https://radimrehurek.com/gensim/models/ldamodel.html
14 https://radimrehurek.com/gensim/models/ldamulticore.html
15 https://radimrehurek.com/gensim_3.8.3/models/wrappers/ldamallet.html
3.3. RQ2 - ML METHODS EMPLOYED FOR TEXT CLASSIFICATION 32
Figure 3.2: Distribution plot of cleaned posts in terms of number of characters, per each class)
Due to the existent class imbalance which is explained in Cunn (2019) and also noticed in this
research (Figure 3.3), the new data-set (which will be referred to as aita_2500) will be shrunk
further such that dominant class represented by the not the asshole will contain less representative
posts, thus making the distribution more balanced.
Some papers existent in the literature have already discussed this issue to a certain extent and
have either reported bad results mainly affected by the class imbalance (Wang, 2017) or suggested
and provided ways that could solve this matter (O’Brien, 2020; Alhassan et al., 2022). At this
3.3. RQ2 - ML METHODS EMPLOYED FOR TEXT CLASSIFICATION 33
Class percentages
Subset Description
NTA YTA ESH NAH
aita_NTA_1000 NTA < 1000 characters 47% 29% 8% 16%
aita_3_balanced NTA < 600 characters 32% 29% 13% 26%
YTA < 1000 characters
aita_4_balanced NTA < 450 characters 27% 26% 22% 25%
YTA < 600 characters
ESH < 800 characters
point, aita_2500 has the following class distribution among verdicts: not the asshole (NTA)
- 62%, asshole (YTA) - 21%, everyone sucks (ESH) - 5% and no assholes here (NAH) - 12%.
Three other subsets have been created from aita_2500 and they are displayed in Table 3.2. The
technique here was to gradually reduce the number of the dominant classes by selecting posts from
a certain length while keeping the minority class (ESH) unchanged.
The main methods required for applying text classification on the 4 subsets have been divided
into three main sections, depicting three different approaches.
3.3.2 BoW & Tf-idf with traditional ML
First approach illustrates the relevance of traditional ML methods (Logistic regression and Multi-
nomial Naive Bayes (MNB)) in the field of text classification with the aid of two NLP techniques
used for word representations: Bag-of-Words (BoW) and Term frequency-inverse document
frequency (Tf-idf). It has been shown in literature(Kibriya et al., 2005) how important the con-
nection between MNB and Tf-idf, in the sense that the former is computationally very efficient
and easy to implement as well, while its performance could be improved even more if Tf-idf
scores are used for recording word frequencies. Additionally, O’Brien has proved how logistic
regression could achieve a good performance when performing binary classification (by merging
NAH posts into NTA and ESH posts into YTA), so it could manage to obtain similar results for
multi-class classification. Their logistic regression model has been used with 1-grams for storing
the terms frequency and, therefore, it is believed that using Tf-idf for a similar task could improve
the predictions even more.
The main methodology employed for this transforming is illustrated in Figure 3.4, where the
main components have been used through Sklearn’s packages.
In order to understand how CountVectorizer and TfidfTransformer work together
and process the given corpus, consider a smaller corpus comprising 3 posts that had already been
cleaned (including lemmatisation and stop-word removal):
When passed into a CountVectorizer object, it will tokenise the sentences into words and
create the term-frequency matrix as shown in Figure 3.5. Afterwards, a TfidfTransformer
3.3. RQ2 - ML METHODS EMPLOYED FOR TEXT CLASSIFICATION 34
will process the term frequency matrix and transform it into a tf-idf representation by applying the
inverse-document frequency formula seen in the Background section. The final tf-idf representa-
tion is depicted in Figure 3.6: After applying this procedure on all existent posts (which obviously
contain way more tokens than the ones from the example) from all 4 main subsets, the tf-idf matrix
containing the training data has been fit along with its corresponding verdicts into a multinomial
logistic classifier and a multinomial naive bayes classifier.
Another interesting experiment which has been considered is using a sampling method on
the main data-set in order to balance the existent classes equally. In this respect, O’Brien has
used Synthetic Minority Over-sampling Technique (SMOTE)(Chawla et al., 2002), an ap-
proach in which the minorities class are over-sampled by creating “synthetic” data rather than
by over-sampling with replacement, as seen in previous attempts to deal with imbalanced data.
This method is usually applied on the tf-idf matrix, as seen in Figure 3.4.
An illustrative example in this sense could be shown for 100 random entries sampled from
the aita_2500 with the same distribution. After converting the extracted posts to word repre-
sentations by using tf-idf, the latter could be re-sampled along with their corresponding verdicts
by using Sklearn’s16 imblearn libary which provides an over-sampling variant of SMOTE. The
16 https://imbalanced-learn.org/stable/references/generated/imblearn.over_
sampling.SMOTE.html
3.3. RQ2 - ML METHODS EMPLOYED FOR TEXT CLASSIFICATION 35
effect of SMOTE, as well as the way in which it generates syntethic data has been captured in the
two plots provided in Figure 3.7:
Figure 3.7: SMOTE applied on a sample of 100 numerical vectors representing posts from
aita_2500, where the verdicts are encoded as it follows: NTA:0, YTA: 1, ESH: 2 and NAH: 3
It can be observed that the majority class (representing NTA) has not been modified at all,
while the other three have gained more population, which aligns along the same range of values.
Eventually, the newly synthetic data will balance the existent data-set, so in this case it will bring
all 4 classes at 25% ratio.
While SMOTE is a very popular method for handling imbalanced data, featuring impressive
results in some papers(Umer et al., 2021) when used with traditional machine learning methods
and tf-idf to perform text categorisation tasks, it has its own downsides. Since it is using a nearest
neighbour algorithm to generate new samples, they might accidentally become too similar to the
existent ones and this might cause overfitting problems when training the models later on. In
addition, SMOTE significantly increases the data-space, which could cause memory issues when
training bigger word representation structures such as Word2Vec or BERT encodings. As a result,
this method has not been considered for the following two other sections comprising the second
research question of this paper.
3.3.3 Word2Vec with Neural Networks
The second approach utilises vector representations known by the name Word2Vec, an extremely
popular technique used in the NLP world to transform any documents into word embeddings,
based on the similarity between different words contained in each document. Gensim’s library has
been preferred for this task since it provides an implementation of the continuous Skip-gram ar-
chitecture of a Word2Vec model, which has been shown to outperform its homologue, Continuous
BoW, as described in the Background section. In spite of all existent pre-trained models trained by
Gensim17 , it has been chosen as part of this study to create the word vectors from scratch, based
on the posts given as training data.
This approach has been used before by Wang in their paper where a similar experiment has
been performed, but for a binary classification task. They have also shown how the words gener-
ated through a pre-trained model might not cover all words existent in posts collected from r/AITA.
17 https://github.com/RaRe-Technologies/gensim-data
3.3. RQ2 - ML METHODS EMPLOYED FOR TEXT CLASSIFICATION 36
However, they have not accounted for how they have chosen the optimal vector size for the word
embeddings to be 30, fact that has determined this study to investigate more in this aspect. In
addition, the SVM classifier used for the classification part in Wang (2017) has been omitted for
this paper, due to severe computational issues SVM could cause when used with Word2Vec, since
its execution time is very slow if Sklearn’s version18 is used.
On what concerns the vector size preferred chosen for this task, 2 dimensions have been
considered initially and compared: 30 (taken from (Wang, 2017)) and 300. The later represents
the default size of all pre-trained vectors. After a short experiment on a sample from the big data-
set which included an MLP classifier with a similar architecture as the one in Wang (2017), 300
has been determined as the optimal vector size. The MLP classifier trained with this size and the
same hidden layer size has managed to surpass the performance of the model with 30 vector and
hidden layer size, in terms of accuracy. Therefore, the Word2Vec model is ready to be trained,
with vector size 300, window size of 5 and a skip-gram architecture.
The main methodology used for this Word2Vec approach is illustrated in Figure 3.8, where
the Bidirectional LSTM model has been initialised from Keras, while the Multi-layer Percepetron
classifier has been taken from Sklearn.
Figure 3.8: Pipeline representing how word2vec is used with deep learning methods
In order to understand the whole Word2Vec process, a simpler model has been created,
considering the following three cleaned sentences:
First of all, Word2Vec will create a dictionary including all words existent in the corpus
and their corresponding index {’pretty’:0, ’little’:1, ’home’:2, ’rude’:3,
18 https://scikit-learn.org/stable/modules/svm.html
3.3. RQ2 - ML METHODS EMPLOYED FOR TEXT CLASSIFICATION 37
It can be noticed that this procedure starts with the second and ends with the second-to-last
word, since the window size is 2, because the later is applied for both left and right sides of the
word. The corresponding probabilities defined in the Background section will be computed for
these pairs, thus determining the similarity between different words.
When this algorithm is applied to a much larger corpus such as the one used for this research,
the Skip-gram model will have a structure similar to the one displayed in Figure 3.9.
Number V represents the size of the vocabulary that is constructed initially and that contains
all the words and their corresponding positions. The Inputlayer represented by a one-hot vector
for a word wm will output a V × E Hidden layer which will be learned with the necessary weights,
where E denotes the size of the word embedding decided initially when training the Word2Vec
model. This layer is then fully connected to the Output layer also called Softmax. which calculates
each probability pi that, if wm were randomly chosen, its position will be similar to the position
of i of wi . In other words, it generates all probabilities that depict how similar two words are with
each other.
The Multi-layer Percepetron (MLP) method used in Wang (2017) has been adopted in this
paper, with a different vector size, as well as a different hidden layer size for the MLP classifier
(300). Figure 3.10 shows how the entire process will be conducted. The Word2Vec component
initially contains the numerical vectors corresponding to each post. Since each post has different
length, so will each vector and therefore, all the vectors need to be averaged to have length 300
3.3. RQ2 - ML METHODS EMPLOYED FOR TEXT CLASSIFICATION 38
before being sent to the Input layer. The latter is fully-connected to three Hidden layers which
are also fully-connected to themselves. The activation function for the Hidden layers tanh, while
the activation function applied to the Output layer is softmax. Likewise, the architecture of the
Bidirectional LSTM (Bi-LSTM) used for the other approach can be seen in Figure 3.11. This
method has not been used before with the same data-set, but has been adopted for different data-
sets created from posts collected from other Reddit. communities(Ting, 2015).
3.3. RQ2 - ML METHODS EMPLOYED FOR TEXT CLASSIFICATION 39
All the existent layers Input, Embedding, Bi-LSTM, Dense and Output have been initialised
by using Keras. The Sequences component represents a pre-processing methods through which
all tokenised posts are transformed into numerical sequences through Keras’s tokeniser19 . These
sequences will all have length 300, just like the size of Word2Vec, so each word from the tokenised
post is mapped to an encoding. The layer connected to Input will be an Embedding20 , initialised
with the weights computed from Word2Vec. The embedding matrix is calculated based on the
weight matrix which has V (size of the vocabulary) word vectors and on the values sent from each
input sequence, where each value corresponding to a word. If a sequence a word w with encoding
e from a sequence xi exists in the vocabulary, its corresponding vector from the weight matrix will
be stored together with e in the a matrix. This process is repeated for each word from the input for
300 times, and will hence build the Embedding matrix of size 300 × 300.
The embedding matrix will be connected itself to a Bidirectional LSTM layer which contains
a Bi-LSTM unit. All 300 vectors will be processed back and forth and twice, so 600 values will
be connected further to a Dense (fully-connected) layer of 64 neurons. This extra hidden layer
(activated with relu will help in calculating the values of the Output). By applying softmax again,
the final probabilities are finally obtained.
3.3.4 DistilBERT
As explained in the Background section, Bidirectional Encoder Representations from Trans-
formers (BERT) represents one of the state-of-the-art transformer models used in NLP and im-
plicitly in text classification tasks. Alhassan et al. (2022) is the paper that made use of several
BERT-based models that they have retrieved from Tensorflow hub21 and finetuned on a very pow-
erful graphics processor, NVIDIA Tesla P100 GPU. This resource has allowed them to make use of
large language models of up to 160GB corpus size, with hundreds of millions parameters. Since
Google Collab22 has been used for this study, the resources were more limited (16GB RAM),
and that is why the smaller version of BERT, DistilBert, has been preferred for this experiment.
19 https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/
Tokenizer
20 https://keras.io/api/layers/core_layers/embedding/
21 https://www.tensorflow.org/hub
22 https://colab.research.google.com/
3.3. RQ2 - ML METHODS EMPLOYED FOR TEXT CLASSIFICATION 40
The machine learning method used for this task is called transfer learning, and relies on using an
pre-trained model and integrating it as a starting point for a completely new task. In this case, a
pre-trained TfDistilBertModel23 will be integrated in our Keras model, in order to process
the posts. The later have to be tokenised accordingly into a DistilBERT encoding, by using the
DistilBert tokeniser.
One thing to mention before delving into the tokenisation process is that BERT models are
able to distinguish between the context of different words very easily, so that lemmatisation and
stop-words removal will not be necessary here - the posts will just be cleaned by using the other
aforementioned tools such as removing punctuation or any other unnecessary symbols. After
cleaning the posts in this way, two different lists will be created: one containing the token ids
assigned to each word through a huge vocabulary dictionary (BERT uses WordPiece24 ), and the
other containing what is known as attention mask, which keeps track of the occurence of each
token, and record when the tokens end. Both lists will have size 512 since this is the maximum
number of tokens any BERT model can process.
In order to understand this process better, consider the following sentence s: "I am
overreacting to what it was said to me!". The following steps will be applied
to compute the token ids and the attention mask corresponding to s:
3. Sequence-to-ids - convert the sequence into ids by replacing each token with the id to which
it is mapped in the vocabulary (e.g [CLS] is mapped to 101, [SEP] to 102, [PAD] to 0)
Token ids: [101, 1045, 2572, 2058, 16416, 11873, 2000, 2054, 2009,
2001, 2056, 2000, 2033, 102, 0, 0, 0, ..., 0]
These two lists [t1 ,t2 , ...,tn ] and [m1 , m2 , ..., mn ] will represent the Input tensors of our Keras model,
whose architecture can be seen in Figure 3.12. These tensors will be processed by the pre-trained
23 https://huggingface.co/docs/transformers/model_doc/distilbert
24 https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt
3.4. RQ3 - INTEGRATE BEST MODEL AND OTHER FEATURES ON WEB PLATFORM 41
model TfDistilBertModel from HuggingFace, which has 6 transformer layers and output a
feature matrix of size 512 × 768 for each input sentence. After applying a Pooling layer which
will compress each pair of feature matrix and input sentence, so as to obtain only a vector per each
input sentence containing all 768 encoded sequences. These will be transformed into 768 neurons
and inserted into the Dense layer which has 64 neurons. Just like in the Bi-LSTM case, relu is
used as an activation function for the thus layer. Finally, the values are calculated from the weights
of this layer and saved into the Output layer, from which the softmax layer used for previous tasks
averages the probability for each class.
3.4 RQ3 - integrate best model and other features on web platform
The third research question revolves around selecting the best performing model found during the
ML process detailed in the previous subsection and integrating it on a web platform, where users
can interact with it. The platform will have the following features:
• two text boxes where the user can input a moral judgement story along with a convenient
title
• the results returned by the AI, which comprise the following components:
1. the VERDICT returned by the model with regards to the given scenario, one of Not
the asshole (NTA), You the asshole (YTA), Everyone sucks here (ESH) and No
assholes here
2. an EXPLANATION returned by ChatGPT3.5, which attempts to justify the given ver-
dict
3. a RECOMMENDATION returned by ChatGPT3.5 meant to guid the user accordingly,
given the scenario they wrote
Similar web platforms have been created before, such as AskDelphi25 and a pretty similar AreY-
ouTheAsshole?26 . The former has been trained on multiple data-sets and can perform various
25 https://delphi.allenai.org/
26 https://areyoutheasshole.com/
3.4. RQ3 - INTEGRATE BEST MODEL AND OTHER FEATURES ON WEB PLATFORM 42
tasks, one of which is moral reasoning, but it is more focused on returning a result such as it is
right, it is acceptable or it is wrong, so it could not deal with more complex stories, where other
parties are involved. At the same time, the other one also called AYTA is trained on several data-
sets collected from r/AITA, which include redditors’ comments as well, but it lacks in complexity
as it is able to predict only two classes.
The platform created as part of this research will include all 4 respective classes, so it will
extend the approach followed by AYTA, but instead of using the comments from Reddit (due to
memory issues and time constraints), it will provide the explanation and recommendation to user,
computed by ChatGPT3.5. To make sure the platform would be usable, a human experiment
involving 20 participants has been conducted. The experiment consists of giving a scenario to a
participant and asking them to judge it with one of the 4 verdicts. After they answer, they would
be able to input the scenario on the platform and see what the AI’s opinion is. Furthermore, the
participant will also be asked to rewrite the scenario such that the verdict (according to THEM)
would be different, and then test it again with the AI.
The experiment has been conducted through a questionnaire which can be found at this link27 .
In total, 5 scenarios have been used, and for each scenario the participants have provided 4 more
new stories covering all 4 verdicts. The participants have also been asked to provide their feedback
with regards to the verdicts, explanations and recommendations generated by the AI. By using the
explanation generated by ChatGPT3.5, it would be known whether the model outputs a completely
wrong verdict if the explanation does not manage to justify the verdict accordingly. Furthermore,
even though the recommendation feature is not related to the verdict, it would be interesting to
explore whether it could actually give good advice in those situations. Therefore, not only is
it desirable to test the usability of the AI model but also how well a state-of-the-art model like
ChatGPT3.5 would perform for this task.
27 https://forms.office.com/e/KCHpmEPrpw
Chapter 4
4.1 RQ1
The distribution of all posts and the sentiment associated with them can be seen in Figure 4.1.
It can be observed that 72199 posts have been labeled with a positive label compared to only
35394 which tend to be more negative. In addition, a small sample of 1264 posts convey a neutral
sentiment, representing that less than 2% of total posts. In order to get a better idea of how
sentiments are distributed over the post, they have been plotted across each verdict as it can be
seen in Figure 4.2. Here, even though the number positive posts clearly surpass the number of
negative posts per each class, it can be noticed that, for the everyone sucks class, the difference is
not that huge.
After computing the coherence scores for both LdaMulticore and LdaMallet approaches and
4.1. RQ1 44
for five different values for the number of topics, two plots have been provided in Figure 4.3.
It can be observed that adopting the MALLET approach has proved to increase the topic
coherence significantly, with almost 10%. The complexity of the former also comes from the
execution time (LdaMallet has taken around 3 hours to completely finish while LdaMulticore
terminated after 57 minutes). By looking at LdaMallet’s graph, a huge leap occurs after exceeding
the limit of 20 topics, which shows that choosing 20 for the optimal number of topics would work.
After reinitialising the LdaMallet model with T = 20, where T is the number of topics used
in Algorithm 2, its final topic coherence will be 0.43. Each discovered topic has its own cluster of
words, each word being linked to a calculated weight. For example, Figure 4.4 describes one of
the 20 discovered topics through the highest 20 weighted words.
From a human point of view, these words clearly suggest a major topic name, which is Edu-
cation. The same intuitive technique has been applied for all 20 detected topics, which have been
also merged, since some of the predominant words from certain topics overlap. In order to pre-
serve some sort of accuracy when it comes to renaming and merging the topics, the whole process
has been performed with ChatGPT, which has come up with the topic-words mapping shown in
Table 4.1. These operations have provided a better understanding of what kind of topics AITA
redditors prefer writing about: there are 13 main topics, some of which (such as Communication,
4.1. RQ1 45
Family, TimeManagement or Relationships cover more than one cluster of words computed by
LdaMallet, thus emphasising their prevalence and importance.
Topic Keywords
eat, food, make, aita, dinner, order, bring,
Food
cook, table, lunch
party, wedding, birthday, christmas, gift, in-
Events
vite, family, year, wear, event
play, game, watch, aita, show, time, movie,
Entertainment
make, video, people
kid, wife, husband, child, daughter, son, year,
baby, aita, law
Family
mom, family, sister, parent, dad, brother,
mother, year, father, house
school, college, class, high, year, student, aita,
Education
make, study, teacher
call, phone, send, post, message, text, re-
spond, picture, reply, find
Communication
start, call, talk, stop, hear, yell, apologize, cry,
joke, angry
work, job, week, coworker, company, day,
Work
time, boss, manager, business
pay, money, buy, give, month, back, rent, bill,
Finance
save, card
aita, bad, sick, smoke, hair, care, day, hospi-
Health
tal, doctor, make
room, dog, house, move, roommate, live,
Housing
clean, apartment, cat, door
day, time, home, night, hour, leave, work,
week, sleep, stay
TimeManagement
move, time, plan, live, week, trip, stay, month,
year, place
life, thing, issue, feel, time, partner, support,
problem, lot, mental
feel, make, boyfriend, thing, upset, time, bad,
talk, bf, aita
guy, pretty, thing, stuff, drink, aita, bit, ass-
Relationship
hole, back, shit
friend, girl, good, people, group, talk, hang,
close, guy, person
girlfriend, relationship, date, break, year,
month, talk, start, meet, ago
car, drive, walk, back, sit, aita, minute, front,
Transportation
wait, people
The distribution of newly-merged and renamed topics over posts can be seen in Figure 4.5.
The number of posts represents in how many posts a certain topic has been chosen as dominant
over the others existent topics. Relationships tend to be discussed the most among redditors, with
4.1. RQ1 46
the mention that all kinds of relationships are included in this cluster, according to Table 4.1.
Family and Housing are the next 2 topics on the list, while Education and Health find themselves
at the other side of the list. Another interesting pattern to notice is that Housing surpasses topics
like Communication or TimeManagement, even though the later are the result of a merge between
clusters. In other example from the literature representing Gonda’s analysis performed on a similar
data-set of posts, Housing tops the list, followed by Family. However, the approach adopted there
has not included a model trained by using the MALLET implementation.
Figure 4.5: Topic distribution over posts after renaming and merging
Another analysis has been performed on the distribution of topics across posts in terms of
the verdict each post has been labelled with. Since Figure 4.2 confirms the uneven distribution of
posts on what concerns the verdict, in Figure 4.6, the percentage distribution has been illustrated
instead, from which more interesting patterns can be extracted.
For instance, for topics such as Entertainment, Education or Food, the percentage of posts
labeled as asshole drops below 60%, while posts labeled as asshole are slightly higher. Moreover,
4.2. RQ2 47
posts whose dominant topic is Communication are more likely to be labeled with everyone sucks
than no assholes here, suggesting the problems that occur in day-to-day communication with the
others. Another argument that confirms this belief is that the Communication topic also appears to
be reporting an almost equal percentage of negative and positive posts, as seen in Figure 4.7.
The same pattern applies to the Transportation topic, whereas posts related to Health are
more likely to be negative, which has been expected due to some of the top 10 words such as
sick, bad or hospital (Table 4.1). In contrast, the Events topic is linked to positive posts with a
probability exceeding 80%, since the cluster of words that defines this topic contain words with a
positive connotation: party, birthday or wedding.
4.2 RQ2
The experiments described in the Methodology section have been conducted on Google Collab.
Each approach has been applied for each subset, and the quality of the predictions has been as-
sessed through both accuracy1 and recall per class2 . The reason for using the latter as well comes
from how misleading the accuracy metric could be in some cases. In this case, since it is desired
to build a model that is able to predict well for all classes, their corresponding recalls have been
compared.
1 https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_
score.html
2 https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_
score.html
4.2. RQ2 48
4.2.1 aita_2500
aita_2500 represents the most imbalanced subset, with the NTA class dominating the others.
Consequently, as seen in Table 4.2, all the models employed for this task have been extremely
biased towards the majority class, with Tf-idf and logistic regression model managing to provide
the best performance as it can be seen in the Recall per class section. Nonetheless, the SMOTE
method applied with logistic regression and multinomial naive bayes proves the benefits of balanc-
ing the data mentioned in the literature studies(Alhassan et al., 2022; O’Brien, 2020; Wang, 2017).
The SMOTE approach has generated a more balanced recall per each class for both approaches,
featuring 40% overall accuracy for the logistic model and 37% for the MNB. While having an
accuracy better than the one of a random model in a 4-class classification task (25%), there is also
a better chance for the two models to predict each class more accurately that a random model.
4.2.2 aita_NTA_1000
After reducing the majority class by almost 50%, even though the overall accuracy has decreased,
the recall per class has slightly improved for the other classes, as Table 4.3 depicts. Tf-idf with
logistic regression has managed to provide better results than MNB, featuring a decent recall
for the YTA class, but still lacking a bit in performance when predicting the other two minority
classes. In addition, this method has surpassed the performance of Word2Vec with MLP in terms
of recall per each of the three minority classes. However, the Bi-LSTM model has achieved the
highest accuracy overall (58%), along with the best recall for YTA, but DistilBERT has performed
similarly, with both having been trained on five epochs.
4.2.3 aita_3_balanced
Moving on to the third subset created by balancing the first three classes (NTA, YTA and ESH),
it can be observed in Table 4.4 that each recall per class has improved significantly for all existent
4.2. RQ2 49
models. Logistic regression has obtained the same results as Bi-LSTM and DistilBERT with
regards to the recall for all classes but NTA. The recall for the latter has only been 63%, being
surpassed by the two Word2Vec and transformer models. MLP is still quite comparable with
logistic regression in this case, both of them achieving 49% overall accuracy and similar recall
values. However, Bi-LSTM and DistilBERT top the table with 58% accuracy, with much better
recall for NTA than the others. Overall, it can be noticed that the best performing models still lack
in predicting some of the minority classes (i.e. Bi-LSTM with only 25% recall and DistilBERT
with 30% recall for ESH).
4.2.4 aita_4_balanced
This fourth subset and at the same time the most balanced one has generated the most satisfactory
results overall. A general trend observed between the results recorded in Table 4.5 compared to
Table 4.4 is that in the case of four balanced classes, not only did the general accuracy improved
for most models, so did the recalls, with three out of four almost reaching 50%. Bi-LSTM has
achieved 62% accuracy, which equalises the one obtained for the big subset, with the difference
that the recalls have increased in percentage. Bi-LSTM still struggled to predict instances labeled
with NTA for which the recall has been 30%. Its homologue DistilBERT, despite achieving a 61%
accuracy, so slightly lower, has managed to surpass the former when predicting NTA instances,
with a recall of 42%. The trade-off has affected the NTA class, for which it has obtained 81% re-
call, lower than the 93% obtained by Bi-LSTM. With regards to the others, logistic regression with
tf-idf has obtained 50% accuracy with decent recall values since all exceeded the 25% threshold,
unlike MNB and MLP. Another observation here is that the models have managed to predict much
better posts labeled as NTA and ESH, both verdicts implying that there is at least one asshole in
the respective scenarios.
4.3 RQ3
4.3.1 Platform & experiment details
The best performing model found in the previous section has been integrated into the web platform
whose usability has been tested through the human experiment. In this way, it would be clear
whether the performance showed on the AITA test set would match the one used on other people.
The design of the platform can be visualised in Figure 4.10. Each user has to provide a story of
maximum 1200 characters along with an adequate title that starts with the acronym AITA. After
3 https://scikit-learn.org/stable/modules/generated/sklearn.metrics.
confusion_matrix.html
4.3. RQ3 51
Figure 4.8: Confusion matrix showing the predictions returned by DistilBERT when tested on
aita_4_balanced. Each verdict NTA, YTA, ESH and NAH has been encoded with 0, 1, 2
and 3, respectively.
Figure 4.9: Number of posts classified correctly and incorrectly as YTA by the model per each
topic
pressing the Submit button, after some seconds, the verdict, explanation and recommendation will
be generated on the right side. In case they get confused by the process, they can always press
the Help button which provides them with some instructions. Additionally, for testing purposes,
an extra Random scenario button has been implemented, so that a user could use a pre-made
scenario to see how the platform works, without having to write their own. Other basic tools for
clearing the two textareas have been also added to the platform: Clear title, Clear story and Clear
both.
The verdict is generated after the program processes and transforms the content of the title
concatenated with the content of the story into a BERT encoding that is parsed into the DistilBERT
model. The explanation and recommendation are generated by ChatGPT3.5 through two requests
that are sent to ChatGPT’s API along with two specific queries. The first query asks ChatGPT
to explain the returned verdict for the given scenario, whereas the second query asks ChatGPT to
give a recommendation in the situation described in the story.
4.3. RQ3 52
After all 20 participants have finished completing the questionnaire form, the answers4 have
been collected and analysed accordingly by Microsoft Forms, which has created figures that sum-
marise the answers to the questions involving multiple choices. On what concerns the answer that
included plenty of text, a qualitative analysis has been performed and the most important remarks
have been extracted.
The verdicts returned by the AI have matched the participant’s satisfaction only for Scenario
4, and half of them have been satisfied with the verdict generated by Scenario 2, while the other
4 That can be found in the Questionnaire_results.xlsx file located at https://tinyurl.com/
34fse4vv
4.3. RQ3 53
three depict disagreement with the verdicts. However, when it comes to explanation, results are
better for these three scenarios, suggesting that the verdicts returned by the AI are not completely
wrong, but maybe slightly (i.e. NTA instead of NAH). In fact, for Scenario 2 and Scenario 3 the
participants have mostly agreed with the explanation, even though not at all with the verdicts. The
recommendation feature has proved to work really well, which shows that ChatGPT3.5 is quite
good when it comes to giving advice in certain situations.
The second part of the experiment has been the more challenging task for both the partici-
pants and the AI - while the former were supposed to come up with modified versions of the given
scenario that would change the verdict, the latter has been properly tested whether it is able to
adapt to spots certain details. The results from this task have been put altogether with no distin-
guishment between the scenarios and can be seen in Figure 4.12. Overall, the AI has managed to
fulfill the expectation of the participants in 50% of the cases, whereas the explanation has man-
aged to justify the verdicts in 55% of all cases. This suggests an overall improvement from the
first experiment where most of the feedback has been negative towards the verdicts, as well as a
decent potential that the DistilBERT model shows when it comes to differentiating between sim-
ilar scenarios. The recommendation feature has worked remarcably once again, with only few
complaints with regards to that.
4.3.3 Qualitative analysis
The participants have been asked to provide feedback after each task, and this section summarises
some of the most interesting suggestions/opinions that have been collected.
Among the main strengths of the web platform the following have been considered worth-
mentioning:
• three of the participants have actually changed their opinion on the verdict of the story
after reading the explanation
4.3. RQ3 54
• the experiment has proved to be really interesting for the participants and they see them-
selves using such a tool in the future
• most participants have been impressed with the explanation and recommendation features
and how on point these are
• some participants think that the verdict, explanation and recommendation returned by the
AI are human-like
• the explanations have been reasonable in some cases where the verdict was wrong
• some recommendations were just explaining the scenario and not giving actual advice
• the classification could be improved overall, since in some cases the model did not manage
to spot the changes in some scenarios
• the model has not found enough "fault" in some scenarios even though from a human per-
spective, the people involved were immorally wrong
• the explanation and recommendation could have been more detailed and provided more
information
Chapter 5
Conclusions
5.1 Achievements
Overall, this paper has focused on providing a possible solution for implementing an AI that could
help people with moral judgement, based on online posts collected from one of Reddit’s biggest
communities, r/AmITheAsshole (r/AITA). However, among the sub-goals that were intended to be
achieved lie the detailed analysis provided on the given data-set which has provided interesting
results related to the topics discussed in the subreddit, and the comparison between different ML
models when it comes to performing a text classification task.
This paper has managed to extend some of the previous attempts to implement such a model
(Alhassan et al., 2022; Wang, 2017) since it has provided a solution for 4-class classification,
which has not been tried before. The issues concerning the imbalanced data-set due to an uneven
distribution of the verdicts have been eventually solved by creating different subsets which are
more balanced. Adopting this approach has led to improving the recall of each class and implicitly
the overall performance, with two models reaching 61% and 62% accuracy.
Moreover, the third research question discussed in this study provides a novel solution for an
AI system that could help people with moral dilemmas, that is different from other AIs such as
AskDelphi or AreYouTheAsshole (AYTA) because of the explanation and recommendation features
given by ChatGPT3.5. The reason to opt for the latter to perform this task comes from the willing-
ness to include the most popular AI tool in the study, since people use it really often. This decision
has proved to be successful as well in spite of many recent concerns that ChatGPT might provide
inconsistent advice sometimes(Krügel et al., 2023). The participants that took part in the human
experiment involving the newly-created tool have showed some sort of satisfaction with regards to
the whole idea, suggesting that they would be willing to use such a system in the future.
maybe also improve the performance of the AI, as well as making it able to return explanations
based on the same comments. However, this process might require plenty of time, since it would
be challenging to filter only the comments considered to be morally right, but r/AITA might label
the top comments for each post, which might make it easier. A suggestion could, therefore, be
to make the model able to generate its own explanation and then compare it to the explana-
tion generated through ChatGPT. Not only would this test the performance of a model trained
specifically for moral judgement tasks to a state-of-the-art AI trained on various data, but also to
explore whether ChatGPT could actually be usable for this sorts of tasks.
5.2.2 Use explainable tools to understand the models better
Unfortunately, time and memory constraints have not allowed this study to provide extra expla-
nation on what concerns the trained models. For instance, Keras provides a feature called At-
tention layer1 that could be attached to any Keras model and used to explain the predictions
returned by the AI. This could lead to a further investigation into improving the overall perfor-
mance, as well as getting a better idea of how the AI processes the text. Additionally, Python’s
SHAP library2 provides great visualisation tools that could be used to see how the model
evaluates the input and returns a probability as an output. This method has been tried, but
some errors with regards to the library have made it unable to work, so this could potentially be
an improvement for the future, too.
5.2.3 Extend the 4-class classification
The current model is only able to perform 4-class classification based on the four classes (verdicts)
on which it has been trained. However, the AI that has been implemented on the platform processes
any type of input, which means that it would return a result no matter what. However, there
are definitely cases which either require more information in order to establish for sure the final
verdict, or they might just have no meaning. Both of these situations along with other ones could
be explored further, and a solution could be to filter posts from the r/AITA that are labeled
as INFO, which means that there is not enough information provided. However, it is not
guaranteed that there are enough posts classified with this verdict.
5.3 Self-evaluation
Personally, I have found this project both exciting and challenging for plenty of reasons. Dealing
with huge data-sets and with text classification tasks requires plenty of research time and a proper
choice of resources as well. I think that I have struggled a bit with the latter, since I had not
expected initially that I would need a powerful GPU to train my models if I wanted better time
performance. Memory management represented the main issue I have encountered throughout the
project, but acquiring Collab Pro has helped me reduce the execution time considerably. Overall,
I have greatly enjoyed exploring the world of NLP and text classification since it has aided me in
greatly improving the skills acquired in the ML-based courses I had attended before at university.
1 https://keras.io/api/layers/attention_layers/attention/
2 https://shap.readthedocs.io/en/latest/
Appendix A
Maintenance manual
In order to install and use the system, you have to follow the steps mentioned below:
1. Python 3.9.6 or above has to be installed, you can do it from their website https://www.
python.org/downloads/
2. The main system is written in Python’s framework Flask, which can be installed from the
official website https://flask.palletsprojects.com/en/2.3.x/
3. Unzip the zip file aita_system.zip and extract all files. The main files of the project
are the following will look like this:
/
main.py
requirements.txt
templates
index.html
static
layout_utils.js
scenarios.json
models
distilbert_model.h5
notebooks
datasets
aita_dataset.csv
aita_preprocessed.csv
RQ1.ipynb
RQ2.ipynb
4. requirements.txt contains all dependencies that need to be installed in order to make the
program work. Open a command line and run the following command in order to install the
dependencies: pip install -r requirements.txt
5. main.py represents the main file of the program that will run the entire software. This can be
done by using the following command: flask -app main run.
6. The notebooks folder contains the two Jupyter notebooks where the data analysis has been
performed, along with a datasets folder. It is recommended that you open the notebooks in
the .ipynb format, since the cells contain output which is relevant to the experiment (i.e fig-
ures, charts, model training on epochs etc..). Additionally, in order to avoid extra dependencies
A. M AINTENANCE MANUAL 58
conflicts, it is highly recommended to open the notebooks in Google Collab, where the experi-
ments have been performed themselves. Each cell can be run separately, but keep in mind that
certain cells will take a long time to run, due to the nature of this task, which has involved long
hours of training certain ML models. A pre-trained model is provided in the models folder, and
that one is also used for the AI tool integrated in main.py. The datasets subfolder contains
the necessary information scraped from Reddit’s r/AITA, as well as new information added dur-
ing the data analysis process (i.e topics per document, cleaned posts etc..). Both RQ1.ipynb
and RQ2.ipynb have their own subsections, but one of them called Utils contains the main
functions needed to be run before running the further experiments. All the cells contained in this
section can be run at the same time, thanks to Jupyter Notebook’s format.
Appendix B
User manual
In order to run and interact with the system, you have to follow the steps mentioned below:
1. Open a command line and go to the directory where you have saved the system and run the
following command that will compile the Flask app and open a local session of the project:
flask -app main run.
2. Go to the link provided in the command line, which should output something like this:
Running on http://127.0.0.1:5000. The web platform should have the following
interface:
3. Write your scenario on the left side by giving it a title and an appropriate story. The limit
of the story is 1200 characters, so you will not be allowed to write more. When you are
ready, press the Submit button and wait a few seconds until the results (verdict, explanation
and recommendation) are generated on the right side. Alternatively, in case you do not have
any inspiration for a scenario, press the Random scenario button, and the platform will
automatically provide a custom scenario for you, instead.
B. U SER MANUAL 60
5. Whenever you encounter any difficulties, press the HELP grey button, which will display a
modal containing specific instructions with regards to the platform:
Bibliography
Adewumi, T., Liwicki, F., and Liwicki, M. (2022). Word2vec: Optimal hyperparameters and their
impact on natural language processing downstream tasks. Open Computer Science, 12(1):134–
141.
Agarwal, B., Mittal, N., Bansal, P., and Garg, S. (2015). Sentiment analysis using common-sense
and context information. Intell. Neuroscience, 2015.
Alhassan, A., Zhang, J., and Schlegel, V. (2022). ’am i the bad one’? predicting the moral judge-
ment of the crowd using pre-trained language models. pages 267–276. European Language
Resources Association.
Almeida, F. and Xexéo, G. (2019). Word embeddings: A survey.
Anees, A. F., Shaikh, A., Shaikh, A., and Shaikh, S. (EasyChair, 2020). Survey paper on sentiment
analysis: Techniques and challenges.
Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie, B., Lovenia, H., Ji, Z., Yu, T., Chung,
W., Do, Q. V., Xu, Y., and Fung, P. (2023). A multitask, multilingual, multimodal evaluation of
chatgpt on reasoning, hallucination, and interactivity.
Botzer, N., Gu, S., and Weninger, T. (2021). Analysis of moral judgement on reddit. CoRR,
abs/2101.07664.
Chauhan, U. and Shah, A. (2022). Topic modeling using latent dirichlet allocation: A survey.
ACM Comput. Surv., 54:145:1–145:35.
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). SMOTE: Synthetic
minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321–357.
Chen, J. Q., Qi, K., Zhang, A., Shalaginov, M. Y., and Zeng, T. H. (2022). Covid-19 impact on
mental health analysis based on reddit comments. pages 2253–2258. IEEE.
Chowdhary, K. R. (2020). Natural language processing. pages 603–649.
Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent
neural networks on sequence modeling.
Cui, Z., Ke, R., Pu, Z., and Wang, Y. (2019). Deep bidirectional and unidirectional lstm recurrent
neural network for network-wide traffic speed prediction.
Cunn, N. (2019). Am i the asshole? http://www.nathancunn.com/
2019-04-04-am-i-the-asshole/.
Dale, R. (2017). The commercial nlp landscape in 2017. Natural Language Engineering, 23:641–
647.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirec-
tional transformers for language understanding.
Efstathiadis, I. S., Paulino-Passos, G., and Toni, F. (2022). Explainable patterns for distinction and
BIBLIOGRAPHY 62
Newman, D., Lau, J. H., Grieser, K., and Baldwin, T. (2010). Automatic evaluation of topic co-
herence. In Human language technologies: The 2010 annual conference of the North American
chapter of the association for computational linguistics, pages 100–108.
Nguyen, T. D., Lyall, G., Tran, A., Shin, M., Carroll, N. G., Klein, C., and Xie, L. (2022). Mapping
topics in 100,000 real-life moral dilemmas. Proceedings of the International AAAI Conference
on Web and Social Media, 16(1):699–710.
O’Brien, E. (2020). Aita for making this? a public dataset of reddit posts about moral dilemmas.
Proferes, N., Jones, N., Gilbert, S., Fiesler, C., and Zimmer, M. (2021). Studying reddit: A
systematic overview of disciplines, approaches, methods, and ethics. Social Media+ Society,
7(2):20563051211019004.
Ramadhan, W., Astri Novianty, S., and Casi Setianingsih, S. (2017). Sentiment analysis using
multinomial logistic regression. In 2017 International Conference on Control, Electronics,
Renewable Energy and Communications (ICCREC), pages 46–49.
Röder, M., Both, A., and Hinneburg, A. (2015). Exploring the space of topic coherence measures.
In Proceedings of the eighth ACM international conference on Web search and data mining,
pages 399–408.
Salehinejad, H., Sankar, S., Barfett, J., Colak, E., and Valaee, S. (2018). Recent advances in
recurrent neural networks.
Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2020). Distilbert, a distilled version of bert:
smaller, faster, cheaper and lighter.
Sivek, S. C. (2021). Am i the. . . data geek who analyzed red-
dit aita posts? yes. https://towardsdatascience.com/
am-i-the-data-geek-who-analyzed-reddit-aita-posts-yes-4954a8d37055.
Syed, S. and Spruit, M. (2017). Full-text or abstract? examining topic coherence scores using la-
tent dirichlet allocation. In 2017 IEEE International Conference on Data Science and Advanced
Analytics (DSAA), pages 165–174.
Symeonidis, S., Effrosynidis, D., and Arampatzis, A. (2018). A comparative evaluation of pre-
processing techniques and their interactions for twitter sentiment analysis. Expert Systems with
Applications, 110:298–310.
Ting, J. (2015). A look into the world of reddit with neural networks.
Umer, M., Sadiq, S., Missen, M. M. S., Hameed, Z., Aslam, Z., Siddique, M. A., and NAPPI,
M. (2021). Scientific papers citation analysis using textual features and smote resampling tech-
niques. Pattern Recognition Letters, 150:250–257.
V M, N. and Kumar R, D. A. (2019). Implementation on text classification using bag of words
model.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and
Polosukhin, I. (2017). Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wal-
lach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information
Processing Systems, volume 30. Curran Associates, Inc.
Wang, I. (2017). “am i the asshole?”: A deep learning approach for evaluating moral scenarios.
CS230: Deep Learning, Spring 2020, Stanford University.
Yao, L., Mimno, D., and Mccallum, A. (2009). Efficient methods for topic model inference on
BIBLIOGRAPHY 64