Social Media Narratives

Download as pdf or txt
Download as pdf or txt
You are on page 1of 117

NAVIGATING SOCIAL MEDIA NARRATIVES

A Dissertation

Submitted to the Graduate School


of the University of Notre Dame
in Partial Fulfillment of the Requirements
for the Degree of

Doctor of Philosophy

by
Nicholas Botzer

Tim Weninger, Director

Graduate Program in Computer Science and Engineering


Notre Dame, Indiana
July 2023
NAVIGATING SOCIAL MEDIA NARRATIVES

Abstract
by
Nicholas Botzer

As social media continues to shape modern society, understanding how individuals


engage in these online spaces and form overarching narratives is crucial. Narratives
have a significant impact on people’s lives by influencing public opinion, and shaping
society’s values. Although these narratives are often known to occur, it can be chal-
lenging to completely understand them and what aspects of messaging resonate with
people. With the recent advances in deep learning and natural language processing,
methods can now be developed to help understand narratives on social media from
a variety of perspectives.
In this dissertation I investigate the development and impact of narratives on
social media by examining three key dimensions: moral judgments, conversational
flow, and user intent. Moral judgments have become a predominant force for indi-
viduals to express their outrage and gather support for social and cultural issues.
These judgements are often formed into a narrative to highlight differences between
groups but little has been done to understand broad patterns that may occur. Narra-
tives also emerge from the way conversations flow on social media, where like-minded
individuals form communities that often develop into echo chambers of discussion
These groups discuss topics in a repetitive manner, and I seek to understand this by
modeling conversational flow with a graph-based model. Finally, I study narratives
from the perspective of user intent. Intent classification is a challenging problem
Nicholas Botzer

that often requires laborious data annotation to be applicable to targeted domains.


Understanding user intent on social media is lacking due to this barrier. To alleviate
this issue, I approach the problem in a semi-supervised manner to make it easier to
achieve high performance while requiring minimal annotation efforts. This will allow
small annotation efforts to be done to conduct studies of user intent on social media
for targeted issues. The findings provide valuable insights into the dynamics of online
discourse and the factors that drive the formation and propagation of narratives on
social media.
To my parents.

ii
CONTENTS

Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Moral Judgement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Conversational Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Intent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Chapter 2: Analysis of Moral Judgement on Reddit . . . . . . . . . . . . . . . 8


2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Findings in Brief . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Data and Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Text Classifier for Moral Judgement . . . . . . . . . . . . . . . 15
2.2.2 Judgement Classification Results . . . . . . . . . . . . . . . . 17
2.2.3 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Analysis of Moral Judgement . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Transferability . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Moral Valence and Popularity . . . . . . . . . . . . . . . . . . 23
2.4 Assigning Judgements to Users . . . . . . . . . . . . . . . . . . . . . 27
2.4.1 Three Types of Negative Users . . . . . . . . . . . . . . . . . 30
2.4.2 Gender and Age Analysis . . . . . . . . . . . . . . . . . . . . 32
2.4.3 Logistic Regression Analysis . . . . . . . . . . . . . . . . . . . 35
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Chapter 3: Entity Graphs for Online Discourse . . . . . . . . . . . . . . . . . 38


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.1 Online Discourse Dataset . . . . . . . . . . . . . . . . . . . . . 42
3.2.2 Entity Linking . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.3 Entity Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

iii
3.3 Conversation Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 Conversation Traversals . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.1 Spreading Activation . . . . . . . . . . . . . . . . . . . . . . . 55
3.5 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.6.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Chapter 4: TK-KNN: A Balanced Distance Based Semi-supervised Learning


method for Intent Classification. . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Top-K KNN Semi-Supervised Learning . . . . . . . . . . . . . . . . . 72
4.2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.2 Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.3 Top-K Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2.4 KNN-Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2.5 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . 78
4.3.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 79
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4.1 Overconfidence in Pseudo-Labelling Regimes . . . . . . . . . . 82
4.4.2 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.4.3 Upper Bound Analysis . . . . . . . . . . . . . . . . . . . . . . 84
4.4.4 Parameter Search . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Chapter 5: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

iv
FIGURES

2.1 Example of (a) a post title and (b) a comment in the /r/AmItheAsshole
subreddit. The NTA prefix and comment-score (not shown) indicates
that the commenter judged the poster as “Not the Asshole”. . . . . . 10
2.2 A post (in blue) made by a user along with the top response comment
(white). The comment is then fed to our Judge-BERT classifier (green)
to determine the moral valence of the post. . . . . . . . . . . . . . . . 20
2.3 A screenshot of the human annotation system. . . . . . . . . . . . . . 21
2.4 Plots showing the prediction of Judge-BERT versus the annotation
agreement by humans for both positive and negative classes. X-axis
displays the annotator agreement against the prediction made by Judge-
BERT. 5/0 represents all annotators agreeing and it matching the
model while 0/5 shows all annotators agreeing for the opposite class,
meaning a wrong prediction by the model. . . . . . . . . . . . . . . . 22
2.5 Judge-BERT analysis at various levels of annotator agreement. . . . . 23
2.6 Posts judged to have positive valence as a function of post score.
Higher indicates more positive valence. Higher post scores are as-
sociated with more positive valence (Mann Whitney τ ∈ [0.40, 0.47],
p < 0.001 two-tailed, Bonferroni corrected) . . . . . . . . . . . . . . . 25
2.7 Lorenz curve depicting the judgement inequality among users; Gini
coefficient = 0.515 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.8 Number of comments (normalized) as a the negativity threshold is
raised. As the negativity threshold is raised the fraction of comments
revealed tends towards 1. Higher lines indicate a higher concentration
of negative users and vice versa. . . . . . . . . . . . . . . . . . . . . . 29
2.9 A diagram showing the posting habits of a Returner. Posts are in the
light blue boxes with blue arrows represent the order of posts. An
example of a post response is shown in the white box with the red
arrow representing the post it came from. Each post is prefaced with
the overarching title, “Me and my partner are having a baby.” followed
by the current update on the situation. The response comments have
also been condensed from their full length. . . . . . . . . . . . . . . . 31

v
2.10 Example of a post title from /r/relationships. Subreddit rules require
the poster to indicate their age and gender, as well as any other indi-
viduals gender and age. . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.1 Illustration of an entity graph created from threaded conversations


from r/politics (blue-edges) and r/conservative (red-edges). The x-axis
represents the (threaded) depth at which each entity was mentioned
within conversations, extracted from Reddit, rooted at Joe Biden. The
y-axis represents the semantic space of each entity, i.e., similar entities
are closer than dissimilar entities on the y-axis. Edge colors denote
whether the transition from one entity set to another occurs more
often from one groups conversations than another. Node colors rep-
resent equivalent entity sets along the x-axis. In this visualization we
observe a pattern of affective polarization as comments coming from
/r/Conservative are more likely to drive the conversation towards top-
ics related to the opposing political party. . . . . . . . . . . . . . . . 40
3.2 (Left) Example comment thread with the post title as the root, two
immediate child comments, one of which has two additional child com-
ments. Entity mentions are highlighted in yellow. (Right) The result-
ing entity tree where each comment is replaced by their entity set.
Note the case where the mention-text Trump in the comment thread
is represented by the standardized entity-label Donald Trump in the
entity tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Paths extracted from the entity tree in Fig. 3.2(b) represented by di-
rected edges over entity sets. . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Entity graph constructed from a star-expansion of the entity tree in
Fig 3.2(b) and the conversation paths in Fig. 3.3(c) This model rep-
resents the entities, their frequent combinations, and the paths fre-
quently used in their invocation. . . . . . . . . . . . . . . . . . . . . 49
3.5 Percent of the predictions made on the testing set that, on average,
exist in the training set for 5-folds. Higher is better. . . . . . . . . . . 50
3.6 Box plot of Word Movers Distance (WMD) as a function of the con-
versation depth `. Lower is better. Box plots represent WMD-error of
entity representations predicted by the narrative hypergraph over all
entities, over all depth, over five folds. . . . . . . . . . . . . . . . . . 52
3.7 Entity graph showing the visual conversation traversals from /r/news.
This illustration shows the paths of conversations over entity sets. The
x-axis represents the depth of the conversation; entity sets are clustered
into a semantically meaningful space along the y-axis. Inset graph
highlights five example entity sets and their connecting conversation
paths. Node colors represent equivalent entity sets. In this example we
highlight how entity sets are placed in meaningful semantic positions
in relation to one another. . . . . . . . . . . . . . . . . . . . . . . . . 54

vi
3.8 Entity graph example of spreading activation on /r/news when Barack Obama
is selected as the starting entity. The x-axis represents the (threaded)
depth at which each entity was mentioned within conversations rooted
at Barack Obama. The y-axis represents the semantic space of each
entity, i.e., similar entities are closer than dissimilar entities on the y-
axis. Node colors represent equivalent entity sets. In this example, we
observe that conversations starting from Barack Obama tend to center
around the United States, political figures such as Donald Tump, and
discussion around whether his religion is Islam. . . . . . . . . . . . . . 56
3.9 Illustration of an entity graph created from threaded conversations
from /r/news (red-edges) and r/worldnews (blue-edges). The x-axis
represents the (threaded) depth at which each entity set was mentioned
within conversations rooted at White House. The y-axis represents the
semantic space of each entity, i.e., similar entities are closer than dis-
similar entity sets on the y-axis. Nodes colors represent equivalent en-
tity sets. Conversations in /r/news tends to coalesce to United States,
while conversations in /r/worldnews tend to scatter into various other
countries (unlabeled black nodes connected by thin blue lines) . . . . 61
3.10 Comparison between the first 6 months of /r/Coronavirus from 2020
to 2021. Illustration of an entity graph created from threaded conver-
sations from /r/Coronavirus in Jan–June of 2020 (red-edges) and from
Jan–June of 2021 (blue-edges). The x-axis represents the (threaded)
depth at which each entity set was mentioned within conversations
rooted at United States. The y-axis represents the semantic space of
each entity set, i.e., similar entity sets are closer than dissimilar entity
sets on the y-axis. Node colors represent equivalent entity sets. Con-
versations tended to focus on China and Italy early in the pandemic,
but turn towards a broader topic space later in the pandemic. . . . . 63

4.1 Example of pseudo label selection when using a threshold (top) ver-
sus the top-k sampling strategy (bottom). In this toy scenario, we
chose k = 2, where each class is represented by a unique shape. As
the threshold selection strategy pseudo-labels data elements (shown as
yellow) that exceed the confidence level, the model tends to become
biased towards classes that are easier to predict. This bias causes a
cascade of mis-labels that leads to even more bias towards the majority
class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

vii
4.2 TK-KNN overview. The model is (1) trained on the small portion of
labeled data. Then, this model is used to predict (2) pseudo labels
on the unlabeled data. Then the cosine similarity (3) is calculated for
each unlabeled data point with respect to the labeled data points in
each class. Yellow shapes represent unlabeled data and green represent
labeled data. Similarities are computed and unlabeled examples are
ranked (4) based on a combination of their predicted probabilities and
cosine similarities. Then, the top-k (k = 2) examples are selected (5)
for each class. These examples are finally added (6) to the labeled
dataset to continue the iterative learning process. . . . . . . . . . . . 73
4.3 Convergence analysis of pseudo-labelling strategies on CLINC150 at
1% labeled data. TK-KNN clearly outperforms the other pseudo-
labelling strategies by balancing class pseudo labels after each training
cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4 Ablation results for each dataset using 1% labeled data. . . . . . . . 83
4.5 A comparison of TK-KNN on HWU64 with 1% labeled data as β varies. 85
4.6 A comparison of TK-KNN on HWU64 with 1% labeled data as k varies. 86

viii
TABLES

2.1 Judgements breakdown in the /r/AmItheAsshole dataset. . . . . . . . 14


2.2 Classification results on the AITA dataset. . . . . . . . . . . . . . . . 17
2.3 Subreddits used for analysis of moral judgement. . . . . . . . . . . . . 19
2.4 Number of posts judged with positive and negative moral valence in
each subreddit (2018). . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Contingency Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.6 Logistic Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . 35

3.1 Reddit discourse dataset. . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.1 Breakdown of the intent classification datasets. . . . . . . . . . . . . 78


4.2 Results for CLINC150, BANKING77, and HWU64 . . . . . . . . . . 81

ix
ACKNOWLEDGMENTS

Throughout graduate school I have had a plethora of people that have supported
me with my work and outside of it as well. I would first like to thank my advisor,
Tim Weninger, for tolerating me and guiding me though out the process. I could not
have asked for a better mentor that gave me the flexibility to work on the problems
I found interesting. I would also like to thank my committee members Meng Jiang,
Kevin Bowyer, and Yonatan Bisk. They all provided fantastic feedback and advice
from my proposal and were great members for my committee. While I had numerous
people guide me through my research I would also like to acknowledge the people that
helped me outside of the lab. I want to thank my wife Xiaoqing, whom I was lucky
enough to meet during my PhD. You have supported me since we first met and have
made my life better since coming into it. To Joey Heyl, for helping me out anytime
I needed it with literally anything. I couldn’t ask for a more dependable friend and
you became an amazing gym partner for me when I moved back home. To all of
my family, for supporting and encouraging me the entire way through. Having soup
nights with you all throughout gave me a chance to relax and enjoy our time together.
All of you were always happy when I was making progress and there for me whenever
I would struggle or be stressed. To Zach Petrusch, I couldn’t have asked for a more
compassionate friend. You’ve always picked up the phone and talked to me whenever
I’m going through tough situations and brought so much fun to our D&D group. To
Scott Mulvihill, you’ve been a my friend for so long and I can always count on you
to invite me out. Every time I come home I always know I’ll be able to hangout with
you and get out and enjoy the world. To Scott Ramey and Matt Wagenhoffer, for

x
being the best remote friends I could have had. The two of you helped keep me sane
by playing games with me throughout my entire time in grad school but especially
during Covid. Finally, I want to thank the rest of the Big Dawgs in House Alan. I
had conversations with all of you and received nothing but support and compassion
for my graduate school work. Without so many people supporting me I would not
have been able to succeed in graduate school and I am extremely thankful to have
all of you in my life.

xi
CHAPTER 1

INTRODUCTION

Social media has become an integral part of modern society, providing a platform
for individuals to express their opinions and engage with others. However, the prolif-
eration of digital communication has turned online spaces into breeding grounds for
polarizing conversations [136], personal attacks [143], and moral outrage [37]. There-
fore, it is essential to understand how people engage with one another in these spaces
as they often create over arching narratives based on different groups. The way that
people share their experiences on social media to create these narratives holds impor-
tant implications for various fields such as psychology, sociology, and communication
studies.
The definition of a narrative, especially on social media, is poorly defined, but
previous works generally accept it as the passage of time or process [67, 130], often
referred to as a “change of state” [90]. But these works are often focused on un-
derstanding narratives from stories or poems, not those that are shaped in everyday
life on social media. The narratives that come from these stores are often linear in
nature. These stories often have a clear beginning, middle, and end to them following
the linear progression. This naturally arises as the narrative is being told from the
perspective of one individual. In contrast, social media narratives are very fractured
and non-linear in how they develop. With input and opinions coming from a wide
variety of people the trajectory of the narrative can quickly shift and be hard to
follow. Because so many different perspectives contribute to social media narratives
a variety of different techniques are necessary to understand them.

1
A significant challenge in understanding these evolving narratives is the massive
volume of data being generated on social media. This information overload [52] makes
it difficult for individuals to sift through sources and comprehend the complex dy-
namics at play, often leading to confusion regarding the messages they receive [113].
Social media platforms also utilize recommendation systems to present users with
relevant information about current events and their interests. While these systems
are considered useful, particularly from a business standpoint, they frequently create
feedback loops and echo chambers on social media [73]. These feedback loops create
communities that often support different messaging and often have varying narratives
related to different topics, or current events. With all these dynamics at play the im-
portance of automated methods to aggregate and understand social media narratives
have become increasingly important.
The resurgence of deep learning and more powerful models has made new meth-
ods for exploring and understanding these trends available to researchers. Natural
language processing in particular, has seen an explosion of use cases in recent years
due to the transformer architecture [138] and pre-trained large language models [41].
These pre-trained models can be fine-tuned and modified for a wide variety of prob-
lem domains and achieve good performance on these tasks. This has allowed for
larger scale and broader analysis to be conducted on social media.
In this dissertation, I explore user narratives from three distinct perspectives: an-
alyzing moral judgements, conversational flow, and user intent. These three methods
help shed light on users values, beliefs, and intentions on social media. Each approach
offers unique insights into how narratives shift over time and influence peoples atti-
tudes and beliefs.

2
1.1 Moral Judgement

As long as society has existed cultures have formed their own moral norms [64]
that are accepted and followed. These moral norms help insure that a group of
people can cooperate together to ultimately help the group survive [15]. When an
individual within the group breaks one of these norms they will typically be punished
by others within the group. Other people coming out against a person’s actions
helps to inform the perpetrator, along with bystanders, that a particular action is
unacceptable. These moments help to define the overarching moral beliefs that are
inherent to any given society.
With the advent of the internet, moral judgements and norms have undergone
some interesting developments. The use of social media has opened up two main
avenues to alter moral norms, exposure to other cultures, and a larger global com-
munity to cast judgements. This is the first time in history that so many people
have been able to share a variety of cultures and viewpoints so easily; thus enabling
others to examine and judge these differences. The increased sharing of experiences
often leads to heated debates on social media covering a wide variety of topics [147].
On social media taking a moral stance has quickly become a predominant method to
attack other individuals or support your own reasoning [16], as it often incites anger
into people. These moral stances are often used to form a narrative on social media
surrounding real world events. They are often calls to action based on a perceived
wrong doing in the world.
Recent findings emphasize positive feedback from moral outrage acts as a feedback
mechanism [18]. The feedback will reinforce the original users stance and increase
the likelihood for further expressions of moral outrage. This often causes more moral
messaging that is used to construct narratives surrounding a given issue or topic.
To study morality on social meida, researchers have often turned towards moral
foundations theory (MFT) [61] to guide their methods. MFT utilizes a lexicon of

3
words to classify text along five different axis of morality. While a generic dictionary
exists for MFT [117], more recent targeted dictionaries and corpora have been created
for specific social media platforms such as Twitter [68] and Reddit [135]. While
this method has lead to a number of findings, others have found it falls short when
considering shifting social media topics and may not accurately capture realistic moral
dynamics. This makes the MFT difficult to apply broadly to social media to capture
the moral messaging that is being used to build a narrative. More recent works look
to leverage large pre-trained models and a variety of psycholinguistic features, often
including MFT, to better understand moral judgements [144, 145, 63].
I discuss work for analyzing moral judgements on the social media website Reddit.
Specifically, I extract data from the subreddit /r/AmITheAsshole and train a moral
judgement classifier based on the comments. This classifier is then applied to infer a
variety of social factors regarding moral judgements. Analyzing user patterns allows
us to understand posting habits and the types of messages used to support moral
messaging often used in narratives.

1.2 Conversational Flow

Social media generates a plethora of conversational data as individuals banter


about common topics or fiercely debate controversial issues [156]. Narratives often
arise in these conversations as people with different beliefs and ideologies discuss
them in group conversations [102]. This raises an interesting idea of understanding
how these conversations will flow based on the community and subject at hand. For
instance, it is possible to analyze whether different communities are pre-disposed to
discuss certain topics more often. While this concept has been raised by linguists be-
fore [23], attempts to analyze it in a structured manner are lacking. With advances
in machine learning it is now possible to extract information from a large number
of conversations at once and understand the flow of topics throughout the turns of

4
a conversation by using entity linking [12]. Looking at the entities of a conversation
is attractive for understanding the flow, as it focuses on the key subjects being ref-
erenced. For my purposes, the action these entities are taking or their sentiment is
not important. The focus instead is to understand the patterns that cause groups
to discuss a topic and end up at a specific topic from a starting point. As repeated
messaging and echo chambers are a driving factor of narratives in online discourse,
my method will allow for the observation of these patterns in a broad manner.
One could look at the rise generative large language models (LLM) as a sort
of proxy for this investigation. LLM’s are trained to predict the next token based
on the previous tokens generated [106] and have therefore learned a sense of how
a conversation should flow given a specific prompt. Although I do not analyze the
generations of these LLM’s it could prove an interesting avenue for further analysis
to understand conversational flow. Specifically, one could give a prompt and generate
a variety of responses from the model and then analyze the flow of them similar to
my method discussed later. I also anticipate that these models will see widespread
use in creating different automatic posts to push forward narratives from groups that
wish to exert influence over others.

1.3 Intent

Understanding an individuals intent is of great importance to dialogue systems


and narratives on social media. Intent can be framed as understanding what ac-
tion an individual wants themselves or others to take based on a textual utterance.
Statements that convey intent are often short in nature and convey a single meaning.
Longer messages or documents often convey a wide variety of intents and not just a
single one. Due to the nature of social media, messages broadcast to others are often
short and clear in the intent that they wish to convey.
In particular users of social media often attempt to influence various communities

5
by showing their support or disdain for particular topics. With the ever changing na-
ture of the world it can be very challenging to capture what intents are important and
understand the broader narrative happening. Applying machine learning methods to
understand these intents in challenging due to the cost of annotating large datasets
and the broad variety of intents that users will message about. I address this issue
by using a semi-supervised learning method to achieve high performance with very
limited labeled data. Semi-supervised learning is useful in problems domains where a
plethora of unlabeled data is available but only a small amount of data can be labeled.
As previously discussed social media has a vast amount of examples, so a method
that can leverage the unlabeled examples is important to improving performance.
A semi-supervised method also allows for training on targeted domains as new
situations occur in the world. Without the need to annotate a large number of
examples, important classes can be quickly chosen and a few examples annotated.
This will allow for broader down stream analysis once the model has been trained.
By having an accurate classifier for intents, popular messages can be found and
examined for other attributes such as named entities, sentiment, and emotion. All of
these attributes can help drive insight into the social factors that influence popular
narratives and messaging on social media.
An interesting problem when dealing with intent for social media is the domain
you wish to apply the model to. As intent can be considered the actions an individual
wants to take you can classify a variety of issues into this domain. For instance, I
discuss how moral judgements are an important aspect of narrative analysis but can
also be viewed as a form of intent. Whether a user wishes to cast a moral judgement
in their messaging or not can be interpreted as a binary classification problem with
intent. Similarly, it is possible to look at political elections throughout the world
and capture the intent of candidates, supporters, and detractors via intent. Messages
could capture individuals asking for support of a candidate, positioning for or against

6
a given topic, or capturing the emotional response of the poster. Intent allows for a
broad net to be cast that is dependent on the target demographic and domain for it
to be applied. As such, the semi-supervised method discussed allows for an analyst to
quickly position a model into their domain of interest and narrow down the specific
intents they are interested in understanding.

1.4 Thesis Overview

Narratives are pervasive on social media and are scantly understood. To under-
stand narratives is to understand how different actors in a society wish to construct
and manipulate messages to the public. This motivates my thesis:

In order to understand the shifting social and cultural landscape


on social media, narrative analysis at scale is necessary. To facilitate
narrative understanding, machine learning methods are necessary to
handle the enormous volume and diversity of narratives.

This dissertation consists of the three main focuses previously discussed moral
judgements, conversational flow, and user intent. Chapter 2 focuses on understand-
ing moral judgements on Reddit and looking at different social aspects that influence
them. Chapter 3 proposes a graph based model and visualization to interpret conver-
sational flow from different communities. Findings are presented comparing different
communities on Reddit and the key entities they discuss.
For the remainder of the dissertation I focus on classifying user intents. Chapter
4 proposes a new semi-supervised learning method to classify intents in low data
scenarios. Results demonstrated strong performance, particularly when labeled data
is extremely scare (1 to 2 examples per class). Finally, in 5 I summarize the main
findings of the work and propose some future directions.

7
CHAPTER 2

ANALYSIS OF MORAL JUDGEMENT ON REDDIT

The work presented in this chapter is a collaboration with Shawn Gu and Tim
Weninger and was published in the IEEE Journal on Computational Social Systems
in 2022 [13]

2.1 Introduction

How do people render moral judgements of others? This question has been pon-
dered for millennia. Aristotle [1], for example, considered morality in relation to
the end or purpose for which a thing exists. Kant [77] insisted that one’s duty was
paramount in determining what course of action might be good. Consequentialists
[128] argue that actions must be evaluated in relation to their effectiveness in bringing
about a perceived good. Regardless of the particular ethical frame that one ascribes
to, the common practice of evaluating others’ behavior in moral terms is widely re-
garded as important for the well-being of a community. Indeed, ethnographers and
sociologists have documented how these kinds of moral judgements actually increase
cooperation within a community by punishing those who commit wrongdoings and
informing them of what they did wrong [14].
The process of rendering moral judgement has taken an interesting turn in the
current era where Online social systems enable people to encounter and consider
the lives and perspectives of others from around the world. At no other time in
history have so many people been able to examine (and judge) such a variety of
cultures and viewpoints so readily. This increased sharing and mixing of viewpoints

8
inevitably leads to fierce online debates about various topics [148], while the content
and outcomes of these debates provides researchers with the opportunity to ask spe-
cific questions about argument, disagreement, moral evaluation, and judgement with
the aid of new computational tools.
To that end, recent work has resulted in the creation of statistical models that
can capture moral sentiment in text [117]. However, these models rely heavily on a
gazette of words and topics as well as their alignment on moral axes. The central
motivation for these works are grounded in moral foundation theory [61] where studies
also tend to investigate the use of morality as related to current events in the news.
Despite their usefulness in understanding the moral valence of specific current events,
the goal of the current work is to study moral judgements rendered on social media
that apply to more common personal situations.
We focus on Reddit in particular, where users can create posts and have discus-
sions in threaded comment-sections. Although the details are complicated, users also
perform curation of posts and comments through upvotes and downvotes based on
their preference [54, 56]. This assigns each post and comment a score reflecting how
others feel about the content. Within Reddit there are a large number of subreddits,
which are small communities typically dedicated to a particular topic. One subreddit
in particular is centered on questions of moral judgement. This subreddit is named
/r/AmItheAsshole. Posters to /r/AmItheAsshole are typically looking to hear from
other Reddit users about whether or not they handled their personal situation in an
ethically appropriate manner. The community works like this. First, users post a de-
scription of a situation in which they were involved, and they are also encouraged to
explain details of other people involved as well as the final outcome of the situation.
Next, other users respond to the initial post with a moral judgement as to whether
the original user was an asshole or not the asshole. Figure 2.1 shows an example of a
typical post and one of its top responses. One important rule of /r/AmItheAsshole

9
(a)

(b)

Figure 2.1. Example of (a) a post title and (b) a comment in the
/r/AmItheAsshole subreddit. The NTA prefix and comment-score (not
shown) indicates that the commenter judged the poster as “Not the
Asshole”.

is that top-level responses must categorize the behavior described in the original post
to one of four categories: Not the Asshole (NTA), You’re the Asshole (YTA), No ass-
holes here (NAH), Everyone sucks here (ESH). In addition to providing a categorical
moral judgement, the responding user must also provide an explanation as to why
they selected that choice. Reddit’s integrated voting system then allows other users
to individually rate the judgements with which they most agree (upvote) or disagree
(downvote). After some time has passed the competition among different judgements
will settle, and one of the judgements will be rated highest. This top comment is
then accepted as the judgement of the community. This process of passing and rating
moral judgement provides a unique view into our original question about how people
make judgements of morality, ethics, and behavior.
Compared to other methodologies of computational evaluation of moral senti-
ments, collecting judgements from /r/AmItheAsshole (AITA) has some important
benefits. First, because posters and commenters are anonymous on Reddit, they
are more likely to share their sensitive stories and frank judgements without fear of

10
reprisal [99, 74]. Second, the voting mechanism of Reddit allows a large number of
users to engage in an aggregated judgement in response to the original post [57].
However, the breadth and variety of this data does pose additional challenges. For
instance, judgements are provided without an explicit moral-framing, and, similarly,
Reddit-votes are susceptible to path dependency effects [58].
The study of how social norms and morals are reasoned about on social media has
only recently become a topic of interest [50, 47]. In these works, large scale annotation
studies were performed using data collected from moral situations; one data source
used in both studies was our subreddit of interest /r/AmItheAsshole. The annotation
efforts from Forbes et al. resulted in a dataset that contains heuristics for various
actions and how acceptable they were found [50]. Likewise, Emelin et al. curated a
new dataset by asking people to create diverse narrative scenarios using the previous
heuristics as writing prompts [47]. Although these works investigated computational
models of social norms and actions, they did not consider how people cast moral
judgement upon others. A recent study looked at online shaming on Twitter [5] –
an increasingly common way to cast moral judgement. In this study, 1000 shaming
tweets were collected and placed into categories based on the type of shaming, which
are further analyzed to create an anti-shaming system. Yet the creation of a model
based on these datasets would be difficult due to the lack of positive moral examples.

2.1.1 Research Questions

In the present work we use data from AITA to investigate how users provide
moral judgements of others. Specifically, we extracted representative judgement-
labels from each comment and used these labels and comments to train a classifier.
Human annotators were then used to verify the performance of this classifier on
other subreddits. This classifier was then broadly applied to infer the moral valence of
comments from other communities in order to answer the following research questions:

11
RQ1: Is moral valence correlated with the score of a post?
Recent research into morality has found that immoral acts posted online trigger
stronger moral outrage responses than if the act was witnessed in person [37, 9].
These strong responses can be helpful in platform moderation as the viral nature
of these posts may drive user engagement. Others lament the rise of cancel culture
as a kind of piling-on effect that can have severe negative consequences for the one
judged to be immoral [5, 97]. Because content on Reddit is rated by the community,
we would expect that posts with immoral acts generate higher scores. With this in
mind we investigated various Reddit communities to observe the behavior in each.

RQ2: Do certain subreddit-communities attract users whose posts are


typically classified by more negative or positive moral judgements?
Previous work on morality and rational behavior has discussed how individuals
tend to act in a manner that promotes the most social utility [65]. This, in turn,
causes people to follow moral guidelines based on numerous factors. In essence, people
will make choices that help themselves in society but also take into account their own
moral identity. In the present work, we seek to understand the types of communities
and individuals that exhibit more negative or positive behavior on Reddit. We also
investigated the social media habits of these users to find underlying themes that
occur based on their behavior.

RQ3: Are self-reported gender and age descriptions associated with posi-
tive or negative moral judgements?
The role that gender plays on social media has been analyzed from a variety
of viewpoints. One angle that has been analyzed is how users receive support on
social media based on their gender [139]. The findings from three social platforms,
including Reddit, show that women receive higher rates of support and disparagement
in comparison to men. De Choudhury et al. also sought to understand how gender

12
plays a role in mental health disclosure on social media finding that men desire social
support less often than women [38]. Findings of differences in the topics of interest
between genders has also been found on social media and Reddit in particular [134].
Studies that analyze gender and age on social media have investigated how language
use aligns with various personality types along these dimensions [121]. These findings
can be useful when investigating gender inequalities that exist in the world. In the
present work, we chose to focus on gender and age as it relates to moral judgements
on social media.

2.1.2 Findings in Brief

To answer these research questions we first tried several modern text classification
systems and evaluated their ability to predict moral judgement on a held out test
set of AITA comments. We dubbed the best classifier Judge-BERT, because it was a
fine-tuned version of the BERT language model for text classification [41]. We then
applied Judge-BERT to comments from several other communities.
In summary, we found that posts that were judged to have positive moral valence
(i.e., NTA label) typically scored higher than posts with negative moral valence.
We also found that certain subreddit-communities where users confess to something
immoral (i.e., such as /r/confessions) tended to attract users whose posts were more-
negative. Among these negative-users we found that their posting habits tended
towards three different types. Finally, we showed that self-described male users were
more likely to be judged with negative moral valence than female users.

2.2 Data and Methodology

We retrieved moral judgements by collecting posts and comments from the sub-
reddit /r/AmItheAsshole, taken from the Pushshift data repository [7].
We restricted our data collections to posts submitted between January 1, 2017,

13
TABLE 2.1

JUDGEMENTS BREAKDOWN IN THE /R/AMITHEASSHOLE


DATASET.

Label Meaning # Comments

NTA Not the Asshole 717,006


YTA You’re the Asshole 372,850
NAH No assholes here 91,903
ESH Everyone sucks here 79,059

and August 31, 2019. In order to assure that labels reflected the result of robust
discussion we excluded those posts containing fewer than 50 comments. Subreddit
rules required that top-level comments begin with one of four possible prefix-labels
indicated in Table 2.1. Because of this rule, we further restricted our data collection
to contain only top-level comments and their prefix-label. Comments with the INFO
prefix, which indicates a request for more information, and comments with no prefix
were also removed from consideration. This methodology resulted in a collection
of 7,500 posts and 1,260,818 comments with explicit moral judgements. Posters
and commenters appeared to put a lot of thought and effort into these discussions.
Each post contained 381 words on average and each comment contained 57 words
on average. For each of the comments we removed the prefix labels from the text.
This was done to ensure the models we trained could not simply learn based on the
labels that were extracted from the original text. We also truncated longer comments
down to a max-length of 128 tokens. This was determined based on the distribution of
comment sizes and performance across multiple tests of the model at varying lengths.
Given that all of the comments we have extracted come from one community, there

14
are most likely survey biases included in our dataset. This can, in part, be attributed
to the self-referential nature of Reddit [127]. Although we did not consider all sub-
reddits, we did consider all comments within a specific timeframe from each selected
subreddit. Differences between these subreddits provided the ability to compare and
contrast community-behavior and their users. Temporal and path dependency biases
certainly affected some of the measures in the present work (e.g., comment score),
however, we remain confident that our dependent-variable, i.e., moral valence, is not
correlated with temporal and ordinal affects. In other words, the timing of a com-
ment almost certainly did not affect its moral valence. Likewise, any ordinal affects
would necessarily occur after the posting, so the causality arrow, if it exists, can only
point in the opposite direction.

2.2.1 Text Classifier for Moral Judgement

Given the dataset with textual posts and textual comments labeled with positive
or negative moral judgements, our goal is to predict whether an unlabeled comment
assigns a positive (NTA or NAH) or negative (YTA or ESH) moral judgement to the
user of the post. It is important to note that this classifier predicts the judgement of
the comment(er), not the morality of the poster.
We define our problem formally as follows.

Problem Definition Given a top level comment C with moral judgement A ∈ {+, −}
that responded to post P we aim to find a predictive function f such that

f : (C) → A (2.1)

Formally, this takes the form of a text classification task where class inference denotes
the valence of a moral judgement. To train such a classifier we utilized the comments

15
we have extracted with their respective class labels from /r/AmItheAsshole. We note
again that the text representing each class label (e.g., NTA, YTA) has been removed
from the comment.
The choice of classification model f is important, and we aim to train a model
that performs well and generalizes to other datasets. We test a variety of models to
ensure we can find one that performs the best on this task. Our choices are based
on recent advances in NLP along with prior methods that have demonstrated strong
performance on short text classification. We selected four text classification models
for use in the current work along with one sentiment analysis method:

• VADER [70]: We utilized the VADER sentiment analysis system to classify


positive and negative sentiment along our positive and negative moral axis.
Vader gives a sentiment value in the range of [-1, 1], with -1 referring to neg-
ative sentiment and 1 to positive sentiment. Neutral sentiment is normally
values between [-0.05, 0.05]. Since our classifier should only consider positive
and negative values, we adjust the thresholds to be any value greater than 0
to indicate positive sentiment and, any value below zero indicating negative
sentiment.

• Multinomial Naı̈ve Bayes [81]: Uses word counts to learn a text classification
model and has shown success in a wide variety of text classification problems.

• Doc2Vec [87]: We created a document embedding for each comment label pair in
our dataset. The document embeddings are then input into a logistic regression
classifier that calculates the class margin.

• BERT Embeddings [41]: We extracted word embeddings from BERT for each
token in the comments. These are averaged together and input into a logistic
regression classifier that calculates the class margin.

• Judge-BERT: We fine-tuned the BERT-base model using the class labels that
we extracted from each comment. Specifically, we added a single dropout layer
after BERT’s final layer, followed by a final output layer that consists of our two
classes. The model was trained using the Adam optimizer and a cross entropy
loss function over three epochs as recommended by Devlin et al [41].

16
TABLE 2.2

CLASSIFICATION RESULTS ON THE AITA DATASET.

Method Accuracy Precision Recall F1

VADER (Sentiment) 54.55 ± 0.00 38.63 ± 0.00 45.49 ± 0.00 41.78 ± 0.00
Doc2Vec Embeddings 65.92 ± 0.04 61.22 ± 0.15 13.5 ± 0.09 22.1 ± 0.09
BERT Embeddings 70.10 ± 0.07 64.28 ± 0.2 36.96 ± 0.14 46.96 ± 0.08
Multinomial Naı̈ve Bayes 72.12 ± 0.17 62.58 ± 0.17 55.22 ± 0.07 58.66 ± 0.05
Judge-BERT 89.03 ± 0.13 85.57 ± 0.18 83.48 ± 0.27 84.51 ± 0.17
Results are mean-averages and standard deviations over five-fold cross-validation. Judge-
BERT performed the best on within-distribution testing.

2.2.2 Judgement Classification Results

We evaluated our four classifiers using accuracy, precision, recall, and F1 metrics.
In this context a false positive is the instance when the classifier improperly assigns a
negative (i.e., asshole) label to a positive judgement. A false negative is the instance
when the classifier improperly assigns a positive (i.e., non-asshole) label to a negative
judgement. We performed 5-fold cross-validation and, for each metric, report the
mean-average and standard deviation over the 5 folds.
The results in Table 2.2 indicate that the Doc2Vec, BERT, and Multinomial Naı̈ve
Bayes classifiers do not perform particularly well at this task. Fortunately, the fine-
tuned Judge-BERT classifier performs relatively well, with an accuracy near 90% and
where type 1 and type 2 errors are relatively similar. Overall, these results indicate
that the Judge-BERT classifier is able to accurately classify moral judgements.

17
2.2.3 Sentiment Analysis

We used VADER to see how much similarity existed between sentiment analysis
of a persons judgement versus the morality of such a judgement. Our assumption
was that sentiment would be correlated with positive and negative moral judgement.
However, Results from VADER on our dataset showed a very different outcome.
VADER performed much worse in comparison to our text classification methods.
The interpretation of these findings is not that VADER is bad at sentiment anal-
ysis. Rather, the moral judgements being cast in /r/AmItheAsshole did not conform
to the same distribution of words that typical sentiment analysis tools, like VADER,
capture.

2.3 Analysis of Moral Judgement

With our trained Judge-BERT classifier, our goal was to better understand moral
judgement across a variety of online social contexts and to analyze various trends
in moral judgement. In order to minimize the transfer-error rate it was important
to select subreddit-communities that were similar to the training dataset. These
subreddits were all based on posts and comments that were purely textual in nature.
This highlights the conversational similarities that we found with /r/AmItheAsshole
and enabled smooth transference from one community to another. In total we chose
ten subreddits to explore in our initial analysis. These subreddits can be broken into
three main stylistic groups and are briefly described in Table 2.3. These subreddits are
some of the more popular subreddits that generally had users initiating conversations
about a situation in their life.
We applied the Judge-BERT classifier to the comments and posts of these ten
subreddits. Specifically, given a post and its comment tree we identified the top-level
comment with the highest score. This top-rated comment, which had received the

18
TABLE 2.3

SUBREDDITS USED FOR ANALYSIS OF MORAL JUDGEMENT.

Subreddit Description
Advice
/r/relationship advice
Users pose questions in a scenario like the AITA
/r/relationships
dataset and receive advice or feedback on their
/r/dating advice
situation.
/r/legaladvice
/r/dating
Confessionals
Users confess to something that they have been
/r/offmychest
keeping to themselves. Typically, confessions are about
/r/TrueOffMyChest
something immoral the poster has done.
/r/confessions
Conversational Users engage in conversations with others to have a
/r/CasualConversation simple conversation or to here other opinions in order
/r/changemyview to change their worldview.

most upvotes from the community, is considered to be the one passing judgement on
the original poster. This design follows how /r/AmItheAsshole passes judgements as
a community. As illustrated in Fig. 2.2, this top-rated comment was then fed to our
classifier and the resulting prediction was used to label the moral valence of the post
and poster. It is important to be clear here: we did not predict the moral valence
of the comment itself, but rather the top-rated comment was used to predict the
commenter’s judgement on the post.

2.3.1 Transferability

Before we analyzed the results of Judge-BERT on these other subreddits, it is


important to first understand how well the classifier generalizes. The generalizability
and transferability of Judge-BERT, in this case, is not easily determined because
the moral valence of other subreddits is not neatly identified with a clear YTA/NTA

19
TL;DR: Married, slept with another
man, and regretted it immediately.
Husband found out, I am not sure
if he wants to leave me or not, but + or
ral V –
(Mo
I am willing to do anything to fix
alen
it. Need advice. ce)
Judge-BERT
If you were so unsatisfied why not
try and fix things before you de-
stroy someones life. You don’t
really deserve a second chance.
You’re actually terrible and I hope
you learn your lesson.

Figure 2.2. A post (in blue) made by a user along with the top response
comment (white). The comment is then fed to our Judge-BERT classifier
(green) to determine the moral valence of the post.

label. Instead, we randomly selected 100 comments from each of the ten subreddits
for our analysis; 50 comments were labeled as NTA and 50 were labeled as YTA by
Judge-BERT. We displayed each comment to five different annotators on Mechanical
Turk. We told each annotator that comments came from /r/AmItheAsshole and
then asked them to label each comment as YTA or NTA. Each worker must have
completed a practice question before starting and have correctly answered a clear,
randomly-inserted control question for their results to count. This task was reviewed
and approved by the University of Notre Dame’s Internal Review Board (#20-01-
5751). An example screenshot of the questionnaire can be found in Fig 2.3.
In addition to the 10 other subreddits, we also included /r/AmItheAsshole in
this experiment as a baseline to compare how human annotators performed on the
actual data in comparison to the subreddits of interest. Because the comments
from /r/AmItheAsshole actually have labels, which we analyzed earlier, we com-
pare the actual labels to the human annotations. Therefore, we can consider the
/r/AmItheAsshole annotations a kind of upper bound to identify the limit of human
performance and agreement on this task.
Annotators labeled 2,961 comments as NTA and 2,039 comments at YTA; about a

20
Figure 2.3. A screenshot of the human annotation system.

3:2 imbalance in favor of NTA. If we consider each human label to be the groundtruth,
then we can calculate the transference performance of Judge-BERT. Because of the
class imbalance of the human labels a random classifier achieves an F1-score 45%.
Overall, we found that Judge-BERT obtained an F1 score of 53% with a precision
of 59% and recall of 48%. Recall that the /r/AmItheAsshole subreddit used the
actual labels, not Judge-BERT labels, yet still only obtained an F1-score of 64%. By
comparing these in-domain human results with the in-domain F1-scores from Tab 2.2,
we found that Judge-BERT far outperformed humans on this task. Other subreddits
varied in performance: /r/relationships performed the best with an F1-score of 56%;
/r/CasualConversation performed the worst with an F1-score of 42%, which dipped
mostly because of a very low recall score (34%) indicating that humans were far less
likely than Judge-BERT to rate a /r/CasualConversation comment as YTA.
Human annotators were not always in agreement with each other; and when
they agreed (fully or partially), their agreement did not always match the label
of Judge-BERT. There is much to unpack from these results, but Fig. 2.4 shows
the agreements rates for comments labeled YTA and NTA by Judge-BERT (except
in the case of /r/AmItheAsshole, which is labelled by the comment itself). If we

21
Labelled YTA Labelled NTA
150

Count of Comments
100

50

0
5/0 4/1 3/2 2/3 1/4 0/5 5/0 4/1 3/2 2/3 1/4 0/5
Annotator Agreement Annotator Agreement
/r/legaladvice /r/relationships /r/relationship advice
/r/dating advice /r/dating /r/changemyview
/r/confessions /r/offmychest /r/TrueOffMyChest
/r/CasualConversation /r/AmITheAsshole

Figure 2.4. Plots showing the prediction of Judge-BERT versus the


annotation agreement by humans for both positive and negative classes.
X-axis displays the annotator agreement against the prediction made by
Judge-BERT. 5/0 represents all annotators agreeing and it matching the
model while 0/5 shows all annotators agreeing for the opposite class,
meaning a wrong prediction by the model.

vary the agreement rates from all-five-correct (5/0), four-correct-one-wrong (4/1),


etc. to none-are-correct (0/5) then we can draw an ROC curve with six sensitivity
points for each subreddit. This ROC curve is illustrated in Fig. 2.5. As expected
the /r/AmItheAsshole results are highest (AUC=0.725), but still not particularly
high. This mediocre annotation performance indicates this is a difficult task for
human annotators: comment text does not always clearly denote the judgement the
author. Given that the /r/AmItheAsshole task was difficult, we expect that the task
in other subreddits to be even more difficult. The ROC curves corresponding to
the other subreddits are also displayed in Fig. 2.5. The mean-average AUC=0.55
with /r/offmychest scoring highest (AUC=0.625) and /r/confessions scoring lowest

22
ROC Curve
1

True Positive Rate


0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
False Positive Rate
/r/AmItheAsshole /r/relationships /r/relationship advice
/r/legaladvice /r/dating advice /r/dating
/r/changemyview /r/confessions /r/offmychest
/r/TrueOffMyChest /r/CasualConversation

Figure 2.5. Judge-BERT analysis at various levels of annotator agreement.

(AUC=0.484).
If we discard the sensitivity analysis and just look at the majority label from the
human annotators we find that the average majority-label accuracy is 62%. But this
masks a large discrepancy: when the Judge-BERT label is NTA, then the accuracy is
71.8%; when the Judge-BERT label is YTA, then the accuracy drops to worse than
random (44.4%). Given the 50/50 breakdown of NTA/YTA labels, we can deduce
that human annotators were far less likely than Judge-BERT to label a comment as
YTA.

2.3.2 Moral Valence and Popularity

Here we can begin to answer RQ1: Is moral valence correlated with the score of
a post? In other words, do posts with positive moral valence score higher or lower
than posts with negative moral valence? To answer this question, we extracted all
posts and their highest scoring top-level comment from 2018 from each subreddit

23
TABLE 2.4

NUMBER OF POSTS JUDGED WITH POSITIVE AND NEGATIVE


MORAL VALENCE IN EACH SUBREDDIT (2018).

Positive Negative

/r/relationship advice 303,860 113,987


/r/relationships 397,835 156,082
/r/dating advice 108,980 45,279
/r/legaladvice 194,083 72,784
/r/dating 77,717 30,569
/r/offmychest 224,187 80,089
/r/TrueOffMyChest 409,889 160,128
/r/confessions 67,768 26,659
/r/CasualConversation 40,772 10,303
/r/changemyview 49,561 18,761

in Table 2.3. The counts for number of positive and negative posts found for each
subreddit are listed in Table 2.4.
Popularity scores on Reddit exhibit a power-law distribution, so the mean-scores
and their differences will certainly be misleading. Instead, we plot the ratio of com-
ments judged to be positive against all comments as a function of the post score
cumulatively in Fig. 2.6. Higher values in the plot indicate more positive valence.
The results here are clear: post popularity was associated with positive moral va-
lence. Most of the subreddits appeared to have similar characteristics except for
/r/CasualConversation, which had a much higher positive valence (on average) than
the other subreddits. Mann-Whitney Tests for statistical significance on individual

24
Valence Ratio for Post Score ≤ x
0.8

0.7

0.6

0.5
100 101 102 103 104 105
Post Score (log)
/r/relationship advice /r/relationships /r/dating advice
/r/legaladvice /r/dating /r/offmychest
/r/TrueOffMyChest /r/confessions /r/CasualConversation
/r/changemyview

Figure 2.6. Posts judged to have positive valence as a function of post


score. Higher indicates more positive valence. Higher post scores are
associated with more positive valence (Mann Whitney τ ∈ [0.40, 0.47],
p < 0.001 two-tailed, Bonferroni corrected)
.

subreddits as well the aggregation of these tests with Bonferroni correction found that
posts with positive valence had significantly higher scores than posts with negative
valence (τ ∈ [0.40, 0.47], p < 0.001 two-tailed).
The correlation between posts with positive moral valence and higher scores can
be explained based on the process by which posts are made. Because posts are made
before votes are cast and the text of a post is (typically) unchanged, then the votes
received for a post is likely related to the moral valence of a given post. Posts are
created on a regular basis and will become more popular based on user votes. With
this in mind it is reasonable to assume that posts with positive moral valence are
upvoted sooner and more often than others, leading to higher total scores.
These findings appear to conflict with studies we previously mentioned that
showed that negative posts elicit anger and encourage a negative feedback loop on
social media [9, 37]. As these studies have shown, posts that elicit moral outrage end

25
up being shared more often on other social media sites. Our expectation for Reddit
was the same.
A further inspection of the posts indicated that posts classified as having posi-
tive moral valence often found users expressing that a moral norm had indeed been
breached. However, the difference in our results compared to others may be ex-
plained by perceived intent, that is, whether or not the moral violation occurred
from an intentional agent towards a vulnerable agent, c.f., dyadic morality [119].
Our inspection of comments expressing negative moral judgement confirmed that the
perceived intent of the poster was critical to the judgement rendered. These negative
judgements typically highlighted what the poster did wrong and advised the poster
to reflect on their actions (or sometimes simply insulted the poster). Conversely,
we found that many posts judged to be positive clearly show that the poster was
the vulnerable agent in the situation to some other intentional agent. Responses to
these posts often displayed sympathy towards the poster and also outrage towards
the other party in the scenario. These instances are perhaps best classified as exam-
ples of empathetic anger [66], which is anger expressed over the harm that has been
done to another. We also note that some of the content labelled to have positive
moral valence is simply devoid of a moral scenario. Simply put, comment’s without a
moral impetus tended to be labeled as NTA. Examples of this can be primarily seen
in /r/CasualConversation where the majority of posts are about innocuous topics.
Another possible explanation for our findings is that users on other online so-
cial media sites like Facebook and Twitter are more likely to like and share news
headlines that elicit moral outrage; these social signals are then used by the site’s
algorithms to spread the headline further throughout the site [17, 56]. Furthermore,
the content of the articles triggering these moral responses often covers current news
events throughout the world. Our Reddit dataset, on the other hand, typically deals
with personal stories and therefore tend to not have the same in-group/out-group

26
reactions as those found on viral Facebook or Twitter posts.

2.4 Assigning Judgements to Users

Next we investigate RQ2: Do certain subreddit-communities attract users whose


posts are typically classified by more negative or positive moral judgements? To
answer this question we need to reconsider our unit of analysis. Rather than assigning
moral valence to the individual post, in this analysis we consider the moral valence of
the user who committed the post. To do this, we again found all posts and comments
of the ten subreddits and found the highest scoring top-level comment; we classify
whether that comment is judging the post to have positive or negative moral valence
and then tally this for the posting user.
Of course, users are also able to post comments and sub-comments. So we ex-
panded this analysis to include judgements of users from throughout the entire com-
ment tree. Each comment can have zero or more replies each with its own score. So,
for each comment we identified the reply with the highest score and classify whether
that reply is judging the comment to have positive or negative moral valence, and
then tally this for the commenting user. We did this for each comment that has at
least one reply at all levels in the comment tree.
Because we assign moral valence scores to users we are able to capture all judge-
ments across the ten subreddits and better-understand their behavior. It is important
to remember that the classifier classifies the moral valence of text – with some amount
of uncertainty – not the user specifically. So we emphasize that we did not label users
as “good” or “bad” explicitly; rather, we identified users as having submitted posts
and comments that were similar to comments that previously received positive or
negative moral judgement.
We included only users that were judged at least 50 times. Each user therefore
has an associated count of positive and negative judgements. This begs an interesting

27
Cumulative negative judgements
1

0.8

0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1
Fraction of Population

Figure 2.7. Lorenz curve depicting the judgement inequality among users;
Gini coefficient = 0.515

question: are some users judged more positively or negatively than others? What
does that distribution look like? To understand this breakdown we first plot a Lorenz
curve in Fig. 2.7. We found that the distribution of moral valence is highly unequal:
about 10% of users receive almost 40% of the observed negative judgements (Gini
coefficient = 0.515).
This clearly indicated that there were a handful of users that received the vast
majority of negative judgements. To identify those users which receive a statistically
significant proportion of negative judgements we performed a one-sided binomial test
on each user. Simply put, this test emits a negativity probability, i.e., the probability
(p-value) that the negativity of a user was not due to chance.
Finally, we illustrate the membership of each subreddit as a function of a users’
negativity probability. As expected, Fig. 2.8 shows that as we increased the nega-
tivity threshold from almost certainly negative to uncertainty (from left to right) we
also increased the fraction of comments observed. These curves therefore indicate the
density of comments that are made from negative users (for varying levels of negativ-

28
Fraction of Comments Observed

100
10−1
10−2
10−3
10−4
10−5

0 0.2 0.4 0.6 0.8 1


Negativity Probability Threshold
/r/relationship advice /r/relationships /r/dating advice
/r/legaladvice /r/dating /r/offmychest
/r/TrueOffMyChest /r/confessions /r/CasualConversation
/r/changemyview

Figure 2.8. Number of comments (normalized) as a the negativity


threshold is raised. As the negativity threshold is raised the fraction of
comments revealed tends towards 1. Higher lines indicate a higher
concentration of negative users and vice versa.

29
ity); higher lines (especially on the left) indicate higher concentration of negativity.
We found that /r/confessions, /r/changemyview, and /r/TrueOffMyChest contain a
higher concentration of comments from more-negative users. On the opposite side
of the spectrum, we found that /r/CasualConversation and /r/legaladvice have deep
curves, which implies that these communities have fewer negative users than others.

2.4.1 Three Types of Negative Users

We selected a sub-group of users that had a statistically significant negative moral


valence by identifying those that were found to have a p-value less than 0.05 from the
one-tailed binomial test. Within this group we investigated their posting habits to
determine what types of posts they made to garner such a large number of negative
judgements. From our analysis of these users we determined that they generally fell
into three different stylistic groups.

1. Explainer: These users argued that what they did was not that wrong.

2. Stubborn Opinion: Users that did not acquiesce to the prevailing opinion of
the responders.

3. Returner: Users that repeatedly posted the same situation hoping to elicit
more-favorable responses.

The first type of user that we observed was the Explainer. The explainer typically
made a post and received comments that condemned their immoral actions. In
response to this judgement, the explainer would reply to many of the comments
in an attempt to convince others that what they did was in fact moral. Often, this
only served to exacerbate the judgements made against them. This then led to further
negative judgements. In fact, we found that many of these users had only made a
handful of posts that each resulted in a large number of comments in self-defense. The
large number of users that responded to these comments with negative judgements
is similar to the effect of online firestorms [114] but at a scale contained to only an

30
Me and my partner are having a baby.
Posts Responses
You really are insecure
How noticeable is her
about her losing her
belly bump?
hot body.

You think her being


Why is she so different
stressed, tired and sick
now?
is bullshit?

Your other posts don’t


Why is she so distant? make you seem as ea-
ger to be involved.

Neither of you are ma-


Her friends are asking ture enough to deal
about it. with a baby.

Quit being a dumbass


I’m moving on. and deal with your re-
sponsibilities.

Figure 2.9. A diagram showing the posting habits of a Returner. Posts are
in the light blue boxes with blue arrows represent the order of posts. An
example of a post response is shown in the white box with the red arrow
representing the post it came from. Each post is prefaced with the
overarching title, “Me and my partner are having a baby.” followed by the
current update on the situation. The response comments have also been
condensed from their full length.

individual post. For these types of posts we also note that some people did come to
the defense of the poster, which follows similar findings that people show sympathy
after a person has experienced a large amount of outrage [118].
The second type of user we observed is the Stubborn Opinion user. These users
are similar to but opposite from the Explainers. Rather than trying to change their
perspective, the Stubborn Opinion user would refuse to acquiesce to the prevailing
opinion of the comment thread. For example, users posting to /r/changemyview
that do not express a change of opinion despite the efforts and agreement of the
commenting users often incur comments casting negative judgement. This back-
and-forth sometimes becomes hostile. Many of these conversations end in personal

31
attacks from one of the participants, which has also been shown in previous work on
conversations in /r/changemyview [27].
The third type of user is the Returner. The returner sought repeated feedback
from Reddit on the same subject. For example, when returners made posts seeking
moral judgement, they often engaged in some of the discussion and may even agree
with some of the critical responses. Some time later, the user returned to edit their
original post or to make another post providing an update about their situation. An
example of a Returner is illustrated in Fig 2.9. In this case, a user continued to request
advice after recently impregnating their partner. In these situations responding users
often found previous posts on the same topic made by the same user and then used
this information to build a stronger case against the user or to argue that the new
post was nothing but a thinly-veiled attempt to shine a more-favorable light on their
original situation. These attempts usually backfired and resulted in more negative
judgments being cast against the user.

2.4.2 Gender and Age Analysis

Our final task investigates RQ3: Are self-reported gender and age descriptions
associated with positive or negative moral judgements? Recent studies on this topic
have found that gender and moral judgements have a strong association [111]. Specif-
ically, women are perceived to be victims more often than men and harsher punish-
ments are sought for men. The rates at which men commit crimes tends to be higher
than the rates of female crime and society generally views crimes as a moral viola-
tion [32]. If we apply these recent findings to our current research question we expect
to find that male users were judged negatively more often than females.
Many studies have analyzed the relationship between morality and age. Most
of these studies have followed groups of people through their teenage years as they
develop into adulthood [34, 3]. In these studies people were presented with moral

32
TABLE 2.5

CONTINGENCY TABLES

/r/relationship advice

Positive Negative

Male 53,416 26,281


Female 57,126 20,714

/r/relationships

Positive Negative

Male 139,163 74,384


Female 216,190 78,823

dilemmas and interviewed to assess their moral reasoning. This is similar to how we
expect people to respond to the scenarios presented on Reddit. One major deviation
from these studies is that our posts encompassed only a small portion of the scope
of possible moral scenarios.
This task is not usually available on public social media services because gender
and age are not typically revealed, while also allowing for anonymous posting. Fortu-
nately, the posting guidelines of /r/relationships and /r/relationship advice required
posters to indicate their age and gender in a structured manner directly in the post
title. An example of this can be seen here:
where the poster uses [M27] to indicate that they identified as male aged 27 years
and that their partner [F25] was identified as female aged 25 years. Using these
conventions we were able to reliably extract users’ self-reported age and gender.
We again applied our Judge-BERT model to assign a moral judgement to the post

33
Figure 2.10. Example of a post title from /r/relationships. Subreddit rules
require the poster to indicate their age and gender, as well as any other
individuals gender and age.

based on the top-scoring comment. In total we extracted judgements from 508,560


posts on /r/relationships and 157,537 posts on /r/relationship advice. Because the
posting age of a user may not be above the age of majority, we were careful to only
collect data from users that are aged 18 and older. In general, the age breakdown
appears to closely follow Reddit’s age demographic that has previously shown a mean
age of 27 [55]. 90% of posters collected from our two subreddits of interest were
between 18-30 years old.
Our first task was to determine if any association exists between moral judgement
and gender. To answer this question we performed a χ2 test of independence. Contin-
gency tables for this test are reported in Table 2.5. The χ2 test reports a significant as-
sociation between gender and moral judgement in /r/relationships (χ2 (1, 508, 560) =
3874.6, p < .0001) and in /r/relationship advice (χ2 (1, 157, 537) = 762.2, p < .0001).
However, the χ2 test on such large sample sizes usually results in statistical signifi-
cance; in fact, the χ2 test tends to find statistical significance for populations greater
than 200 [126]. So we verified this association using φ, which measures the strength of
association controlled for population size. In this case, φ = 0.09 for /r/relationships
and φ = 0.07 for /r/relationship advice. These low values indicate that there was
only a small association between gender and moral judgement when adjusted for
sample size.

34
TABLE 2.6

LOGISTIC REGRESSION ANALYSIS

/r/relationship advice

Variable Coefficient p-value 95% CI

(Constant) -1.1575 <0.001 (-1.2093, -1.1058)


Gender 0.3076 <0.001 (0.2806, 0.3241)
Age 0.0059 <0.001 (0.0039, 0.0080)

/r/relationships

Variable Coefficient p-value 95% CI

(Constant) -1.0923 <0.001 (-1.1214, -1.0631)


Gender 0.3814 <0.001 (0.3693, 0.3935)
Age 0.0034 <0.001 (0.0023, 0.0046)

2.4.3 Logistic Regression Analysis

Our second task was to determine if gender and age were associated with moral
judgement. In other words, were young females, for instance, judged more positively
than, say, old males? To answer this question, we fit a two-variable logistic regression
model where the binary-variable gender is encoded as 0 for female and 1 for male.
We report the findings from the logistic regressor for each subreddit in Table 2.6.
These results indicate that males were judged more negatively than females. Specif-
ically, in /r/relationship advice being male was associated with a 35% increase in
receiving a negative judgement. Similarly, in /r/relationships being male was associ-
ated with a 46% increase in receiving a negative judgement.

35
We also found that age had a relatively small effect on moral judgement: increased
age is slightly correlated with negative judgement. Specifically, in /r/relationship advice
an increase in age by one year was associated with a 0.59% increase in receiving a
negative judgement. In /r/relationships an increase in age by one year was associated
with a 0.34% increase in receiving a negative judgement.
Simply put, those who were older and those who were male (independently) were
statistically more likely to receive negative judgements from Reddit than those who
were younger and female. Although gender was much more of a contributing factor
than age and neither association was particularly strong.

2.5 Conclusion

In this study, we showed that it is possible to learn the language of moral judge-
ments from text taken from /r/AmITheAsshole. We demonstrated that by extracting
the labels and fine-tuning a BERT language model we could achieve good performance
at predicting whether a user rendered a positive or negative moral judgement. This
performance was verified by human annotators on our subreddits of interest. Using
our trained classifier we then analyzed a group of subreddits that are thematically
similar to /r/AmITheAsshole for underlying trends. Our results showed that users
prefer posts that have a positive moral valence rather than a negative moral valence.
Another analysis revealed that a small portion of users are judged to have substan-
tially more negative moral valence than others and they tend towards subreddits such
as /r/confessions. We also showed that these highly negative moral valence users fall
into three different types based on their posting habits. Lastly, we demonstrated that
age and gender have a minimal effect on whether a user was judged to have positive
or negative moral valence.
Although the Judge-BERT classifier enabled us to perform a variety of analysis
it does pose some limitations. The test-subreddits do deviate from the types of

36
moral analysis observed in the training data. While we did show that Judge-BERT
generalizes to other subreddits, this does not change that moral judgement is not
the focus of /r/CasualConversation, for example. Another important limitation is
that these claims may not generalize to all of Reddit. Previous work has shown
that different subreddit communities can have unique norms [26]. For instance, our
finding that men were judged more negatively than woman may only hold in the two
subreddits of analysis, /r/relationships and /r/relationship advice. Whether this rule
holds for all of Reddit is unconfirmed based on our study. This could also implicate
the classifier if the norms that are learned from /r/AmItheAsshole do not align with
other subreddits of analysis.
In the future we hope to implement argument mining in order to gain a better
understanding of the reasons for these judgements by extracting the underlying argu-
ments given by users. As previously mentioned studies such as Forbes et al. [50] have
done this using a large scale annotation effort. Creating an automated method that
performs well at this task would allow for a more in depth analysis of the judgements
being cast on Reddit. Argument mining has seen success on Reddit already with
extracting the persuasive arguments from subreddits like /r/changemyview [45] and
would enable us to get a better understanding of moral judgements on social media.
This would also allow us to aggregate the underlying themes from these judgements
for further analysis. An argument mining system would also allow us to garner a
more clear picture of our current findings. We could analyze our research questions
again such as RQ3 as it relates to gender. The new study could find out more of an
explanation to these findings.

37
CHAPTER 3

ENTITY GRAPHS FOR ONLINE DISCOURSE

The work presented in this chapter is a collaboration with Tim Weninger and was
published in the Journal of Knowledge and Information Systems in 2023 [11].

3.1 Introduction

In any conversation, members continuously track the topics and concepts that are
being discussed. The colloquialism “train-of-thought” is often used to describe the
path that a discussion takes, where a conversation may “derail,” or “come-full-circle,”
etc. An interesting untapped perspective of these ideas exists within the realm of
the Web and Social Media, where a train-of-thought could be analogous to a trail
over a graph of concepts. With this perspective, an individual’s ideas as expressed
through language can be mapped to explicit entities or concepts, and, therefore,
a single argument or train-of-thought can be treated as a path over the graph of
concepts. Within a group discussion, the entities, concepts, arguments, and stories
can be expressed as a set of distinct paths over a shared concept space, what we call
an entity graph.
Scholars have long studied discourse and the flow of narrative in group conversa-
tions, especially in relation to debates around social media [102] and intelligence [92].
The study of language and discourse is rooted in psychology [24] and consciousness
[23].
Indeed, the linguist Wallace Chafe considered “...conversation as a way separate
minds are connected into networks of other minds.” [24] Looking at online conversa-

38
tions from this angle, a natural hypothesis arises: If we think of group discussion as
a graph of interconnected ideas, then can we learn patterns that are descriptive and
predictive of the discussion?
Fortunately, recent developments in natural language processing, graph mining,
and the analysis of discourse now permit the algorithmic modelling of human discus-
sion in interesting ways by piecing them together. This is a broad goal, but in the
present work we provide a first step towards graph mining over human discourse.
Another outcome of the digital age is that much of human discourse has shifted to
online social systems. Interpersonal communication is now observable at a massive
scale. Digital traces of emails, chat rooms, Twitter or other threaded conversations
that approximate in person communication are commonly available. A newer form of
digital group discussion can be seen in the dynamics of Internet fora where individuals
(usually strangers) discuss and debate a myriad of issues.
Technology that can parse and extract information from these conversations cur-
rently exists and operates with reasonable accuracy. From this large body of work,
the study of entity linking has emerged as a way to ground conversational statements
to well-defined entities, such as those that constitute knowledge bases and knowledge
graphs [125]. Wikification, i.e., where entities in prose are linked to Wikipedia-entries
as if it was written for Wikipedia, is one example of entity linking [30]. The Infor-
mation Cartography project is another example that uses these NLP-tools to create
visualizations that help users understand how related news stories are connected in
a simple, yet meaningful manner [124, 123, 78]. But because entity linking tech-
niques have been typically trained from Wikipedia or long-form Web-text, they have
a difficult time accurately processing conversational narratives, especially from so-
cial media [40]. Fortunately, recent progress in Social-NLP has made considerable
strides in recent years [108] providing the ability to extract grounded information
from informal, threaded online discourse [82].

39
Figure 3.1. Illustration of an entity graph created from threaded
conversations from r/politics (blue-edges) and r/conservative (red-edges).
The x-axis represents the (threaded) depth at which each entity was
mentioned within conversations, extracted from Reddit, rooted at
Joe Biden. The y-axis represents the semantic space of each entity, i.e.,
similar entities are closer than dissimilar entities on the y-axis. Edge colors
denote whether the transition from one entity set to another occurs more
often from one groups conversations than another. Node colors represent
equivalent entity sets along the x-axis. In this visualization we observe a
pattern of affective polarization as comments coming from /r/Conservative
are more likely to drive the conversation towards topics related to the
opposing political party.

Taking this perspective, the present work studies and explores the flow of entities
in online discourse through the lens of entity graphs. We focus our attention on
discussion threads from Reddit, but these techniques should generalize to online
discussions on similar platforms so long as the entity linking system can accurately
link the text to the correct entities. The threaded conversations provide a clear
indication of the reply-pattern, which allows us to chart and visualize conversation-
paths over entities.
To be clear, this perspective is the opposite of the conventional social networks

40
approach, where information and ideas traverse over user-nodes; on the contrary, we
consider discourse to be humans traversing over a graph of entities. The conventional
approach to social networks is important for areas such as influence maximization
[79] and the spread of behaviors [22]. Instead, the goal of our alternative perspective
is to discover this network of minds and uncover patterns of how they think over
topics. This alternative perspective is motivated by the large number of influence
campaigns [120], information operations [140], and the effectiveness of disinformation
[59]. These campaigns often operate by seeding conversations in order to exploit
conversation patterns and incite a particular group. Another motivation for our
proposed methodology is humans attraction towards homophily and the large number
of echo chambers that have been created online [33, 53]. Prior works [53] looking at
echo chambers in political discourse rely on this notion of the ideas spreading between
user-nodes. Other works looking at morality [17] also follow this notion of how moral
text spreads throughout a user network. We stress here that our entity graph will
allow for a flipped perspective of having users move across the graph of entities in
various types of conversations. This position allows for a different form of analysis
into how different groups or communities think as a whole.
Our way of thinking is illustrated in Fig. 3.1, which shows a subset of path traver-
sals, which we describe in detail later, from thousands of conversations in /r/politics
and /r/conservative that start from the entity Joe Biden. As a brief preview, we
find that conversations starting with Joe Biden tend to lead towards United States in
conversations from the /r/conservative subreddit (indicated by a red edge), but com-
monly lead towards mentions of the Republican Party in conversations from /r/politics
(indicated by blue-purple edge). From there the conversations move onward to vari-
ous other entities and topics that are cropped from Fig 3.1 to maintain clarity.
In the present work, we describe how to create entity graphs and use them to
answer questions about the nature of online, threaded discourse. Specifically, we ask

41
three research questions:

RQ1 How predictable is online discourse? Can we accurately determine where a


conversation will lead?

RQ2 What do entity graphs of Reddit look like? In general, does online discourse
tend to splinter, narrow, or coalesce? Do conversations tend to deviate or stay
on topic?

RQ3 Can cognitive-psychological theories on spreading activation be applied to fur-


ther illuminate and compare online discourse?

We find that entity graphs provide a detailed yet holistic illustration of online
discourse in aggregate that allow us to address our proposed research questions.
Conversations have an enormous, visually random, empirical possibility space, but
attention tends to coalesce towards a handful of common topics as the depth increases.
Prediction is difficult, and gets more difficult the longer a conversation goes on.
Finally, we show that entity graphs present a particularly compelling tool by which
to perform comparative analysis. For example, we find, especially in recent years,
that conservatives and liberals both tend to focus their conversations on the out-group
– a notion known as affective polarization [72]. We also find that users also tend to
stick to the enforced topics of a subreddit as shown by how r/news tends towards
entities from the United States and r/worldnews tends towards non-US topics.

3.2 Methodology

3.2.1 Online Discourse Dataset

Of all the possible choices from which to collect online discourse, we find that
Reddit provides exactly the kind of data that can be used for this task. It is freely
and abundantly available [8], and it has a large number of users and a variety of
topics. Reddit has become a central source of data for many different works [93].
For example, recent studies on the linguistic analysis of Schizophrenia [161], hate

42
speech [25], misogyny [48], and detecting depression related posts [132] all make
substantial use of Reddit data.
The threading system that is built into Reddit comment-pages is important for
our analysis. Each comment thread begins with a high level topic (the post title),
that is often viewed as the start of a conversation around a specific topic. Users
often respond to the post with their own comments. These can be viewed as direct
responses to the initial post, and then each of these comments can have replies. This
threading system generates a large tree structure where the root is the post title. Of
course, such a threading system is only one possible realization of digital discussion,
but this system provides the ability to understand how conversations move as users
respond to each other in turn. Twitter, Facebook, and Youtube also have discussion
sections, but it is very difficult to untangle who is replying to whom in these (mostly)
unthreaded systems.
Reddit contains a large variety of subreddits, which are small communities focused
on a specific topic. We limit our analysis to only a small number of them, but for
each selection we obtain their complete comment history from January 2017 to June
2021. In total we selected five subreddits: /r/news, /r/worldnews, /r/Conservative,
/r/Coronavirus and /r/politics. We selected these subreddits because they are large
and attract a lot of discussion related to current events, albeit with their own per-
spectives and guidelines. These subreddits also contain a large number of entities,
which we plan to extract and analyze.
Like most social sites, Reddit post-engagement follows the 90-9-1 rule of Internet
engagement. Simply put, most users don’t post or comment, and most posts receive
almost no attention [93]. Because of this we limit our data to include only those
threads that are in the top 20% in terms of number of comments per post. Doing so
ensures that we mostly collect larger discussions threads that have an established back
and forth. We also ignore posts from the well-known bot accounts, (e.g., AutoMod,

43
Germany challenges Russia over alleged
cyberattacks
Germany Russia
About time somebody had the
balls to stand up to Russia
Russia
And then what happens? Sure
Russia is messing around with Entity
other countries, but I doubt much Extraction Russia
and Linking
will come from this.

Well hopefully Germany can get Germany


the rest of Europe on board. Europe

It's sad people think that this is Russia.


No one knows who it is. No one. Trump Donald_Trump
was right when he said it could be a
Russia
500 pound person in a basement with
no affiliation.

(a) Comment Thread (b) Entity Tree

Figure 3.2. (Left) Example comment thread with the post title as the root,
two immediate child comments, one of which has two additional child
comments. Entity mentions are highlighted in yellow. (Right) The resulting
entity tree where each comment is replaced by their entity set. Note the
case where the mention-text Trump in the comment thread is represented
by the standardized entity-label Donald Trump in the entity tree.

LocationBot) to ensure we get actual user posts in the conversation.

3.2.2 Entity Linking

We use entity linking tools to extract the entities from each post title and comment
in the dataset (c.f. [125]). Entity linking tools seek to determine parts of free form
text that represent an entity (a mention) and then map that mention to the appropri-
ate entity-listing in a knowledge base (disambiguation), such as Wikipedia. Existing
models and algorithms rely heavily on character matching between the mention-text
and the entity label, but more-recent models have employed deep representation
learning to make this task more robust [122].

44
An example of entity linking on a comment thread is illustrated in Fig. 3.2. Each
comment thread T contains a post R which serves as the root of the tree cr ∈ T
and comments cx ∈ T , where subscript r and x serve to index the post title and a
specific comment. Each comment can reply to the root c → r or to another comment
cx → cy thereby determining a comment’s depth ` ∈ [0, . . . , L]. Comments and post
titles may or may not contain one or more entities S(c). These entity sets are likewise
threaded, such that S(cx ) → S(cy ) means that the entities in cx were responded to
with the entities in cy , i.e., cx is the parent of cy . With this formalism, the entity
linking task transforms a comment threads into an entity tree as seen in Fig. 3.2.
Specifically, we utilize the End-to-End (E2E) neural model created by Kolitaskas
et. al. [82] to perform entity linking on our selected subreddits. Previous work
has shown that entity linking on Reddit can be quite challenging due to the wide
variety of mentions used [12]. The E2E model we use has been shown to have a high
level of precision on Reddit but lacks a high recall [12]. We find using this model
appropriate as we want to ensure that the entities we find are correct and reliable,
but acknowledge that it may miss a portion of the less well-known entities, as well as
missing any new entities that arise from entity drift. The choice of this entity linker
also influenced our decision to analyze the selected subreddits as the performance
is better in these selected subreddits. We also experimented with the popular REL
entity linker [137]. Although it did retrieve many more entities from the comments,
we found a large number of the entities to be incorrect.
Using the E2E model we extract entities from each post title and comment in-
dividually and construct the entity tree as illustrated in Fig 3.2. Table 3.1 shows a
breakdown of the post, comment, and entity statistics for each subreddit considered
in the present work.

45
TABLE 3.1

REDDIT DISCOURSE DATASET.

# Posts # Comments Total Entities Unique Entities

/r/news 7,299 106,428 240,009 10,573


/r/worldnews 16,056 263,227 692,735 12,840
/r/politics 15,596 326,958 756,576 11,908
/r/Conservative 3,093 41,439 100,756 4,308
/r/Coronavirus 18,469 252,303 509,632 10,246
Top 20% of posts in terms of number of comments between January 2017
to June 2021.

3.2.3 Entity Graph

Given an entity tree, our next task is to construct a model that can be used to
make predictions about the shape and future of the conversation, but also can be
used as a visual, exploratory tool. Although entity trees may provide a good picture
for a single conversation, we want to investigate patterns in a broader manner. To
do this we consider conversations coming from a large number of entity trees in
aggregate. This model takes the form of a weighted directed graph G = (V , E, w)
where each vertex v ∈ V is a tuple of an entity set S(c) and it’s associated depth ` in
the comment tree v = (S(c), `). Each directed edge in the graph e ∈ E connects two
vertices e = (v1 , v2 ) such that the depth , ` of v1 must be one less than the depth of
v2 . Each edge in the graph e ∈ E also contains a weight w : E → R that represents
the frequency of the transition from one entity set to another. This directed graph
captures not only the specific concepts and ideas mentioned within the discourse, but
also the conversational flow over those concepts.
Continuing the example from above, Fig. 3.3 shows three individual paths P
representing the entity tree from Fig. 3.2(b). Each entity set moves from one depth

46
Germany Russia

Russia Germany Germany


Russia
Russia Europe
Russia Path Germany
Expansion Russia Russia
Russia
Germany
Europe Germany Donald_Trump
Russia Russia
Donald_Trump
Russia Depth: 1 2 3

(b) Entity Tree (c) Conversation Paths

Figure 3.3. Paths extracted from the entity tree in Fig. 3.2(b) represented
by directed edges over entity sets.

Germany
Germany Russia Russia Germany
to the next, representing the progression of the discussion.Europe
Germany
During Europe
the construction of the entity paths, we remove comments that do not
Russia

have
Russia any replies. Short paths, Donald_Trump
Donald_Trump those with a length less thanRussia
Russia three, do not offer much
Russia

information in terms of how the conversation will progress, because the conversation
Depth: 1 2 3
empirically did not progress. It may be useful to analyze why some topics resulted
in no follow-on correspondence, but we leave this as a matter for future work.
Because we wish to explore online discourse in aggregate, this is the point where
we aggregate across many comment threads T ∈ T where T represents an entire
subreddit or an intentional mixture of subreddits depending on the task. We extract
all of the conversation paths from our comment threads T to now have a group of
conversation paths P. To generate our graph we iterate over our group of paths
P and aggregated them together to construct our entity graph. For every instance
of an entity set transition in a conversation path we increment the weight w of it’s
respective edge in our entity graph. One key aspect of this is that we count this
transition only once per each comment thread T . This ensures that entity transitions
do not get over counted, by virtue of the thread being larger and containing more

47
conversation paths overall.
One of the limitations of the current graph structure is that the graph does not
capture conversation similarities if some of the entities overlap between two different
vertices. For instance, another entity tree may result in having an entity set S(cr )
that contains a subset of the entities in a given vertex. This new entity set may have
a similar conversational flow but will not be captured in our current entity graph
because the model does not allow for any entity overlap.
To help alleviate this issue we borrow from the notion of a hypergraph and perform
a star-expansion on our graph G [160]. A hypergraph is defined as H = (X, E)
where X is the set of vertices and E is a set of non-empty subsets of X called
hyperedges. The star expansion process turns a hypergraph into a simple, bipartite
graph. It works by generating a new vertex in the graph for each hyperedge present
in the hypergraph and then connects each vertex to each new hyperedge-vertex. This
generates a new graph G(V , E) from H by introducing a new vertex and edge for
each hyperedge such that V = E ∪ P.
While our model is a graph we can treat each entity set S(c) as a hyperedge in our
case to perform this star expansion. This will give us new vertices to represent each
individual entity and allow us to capture transitions from one entity set to another
if they share a subset of entities. An example of the resulting graph after performing
a star-expansion can be seen in Fig. 3.4. This helps to provide valid transition paths
that would otherwise not exist without the star expansion. When the star expansion
operation is performed the edge weights between the new individual entity vertices
and their respective entity sets is set to the number of times that entity set occurred
at a given depth l. Although the star expansion process will generate a much larger
graph due to the large number of vertices, it proves to be useful for prediction and
aligning entity set vertices in a visual space.
This graph-model therefore represents the entities, their frequent combinations,

48
Germany
Germany Russia Russia Germany
Europe

Germany
Europe
Russia
Donald_Trump
Russia Donald_Trump Russia Russia
Russia

Depth: 1 2 3

Figure 3.4. Entity graph constructed from a star-expansion of the entity


tree in Fig 3.2(b) and the conversation paths in Fig. 3.3(c) This model
represents the entities, their frequent combinations, and the paths
frequently used in their invocation.

and the paths frequently used in their invocation over a set of threaded conversations.

3.3 Conversation Prediction

Having generated these entity graphs we turn our attention to the three research
questions. RQ1 first asks if these entity graphs can be used to predict where a conver-
sation may lead. Clearly this is a difficult task, but recent advances in deep learning
and language models have led to major improvements and interest in conversational
AI [107], which has further lead to the development of a number of models that utilize
entities and knowledge graphs [149] from various sources including Reddit [154]. The
main motivation of these tools is to use the topological structure of the knowledge
graphs (entities and their relationships) to improve a conversational agents’ ability to
more-naturally select the next entity in the conversation. The typical methodology
in related machine learning papers seeks to predict the next entity in some conversa-
tion [96]. In these cases, a dataset of paths through a knowledge graph is constructed
from actual human conversations as well as one or more AI models. Then a human

49
Percentage of Valid Entities
1 /r/news
/r/worldnews
/r/politics
0.8 /r/coronavirus
/r/conservative

0.6

2 4 6 8
Conversation Depth (`)

Figure 3.5. Percent of the predictions made on the testing set that, on
average, exist in the training set for 5-folds. Higher is better.

annotator picks the entity that they feel is most natural [96, 75].
Our methodology varies from these as we are not focused on making a machine
learning model to accurately predict these entities precisely. Our goal is to demon-
strate more broad patterns of people conversing over and through the topics. To
this end, we do not evaluate with a standard machine learning paradigm aiming to
optimize for metrics such as accuracy, precision, recall, etc. To demonstrate that
our entity graph captures broad patterns that can be further explored we perform
two tasks: (1) the generalization task and (2) a similarity prediction task. Each task
uses 5-fold cross validation where we split the entity graph into 80/20 splits for Htrain
and Htest respectively. We perform this cross validation in a disjoint manner with
the Reddit threads that we have extracted. This creates 5 different entity graphs,
one for each split, and validates the model’s generalization to unseen Reddit threads.
Although this disjoint split ensures the threads are separate, we do not consider the
temporal aspect of these threads.
The first task: generalization, gets at the heart of our broader question on the
predictability of conversation paths. In this task we simply calculate the number of

50
entity sets, at each level in Htest that also appear in the same level in Htrain of our
kS` ∈Htest \S` ∈Htrain k
entity graph. Formally, we measure generalization as 1 − kS` ∈Htest k
for each
`.
In simple terms, generalization tells us, given an unseen conversation comment, if
the model can make a prediction from the given comment by matching at least one
entity in our entity graph model. This task therefore validates how well the model
captures general conversation patterns by matching at the entity level.
Results of this analysis are shown in Fig. 3.5 where color and shape combinations
indicate the subreddit and ` is represented along the x-axis. Error bars represent
the 95% confidence interval of the mean across the 5 folds. We find that the entity
graph captures much more of the information early in conversations. As the depth
increases to three and beyond, we note a sharp drop in the overlap between the test
and training sets. The widening confidence interval also indicates that the amount
of information varies based on the test set. From these results, we conclude that
analyzing the flow of an unseen conversation early-on is reasonable, but findings
from deeper in the conversation may be difficult because key entities may be missing
from the entity graph.
The second task: similarity prediction looks to measure the similarity between
a predicted entity set and the actual entity set. This methodology uses the entity
embeddings from the E2E entity linking model to represent the entities in the vector
space. For each root in Htest we find its matching root in the Htrain ; if a match does not
exist, we discard and start again. Then we make the Markovian assumption and per-
form probabilistic prediction for each path in the training set via P r(S`+1 (cy )|S` (cx )),
i.e., the empirical probability of a conversation moving to S`+1 (cy ) given the conver-
sation is currently at S` (cx ) in Htrain . The probability for each transition is based on
the edge weights that we captured during the graph construction step. As determined
in the previous experiment, entity sets are increasingly unlikely to match exactly as

51
/r/news /r/worldnews /r/politics
6

4
WMD

0
0 2 4 6 80 2 4 6 8 0 2 4 6 8
Conversation Depth (`) Conversation Depth (`) Conversation Depth (`)

Figure 3.6. Box plot of Word Movers Distance (WMD) as a function of the
conversation depth `. Lower is better. Box plots represent WMD-error of
entity representations predicted by the narrative hypergraph over all
entities, over all depth, over five folds.

the depth increases; so rather than a 0/1 loss, we measure the word movers distance
(WMD) between the predicted entities and the actual entities [84].
Results for this comparison are shown in Fig. 3.6 for three of the larger subreddits.
We again find that as the depth of the conversation increases the distance between
our predicted tree and the ground truth entities rises. These results indicate that as
a conversation continues, the variety of topics discussed tends to increase. Therefore,
predictions are likely to not align well very to those of the true conversation. This is
most clearly seen in the /r/politics plot in Fig. 3.6, where we note a sharp increase
in the later parts of the conversation. If the variety of topics was consistent, then we
would expect the WMD to stay relatively flat throughout the conversation depth.

3.4 Conversation Traversals

Next, we investigate RQ2 through a visualization of the entity graph. Recall that
the entity graph contains entity sets over the depths of the conversation. Specifically,
we seek to understand what conversations on Reddit look like. Do they splinter,

52
narrow, or behave in some other way? We call the set of visual paths conversation
traversals because they indicate how users traverse the entity graph.
We generate these visual conversation traversals using a slightly modified force di-
rected layout [51]. Graph layout algorithms operate like graph embedding algorithms
LINE, node2vec, etc, but rather than embedding graphs into a high dimension space,
visual graph layout tools embed nodes and edges into a 2D space. In our setting
we do make some restrictions to the algorithm in order to force topics to coalesce
into a visually meaningful and standardized space. Specifically, we fix the position
of each vertex in our graph on the x-axis according to `. As in Fig. 3.4, individual
entity vertices always occur to the left of entity set vertices, making the visualization
illustrate how conversations flow from the start to finish in a left to right fashion.
This restriction forces the embedding algorithm to adjust the position only on
the y-coordinate, and this is necessary to allow the individual entity to entity set
edges from the star-expansion to pull entity set vertices close together if and only if
they share many common entities. Loosely connected or disconnected entities will
therefore not be pulled together. As a result, the y-axis tends to cluster entities and
entity-sets together in a semantically meaningful way.
Embedding algorithms are typically parameterized with a learning rate parameter
that determines how much change can happen to the learned representation at each
iteration. Because we want entities to be consistent horizontally, we modify the
learning rate function to increasingly dampen embedding updates over 100 iterations
per depth. For example, given a entity graph of depth L = 10, we would expect 1,000
iterations total. We initially allow all entities and entity sets to update according to
the default learning rate, but as the iterations increase to 100 the learning rate of the
entities and entity sets at ` = 1 will slowly dampen and eventually lock into place
at iteration 100. When these entities and entity sets lock we also lock those same
entities and entity sets at all other depths. This ensures that each of these entities

53
Figure 3.7. Entity graph showing the visual conversation traversals from
/r/news. This illustration shows the paths of conversations over entity sets.
The x-axis represents the depth of the conversation; entity sets are
clustered into a semantically meaningful space along the y-axis. Inset
graph highlights five example entity sets and their connecting conversation
paths. Node colors represent equivalent entity sets. In this example we
highlight how entity sets are placed in meaningful semantic positions in
relation to one another.

and entity sets will be drawn as a horizontal line at the given y position.
Then, from iterations 100-200, the learning rate of the entities and entity sets at
` = 2 will slowly dampen and eventually lock into place at iteration 200. Meanwhile
the entities and entity sets at deep levels will continue to be refined. In this way,
the semantically meaningful y-coordinates tend to propagate from left to right as the
node embedding algorithm iterates.
One complication is that the sheer number of entities and the conversation paths
over the entities is too large to be meaningful to an observer. So we do not draw the
entity-nodes generated by the star-expansion and instead opt to rewire entities sets
based on the possible paths through the individual entity nodes. We also tune the
edge opacity based on the edge weights.
We draw the resulting graph with D3 to provide an interactive visualization [10].

54
Conversation traversals of the entity graph generated from /r/news is illustrated
in Fig. 3.7. This illustration is cropped to remove the four deepest vertical axes
(on the right) and is also cropped to show the middle half of the illustration. A
zoomed in version highlights some interesting entity sets present in the /r/news con-
versation. Recall that the entity sets are consistent horizontally so that both red
circles on the left and the right of the inset plot both indicate the entity set with
Donald Trump; likewise the blue circles on the left and the right of the insert both
represent Barack Obama. Edges moving visually left to right indicate topical paths
found in online discourse. In the /r/news subreddit, which tracks only US news,
Donald Trump and Barack Obama are frequent visits, but so too are national entities
like United States (not highlighted), Iraq, and others. It is difficult to see from this
illustration, but the expanded interactive visualization shows a common coalescing
pattern where large sets of entities and unique combinations of ideas typically coalesce
into more simple singleton entities like Barack Obama or United States.

3.4.1 Spreading Activation

Next, we adapt the illustration of conversation traversals to begin to answer RQ3.


Specifically, we are interested in how the differences in starting points, at the roots
of the comment tree, have any impact on the eventual shape of the conversation. For
example, given a conversation starting with Donald Trump how will the conversation
take shape for liberals and how might that conversation be different among conser-
vatives? This kind of analysis provides endless possibilities in the analysis of how
different groups of people think and articulate ideas a given topic.
To help answer this question, we employ tools from the study of spreading activa-
tion [35]. Spreading activation is a concept from cognitive psychology that has been
used to model how ideas spread and propagate in the brain from an initial source.
A popular use for spreading activation has been on semantic networks to find the

55
Figure 3.8. Entity graph example of spreading activation on /r/news when
Barack Obama is selected as the starting entity. The x-axis represents the
(threaded) depth at which each entity was mentioned within conversations
rooted at Barack Obama. The y-axis represents the semantic space of each
entity, i.e., similar entities are closer than dissimilar entities on the y-axis.
Node colors represent equivalent entity sets. In this example, we observe
that conversations starting from Barack Obama tend to center around the
United States, political figures such as Donald Tump, and discussion around
whether his religion is Islam.

56
relatedness between different concepts.
Formally, spreading activation works by specifying two parameters: (1) a firing
threshold F ∈ [0, . . . , 1] and (2) a decay factor D ∈ [0, . . . , 1]. The vertex/entity set
selected by a user will be given an initial activation Ai of 1. This is then propagated
to each connected vertex as Ai ×wj ×D where wj is the weight of each edge connection
to the corresponding vertex. Each vertex will then acquire its own activation value Ai
based on the total amount of signal received from all incoming edges. If a vertex has
acquired enough activation to exceed the firing threshold F , it too will fire further
propagating forward through the graph. In the common setting, vertices are only
allowed to fire once and the spreading will end once there is no more vertices to
activate.
In our work we use spreading activation as a method for a user to select a starting
topic/entity set within the illustration of conversation traversals. The spreading ac-
tivation function will then propagate the activation of entities along the conversation
paths to highlight those that are mostly likely to activate from a given starting point.
Because we permit the entity graph to be constructed (and labeled) from multiple
subreddits, we can also use the spreading activation function to compare and contrast
how users from different subreddits activate in response to a topic.
After spreading activation has been calculated, our interactive visualization tool
removes all vertices and links that are not part of the activated portion of the graph.
All of the vertices involved in spreading activation will have their size scaled based
on how much activation they received. An example of this is cropped and illus-
trated in Fig. 3.8, which shows how spreading activation occurs when the entity set
Barack Obama is activated within /r/news. Here we see that conversations starting
with (only) Barack Obama tend to move towards discussions about the United States.
We also note that the Islam entity is semantically far away from Barack Obama and
Donald Trump as indicated its placement on the y-axis. The results from using spread-

57
ing activation allow for a much more granular investigation of conversational flow.
These granular levels of conversational flow demonstrate that an individual can search
for patterns related to influence campaigns, echo chambers and other social media
maladies across a number of topics.

3.5 Comparative Analysis

The visual conversation traversals appears to be helpful for investigating trends


within a group. But, our final goal is to use these to compare and contrast how
different groups move through the conversation space. Our first attempt at this
was to use and overlay separate plots and attempt to compare the trends. This
would be challenging though because it would fail to capture the magnitude in any
differences between the groups for various entity set transitions. Our second attempt,
instead, modified the entity graph creation process to take in data from two different
subreddits. By using both communities we can capture how often an entity transition
occurs in each subreddit and use color gradients to indicate the relative strength of
each transition probability based on the edge weight we find in each subreddit. This
visually shows if correlations occur between subreddits. In the present work, we
examined three different scenarios among the subreddits in our dataset.

Scenario 1: liberals and conservatives Determining how motivated groups commu-


nicate about and respond to various topics is of enormous importance in modern
communication studies. For example, communication specialists and political sci-
entists are interested in understanding how users respond to coordinated influence
campaigns that flood social media channels with the same message [103]. Repetition
is key for the idea to stick, and we would expect then that these forms of messaging
would begin to appear in the entity graphs and possibly visually indicated in the
conversation traversals.

58
Although a full analysis of this difficult topic is not within the purvue of the cur-
rent work, we do perform a comparative analysis of /r/Conservative and /r/politics
as proxies for comparing conservative and liberal groups, respectively. We pay partic-
ular attention to determining the particular topics and entities that each group tends
to go towards later (deeper) in the conversation. Such a comparative analysis may be
key to understanding how coordinated influence campaigns orient the conversation
of certain groups or de-rail them.
The comparative illustration using spreading activation was used at the beginning
of the paper in Fig. 3.1 and is not re-illustrated in this section. The illustration yields
some interesting findings. While one might expect /r/Conservative to discuss mem-
bers or individuals related to the republican party, we instead find that conversations
tend to migrate toward mentions of liberal politicians (e.g., Joe Biden) indicated by
red lines in Fig. 3.1. The reverse holds true as well: mentions of Joe Biden leads to-
wards mentions of the Republican Party by the liberal group, as indicated by the blue
line connecting the two. A brief inspection of the underlying comments reveals that
users in each subreddit tend to talk in a negative manner towards the other party’s
politicians. This is a clear example of affective polarization [72] being captured by
our visualization tool. Affective polarization is where individuals organize around
principles of dislike and distrust towards the out-group (the other political party)
even moreso than trust in their in-group.
Another finding we observe is the more pronounced usage of the United States
by conservatives than liberals. This observation could be explained by the finding
that conservatives show a much larger degree of overt patriotism than liberal indi-
viduals [69], which has more recently lead to a renewed interest in populism and
nationalism [39].

59
Scenario 2: US news and Worldnews In our second scenario, we compare the con-
versations from /r/news (red) and /r/worldnews (blue), which are geared towards
US-only news and non-US news respectively.
The comparison between these subreddits reveals unsurprising findings. A much
larger portion of the entity sets come from /r/worldnews as they discuss a much
broader range of topics. Many of the entity transitions that are dominated by
/r/worldnews come from discussions of other countries, events, and people outside
of the United States. The aspects that are shown to come primarily from /r/news
are topics surrounding the United States, China, and major political figures from the
United States. An example of this can be seen in Fig. 3.9 which illustrates spreading
activation starting from White House. Here, the dominating red lines, which reflects
transitions from within conversations on /r/news, converge to United States, even af-
ter topics like Russia or Islam are discussed. An interesting side note is that many of
the unlabeled entities entering the conversation via blue lines (/r/worldnews) in ` = 5
and ` = 6 represent other countries such as Canada, Japan, Mexico, and Germany. The
findings from this comparative analysis do not show any extremely interesting results
but, it does show that the entity graph is able to capture what one would see as the
assumed patterns to find from comparing these two subreddits of interest.

Scenario 3: COVID and Vaccines Our final analysis focuses on comparing a single
subreddit, /r/Coronavirus, but during two different time periods. There is a large
amount of work that has been done analyzing Covid online looking at partisanship
[109], user reaction to misinformation [76], and differences in geographic concerns
[62]. The first segment (highlighted in red) comes from the period of January through
June in 2020, which was during the emergence of the novel Coronavirus. Although
the /r/Coronavirus subreddit had existed for many years prior, it became extremely
active during this time. The second segment was from the following year January -

60
Figure 3.9. Illustration of an entity graph created from threaded
conversations from /r/news (red-edges) and r/worldnews (blue-edges). The
x-axis represents the (threaded) depth at which each entity set was
mentioned within conversations rooted at White House. The y-axis
represents the semantic space of each entity, i.e., similar entities are closer
than dissimilar entity sets on the y-axis. Nodes colors represent equivalent
entity sets. Conversations in /r/news tends to coalesce to United States,
while conversations in /r/worldnews tend to scatter into various other
countries (unlabeled black nodes connected by thin blue lines)

61
June 2021. This time period corresponded to the development, approval and early
adoption of vaccines.
Our analysis of this visualization yielded some interesting findings related to the
coronavirus pandemic that we illustrate in Fig. 3.10. If we begin spreading activation
from the perspective of United States we find that most of the discussion leads to
China and Italy in 2020, which appears reasonable because of China and Italy’s early
struggles with virus outbreaks. In comparison, the 2021 data appeared more likely
to mention Sweden, India, and Germany, which had severe outbreaks during those
months. Our findings from spreading activation allow us to capture the shifting
changes in countries of interest from 2020 to 2021 as the pandemic progressed.

3.6 Conclusion

In the current work we presented a new perspective by which to view and think
about online discourse. Rather than taking the traditional social networks view where
information flows over the human participants, our view is to consider human con-
versations as stepping over a graph of concepts and entities. We call these discourse
maps entity graphs and we show that they present a fundamentally different view of
online human communication.
Taking this perspective we set out to answer three research questions about (1)
discourse prediction, (2) illustration, and (3) behavior comparisons between groups.
We found that discourse remains difficult to predict, and this prediction gets harder
the deeper into the conversation we attempt predictions. We demonstrate that the
visual conversation traversals provide a view of group discourse, and we find that
online discourse tends to coalesce into narrow, simple topics as the conversation
deepens – although those topics could be wildly different from starting topic. Finally,
we show that the spreading activation function is able to focus the visualization to
provide a comparative analysis of competing group dynamics.

62
Figure 3.10. Comparison between the first 6 months of /r/Coronavirus
from 2020 to 2021. Illustration of an entity graph created from threaded
conversations from /r/Coronavirus in Jan–June of 2020 (red-edges) and
from Jan–June of 2021 (blue-edges). The x-axis represents the (threaded)
depth at which each entity set was mentioned within conversations rooted
at United States. The y-axis represents the semantic space of each entity
set, i.e., similar entity sets are closer than dissimilar entity sets on the
y-axis. Node colors represent equivalent entity sets. Conversations tended
to focus on China and Italy early in the pandemic, but turn towards a
broader topic space later in the pandemic.

63
3.6.1 Limitations

While the work in its current state is helpful for better understanding conver-
sations, it is not without its limitations. Foremost, in the present work we only
considered conversations on Reddit. Another limitation is that the entity linking
method we chose is geared towards high-precision at the cost of low-recall. This
means that we can be confident that the entities extracted in the conversations are
mostly correct, but we have missed some portion of entities. The recall limitation
does inhibit the total number of entities we were able to collect; a better system
would provide for better insights in our downstream analysis. This issue can also be
highlighted with the long tail distribution of entities and the challenges this poses to
current methods [71]. An entity linking model that focuses on recall may still result
in useful graphs as prior works have found that many of the entities are considered
“close enough” even when they are not a perfect match to ground truth data [43].
Using a different entity linking model could lead to different patterns extracted from
our method. For a model that optimizes for higher recall it could create a much
larger entity graph, though it would likely contain a fair amount of noise due to the
precision-recall trade off.
Another limitation inherent to the present work is the consideration of conver-
sations as threaded trees. This is an imperfect representation of natural, in-person
conversation, and still different from unthreaded conversations like those found on
Twitter and Facebook, which may require a vastly different entity graph construction
method. Finally, the interactive visualization tool is limited in its ability to process
enormous amounts of conversation data because of its reliance on JavaScript libraries
and interactive browser rendering.

64
3.6.2 Future Work

These limitations leave open avenues for further exploration in future work. Our
immediate goals are to use the entity graphs to better understand how narratives are
crafted and shaped across communities. Improvements in the entity linking process
and addition of concept vertices, pronoun anaphora resolution, threaded information
extraction and other advances in SocialNLP will serve to improve the technology
substantially. We also plan to ingest other threaded conversational domains such
as Hackernews, 4chan, and even anonymized email data. Extensions of this work
could also include capturing more information between entity transitions such as the
sentiment overlayed on a given entity or group of entities. This extra information
could allow us to create entity graphs that not only show the transition but also how
various groups speak and feel about those specific entities.

65
CHAPTER 4

TK-KNN: A BALANCED DISTANCE BASED SEMI-SUPERVISED LEARNING


METHOD FOR INTENT CLASSIFICATION.

Large language models like BERT [41] have significantly pushed the boundaries
of Natural Language Understanding (NLU) and created interesting applications such
as automatic ticket resolution [91]. A key component of such systems is a virtual
agent’s ability to understand a user’s intent to respond appropriately. Successful
implementation and deployment of models for these systems require a large amount
of labeled data to be effective. Although deployment of these systems often generate
a large amount of data that could be used for fine-tuning, the cost of labeling this
data is too high. Semi-supervised learning methodologies are an obvious solution
because they can significantly reduce the amount of human effort required to train
these kinds of models [85, 158] especially in image classification tasks [150, 101, 146].
However, as well shall see, applications of these models is difficult for NLU and intent
classification because of the label distribution.
Indeed, research most closely realted to the present work is the Slot-List model by
Basu et al. [6], which focuses on the meta-learning aspect of semi-supervised learning
rather than using unlabeled data. In a similar vein the GAN-BERT [36] model shows
that using an adversarial learning regime can be devised to ensure that the extracted
BERT features are similar amongst the unlabeled and the labeled data sets and
substantially boost classification performance. Other methods have investigated how
data augmentation can be applied to the NLP domain to enforce consistency in the
models [28], and several other methods have been proposed from the computer vision

66
Existing threshold Imbalanced decision
selection strategy boundary

Unlabeled set of
examples

Our TK-KNN selection Balanced decision


strategy boundary

Figure 4.1. Example of pseudo label selection when using a threshold (top)
versus the top-k sampling strategy (bottom). In this toy scenario, we chose
k = 2, where each class is represented by a unique shape. As the threshold
selection strategy pseudo-labels data elements (shown as yellow) that
exceed the confidence level, the model tends to become biased towards
classes that are easier to predict. This bias causes a cascade of mis-labels
that leads to even more bias towards the majority class.

67
community. However, a recent empirical study found that many of these methods
do not provide the same benefit to NLU tasks as they provide to computer vision
tasks [29] and can even hinder performance in certain instances.
Intent classification remains a challenging problem for multiple reasons. Gen-
erally, the number of intents a system must consider is relatively large, with sixty
classes or more. On top of that, most queries consists of only a short sentence or two.
This forces models to need many examples in order to learn nuance between different
intents within the same domain. In the semi-supervised setting, many methods set a
confidence threshold for the model and assign pseudo-labels to the unlabeled data if
their confidence is above the threshold [129]. This strategy permits high-confidence
pseudo-labeled data elements to be included in the training set, which typically re-
sults in performance gains. Unfortunately, this approach also causes the model to
become overconfident for classes that are easier to predict. The issue is more pro-
nounced for intent classification because of feedback loops that can quickly cause the
model to become biased towards a small number of classes.
In the present work, we describe the Top-K K-Nearest Neighbor (TK-KNN)
method for training semi-supervised models. The main idea of this method is illus-
trated in Figure 4.1. TK-KNN makes two improvements over other pseudo-labeling
approaches. First, to address the model overconfidence problem, we use a top-k sam-
pling strategy when assigning pseudo-labels. Second, we enforce a balanced set of
classes by taking the top-k predictions per class, not simply the top-k overall predic-
tions. Furthermore, when selecting the top-k examples the sampling strategy does not
simply rely on the model’s predictions, which tend to be noisy. Instead we leverage
the embedding space of the labeled and unlabeled examples to find those with simi-
lar embeddings and combine them with the models’ predictions. Experiments using
standard performance metrics of intent classification are performed on three datasets:
CLINC150 [86], Banking77 [20], and Hwu64 [89]. We find that the TK-KNN method

68
outperforms existing methods in most scenarios and performs exceptionally well in
the low-data scenarios.

4.1 Related Work

Intent Classification The task of intent classification has attracted much atten-
tion in recent years due to the increasing use of virtual customer service agents.
Recent research into intent classification systems has mainly focused on learning out
of distribution data [151, 155, 31, 157]. These techniques configure their experiments
to learn from a reduced number of the classes and treat the remaining classes as
out-of-distribution during testing. Although this research is indeed important in its
own regard, it deviates from the present work’s focus on semi-supervised learning.

Pseudo Labeling Pseudo labeling is a mainstay in semi-supervised learning [88,


112, 21]. In simple terms, pseudo labelling uses the model itself to acquire hard
labels for each of the unlabeled data elements. This is achieved by taking the argmax
of the models’ output and treating the resulting label as the example’s label. In
this learning regime, the hard labels are assigned to the unlabeled examples without
considering the confidence of the model’s predictions. These pseudo-labeled examples
are then combined with the labeled data to train the model iteratively. The model is
then expected to iteratively improve until convergence. The main drawback of this
method is that mislabeled data elements early in training can severely degrade the
performance of the system.
A common practice to help alleviate mislabeled samples is to use a threshold τ to
ensure that only high-quality (i.e., confident) labels are retained [129]. The addition
of confidence restrictions into the training process [129] has shown improvements but
also restricts the data used at inference time and introduces the confidence threshold
value as yet another hyperparameter that needs to be tuned.

69
Another major drawback of this selection method is that the model can become
very biased towards the easy classes in the early iterations of learning [2]. Recent
methods, such as FlexMatch [152], have discussed this problem at length and at-
tempted to address this issue with a curriculum learning paradigm that allows each
class to have its own threshold. These thresholds tend to be higher for majority
classes lower for less-common classes. However, this only serves to exacerbate the
problem because the less-common classes will have less-confident labels. A previous
work by Zou et al. [162] proposes a similar class balancing parameter to be learned
per class, but is applied to the task of unsupervised domain adaptation. The closest
previous work to ours is co-training [98] that iteratively adds a single example from
each class throughout the self-training.
The TK-KNN strategy described in the present work addesses these issues by
learning the decision boundaries for all classes in a balanced way while still giving
preference to accurate labels by considering the proximity between the labeled and
the unlabeled examples in the embedding space.

Consistency Regularization Consistency regularization has shown remarkable


success in a variety of SSL tasks. The technique utilizes the assumption that if a
realistic pertubation was applied to an unlabeled data point, that the models predic-
tion should not change significantly, as first seen in [4]. In computer vision, common
techniques are accepted and used, such as flipping an image to act as the perturbed
version. For NLP tasks a variety of recent techniques have bee proposed in the litera-
ture [49], with an empirical study finding mixed results in the semi supervised setting
[29]. While some success has been found in NLP, there are no common methods that
have shown success and broad use. The data augmentations used in computer vision
are generally accepted as label-preserving and help make the model invariant to such
transformations. In many cases the various NLP transformations may not preserve

70
the label, or not perturb the data enough to help regularize the model. While our
work does not focus on leveraging consistency to improve model performance, we

Distance-based Pseudo labeling Another direction explored in recent work is to


consider the smoothness and clustering assumptions found in semi-supervised learn-
ing [100] for pseudo labeling. The smoothness assumption states that if two points
lie in a high-density region their outputs should be the same. The clustering as-
sumption similarly states that if points are in the same cluster, they are likely from
the same class. Recent work by Zhu et al. [159] propose a training-free approach
to detect corrupted labels. They use a k-style approach to detect corrupted labels
that share similar features. The results of this work show that the smoothness and
clustering assumptions are also applicable in a latent embedding space and therefore
data elements that are close in the latent space are likely to share the same clean
label.
Two other recent works have made use of these assumptions in semi-supervised
learning to improve their pseudo-labeling process. First, Taherkhani et al. [133]
use the Wasserstein distance to match clusters of unlabeled examples to labeled
clusters for pseudo-labeling. Second, the aptly-named feature affinity based pseudo-
labeling [42] method uses the cosine similarity between unlabeled examples and clus-
ter centers that have been discovered for each class. The selected pseudo label is
determined based on the highest similarity score calculated for the unlabeled exam-
ple.
Results from both of these works demonstrate that distance-based pseudo-labeling
strategies yield significant improvements over previous methods. However, both of
these methods depend on clusters formed from the labeled data. In the intent classifi-
cation task considered in the current study, the datasets sometimes have an extremely
limited number of labeled examples per class, with instances where there is only one

71
labeled example per class. This scarcity of labeled data makes forming reliable clus-
ters quite challenging. Therefore, the TK-KNN model described in the present work
adapted the K-Nearest Neighbors search strategy to help guide our pseudo-labeling
process.

4.2 Top-K KNN Semi-Supervised Learning

Algorithm 1 TK-KNN Sampling For a Cycle


Require: Data of X = xn , yn : n ∈ (1, ..., N ), U = um : m ∈ (1, ..., M ), β
1: Predict pseudo-labels for all U
2: Calculate cosine similarity via. Eq. (2) for all U per class
3: Calculate score via Eq.(3)
4: Combine X and top-k per class from U

4.2.1 Problem Definition

We formulate the problem of semi-supervised intent classification as follows:


Given a set of labeled intents X = (xn , yn ) : n ∈ (1, ..., N ) where xn represents
the intent example and yn the corresponding intent class c ∈ C and a set of unlabeled
intents U = xm : m ∈ (1, ..., M ), where each instance xm is an intent example lacking
a label. Intents are fed to the model as input and the model outputs a predicted
intent class, denoted as pmodel (c|x, θ), where θ represents some pre-trained model
parameters. Our goal is to learn the optimal parameters for θ.

4.2.2 Method Overview

As described above, we first employ pseudo labeling to iteratively train (and re-
train) a model based on its most confident-past predictions. In the first training cycle,

72
Labeled Examples Unlabeled Examples
Text Intent Text Predicted Intent

1. Train 2. Predict
Is there a carry on item weight Carry On How do you setup direct Direct Deposit
Model Unlabeled deposit? (Prob 0.78)
limit?
BERT
… … …

How do I direct deposit my Direct What are the carry on Carry On


check? Deposit rules? (Prob 0.42)

Intent: Direct Deposit 3. Compute


6. Add to the Cosine
labeled set Text Cosine Score Similarities
How do you setup 0.92
direct deposit?

I would like to setup 0.81


Nearest Neighbors
Selected Examples direct deposit.

Text Pseudo Setup direct deposit 0.62 Intent: Direct Deposit Intent: Carry On
5. Select for me. 4. Rank the
Label
top-k examples
How do you setup direct Direct
deposit? Deposit Intent: Carry On
Text Cosine Score
I would like to setup direct Direct
deposit. Deposit What are the carry on 0.99
rules?
What are the carry on rules? Carry On
Tell me the carry on 0.89
restrictions for United.
Tell me the carry on restrictions Carry On
for United. Labeled Examples Selected
Carry on restrictions for 0.74 Closest
Air Emirates. Unlabeled Examples Neighbor

Figure 4.2: TK-KNN overview. The model is (1) trained on the small portion of
labeled data. Then, this model is used to predict (2) pseudo labels on the unlabeled
data. Then the cosine similarity (3) is calculated for each unlabeled data point with
respect to the labeled data points in each class. Yellow shapes represent unlabeled
data and green represent labeled data. Similarities are computed and unlabeled
examples are ranked (4) based on a combination of their predicted probabilities and
cosine similarities. Then, the top-k (k = 2) examples are selected (5) for each class.
These examples are finally added (6) to the labeled dataset to continue the iterative
learning process.

73
the model is trained on only the small portion of labeled data X. In the subsequent
cycles, the model is trained on the union of X and a subset of the unlabeled data
U that has been pseudo-labeled by the model in the previous cycle. Figure 4.2
illustrates an example of this training regime with the TK-KNN method.
We use the BERT-base [41] model with an added classification head to the top.
The classification head consists of a dropout layer followed by a linear layer with
dropout and ends with an output layer that represents the dataset’s class set C.
However, other BERT-like models should work in this framework. We select the
BERT-base model for fair comparison with other methods.

4.2.3 Top-K Sampling

When applying pseudo-labeling, it is often observed that some classes are easier
to predict than others. In practice, this causes the model to become biased towards
the easier classes [2] and perform poorly on the more difficult ones. The Top-K
sampling process within the TK-KNN system seeks to alleviate this issue by growing
the pseudo-label set across all labels together.
When we perform pseudo labeling, we select the top-k predictions per class from
the unlabeled data. This selection neither uses nor requires any threshold; instead,
it limits each class to choose the predictions with the highest confidence. We rank
each predicted data element with a score based on the models predicted probability.

score(um ) = pmodel (y|xm ; θ) (4.1)

After each training cycle, the number of pseudo labels in the dataset will have
increased by k times the number of classes. This process continues until all examples
are labeled or some number of pre-defined cycles has been reached. We employ
standard early stopping criteria [104] during each training cycle to determine whether

74
or not to stop training.

4.2.4 KNN-Alignment

Although our top-k selection strategy helps alleviate the model’s bias, it still relies
entirely on the model predictions. To enhance our top-k selection strategy, we utilize
a KNN search to modify the scoring function that is used to rank which pseudo-
labeled examples should be included in the next training iteration. The intuition for
the use of the KNN search comes from the findings in [159] where ”closer” instances
are more likely to share the same label based on the neighborhood information when
some labels are corrupted, which often occurs in semi-supervised learning from the
pseudo-labeling strategy.
Specifically, we extract a latent representation from each example in our training
dataset, both the labeled and unlabeled examples. We formulate this latent repre-
sentation in the same way as Sentence-BERT [110] to construct a robust sentence
representation. This representation is defined as the mean-pooled representation of
the final BERT layer that we formally define as:

z = mean([CLS], T1 , T2 , ..., TM ) (4.2)

Where CLS is the class token, T is each token in the sequence, M is the sequence
length, and z is the extracted latent representation. When we perform our pseudo
labeling process we extract the latent representation for all of our labeled data X as
well as our unlabeled data U .
For each unlabeled example, we calculate the cosine similarity between its latent
representation and the latent representations of the labeled counterparts belonging
to the predicted class.
The highest cosine similarity score between the unlabeled example and its labeled

75
neighbors is used to calculate the score of an unlabeled example. An additional
hyperparameter, β, permits the weighing of the model’s prediction and the cosine
similarity for the final scoring function.

score(um ) = (1 − β) × pmodel (y|xm ; θ)+


(4.3)
β × sim(zn , zm )

With these scores we then follow the previously discussed top-k selection strategy
to ensure balanced classes. The addition of the K-nearest neighbor search helps us
to select more accurate labels early in the learning process. We provide pseudo code
for our pseudo-labeling strategy in Algorithm 1.

4.2.5 Loss Function

As we use the cosine similarity to help our ranking method we want to ensure
that similar examples are grouped together in the latent space.While the cross entropy
loss is an ideal choice for classification, as it incentivizes the model to produce accu-
rate predictions, it does not guarantee that discriminative features will be learned,
which our pseudo labeling relies on. To address this issue, we supplemented the
cross-entropy loss with a supervised contrastive loss [80] and a differential entropy
regularization loss [116], and trained the model using all three losses jointly.

C
X
LCE = − yi log(ŷi ) (4.4)
i=1

We select the supervised contrastive loss [80] as the method ensures our model
learns discriminative features by maximizing inter-class examples and minimizing
intra-class examples. This ensures that our model with learn good representations
in the latent space that separate examples belonging to different classes. The su-
pervised contrastive loss relies on augmentations of the original examples. To get

76
these augmentations we simply apply dropout to the representations that we extract
from the model. As is standard for the supervised contrastive loss we add a separate
projection layer to our model to align the representations. The representations fed
to the projection layer is the mean-pooled BERT representation as shown in Eq. 4.2.
This ensures that our model will learn good sentence representations that will be
used to select similar examples.

X −1 X sim(zi , zp )/τ
LSCL = log P (4.5)
i∈I
|P (i)| a∈A sim(zi , za )/τ
p∈P (i)

When adopting the contrastive loss previous works [46] have discussed how the
model can collapse in dimensions as a result of the loss. We follow this work in
adopting a differential entropy regularizer in order to spread the representations our
more uniformly. The method we use is based on the Kozachenko and Leonenko [83]
differential entropy estimator:

N
1 X
LKoLeo =− log(pi ) (4.6)
N i=1

Where pi = min(i6=j) ||f (xi ) − f (xj )||. This regularization helps to maximize the
distance between each point and its neighbors. By doing so it helps to alleviate
the collapse issue. We combine this term with the cross-entropy and contrastive
objectives, weighting it using a coefficient γ.

LALL = LCE + LSCL + γLKoLeo (4.7)

The joint training of these individual components leads our model to have better
discriminative features that are more robust, that results in improved generalization
and prediction accuracy.

77
TABLE 4.1

BREAKDOWN OF THE INTENT CLASSIFICATION DATASETS.

Dataset Intents Domain Train Val Test

CLINC150 151 10 15,250 3,100 5,550


Banking77 77 1 9,002 1,001 3,080
Hwu64 64 21 8,884 1,076 1,076
Note that BANKING77 and HWU64 do not provide vali-
dation sets, so we generated a validation set from the original
training set.

4.3 Experiments

4.3.1 Experimental Settings

Datasets We use three well-known benchmark datasets to test and compare the
TK-KNN model against other models on the intent classification task. Our intent
classification datasets are CLINC150 [86] that contains 150 in-domain intents classes
from ten different domains and one out-of-domain class. BANKING77 [20] that
contains 77 intents, all related to the banking domain. HWU64 [89] which includes
64 intents coming from 21 different domains. Banking77 and Hwu64 do not provide
validation sets, so we created our own from the original training sets. All datasets
are in English. A breakdown of each dataset is shown in Table 4.1.
We conducted our experiments with varying amounts of labeled data for each
dataset. All methods are run with five random seeds and the mean average accu-
racy of their results are reported [44]. This methodology permits tests of statistical
significance. Reported results are therefore accompanied by 95% confidence intervals.

78
4.3.2 Baselines

To perform a proper and thorough comparison of TK-KNN with existing methods,


we implemented and repeated the experiments on the following models and strategies.

• Supervised: Use only labeled portion of dataset to train the model without
any semi-supervised training. This model constitutes a competitive lower bound
of performance because of the limits in the amount of labeled data.
• Pseudo Labeling (PL) [88]: This strategy trains the model to convergence
then makes predictions on all of the unlabeled data examples. These examples
are then combined with the labeled data and used to re-train the model in an
iterative manner.
• Pseudo Labeling with Threshold (PL-T) [129]: This process follows the
pseudo labeling strategy but only selects unlabeled data elements which are
predicted above a threshold τ . We use a τ of 0.95 based on the findings from
previous work.
• Pseudo Labeling with Flexmatch (PL-Flex) [152]: Rather than using a
static threshold across all classes, a dynamic threshold is used for each class
based on a curriculum learning framework.
• GAN-BERT [36]: This method applies generative adversarial networks [60]
to a pre-trained BERT model. The generator is an MLP that takes in a noise
vector. The output head added to the BERT model acts as the discriminator
and includes an extra class for predicting whether a given data element is real
or not.
• MixText [28]: This method extends the MixUp [153] framework to NLP and
uses the hidden representation of BERT to mix together. The method also
takes advantage of consistency regularization in the form of back translated
examples.
• TK-KNN : The method described in the present work using top-k sampling
with a weighted selection based on model predictions and cosine similarity to
the labeled samples.
• Top-k Upper: Top-k sampling method, but always select the correct pseudo-
label. This model serves as an upper bound.

4.3.3 Implementation Details

Each method uses the BERT base model with a classification head attached.
We use the base BERT implementation provided by Huggingface 142, that contains

79
a total of 110M parameters. All models are trained for 30 cycles of self-training.
The models are optimized with the AdamW optimizer with a learning rate of 5e-5.
Each model is trained until convergence by early stopping applied according to the
validation set. We use a batch size of 256 across experiments and limit the sequence
length to 64 tokens. For TK-KNN, we set k = 6 and β = 0.75 and report the results
for these settings. An ablation study of these two hyperparameters is presented later.

Computational Use. In total we estimate that we used around 18,000 GPU hours
for this project. For the final experiments and ablation studies we estimate that the
TK-KNN model used 4400 GPU hours. Experiments were carried out on Nvidia
Tesla P100 GPUs that each had 12GB of memory and 16GB of memory.

4.4 Results

Results from these experiments are shown in Table 4.2. These quantitative results
demonstrate that TK-KNN yielded the best performance on the benchmark datasets.
We observed the most significant performance gains for CLINC150 and BANKING77,
where these datasets have more classes. For instance, on the CLINC150 dataset
with 1% labeled data, our method performs 10.92% better than the second best
strategy, FlexMatch. As the portion of labeled data used increases, we notice that
the effectiveness of TK-KNN diminishes.
Another observation from these results is that the GAN-BERT model tends to
be unstable when the labeled data is limited. This causes the model to have much
larger confidence interval than other methods. However, GAN-BERT does improve
as the proportion of labeled data increases. We also find that while the MixText
method shows improvements the benefits of consistency regularization are not as
strong compared to works from the computer vision domain.
These results demonstrate the benefits of TK-KNN’s balanced sampling strategy

80
TABLE 4.2

RESULTS FOR CLINC150, BANKING77, AND HWU64

Percent Labeled
Method 1% 2% 5% 10%
CLINC150
Supervised 27.35 ±1.71 49.15 ±1.99 67.96 ±0.85 75.05 ±1.57
PL 24.51 ±3.92 48.58 ±1.79 69.19 ±0.54 76.92 ±1.05
PL-T 39.05 ±3.26 56.65 ±1.53 71.25 ±0.5 79.29 ±1.62
PL-Flex 42.81 ±4.39 60.07 ±1.42 73.42 ±1.62 78.86 ±1.01
GAN-BERT 18.18 ±0.0 23.29 ±11.42 44.89 ±24.39 63.02 ±25.1
MixText 12.86 ±6.39 37.93 ±16.8 61.39 ±0.77 74.29 ±0.37
TK-KNN (Ours) 53.73 ±1.72 65.87 ±1.18 74.31 ±0.96 79.45 ±1.01
BANKING77
Supervised 34.73 ±1.5 47.51 ±2.89 70.27 ±1.08 80.82 ±0.41
PL 29.09 ±3.83 45.16 ±2.71 69.69 ±2.16 80.26 ±0.49
PL-T 35.12 ±3.86 51.67 ±3.14 71.16 ±1.98 81.88 ±0.43
PL-Flex 40.04 ±3.4 54.18 ±3.31 73.43 ±1.55 82.54 ±0.84
GAN-BERT 5.4 ±9.16 16.98 ±21.73 54.09 ±29.56 79.64 ±1.39
MixText 32.73 ±6.02 54.75 ±3.15 76.59 ±1.05 82.34 ±0.94
TK-KNN (Ours) 54.16 ±4.56 62.71 ±2.30 76.73 ±01.46 84.45 ±0.52
HWU64
Supervised 48.87 ±1.55 63.88 ±1.6 74.67 ±1.91 82.21 ±1.72
PL 48.46 ±1.86 64.39 ±1.66 75.76 ±1.69 82.49 ±0.94
PL-T 56.9 ±1.64 68.29 ±1.79 76.9 ±1.1 82.96 ±1.69
PL-Flex 60.15 ±3.27 69.87 ±0.93 77.99 ±1.4 83.83 ±1.2
GAN-BERT 33.36 ±16.55 32.9 ±29.07 72.32 ±1.41 81.78 ±1.64
MixText 33.3 ±8.98 56.46 ±11.08 66.65 ±7.28 79.72 ±1.27
TK-KNN (Ours) 65.33 ±2.29 73.03 ±1.31 79.63 ±0.56 84.59 ±0.58

Mean test accuracy results and their 95% confidence intervals across 5 rep-
etitions with different different random seeds. All experiments used k = 6 and
β = 0.75. TK-KNN outperformed existing state of the art models, especially
when the label set is small.

81
0.5

Test Accuracy
0.4

0.3

0.2
0 10 20 30
Cycle
PL-T PL-Flex TK-KNN

Figure 4.3. Convergence analysis of pseudo-labelling strategies on


CLINC150 at 1% labeled data. TK-KNN clearly outperforms the other
pseudo-labelling strategies by balancing class pseudo labels after each
training cycle.

and its use of the distances in the latent space.

4.4.1 Overconfidence in Pseudo-Labelling Regimes

A key observation we found throughout self-training was that the performance


of existing pseudo-labelling methods tended to degrade as the number of cycles in-
creased. An example of this is illustrated in Figure 4.3. Here we see that when a
pre-defined threshold is used, the model tends to improve performance for the first
few training cycles. After that point, the pseudo-labeling becomes heavily biased
towards the easier classes. This causes the model to become overconfident in pre-
dictions for those classes and neglect more difficult classes. PL-Flex corrects this
issue but converges much earlier in the learning process. TK-KNN achieves the best
performance thanks to the slower balanced pseudo-labeling approach. This process
helps the model learn clearer decision boundaries for all classes simultaneously and
prevent overconfidence in the model in some classes.

82
CLINC150 BANKING77 HWU64

Test Accuracy
0.8
0.6
0.4

0 10 20 30 0 10 20 30 0 10 20 30
Cycle Cycle Cycle

TK-KNN Top-K Upper Bound

Figure 4.4. Ablation results for each dataset using 1% labeled data.

4.4.2 Ablation Study

Because TK-KNN is different from existing methods in two distinct ways: (1) top-
k balanced sampling and (2) KNN ranking, we perform a set of ablation experiments
to better understand how each of these affects performance. Specifically, we test
TK-KNN under three scenarios, top-k sampling without balancing the classes, top-k
sampling with balanced classes, and top-k KNN without balancing for classes. When
we perform top-k sampling in an unbalanced manner, we ensure that the total data
sampled is still equal to k ∗ C, where C is the number of classes.
The results from the ablation study demonstrate both the effectiveness of top-k
sampling and KNN ranking. A comparison between our unbalanced sampling top-k
sampling and balanced versions show a drastic difference in performance across all
datasets. We highlight again that the performance difference is greatest in the lowest
resource setting, with a 12.47% increase in accuracy for CLINC150 in the 1% setting.
Results from the TK-KNN method with unbalanced sampling also show an im-
provement over unbalanced sampling alone. This increase in performance is smaller

83
than the difference between unbalanced and balanced sampling but still highlights
the benefits of leveraging the geometry for selective pseudo-labeling.

4.4.3 Upper Bound Analysis

We further ran experiments to gauge the performance of top-k sampling when


ground truth labels are fed to the model instead of predicted pseudo labels. This
experiment gives us an indicator as to how performance should increase throughout
the self-training process in an ideal pseudo-labeling scenario. We present the results
of this in Figure. 4.4. As expected, the model tends to converge towards a fully
supervised performance as the cycle increases and more data is (pseudo-)labeled.
Another point of interest is that the method’s upper bound can continue learning
with proper labels, while TK-KNN method tends to converge earlier. The upper
bound method also takes a significant increase in the first few cycles as well. This
highlights a need to investigate methods for accurate pseudo label selection further,
so that the model can continue to improve.

4.4.4 Parameter Search

TK-KNN relies on two hyperparameters k and β that can affect performance


based on how they are configured. We explore experiments to gauge their effect on
learning by testing k ∈ (4, 6, 8) and β ∈ (0.0, 0.25, 0.50, 0.75, 1.00). When varying k
we hold β at 0.75. For β experiments we keep k = 6. When β = 0.0, this is equivalent
to just top-k sampling based on 4.3. Alternatively, when β = 1.0, this is equivalent to
only using the KNN similarity for ranking. Results from our experiments are shown
in Figures 4.5 for β and 4.6 for k.
As we varied the β parameter, we noticed that all configurations tended to have
similar training patterns. After we trained the model for the first five cycles, the
model tended to move in small jumps between subsequent cycles. From the illustra-

84
Comparison of β Hyperparameter
0.65

Test Accuracy
0.6

0.55

0.5
0 5 10 15 20 25 30
Cycle

0.0 0.25 0.5 0.75 1.0

Figure 4.5. A comparison of TK-KNN on HWU64 with 1% labeled data as


β varies.

tion, we can see that no single method was always the best, but the model tended
to perform worse when β = 0.0, highlighting the benefits of including our KNN sim-
ilarity for ranking. The model reached the best performance when β = 0.75, which
occurs about a third of the way through the training process.
Comparison of values for k show that TK-KNN is robust to adjustments in this
hyperparameter. We notice slight performance benefits from selecting a higher k of
6 and 8 in comparison to 4. When a higher value of k is used the model will see an
increase in performance earlier in the self-training process, as it has more examples
to train from. This is only acheivable though when high quality correct samples are
selected across the entire class distribution. If a k value was selected that is too large,
more bad examples will be included early in the training process and may result in
poor model performance.

85
Comparison of k Hyperparameter

0.65

Test Accuracy
0.6

0.55

0.5

0 5 10 15 20 25 30
Cycle

4 6 8

Figure 4.6. A comparison of TK-KNN on HWU64 with 1% labeled data as


k varies.

4.5 Conclusions

This paper introduces TK-KNN, a balanced distance-based pseudo-labeling ap-


proach for semi-supervised intent classification. TK-KNN deviates from previous
pseudo-labeling methods as it does not rely on a threshold to select the samples.
Instead, we show that a balanced approach that takes the model prediction and K-
Nearest Neighbor similarity measure allows for more robust decision boundaries to
be learned. Experiments on three popular intent classification datasets, CLINC150,
Banking77, and Hwu64, demonstrate that our method improved performance in al-
most all scenarios.

86
4.6 Limitations

While our method shows noticeable improvements, it is not without limitations.


Our method does not require searching for a good threshold but instead requires
two different hyperparameters, k and β, that must be found. We offer a reasonable
method and findings for selecting both of these but others may want to search for
other combinations depending on the dataset. A noticeable drawback from our self-
training method is that more cycles of training will need to be done, especially if the
value of k is small. This requires much more GPU usage to converge to a good point.
Further, we did not explore any heavily imbalanced datasets, so we are unaware of
how TK-KNN would perform under those scenarios.

87
CHAPTER 5

CONCLUSION

Understanding how narratives develop on social media is crucial for managing


public opinion, detecting and combating misinformation, and handling political and
social impacts. These narratives form the public opinion that guides decisions that
happen throughout peoples daily lives. Additionally, understanding narratives pro-
vides valuable cultural insights and can inform the design of more responsible plat-
form algorithms.
In this dissertation I covered three topics and there relation to narratives on so-
cial media. First, I discussed moral judgements that are cast on social media. I
showed how a model could be trained on individuals moral judgements and applied
to a broad variety of other situations. From there, I drew insights into group and
user behavior based on the patterns found from the judgements rendered on other
individuals. These moral judgements within narratives help to inform us of potential
ethical considerations of a community and look to better understand these issues.
Next, I discussed how to model and predict the flow of conversations on social media.
I demonstrated how to construct an entity graph that moves through the turns of
a conversation to see where it will tend towards. A visual tool was presented that
helps facilitate these trends and allow an analyst to explore the potential paths. I
found from this analysis that affective polarization was apparent within different po-
litical communities and temporal shifts in topics were noticeable. Analyzing these
conversational flows allows for the comprehension of the dynamics of online narrative
discussions, enabling the identification of influential factors and patterns. Finally, I

88
proposed a new method for semi-supervised intent detection. With this method I
showed how it was possible to leverage a very small amount of labeled data paired
with unlabeled data to improve model performance. My method relied on balanced
sampling of the classes and a KNN objective to improve pseudo label selection. Exper-
iments from this work demonstrated the benefit of this methodology against previous
works for semi-supervised learning, especially those catered towards natural language
processing.
Intent detection can be applied to a variety of tasks but can be important to
study narratives to detect manipulative content, and aid in the management of false
narratives that may lead to potential social harm.

5.1 Future Work

In this work I covered different methods to help understand narratives on social


media. While each of these works on their own offers interesting insights they are
not without limitation. Narratives are complex and can require these various meth-
ods to garner insights from user behavior. However, the current methods consider
each of these aspects in isolation from one another. Further work should be explored
that looks to combine these methods together. By combing these methods and per-
spectives a more comprehensive understanding of narratives on social media can be
achieved. For example, the insights gained from analyzing conversational flow can
help contextualize and explain the moral judgements that are being passed. I can
also look to use the the intents and moral judgements that I extract and map them
into a flow over time. This would allow for a more granular look at how individual
ideologies shift surrounding different real world events.
One of the first avenues to be explored would be performing a small annotation to
perform intent classification with the TK-KNN method presented here. As discussed
before, intent classification on social media has seen a limited amount of interest [105,

89
131], due in part to the every changing conversations and high cost of annotations. I
define the idea of intent on social media as an action an individual wants to take, or
inform others to take. This form of messaging has become extremely prevalent with
narratives on social media, as people urge others to action. For instance, during the
French election in 2017 there were numerous topics fiercely debated amongst the two
candidates, with both using Twitter as their preferred platform to disseminate their
stance. To this end, intent could be defined in numerous ways and used with the TK-
KNN method. For this particular setup intent could be viewed as users messaging
about voting for candidates, support or opposing issues, or more general emotion
detection. The method for TK-KNN will work as a short text classifier and can be
applied to any of these setups. Furthermore, the intents of interest could be classes
associated with moral judgements, such as fairness, disgust, or pride.
Of particular interest is discovering new intents as they occur. Like many ma-
chine learning methods, the semi-supervised technique operates on a closed world
assumption that all of the classes are known at inference times. In reality, especially
for social media, new classes will need to be discovered as real world events happen.
To address this modifications would be needed to classify intents that have not been
seen before. These examples would need to be help out and then labeled in some
manner to then be fed back to the semi-supervised algorithm to learn the new class.
Of particular interest would be to leverage large generative models, such as GPT-3 to
automatically label and discover these new classes. This would further minimize the
need for human annotation and could allow for a system that automatically updates
quickly entirely on its own.
To tackle this problem, in-context learning has become a burgeoning method for
addressing these situation [19]. A major driver of this shift was the rise of emergent
properties of language models from scaling them up [141]. While many of these in-
context learning setups rely on just a few labeled examples, work is still being done

90
to understand how to best format these prompts. Prior methods have focused on
how best to retrieve good examples for in-context learning [115]. Methods such as
this have shown improved performance in tasks but much is still not known about
the underlying mechanisms to make in-context learning work well. In particular, an
empirical study [95] has shown how removing accurate labels and replacing them
with random ones impacts overall performance in a minimal manner. The findings
are important to future work in automated intent discovery as they highlight how
random labels can still improve the models classification performance for tasks. Other
methods, such as channel prompting [94], flips the paradigm around by passing the
normally predicted portion to the model to force it to predict what the original input
was. This has showns to increase performance in a variety of prompts and particularly
demonstrates strong performance when known labels are lacking.

91
BIBLIOGRAPHY

1. K. Ameriks and D. M. Clarke. Aristotle: Nicomachean Ethics. Cambridge


University Press, 2000.

2. E. Arazo, D. Ortego, P. Albert, N. E. O’Connor, and K. McGuinness. Pseudo-


labeling and confirmation bias in deep semi-supervised learning. In IJCNN,
2020.

3. C. Armon and T. L. Dawson. Developmental trajectories in moral reasoning


across the life span. Journal of moral education, 26(4):433–453, 1997.

4. P. Bachman, O. Alsharif, and D. Precup. Learning with pseudo-ensembles. In


NeurIPS, 2014.

5. R. Basak, S. Sural, N. Ganguly, and S. K. Ghosh. Online public shaming on twit-


ter: Detection, analysis, and mitigation. IEEE Transactions on Computational
Social Systems, 6(2):208–220, 2019.

6. S. Basu, A. Sharaf, A. Fischer, V. Rohra, M. Amoake, H. El-Hammamy,


E. Nosakhare, V. Ramani, B. Han, et al. Semi-supervised few-shot intent clas-
sification and slot filling. arXiv preprint arXiv:2109.08754, 2021.

7. J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, and J. Blackburn. The


pushshift reddit dataset. ICWSM, Jan 2020. URL http://arxiv.org/abs/
2001.08435.

8. J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, and J. Blackburn. The


pushshift reddit dataset. Proceedings of the international AAAI conference on
web and social media, 2020.

9. K. Bebbington, C. MacLeod, T. M. Ellison, and N. Fay. The sky is falling:


evidence of a negativity bias in the social transmission of information. Evolution
and Human Behavior, 38(1):92–101, 2017.

10. M. Bostock, V. Ogievetsky, and J. Heer. D3 data-driven documents. IEEE


transactions on visualization and computer graphics, 17(12):2301–2309, 2011.

11. N. Botzer and T. Weninger. Entity graphs for exploring online discourse. arXiv
preprint arXiv:2304.03351, 2023.

92
12. N. Botzer, Y. Ding, and T. Weninger. Reddit entity linking dataset. Information
Processing & Management, 58(3):102479, 2021.
13. N. Botzer, S. Gu, and T. Weninger. Analysis of moral judgment on reddit.
IEEE Transactions on Computational Social Systems, 2022.
14. R. Boyd and P. J. Richerson. Punishment allows the evolution of cooperation
(or anything else) in sizable groups. Ethology and Sociobiology, 13(3):171–195,
May 1992. ISSN 0162-3095.
15. R. Boyd and P. J. Richerson. Punishment allows the evolution of cooperation
(or anything else) in sizable groups. Ethology and sociobiology, 13(3):171–195,
1992.
16. W. J. Brady and M. J. Crockett. How effective is online outrage? Trends in
cognitive sciences, 23(2), 2019.
17. W. J. Brady, M. Crockett, and J. J. Van Bavel. The mad model of moral con-
tagion: The role of motivation, attention, and design in the spread of moralized
content online. Perspectives on Psychological Science, 15(4):978–1010, 2020.
18. W. J. Brady, K. McLoughlin, T. N. Doan, and M. J. Crockett. How social
learning amplifies moral outrage expression in online social networks. Science
Advances, 7(33):eabe5641, 2021.
19. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Nee-
lakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot
learners. Advances in neural information processing systems, 33:1877–1901,
2020.
20. I. Casanueva, T. Temčinas, D. Gerz, M. Henderson, and I. Vulić. Efficient
intent detection with dual sentence encoders. In Workshop on Natural Language
Processing for Conversational AI, 2020.
21. P. Cascante-Bonilla, F. Tan, Y. Qi, and V. Ordonez. Curriculum labeling:
Revisiting pseudo-labeling for semi-supervised learning. In AAAI, 2021.
22. D. Centola. The spread of behavior in an online social network experiment.
science, 329(5996):1194–1197, 2010.
23. W. Chafe. Discourse, consciousness, and time: The flow and displacement of
conscious experience in speaking and writing. University of Chicago Press, 1994.
24. W. Chafe. Language and the flow of thought. The new psychology of language,
pages 93–111, 2017.
25. E. Chandrasekharan, U. Pavalanathan, A. Srinivasan, A. Glynn, J. Eisenstein,
and E. Gilbert. You can’t stay here: The efficacy of reddit’s 2015 ban examined
through hate speech. Proceedings of the ACM on Human-Computer Interaction,
1(CSCW):1–22, 2017.

93
26. E. Chandrasekharan, M. Samory, S. Jhaver, H. Charvat, A. Bruckman,
C. Lampe, J. Eisenstein, and E. Gilbert. The internet’s hidden rules: An
empirical study of reddit norm violations at micro, meso, and macro scales.
Proceedings of the ACM on Human-Computer Interaction, 2(CSCW):1–25,
2018.

27. J. P. Chang and C. Danescu-Niculescu-Mizil. Trouble on the horizon: Fore-


casting the derailment of online conversations as they develop. In Proceedings
of the 2019 Conference on Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 4745–4756, 2019.

28. J. Chen, Z. Yang, and D. Yang. Mixtext: Linguistically-informed interpolation


of hidden space for semi-supervised text classification. In ACL, 2020.

29. J. Chen, D. Tam, C. Raffel, M. Bansal, and D. Yang. An empirical survey of data
augmentation for limited data learning in nlp. arXiv preprint arXiv:2106.07499,
2021.

30. X. Cheng and D. Roth. Relational inference for wikification. Empirical Methods
in Natural Language Processing, 2013.

31. Z. Cheng, Z. Jiang, Y. Yin, C. Wang, and Q. Gu. Learning to classify open
intent via soft labeling and manifold mixup. IEEE/ACM Transactions on Audio,
Speech, and Language Processing, 30:635–645, 2022.

32. O. Choy, A. Raine, P. H. Venables, and D. P. Farrington. Explaining the gender


gap in crime: The role of heart rate. Criminology, 55(2):465–487, 2017.

33. M. Cinelli, G. De Francisci Morales, A. Galeazzi, W. Quattrociocchi, and


M. Starnini. The echo chamber effect on social media. Proceedings of the
National Academy of Sciences, 118(9):e2023301118, 2021.

34. A. Colby, L. Kohlberg, J. Gibbs, M. Lieberman, K. Fischer, and H. D. Saltzstein.


A longitudinal study of moral judgment. Monographs of the society for research
in child development, pages 1–124, 1983.

35. A. M. Collins and E. F. Loftus. A spreading-activation theory of semantic


processing. Psychological review, 82(6):407, 1975.

36. D. Croce, G. Castellucci, and R. Basili. Gan-bert: Generative adversarial learn-


ing for robust text classification with a bunch of labeled examples. In ACL,
2020.

37. M. J. Crockett. Moral outrage in the digital age. Nature human behaviour, 1
(11):769–771, 2017.

94
38. M. De Choudhury, S. S. Sharma, T. Logar, W. Eekhout, and R. C. Nielsen. Gen-
der and cross-cultural differences in social media disclosures of mental illness.
In Proceedings of the 2017 ACM conference on computer supported cooperative
work and social computing, pages 353–369, 2017.
39. B. De Cleen. Populism and nationalism. The Oxford handbook of populism, 1:
342–262, 2017.
40. L. Derczynski, D. Maynard, G. Rizzo, M. Van Erp, G. Gorrell, R. Troncy,
J. Petrak, and K. Bontcheva. Analysis of named entity recognition and linking
for tweets. Information Processing & Management, 51(2):32–49, 2015.
41. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint
arXiv:1810.04805, 2018.
42. G. Ding, S. Zhang, S. Khan, Z. Tang, J. Zhang, and F. Porikli. Feature
affinity-based pseudo labeling for semi-supervised person re-identification. IEEE
Transactions on Multimedia, 21(11):2891–2902, 2019.
43. Y. Ding, N. Botzer, and T. Weninger. Posthoc verification and the fallibility of
the ground truth. arXiv preprint arXiv:2106.07353, 2021.
44. R. Dror, G. Baumer, S. Shlomov, and R. Reichart. The hitchhiker’s guide to
testing statistical significance in natural language processing. In ACL, 2018.
45. S. Dutta, D. Das, and T. Chakraborty. Changing views: Persuasion modeling
and argument extraction from online discussions. Information Processing &
Management, 57(2):102085, 2020.
46. A. El-Nouby, N. Neverova, I. Laptev, and H. Jégou. Training vision transformers
for image retrieval. arXiv preprint arXiv:2102.05644, 2021.

47. D. Emelin, R. L. Bras, J. D. Hwang, M. Forbes, and Y. Choi. Moral stories:


Situated reasoning about norms, intents, actions, and their consequences. arXiv
preprint arXiv:2012.15738, 2020.

48. T. Farrell, M. Fernandez, J. Novotny, and H. Alani. Exploring misogyny across


the manosphere in reddit. Proceedings of the 10th ACM Conference on Web
Science, 2019.

49. S. Y. Feng, V. Gangal, J. Wei, S. Chandar, S. Vosoughi, T. Mitamura, and


E. Hovy. A survey of data augmentation approaches for nlp. arXiv preprint
arXiv:2105.03075, 2021.

50. M. Forbes, J. D. Hwang, V. Shwartz, M. Sap, and Y. Choi. Social chemistry 101:
Learning to reason about social and moral norms. In Proceedings of the 2020
Conference on Empirical Methods in Natural Language Processing (EMNLP),
pages 653–670, 2020.

95
51. T. M. Fruchterman and E. M. Reingold. Graph drawing by force-directed place-
ment. Software: Practice and experience, 21(11):1129–1164, 1991.

52. S. Fu, H. Li, Y. Liu, H. Pirkkalainen, and M. Salo. Social media overload,
exhaustion, and use discontinuance: Examining the effects of information over-
load, system feature overload, and social overload. Information Processing &
Management, 57(6):102307, 2020.

53. K. Garimella, G. De Francisci Morales, A. Gionis, and M. Mathioudakis. Po-


litical discourse on social media: Echo chambers, gatekeepers, and the price of
bipartisanship, 2018.

54. E. Gilbert. Widespread underprovision on reddit. In CSCW, pages 803–808,


2013.

55. M. Gjurković, M. Karan, I. Vukojević, M. Bošnjak, and J. Šnajder. Pandora


talks: Personality and demographics on reddit. In Proceedings of the Ninth
International Workshop on Natural Language Processing for Social Media, pages
138–152, 2021.

56. M. Glenski and T. Weninger. Rating effects on social news posts and comments.
TIST, 8(6):1–19, 2017.

57. M. Glenski, C. Pennycuff, and T. Weninger. Consumers and curators: Browsing


and voting patterns on reddit. IEEE Transactions on Computational Social
Systems, 4(4):196–206, Dec 2017. ISSN 2373-7476.

58. M. Glenski, G. Stoddard, P. Resnick, and T. Weninger. Guessthekarma: a game


to assess social rating systems. CSCW, 2:1–15, 2018.

59. M. Glenski, E. Ayton, J. Mendoza, and S. Volkova. Multilingual multimodal


digital deception detection and disinformation spread across social platforms.
arXiv preprint arXiv:1909.05838, 2019.

60. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,


A. Courville, and Y. Bengio. Generative adversarial networks. Communications
of the ACM, 63(11):139–144, 2020.

61. J. Graham, J. Haidt, S. Koleva, M. Motyl, R. Iyer, S. P. Wojcik, and P. H.


Ditto. Chapter Two - Moral Foundations Theory: The Pragmatic Validity of
Moral Pluralism, volume 47, page 55–130. Academic Press, Jan 2013.

62. S. C. Guntuku, A. M. Buttenheim, G. Sherman, and R. M. Merchant. Twitter


discourse reveals geographical and temporal variation in concerns about covid-
19 vaccines in the united states. Vaccine, 39(30):4034–4038, 2021.

63. S. Guo, N. Mokhberian, and K. Lerman. A data fusion framework for multi-
domain morality learning. arXiv preprint arXiv:2304.02144, 2023.

96
64. J. Haidt. The righteous mind: Why good people are divided by politics and
religion. Vintage, 2012.

65. J. C. Harsanyi. Morality and the theory of rational behavior. Social research,
pages 623–656, 1977.

66. S. Hechler and T. Kessler. On the difference between moral outrage and em-
pathic anger: Anger about wrongful deeds or harmful consequences. Journal of
Experimental Social Psychology, 76:270–282, 2018.

67. D. Herman. Basic elements of narrative. John Wiley & Sons, 2009.

68. J. Hoover, G. Portillo-Wightman, L. Yeh, S. Havaldar, A. M. Davani, Y. Lin,


B. Kennedy, M. Atari, Z. Kamel, M. Mendlen, et al. Moral foundations twit-
ter corpus: A collection of 35k tweets annotated for moral sentiment. Social
Psychological and Personality Science, 11(8):1057–1071, 2020.

69. L. Huddy and N. Khatib. American patriotism, national identity, and political
involvement. American journal of political science, 51(1):63–77, 2007.

70. C. Hutto and E. Gilbert. Vader: A parsimonious rule-based model for senti-
ment analysis of social media text. In Proceedings of the International AAAI
Conference on Web and Social Media, volume 8, 2014.

71. F. Ilievski, P. Vossen, and S. Schlobach. Systematic study of long tail phe-
nomena in entity linking. Proceedings of the 27th International Conference on
Computational Linguistics, 2018.

72. S. Iyengar, Y. Lelkes, M. Levendusky, N. Malhotra, and S. J. Westwood. The


origins and consequences of affective polarization in the united states. Annual
Review of Political Science, 22:129–146, 2019.

73. R. Jiang, S. Chiappa, T. Lattimore, A. György, and P. Kohli. Degenerate


feedback loops in recommender systems. In Proceedings of the 2019 AAAI/ACM
Conference on AI, Ethics, and Society, pages 383–390, 2019.

74. J. J. Jordan and D. G. Rand. Signaling when no one is watching: A reputation


heuristics account of outrage and punishment in one-shot anonymous interac-
tions. Journal of personality and social psychology, 118(1):57, 2020.

75. J. Jung, B. Son, and S. Lyu. Attnio: Knowledge graph exploration with in-and-
out attention flow for knowledge-grounded dialogue. Proceedings of the 2020
Conference on Empirical Methods in Natural Language Processing (EMNLP),
2020.

76. N. Kalantari, D. Liao, and V. G. Motti. Characterizing the online discourse in


twitter: Users’ reaction to misinformation around covid-19 in twitter, 2021.

77. I. Kant. Critique of judgment. Hackett Publishing, 1987.

97
78. B. F. Keith Norambuena and T. Mitra. Narrative maps: An algorithmic ap-
proach to represent and extract information narratives. Proceedings of the ACM
on Human-Computer Interaction, 4(CSCW3):1–33, 2021.

79. D. Kempe, J. Kleinberg, and É. Tardos. Maximizing the spread of influence
through a social network. Proceedings of the ninth ACM SIGKDD international
conference on Knowledge discovery and data mining, 2003.

80. P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot,


C. Liu, and D. Krishnan. Supervised contrastive learning. Advances in neural
information processing systems, 33:18661–18673, 2020.

81. A. M. Kibriya, E. Frank, B. Pfahringer, and G. Holmes. Multinomial naive


bayes for text categorization revisited. In Australasian Joint Conference on
Artificial Intelligence, pages 488–499. Springer, 2004.

82. N. Kolitsas, O.-E. Ganea, and T. Hofmann. End-to-end neural entity linking.
arXiv preprint arXiv:1808.07699, 2018.

83. L. F. Kozachenko and N. N. Leonenko. Sample estimate of the entropy of a


random vector. Problemy Peredachi Informatsii, 23(2):9–16, 1987.

84. M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger. From word embeddings to


document distances. International conference on machine learning, 2015.

85. I. Laradji, P. Rodrı́guez, D. Vazquez, and D. Nowrouzezahrai. Ssr: Semi-


supervised soft rasterizer for single-view 2d to 3d reconstruction. arXiv preprint
arXiv:2108.09593, 2021.

86. S. Larson, A. Mahendran, J. J. Peper, C. Clarke, A. Lee, P. Hill, J. K. Kum-


merfeld, K. Leach, M. A. Laurenzano, L. Tang, et al. An evaluation dataset for
intent classification and out-of-scope prediction. In EMNLP-IJCNLP, 2019.

87. Q. Le and T. Mikolov. Distributed representations of sentences and documents.


In ICML, pages 1188–1196, 2014.

88. D.-H. Lee et al. Pseudo-label: The simple and efficient semi-supervised learning
method for deep neural networks. In Workshop on challenges in representation
learning, ICML, 2013.

89. X. Liu, A. Eshghi, P. Swietojanski, and V. Rieser. Benchmarking natural lan-


guage understanding services for building conversational agents. arXiv preprint
arXiv:1903.05566, 2019.

90. G. Liveley. Narratology. Oxford University Press, 2019.

91. M. Marcuzzo, A. Zangari, M. Schiavinato, L. Giudice, A. Gasparetto, and A. Al-


barelli. A multi-level approach for hierarchical ticket classification. In W-NUT,
2022.

98
92. M. Mateas and P. Sengers. Narrative intelligence. J. Benjamins Pub., 2003.

93. A. N. Medvedev, R. Lambiotte, and J.-C. Delvenne. The anatomy of reddit: An


overview of academic research. Dynamics on and of Complex Networks, 2017.

94. S. Min, M. Lewis, H. Hajishirzi, and L. Zettlemoyer. Noisy channel language


model prompting for few-shot text classification. In Proceedings of the 60th
Annual Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers), pages 5316–5330, 2022.

95. S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettle-


moyer. Rethinking the role of demonstrations: What makes in-context learning
work? arXiv preprint arXiv:2202.12837, 2022.

96. S. Moon, P. Shah, A. Kumar, and R. Subba. Opendialkg: Explainable conversa-


tional reasoning with attention-based walks over knowledge graphs. Proceedings
of the 57th Annual Meeting of the Association for Computational Linguistics,
2019.

97. E. Ng. No grand pronouncements here...: Reflections on cancel culture and


digital media participation. Television & New Media, 21(6):621–627, 2020.

98. K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-
training. In Proceedings of the ninth international conference on Information
and knowledge management, pages 86–93, 2000.

99. A. D. Ong and D. J. Weiss. The impact of anonymity on responses to sensitive


questions 1. Journal of Applied Social Psychology, 30(8):1691–1708, 2000.

100. Y. Ouali, C. Hudelot, and M. Tami. An overview of deep semi-supervised


learning. arXiv preprint arXiv:2006.05278, 2020.

101. Y. Ouali, C. Hudelot, and M. Tami. Semi-supervised semantic segmentation


with cross-consistency training. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, 2020.

102. R. Page. The narrative dimensions of social media storytelling. The handbook
of narrative analysis, pages 329–347, 2015.

103. C. Paul and M. Matthews. The russian “firehose of falsehood” propaganda


model. Rand Corporation, 2(7):1–10, 2016.

104. L. Prechelt. Early stopping-but when? In Neural Networks: Tricks of the trade,
pages 55–69. Springer, 1998.

105. H. Purohit, G. Dong, V. Shalin, K. Thirunarayan, and A. Sheth. Intent classi-


fication of short-text on social media. In 2015 ieee international conference on
smart city/socialcom/sustaincom (smartcity), pages 222–228. IEEE, 2015.

99
106. A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al. Improving language
understanding by generative pre-training. 2018.
107. A. Ram, R. Prasad, C. Khatri, A. Venkatesh, R. Gabriel, Q. Liu, J. Nunn,
B. Hedayatnia, M. Cheng, A. Nagar, et al. Conversational ai: The science
behind the alexa prize. arXiv preprint arXiv:1801.03604, 2018.
108. C. Ran, W. Shen, and J. Wang. An attention factor graph model for tweet
entity linking. Proceedings of the 2018 World Wide Web Conference, 2018.
109. A. Rao, F. Morstatter, M. Hu, E. Chen, K. Burghardt, E. Ferrara, K. Lerman,
et al. Political partisanship and antiscience attitudes in online discussions about
covid-19: Twitter content analysis. Journal of medical Internet research, 23(6):
e26692, 2021.
110. N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese
bert-networks. In Proceedings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th International Joint Conference on
Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, 2019.
111. T. Reynolds, C. Howard, H. Sjåstad, L. Zhu, T. G. Okimoto, R. F. Baumeister,
K. Aquino, and J. Kim. Man up and take it: Gender bias in moral typecasting.
Organizational Behavior and Human Decision Processes, 161:120–141, 2020.
112. M. N. Rizve, K. Duarte, Y. S. Rawat, and M. Shah. In defense of pseudo-
labeling: An uncertainty-aware pseudo-label selection framework for semi-
supervised learning. arXiv preprint arXiv:2101.06329, 2021.
113. M. G. Rodriguez, K. Gummadi, and B. Schoelkopf. Quantifying information
overload in social media and its impact on social contagions. Eighth Interna-
tional AAAI Conference on Weblogs and Social Media, 2014.
114. K. Rost, L. Stahel, and B. S. Frey. Digital social norm enforcement: Online
firestorms in social media. PLoS one, 11(6):e0155923, 2016.
115. O. Rubin, J. Herzig, and J. Berant. Learning to retrieve prompts for in-
context learning. In Proceedings of the 2022 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language
Technologies, pages 2655–2671, 2022.
116. A. Sablayrolles, M. Douze, C. Schmid, and H. Jégou. Spreading vectors for
similarity search. In ICLR 2019-7th International Conference on Learning
Representations, pages 1–13, 2019.
117. E. Sagi and M. Dehghani. Measuring moral rhetoric in text. Social Science
Computer Review, 32(2):132–144, Apr 2014. ISSN 0894-4393.
118. T. Sawaoka and B. Monin. The paradox of viral outrage. Psychological science,
29(10):1665–1678, 2018.

100
119. C. Schein and K. Gray. The theory of dyadic morality: Reinventing moral
judgment by redefining harm. Personality and Social Psychology Review, 22
(1):32–70, 2018.

120. N. N. Schia and L. Gjesvik. Hacking democracy: managing influence campaigns


and disinformation in the digital age. Journal of Cyber Policy, 5(3):413–428,
2020.

121. H. A. Schwartz, J. C. Eichstaedt, M. L. Kern, L. Dziurzynski, S. M. Ramones,


M. Agrawal, A. Shah, M. Kosinski, D. Stillwell, M. E. Seligman, et al. Per-
sonality, gender, and age in the language of social media: The open-vocabulary
approach. PloS one, 8(9):e73791, 2013.

122. O. Sevgili, A. Shelmanov, M. Arkhipov, A. Panchenko, and C. Biemann. Neural


entity linking: A survey of models based on deep learning. arXiv preprint
arXiv:2006.00575, 2020.

123. D. Shahaf and C. Guestrin. Connecting the dots between news articles. Pro-
ceedings of the 16th ACM SIGKDD international conference on Knowledge
discovery and data mining, 2010.

124. D. Shahaf, J. Yang, C. Suen, J. Jacobs, H. Wang, and J. Leskovec. Information


cartography: creating zoomable, large-scale maps of information. Proceedings
of the 19th ACM SIGKDD international conference on Knowledge discovery
and data mining, 2013.

125. W. Shen, J. Wang, and J. Han. Entity linking with a knowledge base: Is-
sues, techniques, and solutions. IEEE Transactions on Knowledge and Data
Engineering, 27(2):443–460, 2014.

126. K. Siddiqui. Heuristics for sample size determination in multivariate statistical


techniques. World Applied Sciences Journal, 27(2):285–287, 2013.

127. P. Singer, F. Flöck, C. Meinhart, E. Zeitfogel, and M. Strohmaier. Evolution


of reddit: from the front page of the internet to a self-referential community?
In Proceedings of the 23rd international conference on world wide web, pages
517–522, 2014.

128. W. Sinnott-Armstrong. Consequentialism. In E. N. Zalta, editor, The Stanford


Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University,
2021.

129. K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D.


Cubuk, A. Kurakin, and C.-L. Li. Fixmatch: Simplifying semi-supervised learn-
ing with consistency and confidence. In NeurIPS, 2020.

130. M. Sternberg. Telling in time (ii): Chronology, teleology, narrativity. Poetics


today, 13(3):463–541, 1992.

101
131. S. Subramani, H. Q. Vu, and H. Wang. Intent classification using feature sets
for domestic violence discourse on social media. In 2017 4th Asia-Pacific World
Congress on Computer Science and Engineering (APWC on CSE), pages 129–
136. IEEE, 2017.

132. M. M. Tadesse, H. Lin, B. Xu, and L. Yang. Detection of depression-related


posts in reddit social media forum. IEEE Access, 7:44883–44893, 2019.

133. F. Taherkhani, A. Dabouei, S. Soleymani, J. Dawson, and N. M. Nasrabadi. Self-


supervised wasserstein pseudo-labeling for semi-supervised image classification.
In CVPR, 2021.

134. M. Thelwall and E. Stuart. She’s reddit: A source of statistically significant


gendered interest information? Information processing & management, 56(4):
1543–1558, 2019.

135. J. Trager, A. S. Ziabari, A. M. Davani, P. Golazazian, F. Karimi-Malekabadi,


A. Omrani, Z. Li, B. Kennedy, N. K. Reimer, M. Reyes, et al. The moral
foundations reddit corpus. arXiv preprint arXiv:2208.05545, 2022.

136. J. A. Tucker, A. Guess, P. Barberá, C. Vaccari, A. Siegel, S. Sanovich, D. Stukal,


and B. Nyhan. Social media, political polarization, and political disinforma-
tion: A review of the scientific literature. Political polarization, and political
disinformation: a review of the scientific literature (March 19, 2018), 2018.

137. J. M. van Hulst, F. Hasibi, K. Dercksen, K. Balog, and A. P. de Vries. Rel: An


entity linker standing on the shoulders of giants, 2020.

138. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,


L. Kaiser, and I. Polosukhin. Attention is all you need. In NeurIPS, 2017.

139. Z. Wang and D. Jurgens. It’s going to be okay: Measuring access to support
in online communities. In Proceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing, pages 33–45, 2018.

140. J. Weedon, W. Nuland, and A. Stamos. Information operations and face-


book. Retrieved from Facebook: https://fbnewsroomus. files. wordpress.
com/2017/04/facebook-and-information-operations-v1. pdf, 2017.

141. J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama,


M. Bosma, D. Zhou, D. Metzler, et al. Emergent abilities of large language
models. arXiv preprint arXiv:2206.07682, 2022.

142. T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac,


T. Rault, R. Louf, M. Funtowicz, et al. Huggingface’s transformers: State-of-
the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.

102
143. E. Wulczyn, N. Thain, and L. Dixon. Ex machina: Personal attacks seen at
scale. In Proceedings of the 26th international conference on world wide web,
pages 1391–1399, 2017.

144. R. Xi and M. P. Singh. The blame game: Understanding blame assignment in


social media. IEEE Transactions on Computational Social Systems, 2023.

145. R. Xi and M. P. Singh. Morality in the mundane: Categorizing moral reasoning


in real-life social situations. arXiv preprint arXiv:2302.12806, 2023.

146. I. Z. Yalniz, H. Jégou, K. Chen, M. Paluri, and D. Mahajan. Billion-scale semi-


supervised learning for image classification. arXiv preprint arXiv:1905.00546,
2019.

147. S. Yardi and D. Boyd. Dynamic debates: An analysis of group polarization over
time on twitter. Bulletin of science, technology & society, 30(5):316–327, 2010.

148. S. Yardi and D. Boyd. Dynamic debates: An analysis of group polarization


over time on twitter. Bulletin of Science, Technology & Society, 30(5):316–
327, 2010. ISSN 0270-4676. doi: 10.1177/0270467610380011. URL https:
//doi.org/10.1177/0270467610380011.

149. W. Yu, C. Zhu, Z. Li, Z. Hu, Q. Wang, H. Ji, and M. Jiang. A survey of
knowledge-enhanced text generation. arXiv preprint arXiv:2010.04389, 2020.

150. X. Zhai, A. Oliver, A. Kolesnikov, and L. Beyer. S4l: Self-supervised semi-


supervised learning. In Proceedings of the IEEE/CVF International Conference
on Computer Vision, 2019.

151. L.-M. Zhan, H. Liang, B. Liu, L. Fan, X.-M. Wu, and A. Lam. Out-of-scope
intent detection with self-supervision and discriminative training. arXiv preprint
arXiv:2106.08616, 2021.

152. B. Zhang, Y. Wang, W. Hou, H. Wu, J. Wang, M. Okumura, and T. Shinozaki.


Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling.
In NeurIPS, 2021.

153. H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empir-


ical risk minimization. arXiv preprint arXiv:1710.09412, 2017.

154. H. Zhang, Z. Liu, C. Xiong, and Z. Liu. Grounded conversation genera-


tion as guided traverses in commonsense knowledge graphs. arXiv preprint
arXiv:1911.02707, 2019.

155. H. Zhang, H. Xu, and T.-E. Lin. Deep open intent classification with adaptive
decision boundary. In AAAI, 2021.

103
156. J. Zhang, J. Chang, C. Danescu-Niculescu-Mizil, L. Dixon, Y. Hua, D. Tara-
borelli, and N. Thain. Conversations gone awry: Detecting early signs of conver-
sational failure. In Proceedings of the 56th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), pages 1350–1361, 2018.

157. Y. Zhou, P. Liu, and X. Qiu. Knn-contrastive learning for out-of-domain intent
classification. In ACL, 2022.

158. X. Zhu and A. B. Goldberg. Introduction to semi-supervised learning. Synthesis


lectures on artificial intelligence and machine learning, 3(1):1–130, 2009.

159. Z. Zhu, Z. Dong, and Y. Liu. Detecting corrupted labels without training a
model to predict. In ICML, 2022.

160. J. Y. Zien, M. D. Schlag, and P. K. Chan. Multilevel spectral hypergraph


partitioning with arbitrary vertex sizes. IEEE Transactions on computer-aided
design of integrated circuits and systems, 18(9):1389–1399, 1999.

161. J. Zomick, S. I. Levitan, and M. Serper. Linguistic analysis of schizophrenia in


reddit posts. Proceedings of the Sixth Workshop on Computational Linguistics
and Clinical Psychology, 2019.

162. Y. Zou, Z. Yu, B. Kumar, and J. Wang. Unsupervised domain adaptation for
semantic segmentation via class-balanced self-training. In Proceedings of the
European conference on computer vision (ECCV), pages 289–305, 2018.

This document was prepared & typeset with pdfLATEX, and formatted with
nddiss2ε classfile (v3.2017.2[2017/05/09])

104

You might also like