2016 Reagan Epj
2016 Reagan Epj
2016 Reagan Epj
1 Introduction
The power of stories to transfer information and define our own existence has been shown
time and again [–]. We are fundamentally driven to find and tell stories, likened to Pan
Narrans or Homo Narrativus. Stories are encoded in art, language, and even in the math-
ematics of physics: We use equations to represent both simple and complicated functions
that describe our observations of the real world. In science, we formalize the ideas that
best fit our experience with principles such as Occam’s Razor: The simplest story is the
one we should trust. We tend to prefer stories that fit into the molds which are familiar,
and reject narratives that do not align with our experience [].
We seek to better understand stories that are captured and shared in written form, a
medium that since inception has radically changed how information flows []. Without
evolved cues from tone, facial expression, or body language, written stories are forced to
capture the entire transfer of experience on a page. An often integral part of a written story
is the emotional experience that is evoked in the reader. Here, we use a simple, robust sen-
timent analysis tool to extract the reader-perceived emotional content of written stories
as they unfold on the page.
We objectively test aspects of the theories of folkloristics [, ], specifically the common-
ality of core stories within societal boundaries [, ]. A major component of folkloristics
is the study of society and culture through literary analysis. This is sometimes referred
to as narratology, which at its core is ‘a series of events, real or fictional, presented to the
© Reagan et al. 2016. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License
(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, pro-
vided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and
indicate if changes were made.
Reagan et al. EPJ Data Science ( 2016) 5:31 Page 2 of 12
reader or the listener’ []. In our present treatment, we consider the plot as the ‘back-
bone’ of events that occur in a chronological sequence (more detail on previous theories
of plot are in Appendix A in Additional file ). While the plot captures the mechanics of
a narrative and the structure encodes their delivery, in the present work we examine the
emotional arc that is invoked through the words used. The emotional arc of a story does
not give us direct information about the plot or the intended meaning of the story, but
rather exists as part of the whole narrative (e.g., an emotional arc showing a fall in senti-
ment throughout a story may arise from very different plot and structure combinations).
This distinction between the emotional arc and the plot of a story is one point of misunder-
standing in other work that has drawn criticism from the digital humanities community
[]. Through the identification of motifs [], narrative theories [] allow us to analyze,
interpret, describe, and compare stories across cultures and regions of the world []. We
show that automated extraction of emotional arcs is not only possibly, but can test previ-
ous theories and provide new insights with the potential to quantify unobserved trends as
the field transitions from data-scarce to data-rich [, ].
The rejected master’s thesis of Kurt Vonnegut - which he personally considered his
greatest contribution - defines the emotional arc of a story on the ‘Beginning-End’ and ‘Ill
Fortune-Great Fortune’ axes []. Vonnegut finds a remarkable similarity between Cin-
derella and the origin story of Christianity in the Old Testament (see Figure S in Ap-
pendix B in Additional file ), leading us to search for all such groupings. In a recorded
lecture available on YouTube [], Vonnegut asserted:
‘There is no reason why the simple shapes of stories can’t be fed into computers, they
are beautiful shapes.’
For our analysis, we apply three independent tools: matrix decomposition by singular
value decomposition (SVD), supervised learning by agglomerative (hierarchical) cluster-
ing with Ward’s method, and unsupervised learning by a self-organizing map (SOM, a
type of neural network). Each tool encompasses different strengths: the SVD finds the
underlying basis of all of the emotional arcs, the clustering classifies the emotional arcs
into distinct groups, and the SOM generates arcs from noise which are similar to those in
our corpus using a stochastic process. It is only by considering the results of each tool in
support of each other that we are able to confirm our findings.
We proceed as follows. We first introduce our methods in Section , we then discuss
the combined results of each method in Section , and we present our conclusions in
Section . A graphical outline of the methodology and results can be found as Figure S
in Appendix B in Additional file .
2 Methods
2.1 Emotional arc construction
To generate emotional arcs, we analyze the sentiment of , word windows, which we
slide through the text (see Figure ). We rate the emotional content of each window us-
ing our Hedonometer with the labMT dataset, chosen for lexical coverage and its ability to
generate meaningful word shift graphs, specifically using , words as a minimum nec-
essary to generate meaningful sentiment scores [, ]. We emphasize that dictionary-
based methods for sentiment analysis usually perform worse than random on individual
sentences [, ], and although this issue can be resolved by using a rolling average of
Reagan et al. EPJ Data Science ( 2016) 5:31 Page 3 of 12
Figure 2 Annotated emotional arc of Harry Potter and the Deathly Hallows, by JK Rowling, inspired by
the illustration made by Medaris for The Why Files [23]. The entire seven book series can be classified as a
‘Kill the monster’ plot [24], while the many sub plots and connections between them complicate the
emotional arc of each individual book: this plot could not be readily inferred from the emotional arc alone.
The emotional arc shown here, captures the major highs and lows of the story, and should be familiar to any
reader well acquainted with Harry Potter. Our method does not pick up emotional moments discussed briefly,
perhaps in one paragraph or sentence (e.g., the first kiss of Harry and Ginny). We provide interactive
visualizations of all Project Gutenberg books at http://hedonometer.org/books/v3/1/ and a selection of classic
and popular books at http://hedonometer.org/books/v1/.
set of books that represent English works of fiction. We start by selecting for only En-
glish books, with total words between , and ,, with more than downloads
from the Project Gutenberg website, and with Library of Congress Class corresponding
to English fiction.a To ensure that the -download limit is not influencing the results
here, we further test each method for , , , and download thresholds, in each
case confirming the download findings to be qualitatively unchanged. Next, we re-
move books with any word in the title from a list of keywords (e.g., ‘poems’ and ‘collection’,
full list in Appendix C in Additional file ). From within this set of books, we remove
the front and back matter of each book using regular expression pattern matches that
match on .% of the books included. Two slices of the data for download count and
the total word count are shown in Appendix C, Figure S in Additional file . We pro-
vide a list of the book ID’s which are included for download in the Online Appendices
at http://compstorylab.org/share/papers/reaganb/, the books are listed in Table S
in Appendix D in Additional file , and we attempt to provide the Project Gutenberg
ID when we mention a book by title herein. Given the Project Gutenberg ID n, the raw
ebook is available online from Project Gutenberg at http://www.gutenberg.org/ebooks/n/,
e.g., Alice’s Adventures in Wonderland by Lewis Carroll, has ID and is available at
http://www.gutenberg.org/ebooks//. We also provide an online, interactive version of
the emotional arc for each book indexed by the ID, e.g., Alice’s Adventures in Wonderland
is available at http://hedonometer.org/books/v//.
A = U!V T = WV T , ()
where U contains the projection of each sentiment time series onto each of the right sin-
gular vectors (rows of V T , eigenvectors of AT A), which have singular values given along
the diagonal of !, with W = U!. Different intuitive interpretations of the matrices U, !,
and V T are useful in the various domains in which the SVD is applied; here, we focus on
right singular vectors as an orthonormal basis for the sentiment time series in the rows of
A, which we will refer to as the modes. We combine ! and U into the single coefficient
matrix W for clarity and convenience, such that W now represents the mode coefficients.
l
! " "
D(bi , bj ) = l– "bi (t) – bj (t)" ()
t=
# √ $
Nbdk (i) = j ∈ N |D(k, j) < N · (i + )α ()
for a node k in the set of nodes N , with distance function D given above and total number
of nodes N . For results shown here we take α = –.. We implement the learning adapta-
tion function at training iteration i as f (i) = (i + )β , again with β = –., a standard value
for the training hyper-parameters.
3 Results
We obtain a collection of , books that are mostly, but not all, fictional stories by using
metadata from Project Gutenberg to construct a rough filter. We find broad support for
the following six emotional arcs:
• ‘Rags to riches’ (rise).
• ‘Tragedy’, or ‘Riches to rags’ (fall).
• ‘Man in a hole’ (fall-rise).
• ‘Icarus’ (rise-fall).
• ‘Cinderella’ (rise-fall-rise).
• ‘Oedipus’ (fall-rise-fall).
Importantly, we obtain these same six emotional arcs from all possible arcs by observing
them as the result of three methods: As modes from a matrix decomposition by SVD, as
clusters in a hierarchical clustering using Ward’s algorithm, and as clusters using unsuper-
vised machine learning. We examine each of the results in this section.
Figure 3 Top 12 modes from the singular value decomposition of 1,327 Project Gutenberg books. We
show in a lighter color modes weighted by their corresponding singular value, where we have scaled the
matrix ! such that the first entry is 1 for comparison (for reference, the largest singular value is 34.5). The
mode coefficients normalized for each book are shown in the right panel accompanying each mode, in the
range –1 to 1, with the ‘Tukey’ box plot.
We emphasize that by definition of the SVD, the mode coefficients in W can be either
positive and negative, such that the modes themselves explain variance with both the pos-
itive and negative version. In the right panels of each mode in Figure we project the ,
stories onto each of first six modes and show the resulting coefficients. While none are far
from (as would be expected), mode has a mean slightly above and both modes and
have means slightly below . To sort the books by their coefficient for each mode, we
normalize the coefficients within each book in the rows of W to sum to , accounting for
books with higher total energy, and these are the coefficients shown in the right panels of
each mode in Figure . In Appendix E in Additional file , we provide supporting, intu-
itive details of the SVD method, as well as example emotional arc reconstruction using the
modes (see Figures S-S in Additional file ). As expected, less than modes are enough
to reconstruct the emotional arc to a degree of accuracy visible to the eye.
We show labeled examples of the emotional arcs closest to the top modes in Figure
and Figure S in Additional file . We present both the positive and negative modes, and
the stories closest to each by sorting on the coefficient for that mode. For the positive sto-
ries, we sort in ascending order, and vice versa. Mode , which encompasses both the ‘Rags
to riches’ and ‘Tragedy’ emotional arcs, captures % of the variance of the entire space.
We examine the closest stories to both sides of modes -, and direct the reader to Fig-
ure S in Additional file for more details on the higher order modes. The two stories that
have the most support from the ‘Rags to riches’ mode are The Winter’s Tale (,) and Os-
car Wilde, Art and Morality: A Defence of ‘The Picture of Dorian Gray’ (,). Among
Reagan et al. EPJ Data Science ( 2016) 5:31 Page 7 of 12
Figure 4 First 3 SVD modes and their negation with the closest stories to each. To locate the emotional
arcs on the same scale as the modes, we show the modes directly from the rows of V T and weight the
emotional arcs by the inverse of their coefficient in W for the particular mode. The closest stories shown for
each mode are those stories with emotional arcs which have the greatest coefficient in W. In parentheses for
each story is the Project Gutenberg ID and the number of downloads from the Project Gutenberg website,
respectively. Links below each story point to an interactive visualization on http://hedonometer.org which
enables detailed exploration of the emotional arc for the story.
the most categorical tragedies we find Lady Susan () and Warlord of Kor (,).
Number in the sorted list of tragedies is perhaps the most famous tragedy: Romeo and
Juliet by William Shakespeare. Mode is the ‘Man in a hole’ emotional arc, and we find
the stories which most closely follow this path to be The Magic of Oz () and Children of
the Frost (,). The negation of mode most closely resembles the emotional arc of the
‘Icarus’ narrative. For this emotional arc, the most characteristic stories are Shadowings
(,) and Battle-Pieces and Aspects of the War (,). Mode is the ‘Cinderella’ emo-
tional arc, and includes Mystery of the Hasty Arrow (,) and Through the Magic Dorr
(,). The negation of Mode , which we refer to as ‘Oedipus’, is found most characteris-
tically in This World is Taboo (,), Old Indian Days (), and The Evil Guest (,).
We also note that the spread of the stories from their core mode increases strongly for the
higher modes.
Figure 5 Dendrogram from the hierarchical clustering procedure using Ward’s minimum variance
method. For each cluster, a selection of the 20 most central books to a fully-connected network of books are
shown along with the average of the emotional arc for all books in the cluster, along with the cluster ID and
number of books in each cluster (shown in parenthesis). The cluster ID is given by numbering the clusters in
order of linkage starting at 0, with each individual book representing a cluster of size 1 such that the final
cluster (all books) has the ID 2(N – 1) for the N = 1,327 books. At the bottom, we show the average Silhouette
value for all books, with higher value representing a more appropriate number of clusters. For each of the 60
leaf nodes (right side) we show the number of books within the cluster and the most central book to that
cluster’s book network.
distance to other books in the cluster (e.g., considering each intra-cluster collection as a
fully connected weighted network, we take the most central node), and in parenthesis the
number of books in that cluster. In other words, we label each cluster by considering the
network centrality of the fully connected cluster with edges weighted by the distance be-
tween stories. By cutting the dendrogram in Figure at various linkage costs we are able
to extract clusters of the desired granularity. For the cuts labeled C, C, and C, we show
these clusters in Figures S, S, and S in Additional file . We find the first four of our
final six arcs appearing among the eight most different clusters (Figure S in Additional
file ).
The clustering method groups stories with a ‘Man in a hole’ emotional arc for a range
of different variances, separate from the other arcs, in total these clusters (panels A, E,
and I of Figure S in Additional file ) account for % of the Gutenberg corpus. The
remainder of the stories have emotional arcs that are clustered among the ‘Tragedy’ arc
(%), ‘Rags to riches’ arc (%), and the ‘Oedipus’ arc (%). A more detailed analysis of the
results from hierarchical clustering can be found in Appendix F in Additional file , and
this result generally agrees with other attempts that use only hierarchical clustering [].
Reagan et al. EPJ Data Science ( 2016) 5:31 Page 9 of 12
Figure 6 Results of the SOM applied to Project Gutenberg books. Left panel: Nodes on the 2D SOM grid
are shaded by the number of stories for which they are the winner. Right panel: The B-matrix shows that there
are clear clusters of stories in the 2D space imposed by the SOM network.
Figure 7 Download statistics for stories whose SVD Modes comprise more than 2.5% of books, for N
the total number of books and Nm the number corresponding to the particular mode. Modes SV 3
through -SV 4 (both polarities of modes 3 and 4) exhibit a higher average number of downloads and more
variance than the others. Mode arcs are rows of V T and the download distribution is show in log10 space from
20 to 30,000 downloads.
the English fiction Gutenberg Corpus with the null versions of each book and verify that
the emotional arcs of real stories are not simply an artifact. The singular value spectrum
from the SVD is flatter, with higher-frequency modes appearing more quickly, and in total
representing % of the total variance present in real stories (see Figures S and S in
Additional file ). Hierarchical clustering generates less distinct clusters with considerably
lower linkage cost (final linkage cost , vs ,) for the emotional arcs from nonsense
books, and the winning node vectors on a self-organizing map lack coherent structure (see
Figures S and S in Appendix H in Additional file ).
4 Conclusion
Using three distinct methods, we have demonstrated that there is strong support for six
core emotional arcs. Our methodology brings to bear a cross section of data science tools
with a knowledge of the potential issues that each present. We have also shown that con-
sideration of the emotional arc for a given story is important for the success of that story.
Of course, downloads are only a rough proxy for success, and this work may provide an
outline for more detailed analysis of the factors that impact meaningful measures of suc-
cess, i.e., sales or cultural influence.
Our approach could be applied in the opposite direction: namely by beginning with the
emotional arc and aiding in the generation of compelling stories []. Understanding the
emotional arcs of stories may be useful to aid in constructing arguments [] and teaching
common sense to artificial intelligence systems [].
Reagan et al. EPJ Data Science ( 2016) 5:31 Page 11 of 12
Extensions of our analysis that use a more curated selection of full-text fiction can an-
swer more detailed questions about which stories are the most popular throughout time,
and across regions []. Automatic extraction of character networks would allow a more
detailed analysis of plot structure for the Project Gutenberg corpus used here [, , ].
Bridging the gap between the full text stories [] and systems that analyze plot sequences
will allow such systems to undertake studies of this scale []. Place could also be used to
consider separate character networks through time, and to help build an analog to Randall
Munroe’s Movie narrative charts [].
We are producing data at an ever increasing rate, including rich sources of stories writ-
ten to entertain and share knowledge, from books to television series to news. Of profound
scientific interest will be the degree to which we can eventually understand the full land-
scape of human stories, and data driven approaches will play a crucial role.
Additional material
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
All authors contributed equally to the writing of this paper. All authors read and approved the final manuscript.
Author details
1
University of Vermont, 85 South Prospect St, Burlington, VT 05405, USA. 2 University of Adelaide, Adelaide, SA 5005,
Australia.
Acknowledgements
PSD and CMD acknowledge support from NSF Big Data Grant #1447634.
Endnote
a
The specific classes have labels PN, PR, PS, and PZ.
References
1. Pratchett T, Stewart I, Cohen J (2003) The science of Discworld II: the globe. Ebury Press, London
2. Campbell J (2008) The hero with a thousand faces, 3rd edn. New World Library, Novato
3. Gottschall J (2013) The storytelling animal: how stories make us human. Mariner Books, New York
4. Cave S (2013) The 4 stories we tell ourselves about death.
http://www.ted.com/talks/stephen_cave_the_4_stories_we_tell_ourselves_about_death
5. Dodds PS (2013) Homo narrativus and the trouble with fame. Nautilus magazine.
http://nautil.us/issue/5/fame/homo-narrativus-and-the-trouble-with-fame
6. Nickerson RS (1998) Confirmation bias: a ubiquitous phenomenon in many guises. Rev Gen Psychol 2:175-220
7. Gleick J (2011) The information: a history, a theory, a flood. Pantheon, New York
8. Propp V (1968) Morphology of the folktale (1928). University of Texas Press, Austin
9. MacDonald MR (1982) Storytellers sourcebook: a subject, title, and motif index to folklore collections for children.
Gale Group, Farmington Hills
10. da Silva SG, Tehrani JJ (2016) Comparative phylogenetic analyses uncover the ancient roots of Indo-European
folktales. R Soc Open Sci 3(1):150645. doi:10.1098/rsos.150645.
http://rsos.royalsocietypublishing.org/content/3/1/150645.full.pdf
11. Min S, Park J (2016) Narrative as a complex network: a study of Victor Hugo’s Les Misérables. In: Proceedings of HCI
Korea
12. Jockers M (2014) A novel method for detecting plot.
http://www.matthewjockers.net/2014/06/05/a-novel-method-for-detecting-plot/
13. Dundes A (1997) The motif-index and the tale type index: a critique. J Folklore Res 34:195-202
14. Dolby SK (2008) Literary folkloristics and the personal narrative. Trickster Press, Bloomington
15. Uther H-J (2011) The types of international folktales. A classification and bibliography. Based on the system of Antti
Aarne and Stith Thompson. Part I. Animal tales, tales of magic, religious tales, and realistic tales, with an introduction.
FF communications, vol 284. Finnish Academy of Science and Letters, Helsinki
16. Kirschenbaum MG (2007) The remaking of reading: data mining and the digital humanities. In: The national science
foundation symposium on next generation of data mining and cyber-enabled discovery for innovation, Maryland
Reagan et al. EPJ Data Science ( 2016) 5:31 Page 12 of 12