Twitter Sentiment Analysis: The Good The Bad and The OMG!: Efthymios Kouloumpis Theresa Wilson Johanna Moore

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media

Twitter Sentiment Analysis:


The Good the Bad and the OMG!
Efthymios Kouloumpis Theresa Wilson Johanna Moore
i-sieve Technologies HLT Center of Excellence School of Informatics
Athens, Greece Johns Hopkins University University of Edinburgh
epistimos@i-sieve.com Baltimore, MD, USA Edinburgh, UK
taw@jhu.edu j.moore@ed.ac.uk

Abstract Related Work


In this paper, we investigate the utility of linguistic features Sentiment analysis is a growing area of Natural Language
for detecting the sentiment of Twitter messages. We evaluate Processing with research ranging from document level clas-
the usefulness of existing lexical resources as well as features sication (Pang and Lee 2008) to learning the polarity of
that capture information about the informal and creative lan- words and phrases (e.g., (Hatzivassiloglou and McKeown
guage used in microblogging. We take a supervised approach 1997; Esuli and Sebastiani 2006)). Given the character lim-
to the problem, but leverage existing hashtags in the Twitter itations on tweets, classifying the sentiment of Twitter mes-
data for building training data. sages is most similar to sentence-level sentiment analy-
sis (e.g., (Yu and Hatzivassiloglou 2003; Kim and Hovy
2004)); however, the informal and specialized language used
Introduction in tweets, as well as the very nature of the microblogging
In the past few years, there has been a huge growth domain make Twitter sentiment analysis a very different
in the use of microblogging platforms such as Twitter. task. Its an open question how well the features and tech-
Spurred by that growth, companies and media organiza- niques used on more well-formed data will transfer to the
tions are increasingly seeking ways to mine Twitter for microblogging domain.
information about what people think and feel about their Just in the past year there have been a number of papers
products and services. Companies such as Twitratr (twitr- looking at Twitter sentiment and buzz (Jansen et al. 2009;
ratr.com), tweetfeel (www.tweetfeel.com), and Social Men- Pak and Paroubek 2010; OConnor et al. 2010; Tumasjan
tion (www.socialmention.com) are just a few who advertise et al. 2010; Bifet and Frank 2010; Barbosa and Feng 2010;
Twitter sentiment analysis as one of their services. Davidov, Tsur, and Rappoport 2010). Other researchers have
While there has been a fair amount of research on how begun to explore the use of part-of-speech features but re-
sentiments are expressed in genres such as online reviews sults remain mixed. Features common to microblogging
and news articles, how sentiments are expressed given the (e.g., emoticons) are also common, but there has been lit-
informal language and message-length constraints of mi- tle investigation into the usefulness of existing sentiment re-
croblogging has been much less studied. Features such as sources developed on non-microblogging data.
automatic part-of-speech tags and resources such as senti- Researchers have also begun to investigate various ways
ment lexicons have proved useful for sentiment analysis in of automatically collecting training data. Several researchers
other domains, but will they also prove useful for sentiment rely on emoticons for dening their training data (Pak and
analysis in Twitter? In this paper, we begin to investigate this Paroubek 2010; Bifet and Frank 2010). (Barbosa and Feng
question. 2010) exploit existing Twitter sentiment sites for collecting
Another challenge of microblogging is the incredible training data. (Davidov, Tsur, and Rappoport 2010) also use
breadth of topic that is covered. It is not an exaggeration to hashtags for creating training data, but they limit their exper-
say that people tweet about anything and everything. There- iments to sentiment/non-sentiment classication, rather than
fore, to be able to build systems to mine Twitter sentiment 3-way polarity classication, as we do.
about any given topic, we need a method for quickly identi-
fying data that can be used for training. In this paper, we ex- Data
plore one method for building such data: using Twitter hash-
tags (e.g., #bestfeeling, #epicfail, #news) to identify posi- We use three different corpora of Twitter messages in
tive, negative, and neutral tweets to use for training three- our experiments. For development and training, we use
way sentiment classiers. the the hashtagged data set (HASH), which we compile
from the Edinburgh Twitter corpus1 , and the emoticon

Work performed while at the University of Edinburgh data set (EMOT) from http://twittersentiment.
Copyright  c 2011, Association for the Advancement of Articial
1
Intelligence (www.aaai.org). All rights reserved. http://demeter.inf.ed.ac.uk

538
Positive Negative Neutral Total Positive #iloveitwhen, #thingsilike, #bestfeel-
HASH 31,861 (14%) 64,850 (29%) 125,859 (57%) 222,570 ing, #bestfeelingever, #omgthatssotrue,
EMOT 230,811 (61%) 150,570 (39%) 381,381 #imthankfulfor, #thingsilove, #success
ISIEVE 1,520 (38%) 200 (5%) 2,295 (57%) 4,015 Negative #fail, #epicfail, #nevertrust, #worst,
#worse, #worstlies, #imtiredof, #itsno-
Table 1: Corpus statistics tokay, #worstfeeling, #notcute, #somethin-
gaintright, #somethingsnotright, #ihate
Neutral #job, #tweetajob, #omgfacts, #news, #lis-
Hashtag Frequency Synonyms teningto, #lastfm, #hiring, #cnn
#followfriday 226,530 #ff
#nowplaying 209,970 Table 3: Top positive, negative and neutral hashtags used to
#job 136,734 #tweetajob create the HASH data set
#fb 106,814 #facebook
#musicmonday 78,585 #mm
#tinychat 56,376 Emoticon data set
#tcot 42,110
#quote 33,554
The Emoticon data set was created by Go, Bhayani, and
#letsbehonest 32,732 #tobehonest Huang for a project at Stanford University by collecting
#omgfacts 30,042 tweets with positive :) and negative :( emoticons. Mes-
#fail 23,007 #epicfail sages containing both positive and negative emoticons were
#factsaboutme 19,167 omitted. They also hand-tagged a number of tweets to use
#news 17,190 for evaluation, but for our experiments, we only use their
#random 17,180 training data. This set contains 381,381 tweets, 230,811 pos-
#shoutout 16,446 itive and 150,570 negative. Interestingly, the majority of
these messages do not contain any hashtags.
Table 2: Most frequent hashtags in the Edinburgh corpus
iSieve data set
The iSieve data contains approximately 4,000 tweets. It was
appspot.com. For evaluation we use a manually an- collected and hand-annotated by the iSieve Corporation. The
notated data set produced by the iSieve Corporation2 data in this collection was selected to be on certain topics,
(ISIEVE). The number of Twitter messages and the distri- and the label of each tweet reects its sentiment (positive,
bution across classes is given in Table 1. negative, or neutral) towards the tweets topic. We use this
data set exclusively for evaluation.
Hashtagged data set Preprocessing
The hashtagged data set is a subset of the Edinburgh Twit- Data preprocessing consists of three steps: 1) tokeniza-
ter corpus. The Edinburgh corpus contains 97 million tweets tion, 2) normalization, and 3) part-of-speech (POS) tagging.
collected over a period of two months. To create the hash- Emoticons and abbreviations (e.g., OMG, WTF, BRB) are
tagged data set, we rst lter out duplicate tweets, non- identied as part of the tokenization process and treated as
English tweets, and tweets that do not contain hashtags. individual tokens. For the normalization process, the pres-
From the remaining set (about 4 million), we investigate the ence of abbreviations within a tweet is noted and then ab-
distribution of hashtags and identify what we hope will be breviations are replaced by their actual meaning (e.g., BRB
sets of frequent hashtags that are indicative of positive, nega- > be right back). We also identify informal intensiers
tive, and neutral messages. These hashtags are used to select such as all-caps (e.g., I LOVE this show!!! and character rep-
the tweets that will be used for development and training. etitions (e.g., Ive got a mortgage!! happyyyyyy), note their
presence in the tweet. All-caps words are made into lower
Table 2 lists the 15 most-used hashtags in the Edinburgh
case, and instances of repeated charaters are replaced by a
corpus. In addition to the very common hashtags that are part
single character. Finally, the presence of any special Twit-
of the Twitter folksonomy (e.g., #followfriday, #musicmon-
ter tokens is noted (e.g., #hashtags, usertags, and URLs) and
day), we nd hashtags that would seem to indicate message
placeholders indicating the token type are substituted. Our
polarity: #fail, #omgthatsotrue, #iloveitwhen, etc.
hope is that this normalization improves the performance of
To select the nal set of messages to be included in the the POS tagger, which is the last preprocessing step.
HASH dataset, we identify all hashtags that appear at least
1,000 times in the Edinburgh corpus. From these, we se- Features
lected the top hashtags that we felt would be most useful
We use a variety of features for our classication experi-
for identifying positive, negative and neutral tweets. These
ments. For the baseline, we use unigrams and bigrams. We
hashtags are given in Table 3. Messages with these hashtags
also include features typically used in sentiment analysis,
were included in the nal dataset, and the polarity of each
namely features representing information from a sentiment
message is determined by its hashtag.
lexicon and POS features. Finally, we include features to
capture some of the more domain-specic language of mi-
2
www.i-sieve.com croblogging.

539
n-gram features
HASH HASH+EMOT
To identify a set of useful n-grams, we rst remove stop-
0.9
words. We then perform rudimentary negation detection by
attaching the the word not to a word that preceeds or fol-
lows a negation term. This has proved useful in previous 0.8
work (Pak and Paroubek 2010). Finally, all unigrams and
bigrams are identied in the training data and ranked accord-
ing to their information gain, measured using Chi-squared. 0.7
For our experiments, we use the top 1,000 n-grams in a bag-
of-words fashion.3
0.6
Lexicon features
Words listed the MPQA subjectivity lexicon (Wilson,
Wiebe, and Hoffmann 2009) are tagged with their prior po- 0.5
n-grams all
larity: positive, negative, or neutral. We create three features
based on the presence of any words from the lexicon.

Part-of-speech features Figure 1: Average F-measure on the validation set over mod-
els trained on the HASH and HASH+EMOT data
For each tweet, we have features for counts of the number
of verbs, adverbs, adjectives, nouns, and any other parts of
speech. Because the EMOT data set has no neutral data and our
experiments involve 3-way classication, it is not included
Micro-blogging features in the initial experiments. Instead, we explore whether it is
We create binary features that capture the presence of posi- useful to use the EMOT data to expand the HASH data and
tive, negative, and neutral emoticons and abbreviations and improve sentiment classication. 19,000 messages from the
the presence of intensiers (e.g., all-caps and character rep- EMOT data set, divided equally between positive and nega-
etitions). For the emoticons and abbreviations, we use the tive, are randomly selected and added to the HASH data and
Internet Lingo Dictionary (Wasden 2006) and various inter- the experiments are repeated.
net slang dictionaries available online. To get a sense for an upper-bound on the performance
we can expect for the HASH-trained models and whether
Experiments and Results including the EMOT data may yield improvements, we rst
Our goal for these experiments is two-fold. First, we want to check the results of the models on the validation set. Figure 1
evaluate whether our training data with labels derived from shows the average F-measure for the n-gram baseline and all
hashtags and emoticons is useful for training sentiment clas- the features on the HASH and the HASH+EMOT data. On
siers for Twitter. Second, we want to evaluate the effective- this data, adding the EMOT data to the training does lead to
ness of the features from section for sentiment analysis in improvements, particularly when all the features are used.
Twitter data. How useful is the sentiment lexicon developed Turning to the test data, we evaluate the models trained
for formal text on the short and informal tweets? How much on the HASH and the HASH+EMOT data on the ISIEVE
gain do we get from the domain-specic features? data set. Figure 2 shows the average F-measure for the base-
For our rst set of experiments we use the HASH and line and four combinations of features: n-grams and lexicon
EMOT data sets. We start by randomly sampling 10% of the features (n-gram+lex), n-grams and part-of-speech features
HASH data to use as a validation set. This validation set is (n-gram+POS), n-grams, lexicon features and microblog-
used for n-gram feature selection and for parameter tuning. ging features (n-grams+lex+twit), and nally all the features
The remainder of the HASH data is used for training. To combined. Figure 3 shows the accuracy for these same ex-
train a classier, we sample 22,2474 tweets from the train- periments.
ing data and use this data to train AdaBoost.MH (Schapire Interestingly, the best performance on the evaluation data
and Singer 2000) models with 500 rounds of boosting.56 We comes from using the n-grams together with the lexicon
repeat this process ten times and average the performance of features and the microblogging features. Including the part-
the models. of-speech features actually gives a drop in performance.
Whether this is due to the accuracy of the POS tagger on the
3
The number n-grams to include as features was determined
tweets or whether POS tags are less useful on microblogging
empirically using the training data.
4 data will require further investigation.
This is equivalent to 10% of the training data. We experi-
mented with different sample sizes for training the classier, and Also, while including the EMOT data for training gives
this gave the best results based on the validation data. a nice improvement in performance in the absense of mi-
5
The rounds of boosting was determined empirically using the croblogging features, once the microblogging features are
validation set. included, the improvements drop or disappear. The best re-
6
We also experimented with SVMs, which gave similar trends, sults on the evaluation data comes from the n-grams, lexical
but lower results overall. and Twitter features trained on the hashtagged data alone.

540
included, the benet of emoticon training data is lessened.
HASH HASH+EMOT

0.8 References
Barbosa, L., and Feng, J. 2010. Robust sentiment detection
0.68 on twitter from biased and noisy data. In Proc. of Coling.
0.7
0.65 Bifet, A., and Frank, E. 2010. Sentiment knowledge discov-
ery in twitter streaming data. In Proc. of 13th International
0.6 Conference on Discovery Science.
Davidov, D.; Tsur, O.; and Rappoport, A. 2010. Enhanced
sentiment learning using twitter hashtags and smileys. In
0.5
Proceedings of Coling.
S

it

l
s

x

al
am

le

PO

tw
s+

x+ Esuli, A., and Sebastiani, F. 2006. SentiWordNet: A pub-


gr

s+
am
n-

le
am

s+
gr

licly available lexical resource for opinion mining. In Pro-


gr
n-

am
n-

gr

ceedings of LREC.
n-

Hatzivassiloglou, V., and McKeown, K. 1997. Predicting


the semantic orientation of adjectives. In Proc. of ACL.
Figure 2: Average F-measure on the test set over models
trained on the HASH and HASH+EMOT data Jansen, B. J.; Zhang, M.; Sobel, K.; and Chowdury, A. 2009.
Twitter power: Tweets as electronic word of mouth. Journal
of the American Society for Information Science and Tech-
HASH HASH+EMOT nology 60(11):21692188.
0.8 Kim, S.-M., and Hovy, E. 2004. Determining the sentiment
0.74
0.75 of opinions. In Proceedings of Coling.
OConnor, B.; Balasubramanyan, R.; Routledge, B.; and
0.7
Smith, N. 2010. From tweets to polls: Linking text sentiment
to public opinion time series. In Proceedings of ICWSM.
0.6 Pak, A., and Paroubek, P. 2010. Twitter as a corpus for
sentiment analysis and opinion mining. In Proc. of LREC.
Pang, B., and Lee, L. 2008. Opinion mining and sentiment
0.5 analysis. Foundations and Trends in Information Retrieval
S

it

l
s

x

2(1-2):1135.
al
am

le

PO

tw
s+

x+
gr

s+
am
n-

le
am

Schapire, R. E., and Singer, Y. 2000. BoosTexter: A


s+
gr

gr
n-

am
n-

boosting-based system for text categorization. Machine


gr
n-

Learning 39(2/3):135168.
Tumasjan, A.; Sprenger, T. O.; Sandner, P.; and Welpe, I.
Figure 3: Average accuracy on the test set over models 2010. Predicting elections with twitter: What 140 characters
trained on the HASH and HASH+EMOT data reveal about political sentiment. In Proceedings of ICWSM.
Wasden, L. 2006. Internet Lingo Dictionary: A Parents
Guide to Codes Used in Chat Rooms, Instant Messaging,
Conclusions Text Messaging, and Blogs. Technical report, Idaho Ofce
Our experiments on twitter sentiment analysis show that of the Attorney General.
part-of-speech features may not be useful for sentiment anal- Wilson, T.; Wiebe, J.; and Hoffmann, P. 2009. Recog-
ysis in the microblogging domain. More research is needed nizing contextual polarity: An exploration of features for
to determine whether the POS features are just of poor qual- phrase-level sentiment analysis. Computational Linguistics
ity due to the results of the tagger or whether POS features 35(3):399433.
are just less useful for sentiment analysis in this domain.
Features from an existing sentiment lexicon were somewhat Yu, H., and Hatzivassiloglou, V. 2003. Towards answer-
useful in conjunction with microblogging features, but the ing opinion questions: Separating facts from opinions and
microblogging features (i.e., the presence of intensiers and identifying the polarity of opinion sentences. In Proc. of
positive/negative/neutral emoticons and abbreviations) were EMNLP.
clearly the most useful.
Using hashtags to collect training data did prove use-
ful, as did using data collected based on positive and neg-
ative emoticons. However, which method produces the bet-
ter training data and whether the two sources of training data
are complementary may depend on the type of features used.
Our experiments show that when microblogging features are

541

You might also like