Twitter Sentiment Analysis: The Good The Bad and The OMG!: Efthymios Kouloumpis Theresa Wilson Johanna Moore
Twitter Sentiment Analysis: The Good The Bad and The OMG!: Efthymios Kouloumpis Theresa Wilson Johanna Moore
Twitter Sentiment Analysis: The Good The Bad and The OMG!: Efthymios Kouloumpis Theresa Wilson Johanna Moore
538
Positive Negative Neutral Total Positive #iloveitwhen, #thingsilike, #bestfeel-
HASH 31,861 (14%) 64,850 (29%) 125,859 (57%) 222,570 ing, #bestfeelingever, #omgthatssotrue,
EMOT 230,811 (61%) 150,570 (39%) 381,381 #imthankfulfor, #thingsilove, #success
ISIEVE 1,520 (38%) 200 (5%) 2,295 (57%) 4,015 Negative #fail, #epicfail, #nevertrust, #worst,
#worse, #worstlies, #imtiredof, #itsno-
Table 1: Corpus statistics tokay, #worstfeeling, #notcute, #somethin-
gaintright, #somethingsnotright, #ihate
Neutral #job, #tweetajob, #omgfacts, #news, #lis-
Hashtag Frequency Synonyms teningto, #lastfm, #hiring, #cnn
#followfriday 226,530 #ff
#nowplaying 209,970 Table 3: Top positive, negative and neutral hashtags used to
#job 136,734 #tweetajob create the HASH data set
#fb 106,814 #facebook
#musicmonday 78,585 #mm
#tinychat 56,376 Emoticon data set
#tcot 42,110
#quote 33,554
The Emoticon data set was created by Go, Bhayani, and
#letsbehonest 32,732 #tobehonest Huang for a project at Stanford University by collecting
#omgfacts 30,042 tweets with positive :) and negative :( emoticons. Mes-
#fail 23,007 #epicfail sages containing both positive and negative emoticons were
#factsaboutme 19,167 omitted. They also hand-tagged a number of tweets to use
#news 17,190 for evaluation, but for our experiments, we only use their
#random 17,180 training data. This set contains 381,381 tweets, 230,811 pos-
#shoutout 16,446 itive and 150,570 negative. Interestingly, the majority of
these messages do not contain any hashtags.
Table 2: Most frequent hashtags in the Edinburgh corpus
iSieve data set
The iSieve data contains approximately 4,000 tweets. It was
appspot.com. For evaluation we use a manually an- collected and hand-annotated by the iSieve Corporation. The
notated data set produced by the iSieve Corporation2 data in this collection was selected to be on certain topics,
(ISIEVE). The number of Twitter messages and the distri- and the label of each tweet reects its sentiment (positive,
bution across classes is given in Table 1. negative, or neutral) towards the tweets topic. We use this
data set exclusively for evaluation.
Hashtagged data set Preprocessing
The hashtagged data set is a subset of the Edinburgh Twit- Data preprocessing consists of three steps: 1) tokeniza-
ter corpus. The Edinburgh corpus contains 97 million tweets tion, 2) normalization, and 3) part-of-speech (POS) tagging.
collected over a period of two months. To create the hash- Emoticons and abbreviations (e.g., OMG, WTF, BRB) are
tagged data set, we rst lter out duplicate tweets, non- identied as part of the tokenization process and treated as
English tweets, and tweets that do not contain hashtags. individual tokens. For the normalization process, the pres-
From the remaining set (about 4 million), we investigate the ence of abbreviations within a tweet is noted and then ab-
distribution of hashtags and identify what we hope will be breviations are replaced by their actual meaning (e.g., BRB
sets of frequent hashtags that are indicative of positive, nega- > be right back). We also identify informal intensiers
tive, and neutral messages. These hashtags are used to select such as all-caps (e.g., I LOVE this show!!! and character rep-
the tweets that will be used for development and training. etitions (e.g., Ive got a mortgage!! happyyyyyy), note their
presence in the tweet. All-caps words are made into lower
Table 2 lists the 15 most-used hashtags in the Edinburgh
case, and instances of repeated charaters are replaced by a
corpus. In addition to the very common hashtags that are part
single character. Finally, the presence of any special Twit-
of the Twitter folksonomy (e.g., #followfriday, #musicmon-
ter tokens is noted (e.g., #hashtags, usertags, and URLs) and
day), we nd hashtags that would seem to indicate message
placeholders indicating the token type are substituted. Our
polarity: #fail, #omgthatsotrue, #iloveitwhen, etc.
hope is that this normalization improves the performance of
To select the nal set of messages to be included in the the POS tagger, which is the last preprocessing step.
HASH dataset, we identify all hashtags that appear at least
1,000 times in the Edinburgh corpus. From these, we se- Features
lected the top hashtags that we felt would be most useful
We use a variety of features for our classication experi-
for identifying positive, negative and neutral tweets. These
ments. For the baseline, we use unigrams and bigrams. We
hashtags are given in Table 3. Messages with these hashtags
also include features typically used in sentiment analysis,
were included in the nal dataset, and the polarity of each
namely features representing information from a sentiment
message is determined by its hashtag.
lexicon and POS features. Finally, we include features to
capture some of the more domain-specic language of mi-
2
www.i-sieve.com croblogging.
539
n-gram features
HASH HASH+EMOT
To identify a set of useful n-grams, we rst remove stop-
0.9
words. We then perform rudimentary negation detection by
attaching the the word not to a word that preceeds or fol-
lows a negation term. This has proved useful in previous 0.8
work (Pak and Paroubek 2010). Finally, all unigrams and
bigrams are identied in the training data and ranked accord-
ing to their information gain, measured using Chi-squared. 0.7
For our experiments, we use the top 1,000 n-grams in a bag-
of-words fashion.3
0.6
Lexicon features
Words listed the MPQA subjectivity lexicon (Wilson,
Wiebe, and Hoffmann 2009) are tagged with their prior po- 0.5
n-grams all
larity: positive, negative, or neutral. We create three features
based on the presence of any words from the lexicon.
Part-of-speech features Figure 1: Average F-measure on the validation set over mod-
els trained on the HASH and HASH+EMOT data
For each tweet, we have features for counts of the number
of verbs, adverbs, adjectives, nouns, and any other parts of
speech. Because the EMOT data set has no neutral data and our
experiments involve 3-way classication, it is not included
Micro-blogging features in the initial experiments. Instead, we explore whether it is
We create binary features that capture the presence of posi- useful to use the EMOT data to expand the HASH data and
tive, negative, and neutral emoticons and abbreviations and improve sentiment classication. 19,000 messages from the
the presence of intensiers (e.g., all-caps and character rep- EMOT data set, divided equally between positive and nega-
etitions). For the emoticons and abbreviations, we use the tive, are randomly selected and added to the HASH data and
Internet Lingo Dictionary (Wasden 2006) and various inter- the experiments are repeated.
net slang dictionaries available online. To get a sense for an upper-bound on the performance
we can expect for the HASH-trained models and whether
Experiments and Results including the EMOT data may yield improvements, we rst
Our goal for these experiments is two-fold. First, we want to check the results of the models on the validation set. Figure 1
evaluate whether our training data with labels derived from shows the average F-measure for the n-gram baseline and all
hashtags and emoticons is useful for training sentiment clas- the features on the HASH and the HASH+EMOT data. On
siers for Twitter. Second, we want to evaluate the effective- this data, adding the EMOT data to the training does lead to
ness of the features from section for sentiment analysis in improvements, particularly when all the features are used.
Twitter data. How useful is the sentiment lexicon developed Turning to the test data, we evaluate the models trained
for formal text on the short and informal tweets? How much on the HASH and the HASH+EMOT data on the ISIEVE
gain do we get from the domain-specic features? data set. Figure 2 shows the average F-measure for the base-
For our rst set of experiments we use the HASH and line and four combinations of features: n-grams and lexicon
EMOT data sets. We start by randomly sampling 10% of the features (n-gram+lex), n-grams and part-of-speech features
HASH data to use as a validation set. This validation set is (n-gram+POS), n-grams, lexicon features and microblog-
used for n-gram feature selection and for parameter tuning. ging features (n-grams+lex+twit), and nally all the features
The remainder of the HASH data is used for training. To combined. Figure 3 shows the accuracy for these same ex-
train a classier, we sample 22,2474 tweets from the train- periments.
ing data and use this data to train AdaBoost.MH (Schapire Interestingly, the best performance on the evaluation data
and Singer 2000) models with 500 rounds of boosting.56 We comes from using the n-grams together with the lexicon
repeat this process ten times and average the performance of features and the microblogging features. Including the part-
the models. of-speech features actually gives a drop in performance.
Whether this is due to the accuracy of the POS tagger on the
3
The number n-grams to include as features was determined
tweets or whether POS tags are less useful on microblogging
empirically using the training data.
4 data will require further investigation.
This is equivalent to 10% of the training data. We experi-
mented with different sample sizes for training the classier, and Also, while including the EMOT data for training gives
this gave the best results based on the validation data. a nice improvement in performance in the absense of mi-
5
The rounds of boosting was determined empirically using the croblogging features, once the microblogging features are
validation set. included, the improvements drop or disappear. The best re-
6
We also experimented with SVMs, which gave similar trends, sults on the evaluation data comes from the n-grams, lexical
but lower results overall. and Twitter features trained on the hashtagged data alone.
540
included, the benet of emoticon training data is lessened.
HASH HASH+EMOT
0.8 References
Barbosa, L., and Feng, J. 2010. Robust sentiment detection
0.68 on twitter from biased and noisy data. In Proc. of Coling.
0.7
0.65 Bifet, A., and Frank, E. 2010. Sentiment knowledge discov-
ery in twitter streaming data. In Proc. of 13th International
0.6 Conference on Discovery Science.
Davidov, D.; Tsur, O.; and Rappoport, A. 2010. Enhanced
sentiment learning using twitter hashtags and smileys. In
0.5
Proceedings of Coling.
S
it
l
s
x
al
am
le
PO
tw
s+
s+
am
n-
le
am
s+
gr
am
n-
gr
ceedings of LREC.
n-
it
l
s
x
2(1-2):1135.
al
am
le
PO
tw
s+
x+
gr
s+
am
n-
le
am
gr
n-
am
n-
Learning 39(2/3):135168.
Tumasjan, A.; Sprenger, T. O.; Sandner, P.; and Welpe, I.
Figure 3: Average accuracy on the test set over models 2010. Predicting elections with twitter: What 140 characters
trained on the HASH and HASH+EMOT data reveal about political sentiment. In Proceedings of ICWSM.
Wasden, L. 2006. Internet Lingo Dictionary: A Parents
Guide to Codes Used in Chat Rooms, Instant Messaging,
Conclusions Text Messaging, and Blogs. Technical report, Idaho Ofce
Our experiments on twitter sentiment analysis show that of the Attorney General.
part-of-speech features may not be useful for sentiment anal- Wilson, T.; Wiebe, J.; and Hoffmann, P. 2009. Recog-
ysis in the microblogging domain. More research is needed nizing contextual polarity: An exploration of features for
to determine whether the POS features are just of poor qual- phrase-level sentiment analysis. Computational Linguistics
ity due to the results of the tagger or whether POS features 35(3):399433.
are just less useful for sentiment analysis in this domain.
Features from an existing sentiment lexicon were somewhat Yu, H., and Hatzivassiloglou, V. 2003. Towards answer-
useful in conjunction with microblogging features, but the ing opinion questions: Separating facts from opinions and
microblogging features (i.e., the presence of intensiers and identifying the polarity of opinion sentences. In Proc. of
positive/negative/neutral emoticons and abbreviations) were EMNLP.
clearly the most useful.
Using hashtags to collect training data did prove use-
ful, as did using data collected based on positive and neg-
ative emoticons. However, which method produces the bet-
ter training data and whether the two sources of training data
are complementary may depend on the type of features used.
Our experiments show that when microblogging features are
541