Economically-Efficient Sentiment Stream Analysis
Roberto Lourenço Jr.
Adriano Veloso
Adriano Pereira
Computer Science Dept.
Universidade Federal de
Minas Gerais
Computer Science Dept.
Universidade Federal de
Minas Gerais
Computer Science Dept.
Universidade Federal de
Minas Gerais
robertolojr@dcc.ufmg.br
Wagner Meira Jr.
adrianov@dcc.ufmg.br
Renato Ferreira
adrianoc@dcc.ufmg.br
Srinivasan Parthasarathy
Computer Science Dept.
Universidade Federal de
Minas Gerais
Computer Science Dept.
Universidade Federal de
Minas Gerais
Dept. of Computer Science
and Engineering
The Ohio-State University
meira@dcc.ufmg.br
renato@dcc.ufmg.br
srini@cse.ohio-state.edu
ABSTRACT
1. INTRODUCTION
Text-based social media channels, such as Twitter, produce torrents
of opinionated data about the most diverse topics and entities. The
analysis of such data (aka. sentiment analysis) is quickly becoming a key feature in recommender systems and search engines. A
prominent approach to sentiment analysis is based on the application of classification techniques, that is, content is classified according to the attitude of the writer. A major challenge, however, is
that Twitter follows the data stream model, and thus classifiers must
operate with limited resources, including labeled data and time for
building classification models. Also challenging is the fact that
sentiment distribution may change as the stream evolves. In this
paper we address these challenges by proposing algorithms that select relevant training instances at each time step, so that training
sets are kept small while providing to the classifier the capabilities
to suit itself to, and to recover itself from, different types of sentiment drifts. Simultaneously providing capabilities to the classifier,
however, is a conflicting-objective problem, and our proposed algorithms employ basic notions of Economics in order to balance
both capabilities. We performed the analysis of events that reverberated on Twitter, and the comparison against the state-of-the-art
reveals improvements both in terms of error reduction (up to 14%)
and reduction of training resources (by orders of magnitude).
The need for real-time text analytics is clear and present given
the ubiquitous reach of social media sites like Facebook and Twitter. Specifically, recognizing customer sentiment in real-time and
enabling advertising on-the-fly have the potential to be a breakthrough technology [20]. Early examples of such technology in
use were demonstrated in this year’s National Football League’s
Superbowl (a premier sporting event in the USA) where a well
known manufacturer of Oreo cookies took advantage of a third
quarter blackout (and associated Twitter sentiment) to embed a contextual advertisement. Another example at the same event was the
advertisement for a Hollywood movie, where, based on the initial
advertisement which happened before the start of the first quarter
(and associated Twitter sentiment), the decision on which of several
possible advertisements to run later on in the program was apparently taken as a runtime decision. Examples like these are likely
to occur more frequently due to lightweight and easy communication mechanisms, such as Twitter microblogging, which makes
people eager not only to exchange information, but also to convey
their opinions and emotions. People watch events together on television, while tweeting out about things happening around them. As
a result, opinionated content is created almost at the same time the
event is happening in the real world, and becomes available shortly
after. The analysis of such content (aka. sentiment analysis) in
order to exploit the aggregate sentiment of the online crowd goes
beyond advertising, and is becoming crucial to recommender systems and search engines.
There is a growing trend in performing sentiment analysis using classification-related techniques: a process that automatically
builds a classification model by learning, from a set of previously
labeled data (i.e., the training-set), the underlying characteristics
that distinguish one sentiment from another (i.e., happiness, madness, surprise, suspicion). The success of these classifiers rests on
their ability to judge attitude by means of textual-patterns present
in the data, which usually appear in the form of (idiomatic) expressions and combinations of words. Sentiment analysis over Twitter
real-time messages, however, is particularly challenging, because:
(i) Twitter follows the data stream model1 , requiring classifiers to
operate with limited computing and training resources, and (ii) either sentiment distribution or the characteristics related to certain
Categories and Subject Descriptors
H.3.1 [Information Storage and Retrieval]: Content Analysis;
I.5.2 [Pattern Recognition]: Classifier Design and Evaluation
General Terms
Algorithms, Experimentation, Measurement, Performance
Keywords
Sentiment Analysis; Economic Efficiency; Streams and Drifts
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SIGIR’14, July 6–11, 2014, Gold Coast, Queensland, Australia.
Copyright 2014 ACM 978-1-4503-2257-7/14/07 ...$15.00.
1
There are three main source streams in Twitter. The Firehose provides all status updates from everyone in real-time. Spritzer and
Gardenhose are two sub-samples of the Firehose. The current sampling rates are 5% and 15%, respectively.
637
correspond to cases for which no Pareto improvement is possible. These messages privilege either adaptiveness or memorability, and thus they are selected to compose the current
training set from which the classifier is built.
sentiments may change over time in almost unforeseen ways (i.e.,
sentiment drift).
Our Approach to Sentiment Stream Analysis
A possible strategy to cope with the aforementioned challenges is
to employ selective sampling algorithms in order to focus only on
the most relevant training examples/messages at each time step and
to creating training sets from which classifiers are built. Such training sets are kept as small as possible to ensure fast learning times,
since a new classifier must be built at each time step, after a new
target message arrives. Also, messages should be selected so that
the resulting training set provides sufficient resources to enable the
resulting classifier to be effective under the occurrence of drifts.
In order to provide sufficient training resources while keeping sets
small, our algorithms select training messages by taking into account two important properties, that we define as adaptiveness and
memorability. Informally, adaptiveness enables the classifier to
adapt itself to drifts, and thus, improving adaptiveness involves incorporating fresh messages into the current training set, while discarding obsolete ones. Memorability, on the other hand, involves
retaining messages belonging to pre-drift distributions, therefore
enabling the classifier to recover itself from drifts.
We hypothesize that adaptiveness and memorability are both necessary to make classifiers robust to drifts. However, given their antagonistic natures, improving both properties may lead to a conflictingobjective problem, in which the attempt to improve memorability further may result in worsening adaptiveness. Thus, we tackle
the problem by proposing selective sampling algorithms based on
multi-objective optimization, that is, we propose to select training
messages so that the resulting classifier achieves a proper balance
between memorability and adaptiveness. Our algorithms are based
on central concepts in Economics, namely Pareto and Kaldor-Hicks
efficiency criteria [19,22,28]. The Pareto Efficiency criterion informally states that “when some action could be done to make someone better off without hurting anyone else, then it should be done.”
This action is called Pareto improvement, and a system is said to be
Pareto-Efficient if no such improvement is possible. The KaldorHicks criterion is less stringent and states that “when some action
could be done to make someone better off, and this could compensate those that are made worse off, then it should be done.”
Contributions and Findings
The main contribution of this paper is to exploit the intuition behind
the aforementioned concepts for devising new algorithms for sentiment stream analysis. In practice, we claim the following benefits
and contributions:
• We formulate simple-to-compute yet effective utility measures that capture the notions of adaptiveness and memorability. For instance, the similarity between messages that are
candidate to compose the current training set and the target
message, as well as the freshness of the candidate messages,
are measures that tend to privilege adaptiveness. In contrast,
candidate messages are also randomly shuffled, thus privileging memorability. These utility measures result in a utility
space, and the extent to which each candidate message contributes to adaptiveness and memorability depends on where
it is placed in this space.
• We exploit the concept of Pareto Efficiency by separating
messages (viewed as points in the utility space) that are not
dominated by any other message. These messages compose
the Pareto frontier [28], and messages lying in this frontier
638
• We exploit the concept of Kaldor-Hicks Efficiency by selecting an additional set of messages that, although not lying in
the Pareto frontier, correspond to a positive trade-off between
adaptiveness and memorability. These messages are selected
to compose the current training set from which the classifier
is built.
• Our algorithms may operate either on an instance-basis or
in batch-mode, by employing classification models based on
sentiment rules that are kept incrementally as the stream evolves
and training sets are modified.
To evaluate the effectiveness of our algorithms, we performed
experiments using Twitter data collected from three important events
in 2010, spanning different sentiments expressed in different languages. Results show that our algorithms make classifiers extremely
effective, with gains in prediction performance that are up to 14%
when compared against the state-of-the-art. Further, the amount of
training resources needed is decreased by two orders of magnitude.
2. RELATED WORK
In the data stream model, data arrives at high speed and algorithms must work in real time and with limited resources. Further,
in some domains, algorithms must deal either with burst detection [42] and concept drift (i.e., data which nature or distribution
change over time). Žliobaitė [35] categorizes such drifts as sudden,
gradual, incremental and recurring. When data distribution or nature change over time, its relevance must be recalculated to avoid
harming the model. This kind of data stream is known as evolving
data streams.
Many techniques have been proposed to allow accurate classification in evolving data streams. Núñez et al. [27] proposed a
method for keeping a variable training window by adjusting internal structures of decision trees. An ensemble of Hoeffding trees
have been proposed in [5], each tree is limited to a small subset
of attributes. Gama et al. [17] proposed a mechanism to discard
old information based on sliding windows. Bifet et al. [6, 7] proposed an adaptive sliding window algorithm, called ADWIN, suitable for data streams with sudden drifts. The approach presented
in [24] suggests that a time-based forgetting function, which makes
more recent observations more significant, provides adaptiveness
to the classifier. Klinkenberg [23] compares example selection, often used in windowing approaches with example weights. Experiments show that both approaches are effective. In [30] the authors
proposed an approach based on a training augmentation procedure,
which automatically incorporates relevant training messages into
the training-set.
Some works have focused on feature similarity, such as Torres
et al. [31] that studied different methods for data stream classification and proposed a new way of keeping the representative data
models based on similarity measures. Feng et al. [16] extracted the
concept from each data block using feature similarity probabilities.
Masud et al. [25] proposed a novel technique to overcome the lack
of labeled examples by building models from unlabeled instances
and a small amount of labeled ones. Zhu et al. [41] employed active learning to produce a classifier ensemble that selects labeled
instances from data streams to build classifiers. Also, in [37, 38]
active learning approaches are presented for data streams that explicitly handle concept drifts. They are based on uncertainty [21],
Sentiment Scoring
dynamic allocation of labeling efforts over time, and randomization
of the search space. Žliobaitė et al. [36] proposed a system that implements active learning strategies, extending the Massive Online
Analysis (MOA) framework [8].
Works above cited attempt to face concept drift in data stream
through manipulation of classifiers, with mechanisms such as training windows and decay functions, active learning and sampling. In
this paper we present new algorithms that select high-utility examples in order to provide adaptiveness and memorability to the
classifier. In order to balance adaptiveness and memorability, we
formalized this issue as a multi-objective problem. The sample selection is performed using economic efficiency criteria: Pareto and
Kaldor-Hicks. We did not find in the recent literature approaches
that employ multi-objective models based on economic efficiency
criteria to deal with issues in the data stream environment.
We denote as R(tn ) the classifier obtained at time step n, by extracting rules from Dn . Basically, the classifier is a poll of rules,
and each rule {X →
− si } ∈ R(tn ) is a vote given for sentiment si .
Given message tn , a rule is a valid vote if it is applicable to tn .
Definition 2. A rule {X →
− si } ∈ R(tn ) is said to be applicable
to message tn ∈ T if all terms in X are in tn .
We denote as Ra (tn ) the set of rules in R(tn ) that are applicable to message tn . Thus, only rules in Ra (tn ) are considered as
valid votes when scoring sentiments in tn . Further, we denote as
Rsai (tn ) the subset of R(tn ) containing only rules predicting sentiment si . Votes in Rsai (tn ) have different weights, depending on
the confidence of the corresponding rules. The weighted votes for
sentiment si are averaged, giving the score for si with regard to tn :
3. ALGORITHMS
In this section we present novel selective sampling approaches
for learning classifiers to distinguish between different sentiments
expressed in Twitter messages. We start by discussing models based
on specialized association rules. Then we present measures for
adaptiveness and memorability, and describe the message utility
space. Finally, we discuss Pareto and Kaldor-Hicks criteria, and
algorithms that select training messages using these criteria.
3.1
X θ(X →
− si )
(1)
|Rsai (tn )|
Finally, the scores are normalized, thus giving the likelihood of
sentiment si being the attitude in message tn :
s(tn , si ) =
p̂(si |tn ) =
Sentiment Stream Analysis
(2)
j=1
In our context, the task of learning sentiment streams is precisely
defined as follows. At time step n, we have as input a training
set referred to as Dn , which consists of a set of records of the
form < d, si >, where d is a message (represented as a list of
terms), and si is the sentiment implicit in d. The sentiment variable s draws its values from a pre-defined, fixed and discrete set of
possibilities (e.g., s1 , s2 , . . ., sk ). The training set is used to build
a classifier relating textual patterns in the messages to their corresponding sentiments. A sequence of future messages referred to
as T = {tn , tn+1 , . . .}, consists of messages for which only their
terms are known, while the corresponding sentiments are unknown.
The classifier obtained from Dn is used to score the sentiments for
message tn in T . Messages in T are eventually incorporated into
the next training set.
There are countless strategies for devising a classifier for sentiment analysis. Many of these strategies, however, are not wellsuited to deal with data streams. Some are specifically devised for
offline classification [12, 14], and this is problematic because producing classifiers on-the-fly would be unacceptably costly. In such
circumstances, alternate classification strategies may become more
convenient [33].
3.2
s(tn , si )
k
X
s(tn , sj )
Rule Extraction
The simplest approach to rule extraction is the offline one. In this
case, rule extraction is divided into two steps: support counting
and confidence computation. Once the support σ(X ) is known, it
is straightforward to compute the confidence θ(X −
→ si ) for the
corresponding rules [40]. There are several smart support-counting
strategies [1,18,40], and many fast implementations [3] that can be
used. We employ the vertical counting strategy, which is based on
the use of inverted lists [39]. Specifically, an inverted list associated
with termset X , is denoted as L(X ), and contains the identifiers of
the messages in Dn having termset X as a subset. An inverted
list L(X ) is obtained by performing the intersection of two proper
subsets of termset X . The support of termset X is given by the
cardinality of L(X ), that is, σ(X ) = |L(X )|.
Usually, the support for different sets of terms in Dn are computed in a bottom-up way, which starts by scanning all messages
in Dn and computing the support of each term in isolation. In the
next iteration, pairs of terms are enumerated, and their support values are calculated by performing the intersection of the corresponding proper subsets. The search for sets of terms proceeds, and the
enumeration process is repeated until the support values for all sets
of terms in Dn are finally computed. Obviously, the number of
rules increases exponentially with the size of the vocabulary (i.e.,
the number of distinct terms in Dn ), and computational cost restrictions have to be imposed during rule extraction. Typically, the
search space for rules is restricted by pruning rules that do not appear frequently in Dn (i.e., the minimum support approach). While
such restrictions make rule extraction feasible, they also lead to
lossy classifiers, since some rules are pruned and therefore are not
included into R(tn ).
Sentiment Rules and Classifiers
Next we describe classifiers composed of association rules, and
how these rules are used for sentiment-scoring. Such classifiers are
built on-the-fly [32,34], being thus well-suited for sentiment stream
analysis, as shown in [30].
Definition 1. A sentiment rule is a specialized association rule
X −
→ si , where the antecedent X is a set of terms (i.e., a termset),
and the consequent si is the predicted sentiment. The domain for
X is the vocabulary of the training set Dn . The support of X is denoted as σ(X ), and is the number of messages in Dn having X as
a subset. The confidence of rule X −
→ si is denoted as θ(X −
→ si )
σ(X ∪ si )
.
and is given as
σ(X )
Online Rule Extraction. An alternative to offline rule extraction
is to extract rules on-the-fly. Such alternative, which we call online rule extraction, has several advantages [30]. For instance, it
becomes possible to efficiently extract rules from Dn without per-
639
Utility Measures
forming support-based pruning. The idea behind online rule extraction is to ensure that only applicable rules are extracted by projecting Dn on a demand-driven basis. More specifically, rule extraction
is delayed until a message tn ∈ T is given. Then, terms in tn are
used as a filter which configures Dn in a way that only rules that are
applicable to tn can be extracted. This filtering process produces
a projected training-set, denoted as Dn∗ , which contains only terms
that are present in message tn .
At each time step, the classifier must score sentiments that are expressed in the target message. Some of the utility measures we are
going to discuss next are based on the distance to the target message. By minimizing such distance we are essentially maximizing
adaptiveness, since the selected messages are similar to the target
message. As for memorability, we are going to discuss a utility
measure based on randomly shuffling candidate messages:
Lemma 1. All rules extracted from Dn∗ are applicable to tn .
• Distance in space − The similarity between the target message tn and an arbitrary message tj is given by the number
of rules in the classifier Ra (tn ) that are also applicable to
tj . Differently from traditional measures such as cosine and
Jaccard [2], the rule-based similarity considers not only isolated terms, but also combination of terms. Thus, the utility
of message tj is given as:
Dn∗
Proof. Since all training messages in
contain only terms that
are present in message tn , the existence of a rule X −
→ si extracted
from Dn∗ , such that X * tn , is impossible.
Lemma 1 implies that online rule extraction assures that R(tn ) =
Ra (tn ). The next theorem states that search space for rules induced by Dn∗ is much narrower than the search space for rules induced by Dn . Thus, rules can be efficiently extracted from Dn∗ , no
matter the minimum-support value (which can be arbitrary low).
Us (tj ) =
Theorem 1. The number of rules extracted from Dn∗ increases
polynomially with the number of distinct terms in Dn .
(3)
• Distance in time − Let γ(tj ) be a function that returns the
time in which message tj arrived. The utility of message tj
is given as:
Proof. Let k be the number of distinct terms in Dn . Since an arbitrary message tn ∈ T contains at most l terms (with l ≪ k), then
any rule applicable to tn can have at most l terms in its antecedent.
That is, for any rule {X →
− si }, such that X ⊆ tn , |X | ≤ l. Consequently,
the number
of possible rules that are applicable to tn is
l + 2l + . . . + ll = O(2l ) ≪ O(kl ). Thus, the number of applicable rules increases polynomially in k.
Ut (tj ) =
γ(tj )
γ(tn )
(4)
• Memorability − In order to provide memorability, the training set must contain messages posted in different time periods. A simple way to force this is to generate a random permutation of the candidate messages, that is, randomly shuffling the candidate messages [15]. Let α(tj ) be a function
that returns the position of message tj in the shuffle. The
utility of message tj is given as:
Extending Classifiers Dynamically. Let R = {R(t1 ) ∪ R(t2 )
∪ . . . ∪ R(tn )}. With online rule extraction, R is extended dynamically as messages ti ∈ T are processed. Initially R is empty;
a classifier Rti is appended to R every time a message ti is processed. Producing a classifier R(ti ) involves extracting rules from
the corresponding training-set. This operation has a significant
computational cost, since it is necessary perform multiple accesses
to Di . Different messages in T = {t1 , t2 , . . . , tm } may demand
different classifiers {Rt1 , Rt2 , . . . , Rtm }, but different classifiers
may share some rules (i.e., {Rti ∩ Rtj } 6= ∅). In this case, memorization is very effective in avoiding work replication, reducing
the number of data access operations. Thus, before extracting rule
X −
→ si , the classifier first checks whether this rule is already in R.
If an entry is found, then the rule in R is used instead of extracting
it from the training-set. If it is not found, the rule is extracted from
the training-set and then it is inserted into R.
3.3
|Ra (tn ) ∩ Ra (tj )|
|{Ra (tn )}|
Ur (tj ) =
α(tj )
|Dn |
(5)
Each candidate message is judged based on these three utility
measures. The need to judge one situation better than another motivates much of Economics, and next we discuss concepts from Economics and how they can be applied to select messages to compose
the training set.
3.4
Economic Efficiency
When the society is economically efficient, any changes made to
assist one person would harm another. The same intuition could be
exploited for the sake of selecting messages to compose the training
set at each time step. In this case, a training set is economically
efficient if it is only possible to improve memorability at the cost
of adaptiveness, and vice-versa [26, 29].
There is an alternative, less stringent notion of efficiency, which
is based on the principle of compensation [13]. Under new arrangements in the society, some may be better off while others may
be worse off. Compensation holds if those made better off under
the new set of conditions could compensate those made worse off.
Next we discuss algorithms that exploit these two notions of economic efficiency in order to select messages to compose the training sets.
Utility Space and Selective Sampling
Our approach to sentiment stream analysis is based on selecting high-utility messages to compose the training set at each time
step. Training sets must provide adaptiveness and memorability to
the corresponding classifiers. Improving adaptiveness and memorability simultaneously, however, is a conflicting-objective problem. Instead, our approaches create training sets that balance between adaptiveness and memorability. Specifically, at each time
step, candidate messages are placed into an n-dimensional space,
in which each dimension corresponds to a utility measure which is
either related to adaptiveness or memorability.
640
Pareto frontier
e)
pac
in s
e
c
s
stan
nes
(Di
tive
p
a
Ad
(Dist
ance
Adap in space)
tiven
ess
•
•
•
• •
•
•
• ••
•
•
•
•
•
•
•
•
ity
abil
mor
e
M
•
•
••
•
•
•
•
•
•
••
••
•• •
•
◦
•
◦◦
•
Adap
tiven
ess
ance
in tim
e)
Adap
tiven
(Dist
ess
ance
in tim
e)
•
•
◦
◦
◦
◦
◦◦
(Dist
•
••
Figure 3: Points lying in the Pareto frontier.
messages (di , dj ) ∈ Pn for which di dominates dj .
• •
••
••
• •
ity
Messages that are not dominated by any other message, lie on the
Pareto frontier [28]. Therefore, by definition, the Pareto-efficient
training set at time step n, Pn , is composed by all the messages
lying in the Pareto frontier that is built from Dn . There are efficient
algorithms for building and maintaining the Pareto frontier, and we
employed the algorithm proposed in [11] which ensures O(|Dn |)
complexity. We denote the process of exploiting Pareto-efficient
training sets as Pareto-Efficient Selective Sampling, or simply PESS.
Figure 3 shows an illustrative example of a Pareto frontier built
from arbitrary points in the utility space.
Figure 1: Illustrative example. The 3D utility space.
Pareto Frontier
Messages that are candidate to compose the training set at time step
n are placed in a 3-dimensional space, according to their utility
measures, as shown in Figure 1. Thus, each message a is a point in
such utility space, and is given as < Us (a), Ut (a), Ur (a) >.
Kaldor-Hicks Region
The PESS strategy follows a stringent criterion, which tends to select only few messages to compose the training sets. As a result, the
training sets may become excessively small and prone to noise. The
Kaldor-Hicks criterion, on the other hand, follows a cost-benefit
analysis and circumvents the small training set problem by stating
that efficiency is achieved if those that are made better off could in
theory compensate those that are made worse off. Thus, under the
Kaldor-Hicks criterion, an utility measure can compensate other
utility measures, and therefore, this criterion selects messages that
are located inside a region which is below the Pareto frontier. To
define this region we must first define the overall utility of a message.
Definition 3. Message a is said to dominate message b iff both of
the following conditions are hold:
• Us (a) ≥ Us (b) and Ut (a) ≥ Ut (b) and Ur (a) ≥ Ur (b)
• Us (a) > Us (b) or Ut (a) > Ut (b) or Ur (a) > Ur (b)
Therefore, the dominance operator relates two messages so that the
result of the operation has two possibilities as shown in Figure 2:
(i) one message dominates another or (ii) the two messages do not
dominate each other.
Definition 5. Assuming that all measures are equally important,
the overall utility of an arbitrary message di ∈ Dn is:
b
◦
a
◦
•
•
◦
Mem
orabi
l
•
•
•
◦
U (di ) = Us (di ) + Ut (di ) + Ur (di )
◦
◦
◦
c
◦◦
◦
◦
◦
◦
◦
◦
(6)
That is, the overall utility of a message is given as the sum of its
utility measures. Also, the baseline message, which is denoted as
d∗ , is defined as:
◦◦
d∗ = {di ∈ Pn |∀dj ∈ Pn : U (di ) ≤ U (dj )}
(7)
That is, the baseline is the message lying in the frontier for which
the overall utility assumes its lowest value.
Figure 2: The dominance operator: neither a or b dominates
each other, but b dominates c.
The Kaldor-Hicks region is composed of messages for which the
overall utility is not smaller than the baseline overall utility. Such
baseline utility is the utility associated with the message lying in
the Pareto frontier for which the overall utility is the lowest.
Definition 4. Training set Pn = {d1 , d2 , . . . , dm } is said to be
Pareto-efficient at time step n, if Pn ⊆ Dn and there is no pair of
641
into Dn+b , since some messages in B may be similar to each
other. Under this setting, not all messages in B need to be labeled, since only a subset of B is included into Dn+b . Therefore, there is a trade-off between batch size and labeling effort, and a similarity threshold, denoted as δ, controls the
messages in the batch that must be labeled.
Definition 6. Training set Kn = {d1 , d2 , . . . , dm } is said to be
Kaldor-Hicks-efficient at time step n, if Pn ⊆ Kn ⊆ Dn , and
there is no message di ∈ Kn such that U (d∗ ) > U (di ).
We denote the process of exploiting Kaldor-Hicks-efficient training sets as Kaldor-Hicks-Efficient Selective Sampling, or simply
KHSS. Figure 4 shows an illustrative example of a Kaldor-Hicks
region built from arbitrary points in the utility space.
In the following we describe our evaluation scenarios and discuss
the performance of the classifiers.
Kaldor-Hicks region
•
•
•
••
•
•
◦
◦
◦
◦
◦◦
Dilma Rousseff Election Campaign
Figure 4: Points inside the Kaldor-Hicks region.
4. EXPERIMENTAL EVALUATION
In this section we empirically analyze the performance of our
classifiers. We employ the mean squared error (MSE) as the basic
evaluation measure in our experiments, since we are primarily interested in evaluating sentiment scoring given by Equation 2. The
MSE measure is given as:
MSE =
1 X
(1 − p̂(si |ti ))2
|T |
Brazilian Presidential Elections
The presidential election campaigns were held from June to October 2010. the candidate Dilma Rousseff launched a Twitter page
during a public announcement, and she used Twitter as one of the
main sources of information for her voters. The campaign attracted
more than 500,000 followers and as a result Dilma was the second
most cited person on Twitter in 2010. The election came to a second round vote, and Dilma Rousseff won the runoff with 56% of
the votes.
•
•
•
4.1
(8)
∀ti ∈T
where si is the correct sentiment associated with message ti ∈
T , and p̂(si |ti ) is the sentiment score assigned by the classifier to
message ti ∈ T .
To evaluate the amount of computing resources used as the stream
evolves, we employ the RAM-Hours measure [9], where every RAMHour equals a GB of RAM deployed for 1 hour of execution. We
also evaluate the amount of training resources used over time, as
the number of messages labeled during the process. We used Hoeffding Adaptive Trees [4, 10] (abbreviated as HAT), Active Classifier [37, 38] (abbreviated as AC), and Incremental Lazy Associative Classifier [30] (abbreviated as ILAC) as baselines. All datasets
used in our experiments were manually labeled by three to five
human annotators. We expended significant time, effort, and resources to obtain high quality (labeled) data from Twitter streams,
which shall be made available at publication time.
All experiments were performed on a 1.93 GHz Core i7 machines with 8GB of memory, using the MOA system [8], an environment for running experiments with evolving data streams. Our
evaluation follows the Test-Then-Train methodology, in which each
individual message in T is used to test the classifier and then it becomes available for training. Finally, we consider three possible
settings:
• Instance Processing − Once processed, message tn is included into Dn+1 , and then a new classifier is built. Under
this setting, message tn is mandatorily labeled.
• Batch Processing − After a batch of b messages B = {tn ,
tn+1 , . . . , tn+b } is processed, only a subset of B is included
642
We collected 66,643 messages in Portuguese referencing Dilma
Rousseff in Twitter during her campaign. We labeled these messages in order to track the population sentiment of approval during
this period. As shown in Figure 5 (a), approval varied significantly
over the time due to several polemic statements and political attacks, and our goal is to score approval during her campaign.
Figure 5 (b) shows the results in terms of MSE obtained for the
evaluation of the classifiers in this dataset. All classifiers evaluated in this experiment operate on an instance basis. The x-axis
represents different time steps (i.e., each message that passes in
the stream), while the y-axis shows the MSE so far. As it can be
seen, a better approximation is obtained using our proposed algorithms, namely PESS and KHSS. AC and ILAC were very competitive during all the campaign. Both PESS and KHSS algorithms
started much better than the other competing algorithms, but slowly
converges to the baseline numbers as the stream evolves.
Figure 5 (c) shows results concerning the proposed algorithms
when operating in batch mode. The figure shows the number of
messages that were labeled during the process as a function of δ,
the minimum similarity threshold discussed in Section 3.1. Basically, we calculate the Jaccard coefficient associated with each possible pair of messages in the batch, and if the coefficient is greater
than δ, the corresponding messages are merged into a new one.
The process continues merging similar messages until no pair of
messages are similar enough, and the process stops. At the end,
only the merged messages are labeled. Clearly, higher values of δ
implies that less messages are merged, and thus incurring more labeling effort. Further, as the figure shows, the dependence between
labeling effort and δ tends to be linear.
By varying δ, we also study the trade-off between labeling effort and MSE. As shown in Figure 5 (d), MSE decreases if more
labeling effort is spent during the process. Specifically, best results
are achieved when about 40% of the messages in the stream are labeled during the process. Although both PESS and KHSS require
the same amount of training resources, KHSS provides slightly better MSE numbers. Furthermore, smaller batch sizes incur in less
labeling effort for this dataset.
We assume that HAT requires only the target message for updating its tree model, and thus we consider that the training set is composed only by the target message. The AC algorithm requires much
more messages within each training set. An abrupt decrease in the
number of training messages is always observed after drifts. The
proposed PESS algorithm requires very small training sets, since
1
0.7
0.8
0.4
0.8
0.4
0.3
0.2
0.2
0.7
0.6
(batch-0.01)
(batch-0.01)
(batch-0.05)
(batch-0.05)
(batch-0.10)
(batch-0.10)
0.5
0.4
0.3
0.2
0.1
0
05/01 06/01 07/01 08/01 09/01 10/01 11/01
0.1
0
0
0
0.2
Time
0.28
0.27
0.8
1
0
0.2
1
1
10000
0.24
0.01
1000
AC
HAT
ILAC
PESS (instance)
KHSS (instance)
100
0.001
0.0001
AC
HAT
ILAC
PESS (instance)
KHSS (instance)
PESS (batch-0.05)
KHSS (batch-0.05)
1e-05
10
1e-06
0.23
0.22
1
0.4
0.6
Labeling Effort
0.8
0.1
0.25
0.2
0.6
(c) X-Y scatter plot correlating the
minimum similarity threshold δ and labeling effort.
100000
(batch-0.01)
(batch-0.01)
(batch-0.05)
(batch-0.05)
(batch-0.10)
(batch-0.10)
0.26
0
0.4
σ
RAM-Hours
PESS
KHSS
PESS
KHSS
PESS
KHSS
0.6
(b) MSE numbers as the stream evolves.
Training Set Size
0.3
0.29
0.4
Stream Progress
(a) Approval over Dilma Rousseff’s
campaign. Approval sentiment varied
greatly from 05/2010 to 11/2010.
MSE
PESS
KHSS
PESS
KHSS
PESS
KHSS
0.9
Labeling Effort
MSE
Approval
0.5
0.6
1
AC
HAT
ILAC
PESS (instance)
KHSS (instance)
0.6
0.8
1
(d) X-Y scatter plot correlating labeling effort and MSE numbers.
1e-07
0
0.2
0.4
0.6
0.8
Stream Progress
1
(e) Size of the training sets as the
stream evolves.
0
0.2
0.4
0.6
0.8
Stream Progress
1
(f) RAM-Hours as the stream evolves.
Figure 5: Brazilian Presidential Elections. Tweets are in Portuguese.
Again, MSE numbers decrease as more labeling effort is spent during the process. This trend is particularly evidenced for smaller
batch sizes. Further, the KHSS algorithms shows a better tradeoff between labeling effort and MSE. Finally, Figure 6 (c) shows
RAM-Hours numbers for the evaluated algorithms. The AC algorithm, as well as PESS (instance) and KHSS (instance) are, again,
extremely competitive in terms of amount of computing resources
required. Further, the amount of resources required during the process significantly increases when PESS and KHSS operate on batch
mode, but still, ILAC is the worst performer.
the Pareto frontier at each time step is composed by few messages,
but these messages are still able to make the classifier robust to
drifts as the stream evolves. Further, despite being less stringent
than PESS, the proposed KHSS algorithm also requires small training sets, as shown in Figure 5 (e).
Figure 5 (f) shows RAM-Hours numbers for the algorithms. AC,
as well as PESS (instance) and KHSS (instance), are clearly the
best performers in terms of amount of computing resources required. Also, resources required during the process significantly
increases when PESS and KHSS operate on batch mode, but still,
ILAC is the worst performer.
4.2
4.3
TIME’s Person of the Year
FIFA World Cup
The 2010 Soccer World Cup involved 32 teams. The Brazilian
team was defeated by the Dutch team on 07-02-2010, after a controversial match. The Brazilian team scored first, but soon after the
Dutch team scored twice and won the match. A specific player,
Felipe Melo, had decisive participation (for better and worse) in all
three goals. Specifically, Figure 7 (a) shows how the appreciation
for Felipe Melo varied during the match.
Every year, TIME magazine selects the person (or a group of persons) that has mostly influenced during the year. The chosen person
for 2010 was Mark Zuckerberg. The reader choice, however, was
Julian Assange, with an overwhelming superiority of votes.
Zuckerberg and Assange
We collected 5,616 messages in English referencing Julian Assange and Mark Zuckerberg from 1-15-2010 to 12-21-2010. We labeled them in order to track diverse sentiments regarding the magazine’s decision. Sentiments include (dis)approval, surprise (since
the reader choice was pointing to Julian Assange), and even fury.
Figure 6 (a) shows the results in terms of MSE. As it can be
seen, a better approximation is obtained by HAT and ILAC. For
this dataset, AC was not effective in the first time steps. At the end
of the process, both PESS (instance) and KHSS (instance) algorithms achieved competitive numbers when compared against the
best performers.
Figure 6 (b) shows the trade-off between labeling effort and MSE.
The Brazilian Defeat
We collected 3,214 messages in Portuguese referencing Felipe Melo
that were posted in Twitter as the match was happening. We labeled
them in order to track the appreciation for the participation of Felipe Melo.
Figure 7 (b) shows the results in terms of MSE.As it can be
seen, the AC algorithm achieved the worst MSE numbers for this
dataset. On the other hand, HAT, ILAC, as well as PESS (instance) and KHSS (instance) showed extremely competitive numbers. This is expected, since this dataset contains three sudden
drifts (as shown in Figure 7 (a)), and HAT, ILAC, PESS (instance)
643
0.5
AC
HAT
ILAC
PESS (instance)
KHSS (instance)
0.5
PESS
KHSS
PESS
KHSS
PESS
KHSS
0.45
0.4
0.01
(batch-0.01)
(batch-0.01)
(batch-0.05)
(batch-0.05)
(batch-0.10)
(batch-0.10)
0.001
0.0001
RAM-Hours
0.6
MSE
MSE
0.4
0.35
0.3
0.3
0.2
1e-05
1e-06
0.25
0.1
1e-08
0.2
0
0.2
0.4
0.6
0.8
1
1e-09
0
0.2
Stream Progress
0.4
0.6
0.8
1
0
Labeling Effort
(a) MSE numbers as the stream
evolves.
AC
HAT
ILAC
PESS (instance)
KHSS (instance)
PESS (batch-0.05)
KHSS (batch-0.05)
1e-07
(b) X-Y scatter plot correlating labeling effort and MSE numbers.
0.2
0.4
0.6
0.8
1
Stream Progress
(c) RAM-Hours as the stream evolves.
Figure 6: Person of the Year. Tweets are in English.
6. ACKNOWLEDGMENTS
and KHSS (instance) were all able to ensure adaptiveness. For this
dataset, memorability is not mandatory (as the sentiment distribution never returns to a pre-drift distribution), and thus PESS (instance) and KHSS (instance) were not able to provide significant
improvements, although being the best performers overall.
Figure 7 (c) shows a X-Y scatter plot correlating δ and labeling effort. The correlation is almost linear. The trade-off between
labeling effort and MSE is shown in Figure 7 (d). Clearly, MSE decreases with the effort spent to label messages. Figure 7 (e) shows
the number of messages composing the training set at each time
step. As in previous cases, AC and ILAC require much more training resources than other competing algorithms. PESS (instance)
as well as KHSS (instance) require much less training messages,
again, showing that the selective sampling strategy is effective in
producing small and effective sets at each time step.
Finally, Figure 7 (f) shows RAM-Hours numbers. In this case,
AC, as well as PESS (instance) and KHSS (instance), are clearly
the best performers in terms of amount of computing resources required. Further, the amount of resources required during the process significantly increases when PESS and KHSS operate on batch
mode, but still, as in other datasets, ILAC is the worst performer.
Adriano Veloso, Adriano Pereira, Wagner Meira Jr., and Renato
Ferreira would like to acknowledge grants from CNPq, CAPES,
Fapemig, Finep, and InWeb − the Brazilian National Institute of
Science and Technology for the Web. Srinivasan Parthasarathy
would like to acknowledge NSF grant IIS 1111118 and a Google research award. Roberto Oliveira Jr. would like to acknowledge that
some aspects of this work was conducted while he was a visiting
researcher in Srinivasan Parthasarathy’s lab at Ohio State University.
7. REFERENCES
[1] R. Agrawal, T. Imielinski, and A. Swami. Mining association
rules between sets of items in large databases. In SIGMOD,
pages 207–216. ACM, 1993.
[2] R. Baeza-Yates and B. R-Neto. Modern Information
Retrieval. Addison-Wesley-Longman, 1999.
[3] R. Bayardo, B. Goethals, and M. Zaki, editors. Workshop on
Frequent Itemset Mining Implementations, volume 126,
2004.
[4] A. Bifet and E. Frank. Sentiment knowledge discovery in
twitter streaming data. In Disc. Science, pages 1–15, 2010.
[5] A. Bifet, E. Frank, G. Holmes, and B. Pfahringer. Ensembles
of restricted hoeffding trees. TIST, 3(2):30:1–30:20, 2012.
[6] A. Bifet and R. Gavaldà. Learning from time-changing data
with adaptive windowing. In SDM, 2007.
[7] A. Bifet and R. Gavaldà. Adaptive learning from evolving
data streams. In IDA, pages 249–260, 2009.
[8] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. MOA:
Massive online analysis. JMLR, 11:1601–1604, 2010.
[9] A. Bifet, G. Holmes, B. Pfahringer, and E. Frank. Fast
perceptron decision tree learning from evolving data streams.
In PAKDD, pages 299–310, 2010.
[10] A. Bifet, G. Holmes, B. Pfahringer, and R. Gavaldà.
Detecting sentiment change in twitter streaming data. JMLR,
17:5–11, 2011.
[11] S. Börzsönyi, D. Kossmann, and K. Stocker. The skyline
operator. In ICDE, pages 421–430, 2001.
[12] L. Breiman, J. Friedman, R. Olshen, and C. Stone.
Classification and regression trees. Wadsworth Intl., 1984.
[13] J. Chipman. Compensation principle. In S. N. Durlauf and
L. E. Blume, editors, The New Palgrave Dictionary of
Economics. Palgrave Macmillan, 2008.
[14] C. Cortes and V. Vapnik. Support-vector networks. Machine
Learning, 20(3):273–297, 1995.
5. CONCLUSIONS
This paper focused on sentiment analysis on Twitter streams. We
have introduced new algorithms for active training-set formation,
which we denote as Pareto-Efficient Selective Sampling (PESS)
and Kaldor-Hicks Selective Sample (KHSS). The proposed algorithms provide the resulting classifier with memorability and adaptiveness. We formalized the selective sampling process as a multiobjective optimization procedure, which finds a proper balance between adaptiveness and memorability. Adaptiveness is assessed by
computing the distance in time and space between the target message and the candidate ones. Also, candidate messages are randomly shuffled, thus providing memorability to the resulting classifier. The message utility space is composed by such dimensions,
and we compute the Pareto Frontier in this space in order to pick
up messages satisfying the Pareto improvement condition, finding a proper balance between adaptiveness and memorability. The
Kaldor-Hicks criterion enables memorability to compensate adaptiveness, or vice-versa. A systematic evaluation involving recent
events demonstrated the effectiveness of our algorithms.
As future work, we intend to extend our strategies for algorithms
that do not depend on manual labeling.
644
1
0.35
0.8
0.4
0.8
0.2
0.15
0.1
0.2
0.7
0.6
(batch-0.01)
(batch-0.01)
(batch-0.05)
(batch-0.05)
(batch-0.10)
(batch-0.10)
0.5
0.4
0.3
0.2
0.05
0
00:00 00:30 01:00 01:30 02:00 02:30 03:00
0.1
0
0
0
0.2
Time
0.8
1
0
0.2
0.15
0.1
10
1e-05
1e-06
AC
HAT
ILAC
PESS (instance)
KHSS (instance)
PESS (batch-0.05)
KHSS (batch-0.05)
1e-07
1e-08
0.4
0.6
Labeling Effort
0.8
0
0.2
0.4
0.6
0.8
1
1e-09
0
Stream Progress
(d) X-Y scatter plot correlating labeling effort and MSE numbers.
1
0.0001
1
0.2
0.8
(c) X-Y scatter plot correlating the
minimum similarity threshold δ and labeling effort.
ILAC
KHSS (self-labeling-5-0.7)
PESS (self-labeling-5-0.8)
0.05
0
0.6
0.001
100
(batch-0.01)
(batch-0.01)
(batch-0.05)
(batch-0.05)
(batch-0.10)
(batch-0.10)
0.2
0
0.4
σ
RAM-Hours
0.3
0.25
0.6
(b) MSE numbers as the stream
evolves.
Training Set Size
PESS
KHSS
PESS
KHSS
PESS
KHSS
0.35
0.4
Stream Progress
(a) Appreciation associated with Felipe
Melo over the match.
MSE
PESS
KHSS
PESS
KHSS
PESS
KHSS
0.9
Labeling Effort
MSE
Appreciation
0.25
0.6
1
AC
HAT
ILAC
PESS (instance)
KHSS (instance)
0.3
(e) Size of the training sets as the
stream evolves.
0.2
0.4
0.6
0.8
Stream Progress
1
(f) RAM-Hours as the stream evolves.
Figure 7: The Brazilian Defeat. Tweets are in Portuguese.
[15] R. Durstenfeld. Algorithm 235: Random permutation.
Commun. ACM, 7(7):420, 1964.
[16] L. Feng, F. Chen, and Y. Yao. A concept similarity based
data stream classification model. Journal of Information &
Computational Science, 10(4):949–957, 2013.
[17] J. Gama, R. S. ao, and P. Rodrigues. Issues in evaluation of
stream learning algorithms. In SIGKDD, page 329, 2009.
[18] J. Han, J. Pei, Y. Yin, and R. Mao. Mining frequent patterns
without candidate generation: A frequent-pattern tree
approach. Data Mining Knowledge Discovery, 8(1):53–87,
2004.
[19] J. Hicks. The foundations of welfare economics. The
Economic Journal, 49(196):696–712, 1939.
[20] R. Hof. Real-time advertising has arrived, thanks to Oreo and
The Super Bowl, April 2013. www.forbes.com/.
[21] C. Jin, K. Yi, L. Chen, J. Yu, and X. Lin. Sliding-window
top-k queries on uncertain streams. VLDB J., 19(3):411–435,
2010.
[22] N. Kaldor. Welfare propositions in economics and
interpersonal comparisons of utility. The Economic Journal,
49(195):549–552, 1939.
[23] R. Klinkenberg. Learning drifting concepts: Example
selection vs. example weighting. Intell. Data Anal., 8(3),
2004.
[24] I. Koychev. Gradual forgetting for adaptation to concept
drift. In ECAI, pages 101–106, 2000.
[25] M. Masud, J. Gao, L. Khan, J. Han, and B. Thuraisingham.
A practical approach to classify evolving data streams:
Training with limited amount of labeled data. In ICDM,
pages 929–934, 2008.
[26] M. Moreira, J. dos Santos, and A. Veloso. Learning to rank
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
645
similar apparel styles with economically-efficient rule-based
active learning. In ICMR, pages 361–369, 2014.
M. N. nez, R. Fidalgo, and R. Morales. Learning in
environments with unknown dynamics: Towards more robust
concept learners. JMLR, 8, 2007.
F. Palda. Pareto’s Republic and the new Science of Peace.
Cooper-Wolfling, 2011.
M. Ribeiro, A. Lacerda, A. Veloso, and N. Ziviani.
Pareto-efficient hybridization for multi-objective
recommender systems. In RecSys, pages 19–26, 2012.
I. Santana, J. Gomide, A. Veloso, W. M. Jr., and R. Ferreira.
Effective sentiment stream analysis with self-augmenting
training and demand-driven projection. In SIGIR, pages
475–484. ACM, 2011.
D. Torres, J. Ruiz, and Y. Sarabia. Classification model for
data streams based on similarity. In IEA, pages 1–9, 2011.
A. Veloso, W. M. Jr., and M. Zaki. Lazy associative
classification. In ICDM, pages 645–654, 2006.
A. Veloso, W. Meira Jr., M. Gonçalves, H. de Almeida, and
M. Zaki. Calibrated lazy associative classification. Inf. Sci.,
181(13):2656–2670, 2011.
A. Veloso, M. Otey, S. Parthasarathy, and W. Meira Jr.
Parallel and distributed frequent itemset mining on dynamic
datasets. In HiPC, pages 184–193, 2003.
I. Žliobaitė. Learning under concept drift: an overview.
CoRR, abs/1010.4784, 2010.
I. Žliobaitė, A. Bifet, G. Holmes, and B. Pfahringer. MOA
concept drift active learning strategies for streaming data.
JMLR, 17:48–55, 2011.
I. Žliobaitė, A. Bifet, B. Pfahringer, and G. Holmes. Active
algorithms for fast discovery of association rules. In
SIGKDD, pages 283–286, 1997.
[41] X. Zhu, P. Zhang, X. Lin, and Y. Shi. Active learning from
stream data using optimal weight classifier ensemble. IEEE
Transactions on Systems, Man, and Cybernetics, Part B:
Cybernetics, 40(6):1607–1621, 2010.
[42] Y. Zhu and D. Shasha. Efficient elastic burst detection in data
streams. In KDD, pages 336–345, 2003.
learning with evolving streaming data. In Machine Learning
and Knowledge Discovery in Databases, volume 6913, pages
597–612. 2011.
[38] I. Žliobaitė, A. Bifet, B. Pfahringer, and G. Holmes. Active
learning with drifting streaming data. IEEE Trans. on Neural
Networks and Learning Systems, PP(99):1–1, 2013.
[39] M. Zaki and K. Gouda. Fast vertical mining using diffsets. In
SIGKDD, pages 326–335, 2003.
[40] M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New
646