000303069100010
000303069100010
000303069100010
Abstract—In this paper, we design and develop a movie-rating popularity of the Internet drives people to search for other peo-
and review-summarization system in a mobile environment. The ple’s opinions from the Internet before purchasing a product
movie-rating information is based on the sentiment-classification or seeing a movie. Many websites provide user rating and com-
result. The condensed descriptions of movie reviews are generated
from the feature-based summarization. We propose a novel ap- menting services, and these reviews could reflect users’ opinions
proach based on latent semantic analysis (LSA) to identify product about a product. For example, the customer-review section in
features. Furthermore, we find a way to reduce the size of summary Amazon.com lists the number of reviews, the percentage for
based on the product features obtained from LSA. We consider different ratings, and comments from reviewers. When people
both sentiment-classification accuracy and system response time want to purchase books, CDs, or DVDs, these comments and
to design the system. The rating and review-summarization system
can be extended to other product-review domains easily. ratings usually influence their purchasing behaviors. In addition
to these websites, a search engine is another important source
Index Terms—Feature extraction, natural language processing for people to search for other people’s opinions. When a user
(NLP), text analysis, text mining.
enters a query into a search engine, the search engine examines
its index and provides a listing of best-matching web pages ac-
I. INTRODUCTION cording to its criteria, usually with a short summary containing
the document’s title and, sometimes, parts of the text.
EOPLE’s opinion has become one of the extremely impor-
P tant sources for various services in ever-growing popular
social networks. In particular, online opinions have turned into
Current search engines can efficiently help users obtain a re-
sult set, which is relevant to user’s query. However, the semantic
orientation of the content, which is very important information
a kind of virtual currency for businesses looking to market their
in the reviews or opinions, is not provided in the current search
products, identify new opportunities, and manage their reputa-
engine. For example, Google will return around 7 380 000 hits
tions. Meanwhile, cellular phones have definitely become the
for the query “Angels and Demons review.” If search engines can
most-vital part of our lives. There is no doubt that the mobile
provide statistical summaries from the semantic orientations, it
platform is currently one of the most popular platforms in the
will be more useful to the user who polls the opinions from the
world. However, digital content displayed in cellular phones
Internet. A scenario for the aforementioned movie query may
is limited in size, since cellular phones are physically small.
yield such report as “There are 10 000 hits, of which 80% are
Hence, a mechanism that can provide users with condensed
thumbs up and 20% are thumbs down.” This type of service
descriptions of documents will facilitate the delivery of digi-
requires the capability of discovering the positive reviews and
tal content in cellular phones. This paper explores and designs
negative reviews.
a mobile system for movie rating and review summarization in
In recent years, the problem of “opinion mining” has seen
which semantic orientation of comments, the limitation of small
increasing attention [1]–[3]. With the proliferation of reviews,
display capability of cellular devices, and system response time
ratings, recommendations, and other forms of online expres-
are considered.
sion, online opinion could provide important information for
Practically, when we are not familiar with a specific prod-
businesses to market their products, identify new opportuni-
uct, we ask our trusted sources to recommend one. Today, the
ties, and manage their reputations. For example, most recom-
mendation systems attempt to alleviate information overload
Manuscript received July 13, 2010; revised November 4, 2010 and January by identifying which items a user will find worthwhile, and
18, 2011; accepted February 28, 2011. Date of publication April 29, 2011; date collaborative filtering used in this process relies on the opin-
of current version April 11, 2012. This work was supported in part by the Na-
tional Science Council (NSC) under Grant NSC-99-2221-E-009-150 and Grant ions of similar customers to recommend items [4]. Essentially,
NSC-099-2811-E-009-041. This paper was recommended by Associate Editor the task of determining whether a movie review is positive or
G. I. Papadimitriou. negative is similar to the traditional binary-classification prob-
C.-L. Liu, W.-H. Hsaio, and C.-H. Lee are with the Department of Com-
puter Science, National Chiao Tung University, Hsinchu 30010, Taiwan lem. Given a review, the classifier tries to classify the review
(e-mail: clliu@mail.nctu.edu.tw; mr.papa@msa.hinet.net; chl@cs.nctu.edu.tw; into positive category or negative category. However, opinions
badlaugh.cs96g@g2.nctu.edu.tw). in natural language are usually expressed in subtle and com-
G.-C. Lu is with the Global Legal Division iTEC, Hon Hai Precision Industry
Company Ltd., Taipei 236, Taiwan (e-mail: badlaugh.cs96g@g2.nctu.edu.tw). plex ways. Thus, the challenges may not be addressed by sim-
E. Jou is with the Institute for Information Industry, Taipei 106, Taiwan (e- ple text-categorization approaches such as n-gram or keyword-
mail: emeryjou@iii.org.tw). identification approaches [5].
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. In this paper, we collected movie reviews from Internet Blogs
Digital Object Identifier 10.1109/TSMCC.2011.2136334 that do not consist of any rating information. Sentiment analysis
is performed to determine the semantic orientation of the reviews analysis research started from the determination of the semantic
and movie-rating score is based on the sentiment-analysis result. orientation of the terms. Hatzivassiloglou and McKeown [7]
In addition to the accuracy of the classification, system response employed textual conjunctions such as “fair and legitimate” or
time is also taken into account in our system design. Although “simplistic but well-received” to separate similarly connoted
this paper focuses on movie review, the whole design is not only and oppositely connoted words. Esuli and Sebastiani [3] pro-
for movie-review domain. The same design can be applied to posed to determine the orientation of subjective terms based on
other domains such as restaurant, hotel, etc. Meanwhile, increas- the quantitative analysis of the glosses of such terms, i.e., the tex-
ingly more cellular phones have begun using global positioning tual definitions that are given in online dictionaries. The process
system (GPS) functionality, which can utilize user’s current lo- is based on the assumption that terms with similar orientation
cation to provide enhanced services and make cellular phones tend to have “similar” glosses (i.e., textual definitions). Thus,
become context aware. Moreover, the opinion-mining result can synonyms and antonyms could be used to define a relation of
be used by recommendation systems to identify which items a orientation. Esuli and Sebastiani [8] described SENTIWORD-
user will find worthwhile. For example, when people want to NET, which is a lexical resource in which each WordNet synset
have dinner with their friends, restaurant recommendation sys- is associated with three numerical scores, i.e., Obj(s), Pos(s),
tem can provide a restaurant list based on their current GPS and Neg(s), thus describing how objective, positive, and nega-
location, opinion-mining result, and their preferences. tive the terms contained in the synset.
In cellular-phone environment, it is inappropriate to display Traditionally, sentiment classification can be regarded as a
detailed review due to the size of the screen. Hence, we employ binary-classification task [1], [2], [9]. Turney [2] proposed to
summarization technique to reduce the size of information. The determine the orientation of terms by bootstrapping from a pair
system will summarize the reviews (including positive reviews of two minimal sets of “seed” terms by counting the number of
and negative reviews) and provide the user an overview about the hits returned from search engine with a N EAR operator. The
reviews. Meanwhile, movie-review summarization is similar to N EAR operator requires these two phrases or terms to be within
customer review that focuses on product feature [6]. In this pa- a specified word count of one another to be counted as a success-
per, we employ feature-based summarization for movie review. ful result. AltaVista search engine1 allows the user to specify a
Product feature and opinion-word identification are essential to word distance of his/her choice, but the maximum distance is ten
feature-based summarization. We propose an latent-semantic- words. The relationship between a given phrase and a set of seeds
analysis (LSA) based product-feature-identification approach was used to place it into a positive or negative subjectivity class.
to identify product features. Moreover, we extend the result to Pang et al. [1] found out that standard machine learning outper-
propose an LSA-based filtering mechanism, which can further forms human-proposed baselines. They employed naive Bayes,
reduce the size of the summarization according to the features. maximum-entropy classification, and support vector machines
The main contributions of this paper are the following. (SVMs) [10] to perform sentiment-classification task on movie-
1) Design and develop a movie-rating and review- review data. According to their experiment, SVMs tended to do
summarization system in a mobile environment. We con- the best, and unigram with presence information turns out to be
sidered system response time issue to design the mobile the most effective feature.
application, and the same system design can be extended In recent years, some researchers have extended sentiment
to other domains with a little modification. analysis to the ranking problem, where the goal is to assess
2) Propose a novel approach based on LSA to identify prod- review polarity on a multipoint scale [11]–[13]. Snyder and
uct features. Product features and opinion words are used Barzilay [13] addressed the problem of analyzing multiple re-
to select appropriate sentences to become a review sum- lated opinions in a text and presented an algorithm that jointly
marization. learns ranking models for individual aspects by modeling the
3) Propose an LSA-based filtering mechanism to allow the dependencies between assigned ranks. Goldberg and Zhu [12]
users to choose the features in which they are interested, proposed a graph-based semisupervised learning algorithm to
and this mechanism could reduce the size of summary address the sentiment-analysis task of rating inference and their
efficiently. experiments showed that considering unlabeled reviews in the
The rest of this paper is organized as follows. In Section II, learning process can improve rating inference performance.
related surveys are presented. In Section III, the LSA-
based product feature identification approach is introduced. In B. Feature-Based Summarization
Section IV, system design is presented. In Section V, several
In product-review summarization, people are interested in
experiments are introduced. In Section VI, the conclusion is
the reasons why this product is worth buying rather than the
presented.
principal meaning of the comment. Thus, feature-based sum-
marization [6] is used in movie-review summarization. The
II. RELATED SURVEYS feature-based summarization will focus on the product features
A. Sentiment Analysis on which the customers have expressed their opinions. In ad-
dition to product features, the summarization should include
Since a document is composed of sentences and a sentence is
composed of terms, it is reasonable to determine the semantic
orientation of the text from terms. As a result, the sentiment- 1 AltaVista: http://http://www.altavista.com/
LIU et al.: MOVIE RATING AND REVIEW SUMMARIZATION IN MOBILE ENVIRONMENT 399
III. LATENT-SEMANTIC-ANALYSIS-BASED where U and V are matrices with orthonormal columns (i.e.,
PRODUCT-FEATURE IDENTIFICATION U T U = V T V = I), and Σ is a diagonal matrix whose diagonal
elements are the singular values of M .
In this paper, we propose a novel approach based on LSA The original term-document matrix could be approximated
to identify related product-feature terms. Essentially, LSA is by reducing the dimensions of the term–document space, and
a theory and method to analyze relationships between a set this will allow the underlying latent relationships between terms
of documents and the terms they contain by producing a set and documents to be exploited during searching. Equation (2)
of concepts related to the documents and terms. LSA can be shows that the reduced matrix M̃ is obtained by reducing the
applied to any type of count data over a discrete dyadic do- dimensionality, where the system truncates the singular-value
main, which is so-called two-mode data [16]. Supposing that matrix Σ to size k. It is this dimensionality-reduction step, i.e.,
a collection of documents D = {d1 , . . . , dn } with terms from the combining of surface information into a deeper abstraction,
W = {w1 , . . . , wm } are given, then the system can construct a which captures the mutual implications of words and passages.
cooccurrence matrix M , where its dimension is n × m and each Therefore, even though the original vector space is sparse, the
entry Mij denotes the number of times the term wj occurred corresponding low-dimensional space is typically not sparse.
in document di . Each document di is represented using a row Practically, the number of dimensions retained in LSA is an em-
vector, while each term wj is represented using a column vec- pirical issue [17]. We conducted the experiments under different
tor. As shown in (1), LSA applies singular-value decomposition dimensions in the experiment section
(SVD) to the term-document matrix M, and a low-rank approx-
imation of the matrix M could be used to determine patterns in M̃ = U Σ̃V T ≈ U ΣV T = M. (2)
the relationships between the terms and concepts contained in Algorithm 1 shows the algorithm, where the inputs include
the text a term-document matrix, several product-feature seeds, the re-
duced dimensionality in SVD operation, and the number of
M = U ΣV T (1)
extracted features for each seed. In Algorithm 1, lines 3 and
4 are employed to perform linear algebra SVD operation on
the term-document matrix, and lines 5–16 are used to compute
2 http://opennlp.sourceforge.net/ the similarities between the seed product-feature vector and,
400 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 42, NO. 3, MAY 2012
A. Dataset
In this paper, we collected the Chinese movie reviews from
Internet Blogs. Since the original data are an hypertext markup
language (HTML) document, HTML-tag-removal process is re-
quired to extract the text information. Training data are neces-
sary for SVM to train a classification model, and manual classifi-
cation is performed to classify the training reviews into positive
or negative reviews. We randomly selected 500 positive reviews
and 500 negative reviews as the data for classification-model
building. In addition to the model-building data, we further col-
lected around 8000 movie reviews from the Internet, and these
reviews will be used as movie-review database.
B. Sentiment Classification
As mentioned above, sentiment classification is similar to
traditional binary-classification problem. Currently, many clas-
sification algorithms such as SVM [1], [10], [18], [19], decision
trees [20], and neural networks [21] have been proposed and
shown their capabilities in different domains. SVM is one of the
state-of-the-art algorithms. SVM has been shown to be highly
effective in traditional text categorization. SVM measures the
Fig. 1. Movie review and summarization flow.
complexity of hypotheses based on the margin with which they
separate the data instead of the number of features. One re-
pairwise, the other term vectors. The top ones will be collected as markable property of SVM is that their ability to learn can be
related product-feature terms for a specific product feature. The independent of the dimensionality of the feature space.
procedure getTermVectorFromTermDocMatrix is used to ob- In natural-language processing (NLP) and information re-
tain the term-vector representation of a product feature. The seed trieval (IR), bag-of-words model tries to use an unordered col-
is supposed to be one of the terms in the term-document ma- lection of words to represent a text, disregarding grammar and
trix, and it is easy to obtain its corresponding document-vector even word order. In other words, each word in the text con-
representation. Meanwhile, sim in line 7 is used to store the sim- tributes to a feature of the document. In this paper, we employ
ilarities between the seed and the other terms. After sorting in similar approach to construct a feature vector of the document.
descendant order, it is easy to obtain the top ones and their corre- Stop words are removed first and then each distinct word Wi
sponding feature names in procedure getTopRelatedFeatures. in the document is used to represent a feature. As a result, a
When the above steps are completed, each product-feature document could be represented by a feature vector, and many
seed can have its own semantically related term set. The ad- machine-learning algorithms could be applied to perform clas-
vantage of this approach is that it could be applied to all the sification tasks. We employed SVM to perform the classifica-
languages, it does not need any external dictionary, since LSA tion and libsvm [22] package is used in the system. The kernel
is language-independent, and it is based on linear algebra SVD function used in the system is the radial basis function (RBF)
operation. and K-fold cross validation (i.e., K = 5) is conducted in the
experiment.
The classification result will be the basis of the rating. With
IV. SYSTEM DESIGN the proportion of positive and negative reviews, the system could
Fig. 1 shows the system flow. The input is a movie name and provide the rating information to end users. For example, if there
the system will use the movie name to retrieve reviews about are 100 movie reviews for a specific movie and 80 reviews are
this movie from movie-review database. These movie reviews positive, the rating of this movie will be four stars.
LIU et al.: MOVIE RATING AND REVIEW SUMMARIZATION IN MOBILE ENVIRONMENT 401
C. Review Summarization
1) Product-Feature Identification: As mentioned above, we
propose an LSA-based product-feature-identification algorithm
and system can obtain a semantically related feature set for
each seed. We compared three product-feature-identification
approaches, i.e., the LSA-based approach, frequency-based
approach, and PLSA-based approaches, in the experiment
section.
2) Opinion-Word Identification: In addition to feature iden-
tification, opinion words about the product features are impor-
tant as well. Hu and Liu [6] extracted the opinion words by
retrieving the nearby adjective of product features. In addition
to language sentence-structure characteristic, Zhuang et al. [14]
used the dependency grammar graph to find out some relations
Fig. 2. Rating and summarization screenshot.
between feature words and the corresponding opinion words in
training data. They both rely on language sentence structure to
extract opinion words; therefore, these approaches will be appli- features. The system allows the user to determine the feature
cable to those language sentences having such a characteristic. f in which he/she is interested. When the user determines f ,
Many languages do not possess the aforementioned sentence the system will generate a summary, which is related to product
structure. Hence, we propose to use a statistical approach to dis- features F .
cover opinion words. First, we take into account POS-tagging Practically, a positive movie review may include negative
information of the opinion words. According to our analysis, comments about specific aspects and vice versa. In this paper,
adjectives are usually used to describe sentiment in Chinese; we propose to analyze the polarity of a movie review using SVM
therefore, these terms become the candidate opinion words. Sec- and analyze the polarity of a sentence using opinion words. In
ond, term frequency is taken into account; therefore, frequency feature-based summarization, the system can utilize the polarity
of the opinion words should exceed a threshold value. Let AVG of opinion words to determine the polarity of sentences. Hence,
be the average of sum of square of frequency of all items as the system can provide both positive- and negative-review sum-
shown in (3) below. A termi will be selected only if its square marization, regardless of the polarity of a review.
of frequency is equal or larger than AVG. We manually selected With the proportion of positive and negative reviews, the
positive and negative sentences from 500 positive reviews and system could provide the rating information to end users. The
500 negative reviews, respectively. Positive opinion words and rating information combined with review summary could give
negative opinion words could be further obtained based on term end users the rating and summarization information about the
frequency and POS tagging. movie. The “Feature” section in Fig. 2 is a pull-down menu,
which allows the users to choose the features in which they
n
Sf = {Frequency(termi )}2 are interested. Meanwhile, positive summarization and negative
i=1 summarization can be presented to users, regardless of a movie’s
rating.
AVG = Sf /n. (3)
TABLE III
SENTIMENT-CLASSIFICATION RESULTS USING PUBLIC MOVIE-REVIEW DATASET
B. Product-Feature Identification
In product-feature identification, we compared our LSA-
based approach with two other approaches, which are frequency-
based and PLSA-based. We performed experiments using the
movie-review documents mentioned above, which is avail-
able at http://www.cs.cornell.edu/People/pabo/movie-review-
data/. The dataset includes 1000 positive and 1000 negative terms like story, character, and plot can be identified. In the
movie reviews. Since nouns are the candidates of product fea- LSA-based approach, Algorithm 1 is used to identify product
tures, only nouns will be used in this experiment and the to- features and the seeds include scene, plot, director, actor, and
tal number of nouns is 29 632. In addition to movie-review story. The truncated dimension of LSA is 500 in this paper. Ta-
dataset, we employed the movie-review glossary, which is avail- ble V shows the top ten features for each seed. In addition to
able at http://www.movieprofiler.com/movieglossary, as the ba- product-feature identification, the top ten features for each seed
sis of the comparison. The movie-review glossary is created can be regarded as being semantically related to the seed.
for movie reviewers, critics, and film students alike, as well In PLSA-based approach, we applied PLSA [23] to the
as the general public interested in movie reviewing and film dataset. Essentially, PLSA is based on a mixture decompo-
making-related terminology. The number of terminologies is sition derived from a latent class model. The standard pro-
1069. Since many terminologies are only used in movie indus- cedure for maximum-likelihood estimation in latent-variable
try, additional filtering is applied to the dataset. Only the terms models is the expectation–maximization (EM) algorithm [24],
appearing in the movie-review data will be kept. The num- which includes the E-step and the M-step. In E-step, the
ber of terminologies left is 383. A copy of the terminologies posterior probabilities are computed for the latent vari-
obtained from movieprofiler.com and the terminologies used in able z based on the current estimates of the parameters.
this paper are available at http://islab.cis.nctu.edu.tw/download/. In M-step, the parameters are updated based on the pos-
Precision, recall, and F -value are employed to evaluate system terior probabilities obtained in the previous E-step. When
performance. given each occurrence of a word w ∈ W = {w1 , . . . , wM }
In frequency-based approach, all the nouns are ranked ac- in a document d ∈ D = {d1 , . . . , dN }, the E-step is given
cording to their frequencies, and then, the top ones are selected by
as product features. Table IV shows the top ten terms using
frequency-based approach. Frequency-based approach can iden- P (wj |zk )P (zk |di )
P (zk |di , wj ) = K . (4)
l=1 P (wj |zl )P (zl |di )
tify the terms that are often used in movie reviews. Hence, the
404 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 42, NO. 3, MAY 2012
TABLE VI
FIVE ASPECTS GENERATED USING PLSA
TABLE VII
CLUSTERING RESULT USING 20 NEWSGROUPS DATASET
TABLE VIII
THREE ASPECTS GENERATED USING PLSA (20 NEWSGROUPS DATASET)
and then predict the semantic orientation of the review. If 40 000 to provide a new product-review summarization and rating ser-
features are used, it would take around 120 s to load the model. vice. The design can also be extended to other product-review
Hence, we employed frequency criterion to reduce the number domains easily.
of features. Currently, our system uses 1902 features, and it
takes less than 6 s to load model and predict the review. REFERENCES
In product-feature identification, the experiment shows that
[1] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: Sentiment classifica-
LSA-based approach outperforms frequency-based and PLSA- tion using machine learning techniques,” in Proc. ACL-02 Conf. Empirical
based approaches. As a by-product, our LSA-based system can Methods Natural Lang. Process., 2002, pp. 79–86.
identify a related term set for each seed. We propose an LSA- [2] P. D. Turney, “Thumbs up or thumbs down?: Semantic orientation applied
to unsupervised classification of reviews,” in Proc. 40th Annu. Meeting
based filtering mechanism to employ these semantically related Assoc. Comput. Linguist., 2002, pp. 417–424.
terms to reduce the size of summary. Only the sentences contain- [3] A. Esuli and F. Sebastiani, “Determining the semantic orientation of terms
ing these terms will be presented to users. Moreover, the LSA- through gloss classification,” in Proc. 14th ACM Int. Conf. Inf. Knowl.
Manage., 2005, pp. 617–624.
based product-feature-identification approach could be general- [4] S. H. Choi, Y.-S. Jeong, and M. K. Jeong, “A hybrid recommendation
ized to other product-review domains, since the linear algebra method with reduced data for large-scale application,” IEEE Trans. Syst.,
SVD operation could be applied to any language. Man, Cybern. C, Appl. Rev., vol. 40, no. 5, pp. 557–566, Sep. 2010.
[5] T. Mullen and N. Collier, “Sentiment analysis using support vector ma-
Meanwhile, we conducted an experiment on the truncated chines with diverse information sources,” in Proc. EMNLP, 2004, pp. 412–
dimension of LSA. Several truncated-dimension values were 418.
used, and their results were compared with frequency-based [6] M. Hu and B. Liu, “Mining and summarizing customer reviews,” in Proc.
10th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2004, pp. 168–
approach. The experimental result shows that when the truncated 177.
dimension is more than 500, the differences are minor. [7] V. Hatzivassiloglou and K. R. McKeown, “Predicting the semantic ori-
Moreover, we used 20 newsgroups dataset to evaluate PLSA’s entation of adjectives,” in Proc. 8th Conf. Eur. Chap. Assoc. Comput.
Linguist., Morristown, NJ: Assoc. Comput. Linguist., 1997, pp. 174–181.
clustering performance. The result shows that PLSA could out- [8] A. Esuli and F. Sebastiani, “SENTIWORDNET: A publicly available
perform k-means and LSA. One of the important features of the lexical resource for opinion mining,” in Proc. 5th Conf. Lang. Res. Eval.,
newsgroup dataset is that the newsgroups in the experiment are 2006, pp. 417–422.
[9] K. Dave, S. Lawrence, and D. M. Pennock, “Mining the peanut gallery:
highly unrelated. In other words, the boundaries between these opinion extraction and semantic classification of product reviews,” in
aspects are very clear. However, the movie-review dataset does Proc. 12th Int. Conf. World Wide Web, New York: ACM, 2003, pp. 519–
not possess such a characteristic. The articles in the movie re- 528.
[10] V. N. Vapnik, The Nature of Statistical Learning Theory. New York:
view are similar, since they all focus on movie reviews. Hence, it Springer-Verlag, 1995.
might be the reason why PLSA could not determine the bound- [11] B. Pang and L. Lee, “Seeing stars: Exploiting class relationships for sen-
aries between the aspects of movie reviews. timent categorization with respect to rating scales,” in Proc. 43rd Annu.
Meet. Assoc. Comput. Linguist, Morristown, NJ: Assoc. Comput. Lin-
Currently, feature-based summarization is sentence-level guist., 2005, pp. 115–124.
summarization. Although summary sentences are about product [12] A. B. Goldberg and X. Zhu, “Seeing stars when there aren’t many stars:
features and opinion words, these sentences are obtained from Graph-based semi-supervised learning for sentiment categorization,” in
Proc. TextGraphs: First Workshop Graph Based Methods Nat. Lang. Pro-
different paragraphs or movie reviews. It is obvious that a flu- cess, Morristown, NJ: Assoc. Comput. Linguist., 2006, pp. 45–52.
ency problem exists in the summary. Thus, it will be our future [13] B. Snyder and R. Barzilay, “Multiple aspect ranking using the good grief
work to achieve greater fluency of the summarization. algorithm,” in Proc. HLT-NAACL, 2007, pp. 300–307.
[14] L. Zhuang, F. Jing, and X.-Y. Zhu, “Movie review mining and summariza-
tion,” in Proc. 15th ACM Int. Conf. Inf. Knowl. Manage., 2006, pp. 43–50.
[15] Y. Lu, C. Zhai, and N. Sundaresan, “Rated aspect summarization of short
VI. CONCLUSION comments,” in Proc. 18th Int. Conf. World Wide Web, New York: ACM,
2009, pp. 131–140.
In this paper, we design and implement a movie-rating and [16] T. Hofmann, J. Puzicha, and M. I. Jordan, “Learning from dyadic data,” in
review-summarization system in mobile environment. Senti- Proc. Conf. Adv. Neural Inform. Process. Syst. II, Cambridge, MA: MIT
ment classification is applied to the movie reviews, and rat- Press, 1999, pp. 466–472.
[17] T. K. Landauer, P. W. Foltz, and D. Laham, “Introduction to latent semantic
ing information is based on sentiment-classification results. analysis,” Discourse Processes, vol. 25, pp. 259–284, 1998.
In feature-based summarization, product-feature identification [18] T. Joachims, Learning to Classify Text Using Support Vector Machines:
plays an essential role, and we propose a novel approach based Methods, Theory and Algorithms. Norwell, MA: Kluwer, 2002.
[19] C. Silva, U. Lotrič, B. Ribeiro, and A. Dobnikar, “Distributed text classi-
on LSA to identify related product features. Moreover, we use a fication with an ensemble kernel-based learning approach,” IEEE Trans.
statistical approach to identify opinion words. Product features Syst., Man, Cybern. C: Appl. Rev., vol. 40, no. 3, pp. 287–297, May 2010.
and opinion words will be used as the basis for feature-based [20] L. Rokach and O. Maimon, “Top-down induction of decision trees
classifiers—A survey,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev.,
summarization. vol. 35, no. 4, pp. 476–487, Nov. 2005.
In a system-performance-analysis experiment, the number of [21] G. P. Zhang, “Neural networks for classification: A survey,” IEEE Trans.
features plays an important role in SVM-model loading and Syst., Man, Cybern. C, Appl. Rev., vol. 30, no. 4, pp. 451–462, Nov. 2000.
[22] (2001). LIBSVM: A library for support vector machines [Online]. Avail-
prediction. We use frequency criterion to reduce the number of able: http://www.csie.ntu.edu.tw/ cjlin/libsvm.
features, and the experiment shows that it takes less than 6 s to [23] T. Hofmann, “Unsupervised learning by probabilistic latent semantic anal-
load the SVM model and classify the reviews. Furthermore, we ysis,” Mach. Learn., vol. 42, no. 1/2, pp. 177–196, 2001.
[24] A. P. Dempster, N. M. Laird, and D. B. Rubin. (1977). Maximum likeli-
propose an LSA-based filtering approach to reduce the size of hood from incomplete data via the em algorithm. J. R. Stat. Soc., Series B
the summary based on the user’s preferred aspect. The design [Online]. vol. 39, no. 1, pp. 1–38. Available: http://citeseerx.ist.psu.edu/
proposed in this paper could fully utilize the Internet content viewdoc/summary?doi=10.1.1.133.4884.
LIU et al.: MOVIE RATING AND REVIEW SUMMARIZATION IN MOBILE ENVIRONMENT 407
[25] C. D. Manning, P. Raghavan, and H. Schtze, Introduction to Information Chia-Hoang Lee received the Ph.D. degree in com-
Retrieval. New York: Cambridge Univ. Press, 2008. puter science from the University of Maryland,
[26] D. Ramage, P. Heymann, C. D. Manning, and H. Garcia-Molina, “Clus- College Park, in 1983.
tering the tagged web,” in Proc. 2nd ACM Int. Conf. Web Search Data He is currently a Professor with the Department of
Mining, New York: ACM, 2009, pp. 54–63. Computer Science, National Chiao Tung University,
Hsinchu, Taiwan. He was a Faculty Member with the
University of Maryland and Purdue University, West
Lafayette, IN. His current research interests include
artificial intelligence, human–machine interface sys-
tems, natural-language processing, and opinion
mining.