Survey of Entiment Classification Techniques Used For Ndian Regional Languages
Survey of Entiment Classification Techniques Used For Ndian Regional Languages
Survey of Entiment Classification Techniques Used For Ndian Regional Languages
2, April 2015
ABSTRACT
Sentiment Analysis is a natural language processing task that extracts sentiment from various text forms
and classifies them according to positive, negative or neutral polarity. It analyzes emotions, feelings, and
the attitude of a speaker or a writer towards a context. This paper gives comparative study of various
sentiment classification techniques and also discusses in detail two main categories of sentiment
classification techniques these are machine based and lexicon based. The paper also presents challenges
associated with sentiment analysis along with lexical resources available.
KEYWORDS
NLP, sentiment, sentiment analysis, classification techniques, challenges, lexical resources, features,
machine learning, lexicon based.
1.INTRODUCTION
Sentiment Analysis (SA) is a natural language processing task that deals with finding orientation
of opinion in a piece of text with respect to a topic [1]. It deals with analyzing emotions, feelings,
and the attitude of a speaker or a writer from a given piece of text. Sentiment Analysis involves
capturing of users behaviour, likes and dislikes of an individual from the text. The main goal
behind sentiment analysis is to identify sentiment associated with the text by extracting
sentimental context from the text.
The purpose of sentiment analysis is to determine the attitude or inclination of a communicator
through the contextual polarity of their speaking or writing. Their attitude may be reflected in
their own judgment, emotional state of the subject, or the state of any emotional communication
they are using to affect a reader or listener. It is trying to determine a persons state of mind on
the subject they are communicating about. This information can be mined from various data
sources like: texts, tweets, blogs, social media, news articles, product comments.
There are different classification levels in SA: document-level, sentence-level and aspect-level.
Document-level SA aims to classify an opinion of the whole document as expressing a positive or
negative sentiment. Sentence-level SA aims to classify sentiment expressed in each sentence
which involves identifying whether sentence is subjective or objective. Aspect-level SA aims to
classify the sentiment with respect to the specific aspects of entities which is done by identifying
the entities and their aspects for instance researchers need a tool to generate summaries for
deciding whether to read the entire document or not and for summarizing information searched by
user on internet. News groups can use multi document summarization to cluster the information
from different media and summa.
DOI:10.5121/ijcsa.2015.5202
13
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.2,April 2015
The paper presents a detail survey of various sentiment classification techniques. Related work
done and past literature is discussed in section 2.Baseline algorithm is defined in section 3 along
with challenges associated in performing sentiment analysis. Two main categories of sentiment
classification techniques which are machine based SA and lexicon based SA are discussed in
detail in section 4 along with the comparison of each method cited in Indian regional languages.
Finally, section 5 concludes the paper.
2.LITERATURE SURVEY
In this section we cite the relevant past literature of research work done in the field of sentiment
analysis for Indian languages.
Amitava Das and Bandopadhya developed SentiWordNet for Bengali language, which is an
automatically constructed lexical resource in which WordNet synset are assigned positive and a
negative score. SentiWordNet and Subjectivity Word List are used to generate merged sentiment
lexicon in which duplicate words are removed. Bengali SentiWordNet is created by applying
word level lexical-transfer, using an English-Bengali dictionary, on content available in
SentiWordNet. [4]
Authors have [1] developed a modified approach to identify the sentiments associated with Hindi
content by handling negation and discourse relation. They updated the existing Hindi
SentiWordNet (HSWN) by when specific sentiment words where not found in existing HSWN by
extracting same meaning word from English SentiWordNet. Through handling of negation and
discourse associated with text, their proposed algorithm achieved approximately 80% accuracy on
classification of Hindi reviews.
Amandeep Kaur and Vishal Gupta Proposed Algorithm for Sentiment Analysis for Punjabi Text.
Their used the Hindi WordNet to develop the Subjective Lexicon for the Punjabi language.[8]
They are using three popular methods used for the generation of subjective lexicon-Use of BiLingual Dictionary, Machine Translation, Use of Word net .Then they devise an Algorithm
Combining the unigram method and simple scoring method which provides better efficiency. The
overall efficiency of the proposed algorithm is 54.2%.
Aditya Joshi and Pushpak Bhattacharyya [2] proposed a fallback strategy for finding sentiment
associated with Hindi language. Machine Translation, In-language and Resource Based SA where
the three approach proposed for SA in Hindi. Through WordNet linking, words present in English
SentiWordNet were replaced by similar Hindi words to construct Hindi SentiWordNet
(HSWN).To determine the polarity of the opinion associated in text SVM classifier was used to
perform In-language SA. In a machine translation based method Google translator was used to
translate Hindi corpus into English and resulted corpus was used as input to the classifier to
determine polarity. In resource based SA approach the synset corresponds to the English
SentiWordNet is used in the corresponding synset in Hindi to build the SentiWordNet (H-SWN)
for Hindi. 78.14 was the best accuracy achieved through in-language sentiment analysis for Hindi
documents.
Kishorjit and Sivaji Bandyopadhyay proposed verb based Manipuri sentiment analysis. They are
using the conditional random field (CRF) approach. It is an unsupervised approach where the
system learns by giving some training and can be used for testing other texts. Then they processed
text for part of speech tagging using CRF. Here With the help of POS tagger the verbs of each
sentence are identified and the modified lexicon of verbs is used to notify the polarity of the
sentiment in the sentence, because the sentiment of the sentence is highly dependent on the verbs.
Their proposed algorithm achieved approximately 75% accuracy on sentiment analysis of
Manipuri. [9]
14
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.2,April 2015
Authors have used self learning neural network which takes linguistic and part of speech emotive
features as input, for detection of sentiment associated in Tamil content. The primary inputs to the
neural network are ; first the outputs of domain classifier which retrieves noun and verbs and term
domain frequencies, second the two schmaltzy analyzers which are Negation Scorer and Flow
scorer which assigns sore to each document based on the pleasantness of the words in it ,and and
lastly noun, verb and urichol the three taggers . Unsupervised learning was selected and Hebbian
learning was incorporated since emotion recognition has to deal with emotional intelligence. Two
dimensional animated face generator is used to show resultant emotion which is identified by
assigning weights for features based on their affective influence [7].
Das and Bandopadhya [5] used different strategies to predict the sentiment of a word in the given
text. In one of the strategies they annotated the words with their associated polarity manually. In
another strategy, to determine the polarity of the text, Bi-Lingual dictionary for English and
Indian Languages was used. In next strategy synonym and antonym relations of words in
WordNet are used, to determine the polarity. Final strategy used learning from pre-annotated
corpora to find the polarity of the text.
Authors have used lexicon based approach for extracting sentiment from Urdu text. Sentimentannotated lexicon based approach for analyzing sentiment based on SentiUnits. SentiUnits are the
expressions made of one or more words, which carry the sentiment information of the whole
sentence. Shallow parsing is used for identification and extraction of SentiUnits from the given
text. Two types of SentiUnits were used i.e. Single adjective phrase and multiple adjective
phrases. Adjectives, modifiers and orientation are the three attributes associated with SentiUnits.
Process of sentiment analysis is composed of three phases; pre-processing where normalization
and segmentation of text was performed then next phase involves use of shallow parsing for
extracting SentiUnits and finally extracted SentiUnits are compared with lexicon and their
polarities are calculated for classification as positive or negative and overall polarity is obtained
by combining polaritys.[6]
Authors have built a robust sense based classifier [3] which is a supervised document level
sentiment classifier on basis of semantic space based on WordNet senses. Through the use of
similarity metrics unknown synset in the training set was replace by similar synset in test set.
Words in the corpus were annotated with their senses using combinations of manual sense
annotation and automatic iterative WSD. In synset replacement algorithm a synset encountered in
a test document is not found in the training corpus, it is replaced by one of the synsets present in
the training corpus. The substitute synset is determined on the basis of its similarity with the
synset in the test document. The synset that is replaced is referred to as an unseen synset as it is
not known to the trained model.
Authors have proposes Cross-Lingual Sentiment Analysis(CLSA) [10] using WordNet senses as
features for supervised sentiment classification where machine translation is not available for
translation between specific languages .Concept of linked WordNet is used which bridge the gap
between those two languages. WordNet of Hindi and Marathi was developed using an expansion
approach having same synset identifier The words in the corresponding synsets represent
translations of each other in specific contexts. Words of the training as well as the test corpus
were mapped to their WordNet synset identifiers. A classification model is learnt on the training
corpus and tested on the test corpus leads to a new corpora which is represented in the common
feature space i.e. sense space. Accuracy of 72% and 84% for Hindi and Marathi was obtained for
sentiment classification of these languages.
15
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.2,April 2015
3.SENTIMENT ANALYSIS
Sentiment analysis is computational study of emotions, opinions and mainly the sentiment
expressed in the text by user. Sentiment analysis is a challenging task due to many challenges
which are associated while processing natural language. Any sentiment analysis system needs
first to extract feature i.e. sentimental words or phrases from the given text and then using
suitable text classifier overall sentiment associated with the text is extracted.
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.2,April 2015
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.2,April 2015
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.2,April 2015
We first have to tokenize the sentence which involves breaking a stream of text up into words,
phrases, symbols, or other meaningful elements called tokens. It is done by segmenting text by
splitting it by spaces and punctuation marks, and forms a bag of words. Care should be taken so
that short forms such as dont, Ill, shed will remain as one word.
Then important features are identified. Features such as Terms presence and frequency, Parts of
speech (POS), Opinion words and phrases, Negations are identified. We need to take care of
negations, since they will reverse polarities and decide whether we want to use only adjective,
adjectives plus adverbs or simply all the words as features. Lexicon-based or statistical feature
selection methods can be used to select features from documents which treat document as Bag of
Words (BOW) or string. Stemming and removal of stop-words are mostly common feature
selection step.
After we've tokenized and decided which features to use we need to classify the sentiment. Is it
good or bad? Classification can be done with different algorithms. For example: Nave Bayes,
Support Vector Machines, or Max Entropy. Lexical resources like dictionary, WordNet,
SentiWordNet are uses by these classifier algorithms. This technique attempts to determine
whether a text is objective or subjective and whether a subjective text contains positive or
negative sentiments. The system automatically collects, cluster, categorizes, and summarizes
news from several sites on the web on a daily basis. A summarization machine can be viewed as a
system which accepts either a single document or multiple documents or a query as an input and
produces an abstract or extract summary.
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.2,April 2015
The supervised learning methods depend on the existence of labelled training documents.
Supervised learning process: two Steps; Learning (training): Learn a model using the
training data testing: Test the model using unseen test data to assess the model accuracy.
There are different types of supervised classifiers like: Rule-based Classifiers, Decision
Tree Classifiers, Linear Classifiers and Probabilistic Classifiers.
a. Probabilistic Classifier
Mixture models for classification are used by Probabilistic classifiers where it assumes that each
class of the content is a component of the particular mixture. Each mixture component can be
referred as generative model which provides the probability of sampling a particular term for that
component.
1.Naive Bayes Classifier (NB)
Naive Bayes classification model computes the posterior probability of a class is computed in
Naive Bayes Classifier which is based on the way words are distributed in the particular
document. The positions of the word in the document are not considered for classification in this
model as it uses BOWs feature extraction technique. Bayes Theorem is used to predict the
probability where given feature set belongs to a particular label of the content.
P(labelfeatures)=(P(label)*P(features|label))/(P(features))
P (label) signifies the prior probability of a label. P (features | label) signifies the prior probability
that a particular feature set is being classified as a label. P (features) specifies the prior probability
that a given feature set has occurred in the process. On basis of Naive assumption i.e. all features
are independent; the equation can be rewritten as:
P(labelfeatures)=(P(label)*P(f1label)*.*P(fnlabel))/(P(features))
2.Bayesian Network (BN)
Bayesian Network model is a form of directed acyclic graph in which nodes represent
random variables and edges represent conditional dependencies in the graph. Complete
joint probability distribution (JPD) is specified for the model as it is reckoned as complete
model for the variables and their relationships.BN is not frequently used for text mining as
computation complexity is very expensive.
20
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.2,April 2015
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.2,April 2015
where support signifies absolute number of instances present in the training data set which are
relevant to the rule specified and Confidence refers to the conditional probability that the right
hand side of the rule is satisfied when the left-hand side is satisfied for a given input.
4.1.2.Unsupervised learning
Unsupervised learning is deals with finding hidden structure in unlabeled data set. There is no
error or reward signal to evaluate a potential solution as examples given to the learner are
unlabeled. Unsupervised learning methods are useful when there are documents to classify which
are unlabeled. Nearest neighbor (KNN) is unsupervised machine learning algorithm in which
objects are classified based on the majority of its nearest neighbor of the object. The class which
is assigned to the object is based among its most k nearest neighbours object. Objects are
classified based on their similarities to objects in the training data in this algorithm. Selection
process is based on either majority voting or distance weighted voting.
4.2.Lexicon-based approach
The lexicon-based approach involves calculating sentiment polarity for a review using the
semantic orientation of words or sentences in the review. The semantic orientation is a measure of
subjectivity and opinion in text. Sentiment lexicon contains lists of words and expressions used to
express peoples subjective feelings and opinions. For example, start with positive and negative
word lexicons, analyze the document for which sentiment need to find. Then if the document has
more positive word lexicons, it is positive, otherwise it is negative. The lexicon based techniques
to Sentiment analysis is unsupervised learning [13] because it does not require prior training in
order to classify the data.
Manual construction, corpus-based methods and dictionary-based methods are the methods
through which sentiment lexicon are constructed. The manual construction of sentiment lexicon is
a difficult as it involves humans to manually assign polarities to sentimental words and its a
time-consuming task. Dictionary based method is an iterative technique which is initially
constructed manually by selecting small set of sentimental word and this set then iteratively
grows by adding the synonyms and antonyms from the WordNet. This iterative process continues
till no new words are reaming to be added to the seed list. The dictionary based approach have a
limitation is that it cant find opinion words with domain specific orientations. Corpus based
techniques rely on syntactic patterns in large corpora. Corpus-based methods can produce opinion
words with relatively high accuracy. Most of these corpus based methods need very large labelled
training data but it helps to easily find domain specific opinion words and orientations of this
words towards a context.
Following tables presents a comparison of sentiment classification techniques cited in Indian
regional languages.
22
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.2,April 2015
Table 1. Sentiment Analysis techniques used in Hindi text.
Type
Corpus
Hindi [2]
250 Hindi Movie Reviews and English movie reviews
In-Language using SVM
Machine
Resource-based using
Translation
SentiWord list
Based using
SVM
Classificat
ion
Technique
Features
Term
Frequency
Term
presenc
e
TF-IDF
TF-IDF
74.57%
72.57%
78.14%
65.96%
56.35%
60.31%
Accuracy
Benefit
Limit
The error of the machine translation system affects the performance of MT-based
SA.
Hindi
Type
Corpus
Classificatio
n Technique
[1]
Hindi Movie Reviews
Semantic approach using HindiSentiWordNet (HSWN)
Features
Improved
HSWN
Improved
negation
Accuracy
69.78%
78.39%
Benefit
Limit
HSWN
+ Improved
Discosure
HSWN
negation+
80.21%
23
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.2,April 2015
Table 3. Sentiment Analysis techniques used in Manipuri text.
Type
Corpus
Manipuri [9]
Manipuri news paper, total 2,75000 words
Part Of Speech(POS)
Accuracy
Benefit
Limit
Type
Corpus
Punjabi [8]
Documents written in Punjabi language
Part Of Speech(POS)
Accuracy
Recall:70, Precision:78,F-Score:67
Benefit
Limit
Better accuracy
Performance is low.
Lexicon developed for Hindi language has limited coverage.
Type
Corpus
Tamil [7]
Tamil News Text
Part Of Speech(POS)
Accuracy
Precision : 60
Benefit
Limit
24
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.2,April 2015
Table 6. Sentiment Analysis techniques used in Urdu text.
Type
Corpus
Urdu[6]
Movie
Product
Part Of Speech(POS)
Accuracy
72%
Benefit
Limit
78%
5.CONCLUSIONS
Sentiment analysis has lead to determine the attitude or inclination of a communicator through the
contextual polarity of their speaking or writing. Sentiments can be mined from texts, tweets,
blogs, social media, news articles, comments or from any source of information.
Sentiment Analysis has been quite popular and has lead to building of better products,
understanding users opinion, executing and managing of business decisions. People rely and
make decisions based on reviews and opinions. This research area has provided more importance
to the mass opinion instead of word-of-mouth.
Large amount of work in sentiment analysis has been done in English language, as English is a
global language, but there is a need to perform sentiment analysis in other languages also. Large
amount of other languages contents are available on the Web which needs to be mined to
determine the sentiment.
ACKNOWLEDGEMENTS
I am using this opportunity to express my gratitude to thank all the people who contributed in
some way to the work described in this paper. My deepest thanks to my project guide for giving
timely inputs and giving me intellectual freedom of work. I express my thanks to the head of
computer department and to the principal of Pillai Institute of Information Technology, New
Panvel for extending his support.
25
International Journal on Computational Sciences & Applications (IJCSA) Vol.5, No.2,April 2015
REFERENCES
[1] Mittal, Namita, et al. "Sentiment Analysis of Hindi Review based on Negation and Discourse
Relation," 11th Workshop on Asian Language Resources (ALR), In Conjunction with IJCNLP. 2013.
[2] Joshi, Aditya, A. R. Balamurali, and Pushpak Bhattacharyya, "A fall-back strategy for sentiment
analysis in Hindi: a case study," Proceedings of the 8th ICON (2010).
[3] Balamurali, A. R., Aditya Joshi, and Pushpak Bhattacharyya, Robust sense-based sentiment
classification," Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and
Sentiment Analysis, Association for Computational Linguistics, 2011.
[4] Das, Amitava, and Sivaji Bandyopadhyay. "SentiWordNet for Bangla," Knowledge Sharing Event-4:
Task 2 (2010).
[5] Das, Amitava, and Sivaji Bandyopadhyay, "SentiWordNet for Indian languages," The 8th Workshop
on Asian Language Resources. 2010.
[6] Syed, Afraz Z., Muhammad Aslam, and Ana Maria Martinez-Enriquez. "Lexicon based sentiment
analysis of Urdu text using SentiUnits," Advances in Artificial Intelligence, Springer Berlin
Heidelberg, 2010.
[7] Giruba Beulah and Madhan Karky, On Emotion Detection from Tamil Text.
[8] Kaur, Amandeep, and Vishal Gupta. ,"Proposed Algorithm of Sentiment Analysis for Punjabi Text,"
Journal of Emerging Technologies in Web Intelligence 6.2 (2014): 180-183.
[9] Nongmeikapam, Kishorjit, et al. "Verb Based Manipuri Sentiment Aanalysis,"2014.
[10] Balamurali, A. R. "Cross-lingual sentiment analysis for Indian languages using linked wordnets."
(2012).
[11] Pang, Bo, and Lillian Lee. "Opinion mining and sentiment analysis," Foundations and trends in
information retrieval 2.1-2 (2008): 1-135.
[12] Strapparava, Carlo, Alessandro Valitutti, and Oliviero Stock. "The affective weight of lexicon,"
Proceedings of the Fifth International Conference on Language Resources and Evaluation. 2006.
[13] VOHRA, MRSM, and JB TERAIYA, "A Comparative Study of Sentiment Analysis Techniques,"
JIKRCE, 2013.
[14] Medhat, Walaa, Ahmed Hassan, and Hoda Korashy. "Sentiment analysis algorithms and applications:
A survey." Ain Shams Engineering Journal (2014).
[15] Remus, Robert, Uwe Quasthoff, and Gerhard Heyer, "SentiWS-A Publicly Available Germanlanguage Resource for Sentiment Analysis," LREC, 2010.
[16] Esuli, Andrea, and Fabrizio Sebastiani, "SentiWordNet: A publicly available lexical resource for
opinion mining," Proceedings of LREC, 2006.
[17] http://www.cfilt.iitb.ac.in/resources/surveys/SA-Literature%20Survey-2012-Akshat.pdf.
[18] https://class.coursera.org/nlp/lecture/145
[19] http://wsd.nlm.nih.gov
Authors
Pooja Pandey is currently a graduate student pursuing masters in Computer
Engineering at PIIT, New Panvel, and University of Mumbai, India. She has received
her B.E in Computer Engineering from University of RGPV. She is having 2 years of
past experience in the field of teaching. Her areas of interest are Natural Language
processing, theory of computation and ethical hacking.
Sharvari Govilkar is Associate professor in Computer Engineering Department, at
PIIT, New Panvel, and University of Mumbai, India. She has received her M.E in
Computer Engineering from University of Mumbai. Currently she is pursuing her
PhD in Information Technology from University of Mumbai. She is having 17 years
of experience in teaching. Her areas of interest are text mining, Natural language
processing, Compiler Design & Information retrieval etc.
26