Arabic Sentiment Analysis Using Supervis
Arabic Sentiment Analysis Using Supervis
Arabic Sentiment Analysis Using Supervis
Spain
Islam Qarqaz
Department of Computer Science
Jordan University of Science and Technology
Irbid 22110, Jordan
Abstract—Sentiment analysis is a process during which the polarity (i.e. positive, negative or
neutral) of a given text is determined. In general there are two approaches to address this problem;
namely, machine learning approach or lexicon based approach. The current paper deals with
sentiment analysis in Arabic reviews from a machine learning perspective. Three classifiers were
applied on an in-house developed dataset of tweets/comments. In particular, the Naïve Bayes, SVM
and K-Nearest Neighbor classifiers were run on this dataset. The results show that SVM gives the
highest precision while KNN (K=10) gives the highest Recall.
Keywords— sentiment analysis; sentiment classification; opinion mining; text mining, Arabic language.
I. INTRODUCTION
Sentiment analysis or opinion mining is a field of study which attempts to analyze people’s opinions,
sentiments, attitudes, and emotions on entities such as products, services, and organizations. The
expression sentiment analysis was first appeared in [11] (Nasukawa and Yi 2003), and the expression
opinion mining was first appeared in [10] (Dave, Lawence and Pennock 2003). Several authors would use
the phrases sentiment analysis and opinion mining interchangeably. In this work, we will also use the two
phrases interchangeably.
Sentiment analysis can be viewed as a classification process that determines whether a certain
document or text was written to express positive or negative opinion. Sentiment classification would be
helpful in business intelligence applications and recommender systems [20]. There are also potential
applications to message filtering [19].
In recent years, sentiment analysis has gained considerable attention, and its applications have spread
to almost every possible domain. Many systems and applications are built for the English and other Indo-
European languages. However, few studies have focused on the Arabic language which is a native
language for more than 300 million speakers. This paper is concerned with studying sentiment analysis
for public Arabic tweets and comments in social media using classification models that are built using
Rapidminer [16] which is an open source data mining and machine learning software.
The rest of this paper is structured as follows: In Section II, we describe some of the related works. In
Section III, on the other hand, we present the software that was used in this manuscript in addition to the
1
dataset. Section IV presents supervised classification. Section V discusses experimentation and result
analysis. Finally, in Section VI we discuss the conclusions of this study and highlight some future work.
2
El-Halees [7] presented a combined approach that extracts opinions from Arabic documents. He used
a combined approach that consists of three methods: first, the lexicon based method which classifies some
documents. Second, the classified documents are used as a training set for Maximum Entropy model, last
the K-nearest neighbor classifier is used to classify the rest of the documents.
Saleh, Martin, Alfonso, and Perea presented an Opinion Corpus for Arabic (OCA) in [8]. It composed
of Arabic reviews extracted from specialized web pages related to movies and films using Arabic
language. They utilized two classifiers, namely: Support Vector Machine and Naïve Bayes.
Al-Subaihin, Al-Khalifa, and Al-Salman [9] proposed a sentiment analysis tool for modern Arabic
using human based computing. This tool will help construct and dynamically develop and maintain the
tool’s lexicons. They also inspected the problem of conducting sentiment analysis on Arabic text in the
World Wide Web. The solution of the problem they proposed is a lexicon based approach.
Rapidminer [16] is a java-based open source data mining and machine learning software. It has a
graphical user interface (GUI) where the user can design his machine learning process without having to
code. The process is then transformed into an XML (extensible Markup Language) file which defines the
operations that the users want to apply to the given dataset. Perhaps, one of the most valuable extensions
to Rapidminer is the Text Processing package. This includes many operators that support text mining. For
example, there are operators for tokenization, stemming, filtering stopwords, and generating n-grams. The
main reason for choosing Rapidminer is that the text Processing package can deal with the Arabic
language.
B.The Dataset
We generated our dataset by collecting tweets and Facebook comments from the internet. These
tweets and comments address general topics such as education, sports, and political news. As far as the
tweets were concerned, we have utilized the crawler and annotation tool presented in [12].
The authors in [12] have designed a crawler to collect tweets from Twitter. They also, relied on
crowdsourcing for tweets annotation. Initially, 10,000 tweets were collected and annotated. When the
collected tweets were carefully examined, we realized that they suffer from several problems. They
include high number of duplicate tweets; these may be the result of re-tweeting, also some of the
automatically collected tweets are empty and contain the address of the sender only. Such tweets were
removed from our dataset.
We also, manually collected 500 comments from Facebook. Many of these comments were removed
either because they are written in Arabizi where Roman letters are used in writing Arabic words – a style
that Arab users of social media widely use; or because the comment consists of emoticons only. Table 1
shows the number of tweets and comments that remained with their sentiment orientation.
3
Positive Negative Total
Tweet/Comment 1073 1518 2591
The crowdsourcing tool presented in [12] was used to label the tweets. Volunteers have to create a
username and password to use the tool. Once they log onto the system, one tweet or comment will appear
at a time. The user has the option to choose a label from (1) Positive (2) Negative (3) Neutral and (4)
Other.
Positive tweets are given label “1”, while negative tweets are given label “-1”. Neutral tweets are
given label “0”. If the tweet is empty or suffers a problem then, the option “other” is used and that tweet
is deleted from the dataset. Every tweet must be rated by at least three different users and majority voting
is used to assign the final label for every tweet. As quality assurance measure, one of the raters was one of
the researchers in [12]. After labelling was complete, we store the positive tweets in one file and the
negative tweets in another file. The authors of the current paper manually labeled the Facebook
comments.
B. Naïve Bayes
It is a kind of classifier that depends on Bayes rule written in the following formula:
P (c|d) = (1)
The main idea of the Naïve Bayes classifier is to hypothesis that predictor variables are autonomous
which substantially reduces the computation of probabilities. This classifier gives good results and it has
been used in many research such as the work reported in [21] and [24].
4
the English language when compared to other classifiers. Also, the work reported in [22] shows that SVM
gives good results for sentiment analysis of reviews written in Chinese.
X-Validation:
Validation is an important step that allows us to test the accuracy of algorithms. The most common
approaches to validation are hold out method and cross validation method. In the hold out method, part of
the data or reviews is held out for testing and the remaining parts are used for training the classifier. The
cross validation method, by comparison, splits the data into testing and training as in the hold out method
but the data is scanned several times and each division or part of the data is get to be used in the training
and testing phases. To be clear, in the 10-fold cross validation method, the data is divided into 10 divisions
or parts; one is used for testing and 9 for training in the first run. In the second run, a different part is used
for testing and 9 parts, including the one that was used for testing in run one, are used for training. The runs
continue until each part or division is given the chance to be part of the training data and the testing data.
5
The final accuracy is the average of the accuracies obtained in the 10 runs. In the current research, we have
used 10-fold cross validation.
The X-Validation operator is a nested one that consists of an operator for the classifier and another operator
for calculating the performance of the classifier. The set of classifiers that we have used here are SVM, K-
NN, and Naïve Bayes.
The Performance Operator is responsible for calculating the accuracy of the classifier. It has many
parameters that one can choose from when deciding on a method for calculating the accuracy of the
classifier. In the current research, we used precision and recall as measures of accuracy. To calculate these
we need:
TP: the number of reviews that were correctly classified by the classifier to belong to the current class.
TN: the number of reviews that were correctly classified by the classifier not to belong to the current class.
FP: the number of reviews that were mistakenly classified by the classifier to belong to the current class.
FN: the number of reviews that were mistakenly classified by the classifier not to belong to the current
class.
Therefore Precision = TP/(TP+FP) and Recall = TP/(TP+FN) for binary classification tasks. As the
problem we are dealing with consists of two classes, we calculated the precision/recall per class and we
also calculated the macro-precision and macro-recall for the two classes together. Table 2 shows the class
precision and recall for the Naïve Bayes classifier. As it can be seen from the table, the precision of the
Negative class is 78.2 while the precision for the Positive class is 54.21. This variance is mainly
contributed to the fact that the dataset is not balanced; the number of negative reviews is larger than the
positive reviews. Recall on the other hand equals 52.7 for the Negative class and 79.2 for the Positive class.
The reason for the lower recall for the Negative class, in our opinion, is that there is a large number of
negative reviews that the classifier should retrieve to obtain high recall as the number of negative reviews
is large.
Tables 3 and 4 depict the class precision and recall for the SVM and K-NN (K=10) classifiers,
respectively.
Table 5 shows the Macro-Recall and Macro-Precision for the three classifiers. As the table shows, the
highest Macro-Recall was achieved in the case of KNN. The highest Macro-Precision was achieved in the
case of SVM.
Table2: Class Precision and Recall for the Naïve Bayes Classifier
True True Class
Negative Positive Precision
Predicted
800 223 78.20
Negative
Predicted
718 850 54.21
Positive
Class Recall 52.70 79.22
6
True True Class
Negative Positive Precision
Predicted
1474 806 64.65
Negative
Predicted
44 267 85.85
Positive
Class Recall 97.10 24.88
ROC (Receiving Operator Characteristics) curves are graphical methods used for depicting the accuracy
of the classifiers. A ROC chart is used to describe the effectiveness of a classifier which allocates items
into one of two categories depending on whether or not they exceed a threshold. In ROC-curves, the x-
axis is the false positive rate (FPR) and the y-axis is the true positive rate (TPR). The FPR measure is the
fraction of negative examples that are misclassified, and the TPR is the fraction of positive examples that
are correctly labeled. The best point in the ROC space is located in the left upper cornet or coordinate
(0,1) and sometimes it is called perfect classification. The diagonal of the ROC space represents random
guesses; as it is the case with flipping coins. The points that are located above the diagonal are better than
random guesses and the points that are located under the ROC diagonal are worse than random guesses.
Figure 1 shows the ROC curve for the three classifiers. As is can be seen from the figure, all classifiers
performed better than random guesses. In fact, the results are located in the left side of the ROC space
which indicate that the classifiers actually did a good job.
Table4: Class Precision and Recall for the KNN Classifier (K=10).
True True Class
Negative Positive Precision
Predicted
1260 482 72.33%
Negative
Predictec
258 591 69.61%
Positive
Class Recall 83.00 55.08
Certainly there are many ways that this work can and will be improved. Firstly, the size of the dataset is
rather small and if we want to make solid conclusions then we definitely need big datasets. Secondly,
crowdsourcing is a useful tool when labelling or annotating large amounts of data is considered. In this
work we utilized crowdsourcing to label our dataset. Thirdly semi-supervised learning could be used to
7
sentiment analysis in Arabic text as this techniques has been applied successfully to other languages as it is
described in the research reported in [13] [14] [15] [17].
REFERENCES
[1] Hassan. S., Yulan. H., and Alani. H., "Semantic sentiment analysis of Twitter." The Semantic Web–
ISWC. Springer, pp. 508-524, 2012.
[2] Abdul-Mageed. M., Diab. M., and Korayem M., "Subjectivity and sentiment analysis of modern
standard Arabic." Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies. Vol. 2. 2011.
[3] Almas Y., and Ahmad K., "A note on extracting sentiments in financial news in English, Arabic &
Urdu." The Second Workshop on Computational Approaches to Arabic Script-based Languages.
2007.
[4] Abdul-Mageed M., and Diab M., "AWATIF: A multi-genre corpus for Modern Standard Arabic
subjectivity and sentiment analysis." Proceedings of LREC, Istanbul, Turkey, 2012.
[5] Elhawary M. and Elfeky M., "Mining Arabic Business Reviews." Data Mining Workshops
(ICDMW), P. 1108-1113, 2010.
[6] Pang B., and Lee L. "A sentimental education: Sentiment analysis using subjectivity summarization
based on minimum cuts." Proceedings of the 42nd annual meeting on Association for Computational
Linguistics. 2004.
8
[7] El-Halees A., "Arabic Opinion Mining Using Combined Classification Approach." Proceedings of the
International Arab Conference on Information Technology, ACIT. 2011.
[8] Rushdi-Saleh M., Martín-Valdivia M., Ureña-López L. & Perea-Ortega J.M., Bilingual Experiments
with an Arabic-English Corpus for Opinion Mining, 2011.
[9] Al-Subaihin A., Al-Khalifa H., and Al-Salman A.M., "A proposed sentiment analysis tool for
modern arabic using human-based computing." Proceedings of the 13th International Conference on
Information Integration and Web-based Applications and Services. ACM, 2011.
[10] Dave K., Lawrence S., & Pennock D.M., "Mining the peanut gallery: Opinion extraction and
semantic classification of product reviews." In Proceedings of the 12th international conference on
World Wide Web, pp. 519-528. ACM, 2003.
[11] Nasukawa T., and Jeonghee Y., "Sentiment analysis: Capturing favorability using natural language
processing." Proceedings of the 2nd international conference on Knowledge capture. ACM, 2003.
[12] [Self reference to the authors, names were removed as per Journal instructions] " Sentiment
Analysis." June 2012.
[13] Rao D., and Ravichandran D., "Semi-supervised polarity lexicon induction." Proceedings of the 12th
Conference of the European Chapter of the Association for Computational Linguistics. Association
for Computational Linguistics, 2009.
[14] Dasgupta S., and Ng V., "Mine the easy, classify the hard: a semi-supervised approach to automatic
sentiment classification." In Proceedings of the Joint Conference of the 47th Annual Meeting of the
ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP:
Volume 2, pp. 701-709. Association for Computational Linguistics, 2009.
[15] Sindhwani V., and Melville P., "Document-word co-regularization for semi-supervised sentiment
analysis." 8th IEEE International Conference on Data Mining (ICDM'08), pp. 1025-1030, 2008.
[16] Rapidminer, http://rapid-i.com/, last access on 31-Jan-201.
[17] Goldberg, Anderwo B., and Zhu X., "Seeing stars when there aren't many stars: graph-based semi-
supervised learning for sentiment categorization." Proceedings of the First Workshop on Graph Based
Methods for Natural Language Processing. Association for Computational Linguistics, 2006.
[18] Kumar A., and Sebastian T.M., "Sentiment Analysis on Twitter." IJCSI International Journal of
Computer Science, Issue 9.3, pp. 372-378, 2102.
[19] Malouf R, and Mullen.T. "Taking sides: User classification for informal online political
discourse." Internet Research 18.2: pp. 177-190, 2008.
[20] Glance N., Hurst M., Nigam K., Siegler M., Stockton R., & Tomokiyo T.,"Deriving marketing
intelligence from online discussion. In Proceedings of the 11th ACM SIGKDD international
conference on knowledge discovery in data mining, pp. 419-428, 2005.
[21] Rish I., "An empirical study of the naive Bayes classifier." IJCAI workshop on empirical methods in
artificial intelligence. Vol. 3. No. 22. 2001.
[22] Wan X., "Co-training for cross-lingual sentiment classification. "Proceedings of the Joint Conference
of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural
Language Processing of the AFNLP: Volume 1. Association for Computational Linguistics, 2009.
[23] Tokuhisa R., Kentaro I., and Yuji M., "Emotion classification using massive examples extracted from
the web." Proceedings of the 22nd International Conference on Computational Linguistics, Volume 1.
Association for Computational Linguistics, 2008.
[24] McCallum A., and Nigam K. "A comparison of event models for Naive Bayes text
classification." AAAI-98 workshop on learning for text categorization. Vol. 752. 1998.
[25] Ye Q., Zhang Z., and Law R., "Sentiment classification of online reviews to travel destinations by
supervised machine learning approaches." Expert Systems with Applications, Vol. 36, Issue 3, pp.
6527-6535, 2009.
9
[26] Fung G., and Olvi L.M., "Incremental support vector machine classification." Proceedings of the
Second SIAM International Conference on Data Mining, Arlington, Virginia. 2002.
10