Arabic Sentiment Analysis Using Supervis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

The 1st International Workshop on Social Networks Analysis, Management and Security (SNAMS - 2014), August 2014, Barcelona,


Arabic Sentiment Analysis using Supervised Classification

Rehab M. Duwairi
Department of Computer Information Systems
Jordan University of Science and Technology
Irbid 22110, Jordan

Islam Qarqaz
Department of Computer Science
Jordan University of Science and Technology
Irbid 22110, Jordan

Abstract—Sentiment analysis is a process during which the polarity (i.e. positive, negative or
neutral) of a given text is determined. In general there are two approaches to address this problem;
namely, machine learning approach or lexicon based approach. The current paper deals with
sentiment analysis in Arabic reviews from a machine learning perspective. Three classifiers were
applied on an in-house developed dataset of tweets/comments. In particular, the Naïve Bayes, SVM
and K-Nearest Neighbor classifiers were run on this dataset. The results show that SVM gives the
highest precision while KNN (K=10) gives the highest Recall.

Keywords— sentiment analysis; sentiment classification; opinion mining; text mining, Arabic language.

Sentiment analysis or opinion mining is a field of study which attempts to analyze people’s opinions,
sentiments, attitudes, and emotions on entities such as products, services, and organizations. The
expression sentiment analysis was first appeared in [11] (Nasukawa and Yi 2003), and the expression
opinion mining was first appeared in [10] (Dave, Lawence and Pennock 2003). Several authors would use
the phrases sentiment analysis and opinion mining interchangeably. In this work, we will also use the two
phrases interchangeably.
Sentiment analysis can be viewed as a classification process that determines whether a certain
document or text was written to express positive or negative opinion. Sentiment classification would be
helpful in business intelligence applications and recommender systems [20]. There are also potential
applications to message filtering [19].
In recent years, sentiment analysis has gained considerable attention, and its applications have spread
to almost every possible domain. Many systems and applications are built for the English and other Indo-
European languages. However, few studies have focused on the Arabic language which is a native
language for more than 300 million speakers. This paper is concerned with studying sentiment analysis
for public Arabic tweets and comments in social media using classification models that are built using
Rapidminer [16] which is an open source data mining and machine learning software.
The rest of this paper is structured as follows: In Section II, we describe some of the related works. In
Section III, on the other hand, we present the software that was used in this manuscript in addition to the

dataset. Section IV presents supervised classification. Section V discusses experimentation and result
analysis. Finally, in Section VI we discuss the conclusions of this study and highlight some future work.


Researchers have proposed many different approaches for sentiment analysis. In general, there are
two main methods, the first one is using machine learning techniques or supervised techniques which are
presented in this paper, and the other one is unsupervised techniques. Many studies have focused on the
sentiment analysis for the English language and other Indo-European languages. Pang and Lee [6] used
machine learning techniques for sentiment classification. They employed three classifiers (Naïve Bayes,
Maximum Entropy classification, Support Vector machine). Their data source was the Internet Movie
Database (IMDB); they selected only reviews where the author rating was expressed either with stars or
some numeral value. Dave, Lawrence, and Pennock [10] proposed an approach, which begins with
training a classifier using a corpus of self-tagged reviews available from major web sites. They decided to
use n-grams on two tests and the result showed that this way is better than traditional machine learning.
Many researches were introduced to analyze sentiment and extracting opinions from the World Wide
Web. This proved to be important due to the large amount of data contributed by users in websites such as
social networks (Facebook, Twitter, etc.). For example Hassan, Yulan, and Alani [1] studied Semitic
sentiment analysis of Twitter. The authors used three different Twitter datasets for their experiments.
They proposed the using of Semitic features in Twitter sentiment classification and explored three
different approaches for incorporating them into the analysis with replacement, augmentation, and
interpolation. In [18], Kumar and Sebastian presented a novel approach for sentiment analysis on Twitter
data. To do that, they extracted the opinion words in tweets.
There are few studies for sentiment analysis for the Arabic language. For example, Abdul Majeed and
Diab presented a newly developed manually annotated corpus of Modern Standard Arabic (MSA)
together with a new polarity lexicon [2]. They ran their experiments on three different pre-processing
settings based on tokenized text from the Penn Arabic Treebank (PATB). They adopted two-stage
classification approach, in the first stage they built a binary classifier to sort out objective from subjective
cases. For the second stage, they applied binary classification that distinguishes positive from negative
In [4], the same researchers in [2] reported efforts to bridge the gap between Arabic researches by
presenting AWATIF; a multi-genre corpus for Modern Standard Arabic for Subjectivity and Sentiment
Analysis (MSA SSA). They extend their previous work by showing how annotation studies within
subjectivity and sentiment analysis can both be inspired by existing linguistic theory and cater for genre
Alams, and Ahmed [3] target three languages (English, Arabic, and Urdu) in their work. They
described a method for automatically extracting specialist terms called local grammar. The authors
compared the behavior of single and compound tokens in specialist and general language corpora to
determine whether a token is behaving like a sentiment term or not.
Elhawary and Elfeky [5] extract business reviews scattered on the web written in the Arabic
language. They built a system that comprises two components: a reviews classifier that classifies any web
page whether it contains reviews or not, and sentiment analyzer that identifies the reviews’ text if it
(positive, negative, neutral or mixed).

El-Halees [7] presented a combined approach that extracts opinions from Arabic documents. He used
a combined approach that consists of three methods: first, the lexicon based method which classifies some
documents. Second, the classified documents are used as a training set for Maximum Entropy model, last
the K-nearest neighbor classifier is used to classify the rest of the documents.
Saleh, Martin, Alfonso, and Perea presented an Opinion Corpus for Arabic (OCA) in [8]. It composed
of Arabic reviews extracted from specialized web pages related to movies and films using Arabic
language. They utilized two classifiers, namely: Support Vector Machine and Naïve Bayes.
Al-Subaihin, Al-Khalifa, and Al-Salman [9] proposed a sentiment analysis tool for modern Arabic
using human based computing. This tool will help construct and dynamically develop and maintain the
tool’s lexicons. They also inspected the problem of conducting sentiment analysis on Arabic text in the
World Wide Web. The solution of the problem they proposed is a lexicon based approach.


A. RapidMiner

Rapidminer [16] is a java-based open source data mining and machine learning software. It has a
graphical user interface (GUI) where the user can design his machine learning process without having to
code. The process is then transformed into an XML (extensible Markup Language) file which defines the
operations that the users want to apply to the given dataset. Perhaps, one of the most valuable extensions
to Rapidminer is the Text Processing package. This includes many operators that support text mining. For
example, there are operators for tokenization, stemming, filtering stopwords, and generating n-grams. The
main reason for choosing Rapidminer is that the text Processing package can deal with the Arabic

B.The Dataset

We generated our dataset by collecting tweets and Facebook comments from the internet. These
tweets and comments address general topics such as education, sports, and political news. As far as the
tweets were concerned, we have utilized the crawler and annotation tool presented in [12].
The authors in [12] have designed a crawler to collect tweets from Twitter. They also, relied on
crowdsourcing for tweets annotation. Initially, 10,000 tweets were collected and annotated. When the
collected tweets were carefully examined, we realized that they suffer from several problems. They
include high number of duplicate tweets; these may be the result of re-tweeting, also some of the
automatically collected tweets are empty and contain the address of the sender only. Such tweets were
removed from our dataset.
We also, manually collected 500 comments from Facebook. Many of these comments were removed
either because they are written in Arabizi where Roman letters are used in writing Arabic words – a style
that Arab users of social media widely use; or because the comment consists of emoticons only. Table 1
shows the number of tweets and comments that remained with their sentiment orientation.

Table1: Num. of neg. and pos. tweets/comments in the dataset.

Positive Negative Total
Tweet/Comment 1073 1518 2591

The crowdsourcing tool presented in [12] was used to label the tweets. Volunteers have to create a
username and password to use the tool. Once they log onto the system, one tweet or comment will appear
at a time. The user has the option to choose a label from (1) Positive (2) Negative (3) Neutral and (4)
Positive tweets are given label “1”, while negative tweets are given label “-1”. Neutral tweets are
given label “0”. If the tweet is empty or suffers a problem then, the option “other” is used and that tweet
is deleted from the dataset. Every tweet must be rated by at least three different users and majority voting
is used to assign the final label for every tweet. As quality assurance measure, one of the raters was one of
the researchers in [12]. After labelling was complete, we store the positive tweets in one file and the
negative tweets in another file. The authors of the current paper manually labeled the Facebook


A. K-nearest neighbor classifier (KNN)

This classifier is a simple one which chooses the K number of nearest neighbors in the training
documents and classifies an unannotated document based on these K neighbors. Specifically, it calculates
the similarity between the unlabeled document and the remaining documents in the training dataset. After
that, the labels of the most K similar documents are considered. The final label of the new document is
determined using majority voting or weighted average of the labels of these K neighbors. In [23], the
authors used KNN to classify emotions contained in examples, written in Japanese, extracted from the

B. Naïve Bayes
It is a kind of classifier that depends on Bayes rule written in the following formula:

P (c|d) = (1)

The main idea of the Naïve Bayes classifier is to hypothesis that predictor variables are autonomous
which substantially reduces the computation of probabilities. This classifier gives good results and it has
been used in many research such as the work reported in [21] and [24].

C. Support Vector Machines (SVM)

It is an effective traditional text categorization framework. The main idea of SVM is to find the hyper-
plan, which is represented as a vector that separates document vectors in one class from document vectors
in other classes [26]. SVM shows very good performance and higher accuracy in many studies directed
towards sentiment analysis in many languages. The work reported in [25] shows that SVM did well with

the English language when compared to other classifiers. Also, the work reported in [22] shows that SVM
gives good results for sentiment analysis of reviews written in Chinese.


All the experiments that were carried out in this research were done using Rapidminer [16] which was
described in Section III. Here, the classification task for a given classifier is designed as a process. This
process consists of several operators that are described next.
Process documents from Files:
This is a container operator, i.e. it contains other operators related to text processing. In this work, the
Tokenize, Stem(Arabic), Filter Stopwords(Arabic), and Generate-n-Grams(Terms) operators were used.
The Tokenize operator is responsible for splitting the text of the review into tokens or words. The
Stem(Arabic) operator is responsible for reducing an Arabic token to its stem or root. Rapidminer [16] also
has another operator called Stem(Arabic, Light). This operator does not reduce a word to its proper root;
rather, it removes common prefixes and suffices from words or tokens. The Filter Stopword(Arabic)
removes noise Arabic words that do not affect the classification task. When dealing with sentiment
analysis, the usage of this operator is tricky, because negation words are considered stopwords for topical
classification and thus removed. On the other hand, negation words are critical for sentiment analysis as
they can reverse the sentiment from positive to negative and vice versa. The Generate-n-grams operator
can slightly alleviate this problem by generating sequences of n-words and each sequence is considered
one token. N, here, specifies the number of words or terms in a sequence. In this work, n was set to 2, i.e.
we generated bi-grams. Obviously there are more sophisticated methods to deal with valence shifters such
as negations. For example, one could use a parser that would search for valence shifters and attached them
to the proper term and determine the sentiment of the sequence as a whole. For instance, good (‫ )جيد‬is a
positive word and not-good (‫ )ليسى جيدا‬is a negative word in a given context.
The Process Documents from Files Operator takes as input folders that contain text files. In this work,
two folders were fed to this operator; namely; one folder which contains the positive reviews and a second
folder that contains negative reviews. This operator has a set of parameters that are generally useful when
dealing with text. For example, the Vector Creation parameter allows the user to choose a weighting
scheme for the terms from TF (Term Frequency), TF-IDF (Term Frequency, Inverse Document
Frequency) and others.

Validation is an important step that allows us to test the accuracy of algorithms. The most common
approaches to validation are hold out method and cross validation method. In the hold out method, part of
the data or reviews is held out for testing and the remaining parts are used for training the classifier. The
cross validation method, by comparison, splits the data into testing and training as in the hold out method
but the data is scanned several times and each division or part of the data is get to be used in the training
and testing phases. To be clear, in the 10-fold cross validation method, the data is divided into 10 divisions
or parts; one is used for testing and 9 for training in the first run. In the second run, a different part is used
for testing and 9 parts, including the one that was used for testing in run one, are used for training. The runs
continue until each part or division is given the chance to be part of the training data and the testing data.

The final accuracy is the average of the accuracies obtained in the 10 runs. In the current research, we have
used 10-fold cross validation.
The X-Validation operator is a nested one that consists of an operator for the classifier and another operator
for calculating the performance of the classifier. The set of classifiers that we have used here are SVM, K-
NN, and Naïve Bayes.
The Performance Operator is responsible for calculating the accuracy of the classifier. It has many
parameters that one can choose from when deciding on a method for calculating the accuracy of the
classifier. In the current research, we used precision and recall as measures of accuracy. To calculate these
we need:
TP: the number of reviews that were correctly classified by the classifier to belong to the current class.
TN: the number of reviews that were correctly classified by the classifier not to belong to the current class.
FP: the number of reviews that were mistakenly classified by the classifier to belong to the current class.
FN: the number of reviews that were mistakenly classified by the classifier not to belong to the current
Therefore Precision = TP/(TP+FP) and Recall = TP/(TP+FN) for binary classification tasks. As the
problem we are dealing with consists of two classes, we calculated the precision/recall per class and we
also calculated the macro-precision and macro-recall for the two classes together. Table 2 shows the class
precision and recall for the Naïve Bayes classifier. As it can be seen from the table, the precision of the
Negative class is 78.2 while the precision for the Positive class is 54.21. This variance is mainly
contributed to the fact that the dataset is not balanced; the number of negative reviews is larger than the
positive reviews. Recall on the other hand equals 52.7 for the Negative class and 79.2 for the Positive class.
The reason for the lower recall for the Negative class, in our opinion, is that there is a large number of
negative reviews that the classifier should retrieve to obtain high recall as the number of negative reviews
is large.
Tables 3 and 4 depict the class precision and recall for the SVM and K-NN (K=10) classifiers,
Table 5 shows the Macro-Recall and Macro-Precision for the three classifiers. As the table shows, the
highest Macro-Recall was achieved in the case of KNN. The highest Macro-Precision was achieved in the
case of SVM.
Table2: Class Precision and Recall for the Naïve Bayes Classifier
True True Class
Negative Positive Precision
800 223 78.20
718 850 54.21
Class Recall 52.70 79.22

Table3: Class Precision and Recall for the SVM Classifier.

True True Class
Negative Positive Precision
1474 806 64.65
44 267 85.85
Class Recall 97.10 24.88

ROC (Receiving Operator Characteristics) curves are graphical methods used for depicting the accuracy
of the classifiers. A ROC chart is used to describe the effectiveness of a classifier which allocates items
into one of two categories depending on whether or not they exceed a threshold. In ROC-curves, the x-
axis is the false positive rate (FPR) and the y-axis is the true positive rate (TPR). The FPR measure is the
fraction of negative examples that are misclassified, and the TPR is the fraction of positive examples that
are correctly labeled. The best point in the ROC space is located in the left upper cornet or coordinate
(0,1) and sometimes it is called perfect classification. The diagonal of the ROC space represents random
guesses; as it is the case with flipping coins. The points that are located above the diagonal are better than
random guesses and the points that are located under the ROC diagonal are worse than random guesses.
Figure 1 shows the ROC curve for the three classifiers. As is can be seen from the figure, all classifiers
performed better than random guesses. In fact, the results are located in the left side of the ROC space
which indicate that the classifiers actually did a good job.


This work has considered sentiment analysis in Arabic text. A dataset, which consists of 2591
tweets/comments, was collected and labelled using crowdsourcing. The Naïve Bayes, SVM and KNN
classifiers were used to detect the polarity of a given review. 10-fold cross validation was used to split the
data into training and testing sets. The best precision was achieved by SVM and it equals to 75.25.The best
recall was achieved in the case of KNN(K=10) and it equals 69.04.

Table4: Class Precision and Recall for the KNN Classifier (K=10).
True True Class
Negative Positive Precision
1260 482 72.33%
258 591 69.61%
Class Recall 83.00 55.08

Certainly there are many ways that this work can and will be improved. Firstly, the size of the dataset is
rather small and if we want to make solid conclusions then we definitely need big datasets. Secondly,
crowdsourcing is a useful tool when labelling or annotating large amounts of data is considered. In this
work we utilized crowdsourcing to label our dataset. Thirdly semi-supervised learning could be used to

sentiment analysis in Arabic text as this techniques has been applied successfully to other languages as it is
described in the research reported in [13] [14] [15] [17].

Table5: Macro-Precision and Micro-Recall for the three classifiers.

Macro- Macro-
Precision Recall
66.205 65.96
SVM 75.25 60.99
70.97 69.04

Figure1: The ROC Curve for the Three Classifiers

[1] Hassan. S., Yulan. H., and Alani. H., "Semantic sentiment analysis of Twitter." The Semantic Web–
ISWC. Springer, pp. 508-524, 2012.
[2] Abdul-Mageed. M., Diab. M., and Korayem M., "Subjectivity and sentiment analysis of modern
standard Arabic." Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies. Vol. 2. 2011.
[3] Almas Y., and Ahmad K., "A note on extracting sentiments in financial news in English, Arabic &
Urdu." The Second Workshop on Computational Approaches to Arabic Script-based Languages.
[4] Abdul-Mageed M., and Diab M., "AWATIF: A multi-genre corpus for Modern Standard Arabic
subjectivity and sentiment analysis." Proceedings of LREC, Istanbul, Turkey, 2012.
[5] Elhawary M. and Elfeky M., "Mining Arabic Business Reviews." Data Mining Workshops
(ICDMW), P. 1108-1113, 2010.
[6] Pang B., and Lee L. "A sentimental education: Sentiment analysis using subjectivity summarization
based on minimum cuts." Proceedings of the 42nd annual meeting on Association for Computational
Linguistics. 2004.

[7] El-Halees A., "Arabic Opinion Mining Using Combined Classification Approach." Proceedings of the
International Arab Conference on Information Technology, ACIT. 2011.
[8] Rushdi-Saleh M., Martín-Valdivia M., Ureña-López L. & Perea-Ortega J.M., Bilingual Experiments
with an Arabic-English Corpus for Opinion Mining, 2011.
[9] Al-Subaihin A., Al-Khalifa H., and Al-Salman A.M., "A proposed sentiment analysis tool for
modern arabic using human-based computing." Proceedings of the 13th International Conference on
Information Integration and Web-based Applications and Services. ACM, 2011.
[10] Dave K., Lawrence S., & Pennock D.M., "Mining the peanut gallery: Opinion extraction and
semantic classification of product reviews." In Proceedings of the 12th international conference on
World Wide Web, pp. 519-528. ACM, 2003.
[11] Nasukawa T., and Jeonghee Y., "Sentiment analysis: Capturing favorability using natural language
processing." Proceedings of the 2nd international conference on Knowledge capture. ACM, 2003.
[12] [Self reference to the authors, names were removed as per Journal instructions] " Sentiment
Analysis." June 2012.
[13] Rao D., and Ravichandran D., "Semi-supervised polarity lexicon induction." Proceedings of the 12th
Conference of the European Chapter of the Association for Computational Linguistics. Association
for Computational Linguistics, 2009.
[14] Dasgupta S., and Ng V., "Mine the easy, classify the hard: a semi-supervised approach to automatic
sentiment classification." In Proceedings of the Joint Conference of the 47th Annual Meeting of the
ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP:
Volume 2, pp. 701-709. Association for Computational Linguistics, 2009.
[15] Sindhwani V., and Melville P., "Document-word co-regularization for semi-supervised sentiment
analysis." 8th IEEE International Conference on Data Mining (ICDM'08), pp. 1025-1030, 2008.
[16] Rapidminer,, last access on 31-Jan-201.
[17] Goldberg, Anderwo B., and Zhu X., "Seeing stars when there aren't many stars: graph-based semi-
supervised learning for sentiment categorization." Proceedings of the First Workshop on Graph Based
Methods for Natural Language Processing. Association for Computational Linguistics, 2006.
[18] Kumar A., and Sebastian T.M., "Sentiment Analysis on Twitter." IJCSI International Journal of
Computer Science, Issue 9.3, pp. 372-378, 2102.
[19] Malouf R, and Mullen.T. "Taking sides: User classification for informal online political
discourse." Internet Research 18.2: pp. 177-190, 2008.
[20] Glance N., Hurst M., Nigam K., Siegler M., Stockton R., & Tomokiyo T.,"Deriving marketing
intelligence from online discussion. In Proceedings of the 11th ACM SIGKDD international
conference on knowledge discovery in data mining, pp. 419-428, 2005.
[21] Rish I., "An empirical study of the naive Bayes classifier." IJCAI workshop on empirical methods in
artificial intelligence. Vol. 3. No. 22. 2001.
[22] Wan X., "Co-training for cross-lingual sentiment classification. "Proceedings of the Joint Conference
of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural
Language Processing of the AFNLP: Volume 1. Association for Computational Linguistics, 2009.
[23] Tokuhisa R., Kentaro I., and Yuji M., "Emotion classification using massive examples extracted from
the web." Proceedings of the 22nd International Conference on Computational Linguistics, Volume 1.
Association for Computational Linguistics, 2008.
[24] McCallum A., and Nigam K. "A comparison of event models for Naive Bayes text
classification." AAAI-98 workshop on learning for text categorization. Vol. 752. 1998.
[25] Ye Q., Zhang Z., and Law R., "Sentiment classification of online reviews to travel destinations by
supervised machine learning approaches." Expert Systems with Applications, Vol. 36, Issue 3, pp.
6527-6535, 2009.

[26] Fung G., and Olvi L.M., "Incremental support vector machine classification." Proceedings of the
Second SIAM International Conference on Data Mining, Arlington, Virginia. 2002.


View publication stats

You might also like