Urdu Sentiment Analysis With Deep Learning Methods
ABSTRACT Although over 169 million people in the world are familiar with the Urdu language and a large
quantity of Urdu data is being generated on different social websites daily, very few research studies and
efforts have been completed to build language resources for the Urdu language and examine user sentiments.
The primary objective of this study is twofold: (1) develop a benchmark dataset for resource-deprived Urdu
language for sentiment analysis and (2) evaluate various machine and deep learning algorithms for sentiment.
To find the best technique, we compare two modes of text representation: count-based, where the text is
represented using word n-gram feature vectors and the second one is based on fastText pre-trained word
embeddings for Urdu. We consider a set of machine learning classifiers (RF, NB, SVM, AdaBoost, MLP,
LR) and deep leaning classifiers (1D-CNN and LSTM) to run the experiments for all the feature types.
Our study shows that the combination of word n-gram features with LR outperformed other classifiers for
sentiment analysis task, obtaining the highest F1 score of 82.05% using combination of features.
INDEX TERMS Urdu sentiment analysis, machine learning, deep learning, natural language processing.
L. Khan et al.: Urdu Sentiment Analysis With Deep Learning Methods
Mostly, Urdu websites are developed in a descriptive arrange- Similarly, the 2014 version of the SemEval Twitter dataset
ment rather than a proper text encoding structure; due to this contains 1,853 user tweets and 1,142 LiveJournal news [15].
hurdle, it is challenging to create a benchmark corpus in Urdu. The 2016 and 2017 versions of the SemEval datasets were
Urdu sentiment analysis has not yet been investigated com- split into training, development, and test sets for each sub-
pletely even after its considerable use; most of the existing task [16]. In this edition there were five subtasks: A, B, C, D,
literature studies are focused on different aspects of language and E.
processing [12], [13]. In addition to the SemEval efforts, Korean, German, and
In this paper, the primary focus is to contribute a bench- Indonesian languages have also been investigated for sen-
mark corpus for Urdu sentiment analysis. Our corpus known timent analysis. A Korean dataset was created (KOSAC)
as Urdu Corpus for Sentiment Analysis (UCSA). This new that contains 332 news articles. Their primary aim was to
dataset and experiments provide a benchmark enabling fur- examine sentiments in Korean and they used Korean subjec-
ther research in sentiment analysis in Urdu language. tivity markup language to annotate their dataset [17]. Another
The main contributions of this research are as follows: dataset was developed that contains customer reviews about
• A new sentiment analysis corpus in Urdu is collected various Amazon products [18]. Amazon review parser was
that contains user reviews about various services: prod- used for the dataset collection. Human experts annotated
ucts, games, and politics. It is manually annotated by each review according to their semantic meaning. A total
experts following a set of guidelines (publicly available; of 63,067 reviews were collected about different products.
see a link below); Another effort was made to develop an Indonesian corpus.
• We provided baseline results for the state-of-the-art The Twitter Streaming API was used to collect this dataset
machine leaning (RF, NB, SVM, AdaBoost, MLP, and they also used geo location just to collect Indonesian
LR) and deep learning (1D-CNN, LSTM) models dialect tweets. Their Indonesian dataset contains 5.3 million
on our UCSA corpus using two text representations: tweets [19].
word n-gram features and fastText pre-trained word Recently, deep learning methods were implemented to
embeddings; investigate text representations and to overcome the prob-
• To the best of our knowledge, no research study shows lem of sentiment classification on a large social network
the use of deep learning models with pre-trained word datasets [20]–[22]. In addition, improved word vectors
embedding models for Urdu sentiment analysis; there- (IWVs), was recommended for word embedding because
fore, we studied the effectiveness of word embedding of their higher performance in the domain of sentiment
models in resource-deprived languages such as Urdu. analysis [23].
Our corpus UCSA is publicly available.2 A few study were performed on the sentiment analysis
The rest of the paper is organized as follows. Section II of social network data on the subject to support intelligent
presents the background and related work. Section III transportation systems [24]–[26]. Data were gathered from
describe the corpus collection details. Section IV presents the various social networking sites such as Facebook, Twitter,
methodology of the paper. Section V analyzes the experimen- TripAdvisor. They achieved an accuracy of 93% on their sen-
tal setting and results. Finally, Section VI concludes the paper. timent analysis dataset. In addition, based on social network
data, a real-time observation framework was suggested to
II. BACKGROUND AND RELATED WORK detect traffic accidents and analyze traffic conditions by using
In this section, we discuss famous datasets as well as machine BiLSTM [26]. They achieved an accuracy of 97% for traffic
and deep learning techniques for sentiment analysis. event detection analysis.
To create a benchmark dataset for sentiment analysis, Although a considerable quantity of data is available on
SemEval contests are considered one of the most noticeable internet research on sentiment analysis, Urdu is still at the
literature efforts. In the series of SemEval competitions to initial level compared to other resource-rich languages such
examine sentiment analysis, researchers performed distinct as English. A large quantity of data is required to create a
tasks using different datasets. These datasets developed in benchmark dataset for sentiment analysis. The drawbacks of
Arabic and English [14]. Generally, these datasets contain existing corpora are that they are too small or contains data
user tweets from Twitter and they are related to different about limited genres.
products such as laptops, TVs, and mobiles. The SemEval In the first study [27], authors collected user reviews
corpus 2013 edition consists of Twitter and SMS data; Twitter to create two corpora to find their models efficiency. The
tweets were divided into three sets: training (9,728), devel- first corpus contains 322 positive and 328 negative movie
opment (1,654) and test (3,813) while SMS messages were reviews. The second corpus contains reviews about electronic
used for testing purpose only, which contains 2,093 messages. appliances. This dataset contains 650 user reviews, among
322 are positive and 328 are negative. In this study, they
2 http://ieee-dataport.org/documents/urdu-corpus-sentiment-analysis; last used grammatical-based approach as well as they focused
visited: 20-06-2021 on sentence grammatical structure. They achieved 82.5%
This section focuses on the experimental details of our
machine learning and deep learning models such as the
support vector machine (SVM), naïve Bayes (NB), random
forest (RF), AdaBoost, multilayer perceptron (MLP), logistic
regression (LR), 1-dimensional convolutional neural network
(1D-CNN), and long short-term memory (LSTM). All these FIGURE 2. High-level system architecture for Urdu sentiment analysis.
TABLE 2. Urdu sentiment analysis results using machine learning models with word N-gram features.
TABLE 3. Urdu sentiment analysis results using deep learning models with pre-trained word embeddings.
σ represent sigmoid function in the above equation while information need to be update or discard while tanh layer
Wf and bf specify weighted matrices and bias, correspond- allocate weights to the passing values. Then these values are
ingly of the forget state. multiplied to update the cell state and then add new memory
to old memory Yt − 1 that result in Yt [2].
2) STEP 2
it = σ (Wi [ht − 1, Xt ] + bi ), (2)
In step 2, we store new input Xt as well as update the cell state.
we executes two actions: one is for sigmoid layer while other Nt = tanh(Wn [ht − 1, Xt ] + bn ), (3)
one is for tanh layer. Sigmoid layer makes a decision which Yt = Yt − 1 ∗ ft + Nt ∗ it . (4)
where Yt − 1 and Yt are showing the cell states at time t − 1 output cell state Yt but in a filtered form. In ordered to create
and t. While W represent weight matrices and b represent bias output, sigmoid layer choose the part of cell state. After that
to the cell sate. sigmoid gate Yt output is multiplied by the new values that
are produced by tanh layer from the cell state Yt
3) STEP 3 Yt = σ (Wo [ht − 1, Xt ] + bo ) (5)
In this step, we have output values ht . These values based ht = Yt ∗ tanh(Ct ) (6)
on output cell state Yt ; however, in a filtered form. The last
step is related to output values ht , which depend on the Wo and bo depicts the weight matrices and bias.
FIGURE 8. Performance comparison of machine learning models using combination (1-2) of features.
FIGURE 9. Performance comparison of machine learning models using combination (1-3) of features.
E. EVALUATION MEASURES where TP and FP stand for true positive and false positive,
We evaluate the effectiveness of our sentiment analysis mod- and FN stands for false negative.
els using Recall (R), Precision (P), and F1 -measure. The
mathematical equations are as follows: V. EXPERIMENTAL SETTINGS AND RESULTS
TP We performed our experiments on UCSA, which is publicly
Precision = , available to the research community. UCSA contains
9,601 Urdu reviews, which belong to politics, dramas,
TP movies, TV talk shows, sports, and software domains.
Recall = ,
TP + FN The dataset is split into training, which contains 80% of
2×P×R user reviews, and testing, which contains 20%. In all the
F1 = , experiments for machine learning models we used default
parameters. For deep learning algorithms, we used mean accuracy has been achieved for Urdu sentiment analysis using
square error (MSE) as a loss function, Adam as an optimizer. various machine and deep learning models. After perform-
We set the number of epochs to 25. ing various experiments based on two text representations:
n-gram features and pre-trained word embeddings, we achieve
A. RESULT AND DISCUSSION the highest F1 score of 82.05% using LR with combination of
Each of the six machine learning classifiers is run on the features. The SVM classifier is the second highest performer
UCSA dataset using word n-gram features. All the revealed for this task and its average performance is better than all
results are carefully examined to improve the results and iden- other classifiers. This study open a new domain for future
tify the finest machine learning classifier with features that researchers to explore resource-deprived languages. One of
achieve better results than the others concerning the accuracy, the limitations of this study is that it includes only positive
precision, recall, and F1 score. By witnessing the Table 2 and negative classes; our future work will include a neutral
results, all the machine learning classifiers’ performances class in our dataset. In the future, we will also include
are quite poor with the trigram feature. Generally, there are state-of-the-art classifiers in the benchmark techniques such
discriminative models (SVM, LR, etc.) and generative classi- as BERT.
fication models (NB).
Both SVM and LR achieve satisfactory results, as both REFERENCES
classifiers belong to discriminative models. Logistic regres- [1] S.-U. Hassan, A. Akram, and P. Haddawy, ‘‘Identifying important citations
sion is a supervised machine learning algorithm that is used using contextual information from full text,’’ in Proc. ACM/IEEE Joint
Conf. Digit. Libraries (JCDL), Jun. 2017, pp. 1–8.
when problems are categorical in nature and it is the most [2] Y. Liu, F. Du, J. Sun, T. Silva, Y. Jiang, and T. Zhu, ‘‘Identifying social roles
commonly used classifier when the data have two classes, using heterogeneous features in online social networks,’’ J. Assoc. Inf. Sci.
either positive or negative. Overall, the highest accuracy Technol., vol. 70, pp. 660–674, Mar. 2019.
[3] Z. Luo, S. Huang, and K. Q. Zhu, ‘‘Knowledge empowered prominent
of 81.94%, precision of 79.95%, recall of 84.26%, and F1 aspect extraction from product reviews,’’ Inf. Process. Manage., vol. 56,
score of 82.05% were achieved by LR with the combination no. 3, pp. 408–423, May 2019.
of n-gram features. The SVM classifier achieves the second [4] F. Anwaar, N. Iltaf, H. Afzal, and R. Nawaz, ‘‘HRS-CE: A hybrid frame-
work to integrate content embeddings in recommender systems for cold
highest accuracy, precision, recall and F1 score, which were start items,’’ J. Comput. Sci., vol. 29, pp. 9–18, Nov. 2018.
81.47%, 80.32%, 82.36%, and 81.47%, respectively, with the [5] R. Nawaz, P. Thompson, and S. Ananiadou, ‘‘Identification of manner in
unigram feature. bio-events,’’ in Proc. LREC, 2012, pp. 3505–3510.
[6] H. Qadir, O. Khalid, M. U. S. Khan, A. U. R. Khan, and R. Nawaz,
The worst accuracy out of all classifiers was 55.25% gain ‘‘An optimal ride sharing recommendation framework for carpooling ser-
by RF with trigram features. All classifiers perform better vices,’’ IEEE Access, vol. 6, pp. 62296–62313, 2018.
with bigram features as compared to trigram features. The [7] M. Z. Asghar, A. Sattar, A. Khan, A. Ali, F. M. Kundi, and S. Ahmad,
‘‘Creating sentiment lexicon for sentiment analysis in Urdu: The case
overall results using different machine learning models with of a resource-poor language,’’ Expert Syst., vol. 36, no. 3, Jun. 2019,
different features shown in Table 2. Figures 5, 6, 7, 8 and 9 Art. no. e12397.
describe the comparison of each model in terms of accuracy, [8] A. Z. Syed, M. Aslam, and A. M. Martinez-Enriquez, ‘‘Lexicon based
sentiment analysis of Urdu text using sentiunits,’’ in Proc. Mexican Int.
precision, recall and F1 measure with word n-gram features. Conf. Artif. Intell. Berlin, Germany: Springer, 2010, pp. 32–43.
Table 3 presents the results of deep learning mod- [9] M. Ijaz and S. Hussain, ‘‘Corpus based Urdu lexicon development,’’ in
els on our dataset. LSTM achieves slightly better results Proc. Conf. Lang. Technol. (CLT), Peshawar, Pakistan, vol. 73, 2007,
pp. 1–12.
than the 1D-CNN model in terms of accuracy, which is [10] W. Anwar, X. Wang, and X.-L. Wang, ‘‘A survey of automatic urdu
75.96 for LSTM and 75.73% for 1D-CNN. Deep learn- language processing,’’ in Proc. Int. Conf. Mach. Learn. Cybern., 2006,
ing results are slightly lower than machine learning mod- pp. 4489–4494.
[11] A. Daud, W. Khan, and D. Che, ‘‘Urdu language processing: A survey,’’
els. It is because some of the words are out of vocabulary Artif. Intell. Rev., vol. 47, no. 3, pp. 279–311, Mar. 2017.
in fastText pre-trained model. Therefore, in machine and [12] S. Kiritchenko, S. Mohammad, and M. Salameh, ‘‘SemEval-2016 task
deep learning our results are in line with state-of-the-art 7: Determining sentiment intensity of english and arabic phrases,’’
in Proc. 10th Int. Workshop Semantic Eval. (SemEval), 2016,
results. pp. 42–51.
As previously stated, a lack of research using machine [13] J. Villena-Román, J. García-Morera, and J. C. González-Cristóbal,
learning algorithms in Urdu sentiment analysis is seen. Very ‘‘DAEDALUS at SemEval-2014 task 9: Comparing approaches for sen-
timent analysis in Twitter,’’ in Proc. 8th Int. Workshop Semantic Eval.
few studies are found regarding this context and they used (SemEval), 2014, pp. 218–222.
different machine learning classifiers on a very insignificant [14] P. Nakov, A. Ritter, S. Rosenthal, F. Sebastiani, and
dataset. Our dataset contains more user reviews as a compare V. Stoyanov, ‘‘SemEval-2016 task 4: Sentiment analysis in Twitter,’’ 2019,
arXiv:1912.01973. [Online]. Available: https://arxiv.org/abs/1912.01973
to previous studies. The results of our study reveal that each [15] H. Jang, M. Kim, and H. Shin, ‘‘KOSAC: A full-fledged Korean sentiment
model in our study performs better than existing models. analysis corpus,’’ in Proc. 27th Pacific Asia Conf. Lang., Inf., Comput.
A comparison of our study with existing studies is presented (PACLIC), 2013, pp. 366–373.
[16] L.-S. Chen, C.-H. Liu, and H.-J. Chiu, ‘‘A neural network based approach
in Table 4. for sentiment classification in the blogosphere,’’ J. Informetrics, vol. 5,
no. 2, pp. 313–322, Apr. 2011.
VI. CONCLUSION AND FUTURE WORK [17] A. F. Wicaksono, C. Vania, B. Distiawan, and M. Adriani, ‘‘Auto-
matically building a corpus for sentiment analysis on Indonesian
Few research studies have been reported in the Urdu sen- tweets,’’ in Proc. 28th Pacific Asia Conf. Lang., Inf. Comput., 2014,
timent analysis domain. In this paper, high classification pp. 185–194.
[18] A. Z. Syed, M. Aslam, and A. M. Martinez-Enriquez, ‘‘Associ- LAL KHAN was born in D. G. Khan, Punjab,
ating targets with SentiUnits: A step forward in sentiment analy- Pakistan, in 1990. He received the M.S. degree in
sis of Urdu text,’’ Artif. Intell. Rev., vol. 41, no. 4, pp. 535–561, computer science from the Federal Urdu Univer-
Apr. 2014. sity of Arts, Science and Technology, Islamabad,
[19] Z. U. Rehman and I. S. Bajwa, ‘‘Lexicon-based sentiment analysis for in 2017. He is currently a Ph.D. Scholar with
urdu language,’’ in Proc. 6th Int. Conf. Innov. Comput. Technol. (INTECH), the Department of Computer Science and Infor-
Aug. 2016, pp. 497–501. mation Engineering, Chang Gung University,
[20] W. Zhao, Z. Guan, L. Chen, X. He, D. Cai, B. Wang, and Q. Wang,
Taiwan. He is also working in NLP task for
‘‘Weakly-supervised deep embedding for product review sentiment anal-
resource-deprived languages. His research inter-
ysis,’’ IEEE Trans. Knowl. Data Eng., vol. 30, no. 1, pp. 185–197,
Jan. 2018. ests include machine learning, deep learning, nat-
[21] M. Kamkarhaghighi and M. Makrehchi, ‘‘Content tree word embedding ural language processing (NLP), and speech recognition.
for document representation,’’ Expert Syst. Appl., vol. 90, pp. 241–249,
Dec. 2017. AMMAR AMJAD received the master’s degree
[22] Z. Hu, J. Hu, W. Ding, and X. Zheng, ‘‘Review sentiment analysis based in computer science from the National Col-
on deep learning,’’ in Proc. IEEE 12th Int. Conf. e-Bus. Eng., Oct. 2015, lege of Business Administration and Economics,
pp. 87–94. in March 2017. He is currently pursuing the Ph.D.
[23] S. M. Rezaeinia, A. Ghodsi, and R. Rahmani, ‘‘Improving the accu- degree in electrical engineering with the Division
racy of pre-trained word embeddings for sentiment analysis,’’ 2017, of Computer Science and Information Engineer-
arXiv:1711.08609. [Online]. Available: https://arxiv.org/abs/1711.08609 ing, Chang Gung University, Taiwan. His main
[24] F. Ali, D. Kwak, P. Khan, S. El-Sappagh, A. Ali, S. Ullah, K. H. Kim, research interests include speech processing, lan-
and K.-S. Kwak, ‘‘Transportation sentiment analysis using word embed-
guage learning, speech analysis, speech synthe-
ding and ontology-based topic modeling,’’ Knowl.-Based Syst., vol. 174,
sis, voice pathologies, auditory neuroscience, and
pp. 27–42, Jun. 2019.
[25] F. Ali, S. El-Sappagh, S. M. R. Islam, A. Ali, M. Attique, M. Imran, and machine learning.
K.-S. Kwak, ‘‘An intelligent healthcare monitoring framework using wear-
able sensors and social networking data,’’ Future Gener. Comput. Syst., NOMAN ASHRAF received the master’s degree
vol. 114, pp. 23–43, Jan. 2021. in computer science from the National University
[26] F. Ali, A. Ali, M. Imran, R. A. Naqvi, M. H. Siddiqi, and K.-S. of Computer and Emerging Sciences, Islamabad,
Kwak, ‘‘Traffic accident detection and condition analysis based on Pakistan. He worked as a Lecturer with The Uni-
social networking data,’’ Accident Anal. Prevention, vol. 151, Mar. 2021, versity of Lahore, Pakistan, from 2017 to 2019.
Art. no. 105973.
He is currently a Ph.D. Scholar with the Cen-
[27] N. Mukhtar and M. A. Khan, ‘‘Urdu sentiment analysis using supervised
tro de Investigación en Computación, Instituto
machine learning approach,’’ Int. J. Pattern Recognit. Artif. Intell., vol. 32,
no. 2, Feb. 2018, Art. no. 1851001. Politécnico Nacional (IPN). His research inter-
[28] M. Pontiki, D. Galanis, H. Papageorgiou, I. Androutsopoulos, ests include natural language processing (NLP),
S. Manandhar, M. Al-Smadi, M. Al-Ayyoub, Y. Zhao, B. Qin, machine learning, and deep learning.
O. De Clercq, and V. Hoste, ‘‘SemEval-2016 task 5: Aspect based
sentiment analysis,’’ in Proc. Int. Workshop Semantic Eval., 2016, HSIEN-TSUNG CHANG (Member, IEEE)
pp. 19–30.
received the M.S. and Ph.D. degrees from the
[29] H. Masroor, M. Saeed, M. Feroz, K. Ahsan, and K. Islam, ‘‘Transtech:
Department of Computer Science and Informa-
Development of a novel translator for Roman Urdu to English,’’ Heliyon,
vol. 5, no. 5, May 2019, Art. no. e01780. tion (CSIE), National Chung Cheng University,
[30] A. Rafae, A. Qayyum, M. Moeenuddin, A. Karim, H. Sajjad, and in July 2000 and July 2007, respectively. He joined
F. Kamiran, ‘‘An unsupervised method for discovering lexical variations the Faculty of the Department of Computer Sci-
in roman urdu informal text,’’ in Proc. Conf. Empirical Methods Natural ence and Information Engineering, Chang Gung
Lang. Process., 2015, pp. 823–828. University, and served as an Associate Professor.
[31] D. Maynard and K. Bontcheva, ‘‘Challenges of evaluating sentiment anal- He is currently a member of the Artificial Intel-
ysis tools on social media,’’ in Proc. 10th Int. Conf. Lang. Resour. Eval. ligence Research Center, Chang Gung University,
(LREC), 2016, pp. 1142–1148. and the Department of Physical Medicine and Rehabilitation, Chang Gung
[32] M. Ganapathibhotla and B. Liu, ‘‘Mining opinions in comparative sen- Memorial Hospital. He is also the Director of the Web Information and
tences,’’ in Proc. 22nd Int. Conf. Comput. Linguistics (COLING), 2008, Data Engineering Laboratory (WIDE Lab). His research interests include
pp. 241–248. artificial intelligence, natural language processing, information retrieval, big
[33] E. Grave, P. Bojanowski, P. Gupta, A. Joulin, and T. Mikolov, ‘‘Learn- data, web services, and search engines.
ing word vectors for 157 languages,’’ 2018, arXiv:1802.06893. [Online].
Available: https://arxiv.org/abs/1802.06893
[34] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, ‘‘Enriching word ALEXANDER GELBUKH is currently a Research
vectors with subword information,’’ Trans. Assoc. Comput. Linguistics, Professor and the Head of the Natural Lan-
vol. 5, pp. 135–146, Dec. 2017. guage Processing Laboratory, Center for Com-
[35] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, ‘‘Dis- puting Research, Instituto Politécnico Nacional,
tributed representations of words and phrases and their compositional- Mexico, and a Honorary Professor of Amity Uni-
ity,’’ 2013, arXiv:1310.4546. [Online]. Available: https://arxiv.org/abs/ versity, India. He has authored or coauthored more
1310.4546 than 500 publications in computational linguis-
[36] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and
tics, natural language processing, and artificial
P. Kuksa, ‘‘Natural language processing (almost) from scratch,’’ J. Mach.
intelligence, recently with a focus on sentiment
Learn. Res., vol. 12, pp. 2493–2537, Aug. 2011.
[37] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, ‘‘A convolutional neu- analysis and opinion mining. He is a member of
ral network for modelling sentences,’’ 2014, arXiv:1404.2188. [Online]. the Mexican Academy of Sciences, a Founding Member of the Mexican
Available: https://arxiv.org/abs/1404.2188 Academy of Computing, and a National Researcher of Mexico (SNI) at
[38] Y. Kim, ‘‘Convolutional neural networks for sentence classifica- excellence level 3 (highest). He is the Editor-in-Chief, an associate editor,
tion,’’ Sep. 2014, arXiv:1408.5882v2. [Online]. Available: https://arxiv. or an editorial board member of more than 20 international journals, and
org/abs/1408.5882v2 has been the chair or the program committee chair of over 50 international
[39] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural conferences.
Comput., vol. 9, no. 8, pp. 1735–1780, 1997.