Abstract
Massive amounts of unstructured content have been generated day-by-day on social media platforms like Facebook, Twitter and blogs. Analyzing and extracting useful information from this vast amount of text content is a challenging process. Social media have currently provided extensive opportunities for researchers and practitioners to do adequate research on this area. Most of the text content in social media tend to be either in English or code-mixed regional languages. In a multilingual country like India, code-mixing is the usual fashion witnessed in social media discussions. Multilingual users frequently use Roman script, an convenient mode of expression, instead of the regional language script for posting messages on social media and often mix it with English into their native languages. Stylistic and grammatical irregularities are significant challenges in processing the code-mixed text using conventional methods. This paper explains the new word embedding via character level representation as features for POS tagging the code-mixed text in Indian languages using the ICON-2015, ICON-2016 NLP tools contest data set. The proposed word embedding features are context-appended, and the well-known Support Vector Machine (SVM) classifier has been used to train the system. We have combined the Facebook, Twitter, and WhatsApp code-mixed data of three Indian languages to train the Transfer learning based language-independent and source independent POS tagging. The experimental results demonstrated that the proposed transfer method achieved state-of-the-art accuracy in 12 systems out of 18 systems for the ICON data set.



Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The term ”Code-Mixed Cross-Script” referred to as code-mixed script throughout this paper.
Even though we have only used the data set provided by the task organizers. We considered our task submission as unconstrained because the data set of other languages and other sources is used to learn word embedding and character embedding.
Constrained means the participant team is only allowed to use only the corpus given by the organizer for the training. No external resources are allowed.
Unconstrained means the participant team can use any external resource (available POS tagger, NER, Parser, and any additional data) to train their system.
Stylistic features used in Constraint and Word2vec is used in Unconstrained model.
References
(2016) Part of speech tagging for code switched data. In: Proceedings of the second workshop on computational approaches to code wwitching, pages 98–107
Adithya P, Monojit C, Sunayana S (2018) Word embeddings for code-mixed language processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3067–3072
Ali MM, Ranjha MI, Fakhar Sj (2010) Effects of code mixing in indian film songs. J Media Stud 31–2:2010
Anand KM, Rajendran S, Soman KP (2015) Cross-lingual preposition disambiguation for machine translation. In: Eleventh international conference on data mining and warehousing, ICDMW 2015, volume 54, pages 291–300. Elsevier-Procedia Computer Science
Anand KM, Soman KP (2015) Amrita_cen@ icon-2015: Part-of-speech tagging on indian language mixed scripts in social media. In: ICON-NLP tools contest report at ICON
Anupam J, Amitava D (2016). Part-of-speech tagging system for indian social media text on twitter. In Social-India 2014, First Workshop on Language Technologies for Indian Social Media Text, at the Eleventh International Conference on Natural Language Processing (ICON-2014), pages 21–28
Banerjee S, Naskar SK, Rosso P, Bandyopadhyay S (2017) Named entity recognition on code-mixed cross-script social media content. Computación y Sistemas 21(4):681–692
Björn G, Amitava D (2014). On measuring the complexity of code-mixing. In: Proceedings of the 11th international conference on natural language processing, Goa, India, pages 1–7
Chakma K (2014) Revisiting automatic transliteration problem for code-mixed romanized indian social media text. In: Social-India 2014, First Workshop on Language Technologies for Indian Social Media Text, at the Eleventh International Conference on Natural Language Processing (ICON-2014), volume 2014, page 42
Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12(Aug):2493–2537
Dong N, Doğruöz SA (2013) Word level language identification in online multilingual communication. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 857–862
Huang Eric H, Richard S, Manning Christopher D, Ng Andrew Y (2012). Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th annual meeting of the association for computational linguistics: long papers-Volume 1, pages 873–882. Association for Computational Linguistics
Jain D, Kumar A, Garg G Sarcasm detection in mash-up language using soft-attention based bi-directional lstm and feature-rich cnn. Applied Soft Computing, 91:106198, 2020. ISSN 1568-4946. https://doi.org/10.1016/j.asoc.2020.106198. URL http://www.sciencedirect.com/science/article/pii/S1568494620301381
Jamatia, A and Amitava D(2014) Part-of-speech tagging system for indian social media text on twitter. In: Social-India 2014, First Workshop on Language Technologies for Indian Social Media Text, at the Eleventh International Conference on Natural Language Processing (ICON-2014), pages 21–28
Joachims T (1999) Making large-scale SVM learning practical. In: Schölkopf B, Burges C, Smola A (eds) Advances in Kernel Methods - Support Vector Learning, vol chapter 11. MIT Press, Cambridge, MA, pp 169–184
Joseph R, Mooney Raymond J (2010) Multi-prototype vector-space models of word meaning. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, pp 109–117
Kamal S (2015). Part-of-speech tagging for code-mixed indian social media text at icon 2015. In ICON-NLP tools contest report, at the Twelfth International Conference on Natural Language Processing (ICON-2015)
Kelsey B, Dan G (2018) PPart-of-speech tagging for code-switched, transliterated texts without explicit language identification. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3084–3089
Kumar, M Anand and Soman KP (2014). Amrita-cen@ fire-2014: Morpheme extraction and lemmatization for tamil using machine learning. In: ACM International Conference Proceeding Series, pages 112–20
Le Q, Mikolov T (2014). Distributed representations of sentences and documents. In: Proceedings of the 31st international conference on machine learning (ICML-14), pages 1188–1196
Lin W, Haitao L (2016). Syntactic differences of adverbials and attributives in chinese-english code-switching. Language Sciences, 55:16 – 35. ISSN 0388-0001
Mikolov T, Chen K, Corrado G, Dean J (2014) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2014
Mushtaq Hammad, Zahra Taskeen (2012) An analysis of code-mixing in television commercials. Lang India 12–11:2012
Myers-Scotton Carol (2002) Bilingual speech, a typology of code-mixing. Language 78(2):330–333
Nelakuditi K, Jitta DS, Mamidi R (2018) Part-of-speech tagging for code mixed english-telugu social media data. In: Gelbukh Alexander (ed) Computational linguistics and intelligent text processing. Springer International Publishing, Cham, pp 332–342
Partha P, Goutam M, Amarnath P (2018). An hmm based pos tagger for pos tagging of code-mixed indian social media text. In Jyotsna Kumar Mandal and Devadatta Sinha, editors, Social Transformation–Digital Way, pages 495–504, Singapore. Springer Singapore
Parth G, Kalika B, Banchs Rafael E, Monojit C, Paolo R (2014) Query expansion for mixed-script information retrieval. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pages 677–686. ACM
Raveesh Motlan A, Sharma. (2015) Pos tagging for code-mixed indian social media text : Systems from iiit-h for icon-nlp tools contest. In ICON-NLP tools contest report, at the Twelfth international conference on natural language processing (ICON-2015)
Sampathkumar A, Ravi R, Srinivas A, Achyut S, Sandeep K, Sivaram M (2020) An efficient hybrid methodology for detection of cancer-causing gene using csc for micro array data. Journal of Ambient Intelligence and Humanized. Computing 1–9
Santos Cicero D and Bianca Z (2014). Learning character-level representations for part-of-speech tagging. In: Proceedings of the 31st international conference on machine learning (ICML-14), pages 1818–1826
Sequiera R, Choudhury M, Bali K (2015). Pos tagging of hindi-english code mixed text from social media: Some machine learning experiments. In: 12th International Conference on Natural Language Processing, page 233
Sharma Kalika Bali Jatin, Choudhury Monojit, Vyas Yogarshi (2014) “i am borrowing ya mixing?” an analysis of english-hindi code mixing in facebook. EMNLP page 116:2014
Solorio Thamar, Liu Yang (2008). Part-of-speech tagging for english-spanish code-switched text. In:d Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1051–1060. Association for Computational Linguistics
Souvick G, Satanu G, Dipankar D(2016) Part-of-speech tagging of code-mixed social media text. In Proceedings of the Second Workshop on Computational Approaches to Code Switching, pages 90–97
Spandana G, Jatin S, Kalika B (2013) Query word labeling and back transliteration for indian languages: Shared task system description. FIRE Working Notes -2013
Vyas Y, Gella S, Sharma J, Bali K, Choudhury Monojit (2014) Pos tagging of English-Hindi code-mixed social media content. In EMNLP 14:974–979
Wen SC, Min C, Chen WC(2018). Analyzing the trend of o2o commerce by bilingual text mining on social media. Computers in Human Behavior, page in press
Xiang Z, Junbo Z, Yann L (2015) Character-level convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657
Acknowledgements
We would like to thank ICON-2015 and ICON-2016 tools contest organizers for organizing the NLP event in India. We also like to thank Dr. Amitav Das for initiating this research along with the tools contest.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Madasamy, A.K., Padannayil, S.K. Transfer learning based code-mixed part-of-speech tagging using character level representations for Indian languages. J Ambient Intell Human Comput 14, 7207–7218 (2023). https://doi.org/10.1007/s12652-021-03573-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-021-03573-3