Abstract
The widespread use of social media today has generated lots of research interest towards information retrieval, natural language processing, and also machine learning. The vast diversity of languages used on social media creates the need for accurate automated language identification tools. In this research, we develop a language identification tool that can help automatically identify social media posts in Indonesian, Javanese, Sundanese, and Minangkabau. The latter three are some of the most widely spoken regional languages in Indonesia. We conducted experiments to compare three popular methods used to develop language identification tools, namely N-grams, statistical models, and the Small Words technique. Our experiments conducted using articles on internet for training and tested using social media data that we constructed, show that the statistical method obtains the best result among all the methods used.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
House, A.S., Neuburg, E.P.: Toward automatic identification of the language of an utterance. I. Preliminary methodological considerations. J. Acoust. Soc. Am. 62(3), 708–713 (1977)
Ruslan, H.: Bahasa Daerah di Indonesia Terancam Punah (2013). Retrieved from Republika: http://www.republika.co.id/berita/nasional/umum/13/06/12/moa5s5-bahasa-daerah-di-indonesia-terancam-punah
Cavnar, W.B., Trenkle, J.M.: N-gram based text categorization. In: Proceedings of SDAIR 1994, pp. 161–175 (1994)
Kranig, S.: Evaluation of Language Identification Method. Bakalárska práca. Universität Tübingen, Nemecko (2005)
Dunning, T.: Statistical identification of language. Technical report MCCS-94-273, Computing Research Lab, New Mexico State University (1994)
Grefenstette, G.: Comparing two language identification schemes. In: Proceedings of JADT 1995, 3rd International Conference on Statistical Analysis of Textual Data (1995)
Padró, M., Padró, L.: Comparing methods for language identification. Procesamiento del Lenguaje Nat. 33, 155–162 (2004)
Wilkinson, D., Huberman, B.: Cooperation and quality in Wikipedia. In: Proceedings of the 2007 International Symposium on Wikis, pp. 157–164 (2007)
Adafre, S.F., De Rijke, M.: Finding similar sentences across multiple languages in Wikipedia. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 62–69 (2006)
Tyers, F.M., Pienaar, J.: Extracting bilingual word pairs from Wikipedia. In: Proceedings of the SALTMIL Workshop at the Language Resources and Evaluation Conference, LREC 2008, pp. 19–22 (2008)
Louvan, S., Ibrahim, M., Adriani, M., Vania, C., Trisedya, B.D., Wanagiri, M.Z.: University of Indonesia at TREC 2011 microblog track. In: Text Retrieval Conference Proceedings. NIST (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Martadinata, P., Trisedya, B.D., Manurung, H.M., Adriani, M. (2016). Building Indonesian Local Language Detection Tools Using Wikipedia Data. In: Murakami, Y., Lin, D. (eds) Worldwide Language Service Infrastructure. WLSI 2015. Lecture Notes in Computer Science(), vol 9442. Springer, Cham. https://doi.org/10.1007/978-3-319-31468-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-31468-6_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31467-9
Online ISBN: 978-3-319-31468-6
eBook Packages: Computer ScienceComputer Science (R0)