Building Indonesian Local Language Detection Tools Using Wikipedia Data

Martadinata, Puji; Trisedya, Bayu Distiawan; Manurung, Hisar Maruli; Adriani, Mirna

doi:10.1007/978-3-319-31468-6_8

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9442))

Included in the following conference series:

International Workshop on Worldwide Language Service Infrastructure

435 Accesses
1 Citations

Abstract

The widespread use of social media today has generated lots of research interest towards information retrieval, natural language processing, and also machine learning. The vast diversity of languages used on social media creates the need for accurate automated language identification tools. In this research, we develop a language identification tool that can help automatically identify social media posts in Indonesian, Javanese, Sundanese, and Minangkabau. The latter three are some of the most widely spoken regional languages in Indonesia. We conducted experiments to compare three popular methods used to develop language identification tools, namely N-grams, statistical models, and the Small Words technique. Our experiments conducted using articles on internet for training and tested using social media data that we constructed, show that the statistical method obtains the best result among all the methods used.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

€32.70 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: EUR 29.95; Price includes VAT (France)

eBook: EUR 42.79; Price includes VAT (France)

Softcover Book: EUR 52.74; Price includes VAT (France)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Indian Language Identification for Short Text

Language Identification Using Multinomial Naive Bayes Technique

A comprehensive survey on Indian regional language processing

Article 12 June 2020

Notes

1.
www.wikipedia.com.
2.
twitter.com.
3.
http://odur.let.rug.nl/˜vannoord/TextCat/.
4.
http://search.cpan.org/~mpiotr/Lingua-Ident-1.7/Ident.pm.
5.
http://search.cpan.org/~ambs/Lingua-Identify-0.56/lib/Lingua/Identify.pm.

References

House, A.S., Neuburg, E.P.: Toward automatic identification of the language of an utterance. I. Preliminary methodological considerations. J. Acoust. Soc. Am. 62(3), 708–713 (1977)
Article Google Scholar
Ruslan, H.: Bahasa Daerah di Indonesia Terancam Punah (2013). Retrieved from Republika: http://www.republika.co.id/berita/nasional/umum/13/06/12/moa5s5-bahasa-daerah-di-indonesia-terancam-punah
Cavnar, W.B., Trenkle, J.M.: N-gram based text categorization. In: Proceedings of SDAIR 1994, pp. 161–175 (1994)
Google Scholar
Kranig, S.: Evaluation of Language Identification Method. Bakalárska práca. Universität Tübingen, Nemecko (2005)
Google Scholar
Dunning, T.: Statistical identification of language. Technical report MCCS-94-273, Computing Research Lab, New Mexico State University (1994)
Google Scholar
Grefenstette, G.: Comparing two language identification schemes. In: Proceedings of JADT 1995, 3rd International Conference on Statistical Analysis of Textual Data (1995)
Google Scholar
Padró, M., Padró, L.: Comparing methods for language identification. Procesamiento del Lenguaje Nat. 33, 155–162 (2004)
Google Scholar
Wilkinson, D., Huberman, B.: Cooperation and quality in Wikipedia. In: Proceedings of the 2007 International Symposium on Wikis, pp. 157–164 (2007)
Google Scholar
Adafre, S.F., De Rijke, M.: Finding similar sentences across multiple languages in Wikipedia. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 62–69 (2006)
Google Scholar
Tyers, F.M., Pienaar, J.: Extracting bilingual word pairs from Wikipedia. In: Proceedings of the SALTMIL Workshop at the Language Resources and Evaluation Conference, LREC 2008, pp. 19–22 (2008)
Google Scholar
Louvan, S., Ibrahim, M., Adriani, M., Vania, C., Trisedya, B.D., Wanagiri, M.Z.: University of Indonesia at TREC 2011 microblog track. In: Text Retrieval Conference Proceedings. NIST (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computer Science, Universitas Indonesia, Depok, Indonesia
Puji Martadinata, Bayu Distiawan Trisedya, Hisar Maruli Manurung & Mirna Adriani

Authors

Puji Martadinata
View author publications
You can also search for this author in PubMed Google Scholar
Bayu Distiawan Trisedya
View author publications
You can also search for this author in PubMed Google Scholar
Hisar Maruli Manurung
View author publications
You can also search for this author in PubMed Google Scholar
Mirna Adriani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Puji Martadinata .

Editor information

Editors and Affiliations

Unit of Design, Kyoto University, Kyoto, Japan
Yohei Murakami
Kyoto University, Kyoto, Japan
Donghui Lin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Martadinata, P., Trisedya, B.D., Manurung, H.M., Adriani, M. (2016). Building Indonesian Local Language Detection Tools Using Wikipedia Data. In: Murakami, Y., Lin, D. (eds) Worldwide Language Service Infrastructure. WLSI 2015. Lecture Notes in Computer Science(), vol 9442. Springer, Cham. https://doi.org/10.1007/978-3-319-31468-6_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-31468-6_8
Published: 13 March 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31467-9
Online ISBN: 978-3-319-31468-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Building Indonesian Local Language Detection Tools Using Wikipedia Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Indian Language Identification for Short Text

Language Identification Using Multinomial Naive Bayes Technique

A comprehensive survey on Indian regional language processing

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Building Indonesian Local Language Detection Tools Using Wikipedia Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Indian Language Identification for Short Text

Language Identification Using Multinomial Naive Bayes Technique

A comprehensive survey on Indian regional language processing

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation