Dzongkha Word Segmentation

Pema Choejey

Dzongkha Word Segmentation

Pema Choejey

2010

visibility

…

description

8 pages

link

1 file

Dzongkha, the national language of Bhutan, is continuous in written form and it fails to mark the word boundary. Dzongkha word segmentation is one of the fundamental problems and a prerequisite that needs to be solved before more advanced Dzongkha text processing and other natural language processing tools can be developed. This paper presents our initial attempt at segmenting Dzongkha sentences into words. The paper describes the implementation of Maximal Matching (Dictionary based Approach) followed by bigram techniques (Non-dictionary based Approach) in segmenting the Dzongkha scripts. Although the used techniques are basic and naive, it provides a baseline of the Dzongkha word segmentation task. Preliminary experimental results show percentage of segmentation accuracy. However, the segmentation accuracy is dependent on the type of document domain and size and quality of the lexicon and the corpus. Some of the related issues for future directions are also discussed.

Dzongkha Word Segmentation Sithar Norbu, Pema Choejey, Tenzin Dendup Research Division Department of Information Technology & Telecom {snorbu, pchoejay, tdendup}@dit.gov.bt Sarmad Hussain, Ahmed Mauz Center for Research in Urdu Language Processing National University of Computer & Emerging Sciences {sarmad.hussain, ahmed.mauz}@nu.edu.pk essential in the analysis of natural language processing. This is because word is both syntactically and semantically, the fundamental unit for analyzing language structure. Like in any other language processing task, Dzongkha word segmentation is also viewed as one of the fundamental and foremost steps in Dzongkha related language processing tasks. The most challenging features of Dzongkha script is the lack of word boundary separation between the words1. So, in order to do the further linguistic and natural language processing tasks, the scripts should be transformed into a chain of words. Therefore, segmenting a word is an essential role in Natural Language Processing. Like Chinese, Japanese and Korean (CJK) languages, Dzongkha script being written continuously without any word delimiter causes a major problem in natural language processing tasks. But, in case of CJK, Thai, and Vietnamese languages, many solutions have been published before. For Dzongkha, this is the first ever word segmentation solution to be documented. In this paper, we describe the Dzongkha word segmentation, which is performed firstly using the Dictionary based approach where the principle of maximal matching algorithm is applied to the input text. Here, given the collection of lexicon, the maximal matching algorithm selects the segmentation that yields the minimum number of words token from all possible segmentations of the input sentence. Then, it uses non-dictionary based approach where bigram technique is applied. The probabilistic model of a word sequence is Abstract Dzongkha, the national language of Bhutan, is continuous in written form and it fails to mark the word boundary. Dzongkha word segmentation is one of the fundamental problems and a prerequisite that needs to be solved before more advanced Dzongkha text processing and other natural language processing tools can be developed. This paper presents our initial attempt at segmenting Dzongkha sentences into words. The paper describes the implementation of Maximal Matching (Dictionary based Approach) followed by bigram techniques (Non-dictionary based Approach) in segmenting the Dzongkha scripts. Although the used techniques are basic and naive, it provides a baseline of the Dzongkha word segmentation task. Preliminary experimental results show percentage of segmentation accuracy. However, the segmentation accuracy is dependent on the type of document domain and size and quality of the lexicon and the corpus. Some of the related issues for future directions are also discussed. Keywords: Dzongkha script, segmentation, maximal matching, technique, smoothing technique. 1 word bigram Introduction Segmentation of a sentence into word is one of the necessary preprocessing tasks and is 1 http://www.learntibetan.net/grammar/sentence.htm 94 Proceedings of the 8th Workshop on Asian Language Resources, pages 94–101, Beijing, China, 21-22 August 2010. c 2010 Asian Federation for Natural Language Processing studied using the Maximum Likelihood Estimation (MLE). The approach using the MLE has an obvious disadvantage because of the unavoidably limited size of the training corpora (Nuges, 2006). To this problem of data sparseness, the idea of Katz back-off model with Good-Turing smoothing technique is applied. 2 whitespace in the Dzongkha sentence serves as a phrase boundary or comma, and is a faithful representation of speech: after all in speech, we pause not between words, but either after certain phrases or at the end of sentence. The sample dzongkha sentence reads as follows: རོང་ཁ་གོང་འཕེལ་ལན་ཚོགས་འདི་ འབག་རལ་ཁབ་ནང་ གཞང་གི་ཁ་ ཐག་ལས་ འབག་གི་རལ་ཡོངས་སད་ཡིག་ རོང་ཁའི་སིད་བས་རམ་མི་ Dzongkha Script དང་ རོང་ཁའི་མཐར་ཐག་གི་དབང་འཛིན་པ་ རང་དབང་རང་སོང་གི་ འདས་ཚོགས་ མཐོ་ཤོས་ཅིག་ཨིན། འདས་ཚོགས་འདི་ འབག་རལ་ Dzongkha language is the official and national language of Bhutan. It is spoken as the first language by approximately 130,000 people and as the second language by about 470,000 people (Van Driem and Tshering, 1998). Dzongkha is very much related to SinoTibetan language which is a member of Tibeto-Burmese language family. It is an alphabetic language, with phonetic characteristics that mirror those of Sanskrit. Like many of the alphabets of India and South East Asia, the Bhutanese script called Dzongkha script is also a syllabic2. A syllable can contain as little as one character or as many as six characters. And a word can be of one syllable, two syllable or multiple syllables. In the written form, Dzongkha script contains a dot, called Tsheg ( ་ ) that serve as syllable and phrase delimiter, but words are not separated at all. བཞི་པ་མི་དབང་མངའ་བདག་རིན་པོ་ཆེ་ དཔལ་འཇིགས་མེད་སེང་གེ་ དབང་ཕག་མཆོག་གི་ཐགས་དགོངས་དང་འཁིལ་ཏེ་ སི་ལོ་ ༡༩༨༦ ལ་ གཞི་བཙགས་གནང་གནངམ་ཨིན། (English Translation of example text) [The Dzongkha Development Commission is the leading institute in the country for the advancement of Dzongkha, the national language of Bhutan. It is an independent organization established by the Fourth King of Bhutan, His Majesty the King Jigme Singye Wangchuck, in 1986.] 3 Since, our language has no word boundary delimiter, the major resource for Dzongkha word segmentation is a collection of lexicon (dictionary). For such languages, dictionaries are needed to segment the running texts. Therefore, the coverage of a dictionary plays a significant role in the accuracy of word segmentation (Pong and Robert, 1994). The dictionary that we used contains 23,333 word lists/lexicons. The lexicons were collected from “Dzongkha Dictionary”, 2nd Edition, Published by Dzongkha Development Authority, Ministry of Education, 2005, (ddc@druknet.bt). The manually segmented text corpus containing 41,739 tokens are also used for the method. The text corpora were collected from different sources like newspaper articles, dictionaries, printed books, etc. and belong to domains such as World Affairs, Social Sciences, Arts, Literatures, Adventures, Culture and History. Some texts like poetry and songs were added manually. For example, Dzongkha Transliteration English Syllables དམརཔོ dmarpo red Singlesyllabled སོབ་དཔོན slop-pon Teacher Twosyllabled འཇམ་ཏོག་ཏོ hjam-tog-to easy Threesyllabled འར་རི་འར་རི har-ri-hur-ri Materials and Methods crowdedness Four/confusion syllabled Table 1: Different syllabled Dzongkha scripts. The sentence is terminated with a vertical stroke called Shad ( ། ). This Shad acts as a full_stop. The frequent appearance of 2 http://omniglot.com/writing/tibetan.htm 95 Table below gives the glimpse of textual domains contained in the text corpora used for the method (Chungku et al., 2010). Domain Sub domain (Huor et al., 2007) generated by the maximal matching algorithm. These mechanisms are described in the following (%) World Affairs Bilateral relations 12% Social Science Political Science 2% Arts Poetry/Songs/Ballad 9% Literatures Essays/Letters/Dictionary 72% Adventures Travel Adventures 1% Culture Culture Heritage/Tradition 2% History Myths/Architecture 2% 3.1 Maximal Matching Algorithm The basic idea of Maximal matching algorithm is, it first generates all possible segmentations for an input sentence and then selects the segmentation that contains the minimum number of word tokens. It uses dictionary lookup. We used the following steps to segment the given input sentence. 1. Read the input of string text. If an input line contains more than one sentence, a sentence separator is applied to break the line into individual sentences. 2. Split input string of text by Tsheg( ་ ) into syllables. 3. Taking the next syllables, generate all possible strings 4. If the number of string is greater than n for some value n3  Look up the series of string in the dictionary to find matches, and assign some weight-age4 accordingly.  Sort the string on the given weight-age  Delete (number of strings – n) low count strings. 5. Repeat from Step 2 until all syllables are processed. Table 2: Textual domain contained in Corpus Figure 1 below shows the Dzongkha Word Segmentation Process. The above mentioned steps produced all possible segmented words from the given input sentence based on the provided lexicon. Thus, the overall accuracy and performance depends on the coverage of lexicon (Pong and Robert, 1994). Figure 1: Dzongkha Word Segmentation Process. Dzongkha word segmentation implements a principle of maximal matching algorithm followed by statistical (bigram) method. It uses a word list/lexicon at first to segment the raw input sentences. It then uses MLE principles to estimate the bigram probabilities for each segmented words. All possible segmentation of an input sentence by Maximal Matching are then re-ranked and picked the mostly likely segmentation from the set of possible segmentations using a statistical approach (bigram technique). This is to decide the best possible segmentation among all the words 3 The greater the value of n, the better the chances of selecting the sentence with the fewest words from the possible segmentation. 4 If the possible string is found in the dictionary entries, the number of syllable in the string is counted. Then, the weight-age for the string is calculated as (number of syllable)2 else it carries the weight-age 0 96 3.2 Bigram Method probability distribution implied by a language model so that all reasonable word sequences can occur with some probability. This often involves adjusting zero probabilities upward and high probabilities downwards. This way, smoothing technique not only helps prevent zero probabilities but the overall accuracy of the model are also significantly improved (Chen and Goodman, 1998). In Dzongkha word segmentation, Katz backoff model based on Good-Turing smoothing principle is applied to handle the issue of data sparseness. The basic idea of Katz back-off model is to use the frequency of n-grams and if no n-grams are available, to back off to (n-1) grams, and then to (n-2) grams and so on (Chen and Goodman, 1998). (a) Maximum Likelihood Estimation5 In the bigram method, we make the approximation that the probability of a word depends on identifying the immediately preceding word. That is, we calculate the probability of next word given the previous word, as follows: P  w 1n  =Π ni=1 P  wi /w i−1  where  P  w i / wi−1  = count  wi −1 w i  count  wi−1  where  count  w i−1 wi  is a total occurrence of a word sequence w i−1 wi in the corpus, and  count  wi−1  is a total occurrence of a word w i−1 in the corpus. The summarized procedure of Katz smoothing technique is given by the following algorithm:6 { To make P  w i / wi−1  meaningful for i=1 , we use the distinguished token <s> at the beginning of the sentence; that is, we pretend w0 = <s>. In addition, to make the sum of the probabilities of all strings equal 1, it is necessary to place a distinguished token </s> at the end of the sentence. One of the key problems with the MLE is insufficient data. That is, because of the unavoidably limited size of the training corpus, vast majority of the word are uncommon and some of the bigrams may not occur at all in the corpus, leading to zero probabilities. Therefore, following smoothing techniques were used to count the probabilities of unseen bigram. C  wi−1 / wi  ifr>k Pkatz  wi∣w i−1 = d r C  w i−1 /w i  ifk≥r>0 α  wi −1  P  wi  ifr=0 } where  r is the frequency of bigram counts  k is taken for some value in the range of 5 to 10, other counts are not re-estimated.  dr = n K+1 r − k+1  r n1 1−  k+1  nk+ 1 n1  1− (b) Smoothing Bigram Probabilistic The above problem of data sparseness underestimates the probability of some of the sentences that are in the test set. The smoothing technique helps to prevent errors by making the probabilities more uniform. Smoothing is the process of flattening a α  wi−1  = ∑ wi :r> 0 1− P Katz  w i∣w i−1  ∑ wi :r> 0 PKatz  w i  With the above equations, bigrams with nonzero count r are discounted according to the 5 P.M, Nugues. An Introduction to Language Processing with Perl and Prolog: An Outline of Theories, Implementation, and Application with Special Consideration of English, French, and German (Cognitive Technologies) (95 – 104). 6 X. Huang, A. Acero, H.-W.Hon, Spoken Language Processing: A Guide to Theory, Algorithm and System Development, (Prentice-Hall Inc., New Jersey 07458, 2001), 559 - 561. 97 discount ratio d r= r r i.e., the manually segmented corpus containing 41,739 tokens are used for the method. In the sample comparison below, the symbol ( ་ ) does not make the segmentation unit's mark, but ( ། ) takes the segmentation unit's mark, despite its actual mark for comma or full_stop. The whitespace in the sentence are phrase boundary or comma, and is a faithful representation of speech where we pause not between words, but either after certain phrases or at the end of sentence. count subtracted from the non-zero count are redistributed among the zero count bigrams according to the next lower-order distribution, the unigram model. 4 Evaluations and Results Subjective evaluation has been performed by comparing the experimental results with the manually segmented tokens. The method was evaluated using different sets of test documents from various domains consisting of 714 manually segmented words. Table 3 summarizes the evaluation results. Document text Correct Detect Consider the sample input sentence: རོང་ཁ་ལི་ནགསི་འདི་ རོང་ཁ་གོག་རིག་ནང་བཙགས་ནིའི་དོན་ལ་ རབ་ སོར་ཧིལ་བ་གཅིག་ཁར་བསོམ་མི་ རང་དབང་ལི་ནགསི་ བཀོལ་སོད་ རིམ་ལགས་འདི་གི་ ཉེ་གནས་སེ་མཐན་འགར་བཟོ་ཡོད་པའི་ཐོན་རིམ་ Accuracy ཅིག་ཨིན། དེ་གིས་ ཆ་ཚང་སེ་སད་བསར་འབད་ཡོད་པའི་ལག་ལེན་པའི་ (Correctly segmented tokens / total no. of words) ངོས་འད་བ་ཚ་སོནམ་ཨིན། Astrology.txt 102/116 87.9% dzo_linux.txt 85/93 91.4% movie_awards.txt 76/84 90.5% News.txt 78/85 91.8% Notice.txt 83/92 90.2% Religious.txt 63/73 89.0% Song.txt 57/60 95.0% ཐོན།རིམ།ཅིག།ཨིན། དེ་གིས། ཆ་ཚང།སེ།སད་བསར།འབད།ཡོད།པའི། Tradition.txt 109/111 98.2% ལག་ལེན།པའི།ངོས།འད་བ།ཚ།སོནམ།ཨིན། Total 653/714 91.5% Manually segmented sentence of the sample input sentence: རོང་ཁ།ལི་ནགསི།འདི། རོང་ཁ།གོག་རིག།ནང།བཙགས།ནིའི།དོན་ལ། རབ་སོར།ཧིལ་བ།གཅིག།ཁར།བསོམ།མི། རང་དབང།ལི་ནགསི། བཀོལ་ སོད།རིམ་ལགས།འདི།གི། ཉེ་གནས།སེ།མཐན་འགར།བཟོ།ཡོད།པའི། Using maximal matching algorithm: Table 3: Evaluation Results རོང་ཁ། ལི། ནགསི། འདི། རོང་ཁ། གོག་རིག། ནང། བཙགས། Accuracy in %age are measured as: ནིའི། དོན། ལ། རབ་སོར། ཧིལ། བ། གཅིག། ཁར། བསོམ། Accuracy(%) = མི། རང་དབང། ལི། ནགསི། བཀོལ་སོད། རིམ་ལགས། འདི་གི། ཉེ་གནས། སེ། མཐན་འགར། བཟོ། ཡོད། པའི། ཐོན། རིམ། N ∗100 T ཅིག་ཨིན། དེ། གིས། ཆ་ཚང། སེ། སད་བསར། འབད། ཡོད། where  N is the number of correctly segmented tokens  T is the total number of manually segmented tokens/ Total number of words. པའི། ལག་ལེན། པའི། ངོས། འད། བ། ཚ། སོནམ། ཨིན། System segmented version of the sample input sentence: Underlined text shows the incorrect segmentation. རོང་ཁ།ལི་ནགསི་འདི། རོང་ཁ།གོག་རིག།ནང།བཙགས།ནིའི་དོན་ལ། We have taken the extract of different test data hoping it may contain fair amount of general terms, technical terms and common nouns. The རབ་སོར།ཧིལ་བ།གཅིག།ཁར།བསོམ།མི། རང་དབང།ལི་ནགསི།བཀོལ་ སོད།རིམ་ལགས།འདི།གི། ཉེ་གནས།སེ།མཐན་འགར།བཟོ།ཡོད།པའི། 98 ཐོན།རིམ།ཅིག།ཨིན། དེ།གིས།ཆ་ཚང།སེ།སད་བསར།འབད།ཡོད།པའི། 2.འདི།རོང་ཁ།གི།ཞིབ།འཚོལ།ཡིག་ཆ།ཨིན ལག་ལེན།པའི།ངོས།འད་བ།ཚ།སོནམ། ཨིན། this | Dzongkha | of | arrange together | search/ expose | written document | is 5 Discussions 3.འདི།རོང།ཁ།གི།ཞིབ་འཚོལ།ཡིག་ཆ།ཨིན During the process of word segmentation, it is understood that the maximal matching algorithm is simply effective and can produce accurate segmentation only if all the words are present in the lexicon. But since not all the word entry can be found in lexicon database in real application, the performance of word segmentation degrades when it encounters words that are not in the lexicon (Chiang et al., 1992). Following are the significant problems with the dictionary-based maximal matching method because of the coverage of lexicon (Emerson, 2000):  incomplete and inconsistency of the lexicon database  absence of technical domains in the lexicon  transliterated foreign names  some of the common nouns not included in the lexicon  lexicon/word lists do not contains this | fortress | mouth/surface | of | research | written document | is These problems of ambiguous word divisions, unknown proper names, are lessened and solved partially when it is re-ranked using the bigram techniques. Still the solution to the following issues needs to be discussed in the future. Although the texts were collected from widest range of domains possible, the lack of available electronic resources of informative text adds to the following issues:  small number of corpus were not very impressive for the method  ambiguity and inconsistent of manual segmentation of a token in the corpus resulting in incompatibility and sometimes in conflict. Ambiguity and inconsistency occurs because of difficulties in identifying a word. Since the manual segmentation of corpus entry was carried out by humans rather than computer, such humans have to be well skilled in identifying or understanding what a word is. The problem with the Dzongkha scripts that also hampers the accuracy of dzongkha word segmentation includes the issues such as ambiguous use of Tsheg ( ་ ) in different documents. There are two different types of Tsheg: Unicode 0F0B ( ་ ) called Tibetan mark inter syllabic tsheg is a normal tsheg that provides a break opportunity. Unicode 0F0C ( ༌ ) called Tibetan Mark Delimiter Tsheg Bstar is a non-breaking tsheg and it inhibits line breaking. For example, input sentence with Tsheg 0F0B: genitive endings པའི (expresses the genitive relationship as a quality or characteristic of the second element, for example, དབལ་པའི་བ 'son of a pauper') possessive, and for འི (first example, singular ངེའི་བམོ which actually is ངི་གི་བམོ 'my daughter') that indicates possession or a part-to-whole relationship, like English 'of'. A Dzongkha sentence like: འདི་རོང་ཁ་གི་ ཞིབ་འཚོལ་ཡིག་ཆ་ ཨིན། may include the following ambiguous possible segmentation based on simple dictionary lookup: སངས་རས་དང་ཚེ་རིང་གཉིས་ བར་དོན་དང་འཕལ་རིག་ནང་ ལ་འབད་ དོ་ཡོདཔ་ཨིན་པས། 1.འདི།རོང་ཁ།གི།ཞིབ་འཚོལ།ཡིག་ཆ།ཨིན achieves 100% segmentation as follow: this | Dzongkha | of | research | written document | is 99 སངས་རས། དང། ཚེ་རིང། གཉིས། བར་དོན། དང། འཕལ། རིག། ནང། Acknowledgment ལ། འབད། དོ། ཡོདཔ། ཨིན། པས། This research work was carried out as a part of PAN Localization Project (http://www.PANL10n.net) with the aid of a grant from the International Development Research Centre (IDRC), Ottawa, Canada, administered through the Center of Research in Urdu Language Processing (CRULP), National University of Computer and Emerging Sciences (NUCES), Pakistan. The research team would also like to express the gratitude to all the PAN Localization Project members of Bhutanese team based at Department of Information Technology and Telecom, for their efforts in collecting, preparing and providing with the lexicon, corpus, useful training and testing materials and finally for the their valuable support and contribution that made this research successful. whereas the same input sentence with Tsheg 0F0C is incorrectly segmented as follows: སངས༌རས༌དང༌ཚེ༌རིང༌གཉིས༌། བར༌དོན༌དང༌འཕལ༌རིག༌ནང༌། ལ༌འབད༌དོ༌ཡོདཔ༌ཨིན༌པས། There are also cases like shortening of words, removing of inflectional words and abbreviating of words for the convenience of the writer. But this is not so reflected in the dictionaries, thus affecting the accuracy of the segmentation. Following words has a special abbreviated way of writing a letter or sequence of letters at the end of a syllable as རོ་རེ as རོེ ཡེ་ཤེས as ཡེས etc.. 6 References Conclusion and Future works This paper describes the initial effort in segmenting the Dzongkha scripts. In this preliminary analysis of Dzongkha word segmentation, the preprocessing and normalizations are not dealt with. Numberings, special symbols and characters are also not included. These issues will have to be studied in the future. A lot of discussions and works also have to be done to improve the performance of word segmentation. Although the study was a success, there are still some obvious limitations, such as its dependency on dictionaries/lexicon, and the current Dzongkha lexicon is not comprehensive. Also, there is absence of large corpus collection from various domains. Future work may include overall improvement of the method for better efficiency, effectiveness and functionality, by exploring different algorithms. Furthermore, the inclusion of POS Tag sets applied on ngram techniques which is proven to be helpful in handling the unknown word problems might enhance the performance and accuracy. Increasing corpus size might also help to improve the results. Chen, Stanley F., Joshua Goodman, 1998. An Empirical Study of Smoothing Techniques for Language Modeling, Computer Science Group, Harvard University, Cambridge, Massachusetts Chiang, T-Hui., J-Shin Chang,, M-Yu Lin, K-Yih Su, 2007. Statistical models for word segmentation and unknown word resolution. Department of Electrical Engineering , National Tsing Hua University, Hsinchu, Taiwan. Chungku., Jurmey Rabgay, Gertrud Faaβ, 2010. NLP Resources for Dzongkha. Department of Information Technology & Telecom, Ministry of Information & Communications, Thimphu, Bhutan. Durrani, Nadir and Sarmad Hussain, 2010. Urdu Word Segmentation. Human Language Technologies: 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, June 2010. Emerson, Thomas. 2000. Segmenting Chinese in Unicode. 16th International Unicode conference, Amsterdam, The Netherlands, March 2000 Haizhou, Li and Yuan Baosheng, 1998. Chinese Word Segmentation. Language, Information and Computation (PACLIC12), 1998. Haruechaiyasak, C., S Kongyoung, M.N. Dailey, 2008. A Comparative Study on Thai Word 100 Segmentation Approaches. In Proceedings of ECTI-CON, 2008. Huang, X., A. Acero, H.-W. Hon, 2001. Spoken Language Processing: A Guide to Theory, Algorithm and System Development (pp. 539 – 578). Prentice-Hall Inc., New Jersey 07458. Huor, C.S., T. Rithy, R.P. Hemy, V. Navy, C. Chanthirith, C. Tola, 2007. Word Bigram Vs Orthographic Syllable Bigram in Khmer Word Segmentation. PAN Localization Working Papers 2004 - 2007. PAN Localization Project, National University of Computer and Emerging Sciences, Lahore, Pakistan. Jurafsky, D., A. Acero, H.-W. Hon, 1999. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition (pp. 189 – 230). Prentice-Hall Inc., New Jersey 07458. Nugues, P.M. 2006. An Introduction to Language Processing with Perl and Prolog: An Outline of Theories, Implementation, and Application with Special Consideration of English, French, and German (Cognitive Technologies) (pp. 87 – 104). Springer-Verlag Berlin Heidelberg Pong, L.W. and Robert. 1994. Chinese word segmentation based on maximal matching and bigram techniques. Retrieved from The Association for Computational Linguistics and Chinese Language Processing. On-line: http://www.aclclp.org.tw/rocling/1994/P04.pdf Sunthi, Thepchai. 2007. Word Segmentation and POS tagging. ADD-2 Workshop, SIIT, NECTEC, Thailand. Van Driem, George. and Karma Tshering, (Collab), “Languages of Greater Himalayan Region”, 1998. 101

Log In

Dzongkha Word Segmentation

Sign up for access to the world's latest research.

Related papers

Related papers

Related topics