IITC 2008p4 PDF
IITC 2008p4 PDF
IITC 2008p4 PDF
The steps involved in the operation of the parser can be The lexicon comprises of several files which are
summarized as follows: programmed in lexc (lexicon compiler). There is a lexicon
1. The input text file is fed to the transliteration file for each class of verb root and these contain the verb
module. This file contains the Sinhala words that need to roots and morphemes recorded during implementation
be analyzed or generated and the text should be in Sinhala and training. As mentioned before in section 2.3 four
Unicode. basic verb root classes were identified. However, some of
2. The transliterator converts the contents of the input these verb groups could further be categorized into sub-
text file into Romanized Sinhala text. groups depending on their orthography. For example, the
3. xfst(Xerox finite state transducer) is invoked and the roots adi අදි and ari අරි both belong to the same ‘i group’.
compiled finite state network is loaded onto the stack. The However, they behave differently in their past tense form;
transliterated input text file is given as the input for the adi අදි becomes ædda ඇද්ද while ari අරි becomes æriya
xfst’s ‘apply up’ or ‘apply down’ commands, depending ඇරිය. This phenomenon is not isolated and since there are
on analysis or generation mode. other roots that display similar behaviour [1], they were
4. xfst analyzes/generates the input strings and the put into separate sub groups. Therefore separate lexicons
output is written to a text file in ANSI encoding. were created for each of the sub-groups. Altogether there
5. The output file is processed by the result formatter are 11 such lexicon files. The format of the lexicon files
and a formatted text file in Unicode encoding is produced. programmed in lexc is given in Figure 3.
6. The formatted output file is input into the
transliterator and the transliterated output file is produced.
4.2 Transliterator
4.7 Parser Output Unicode support: The Xerox compiler version 8.1.3 does
The parser is capable of analyzing and generating strings not have Unicode support. Since it was not possible to
at the word level in both Unicode Sinhala format and acquire a version that does, a way to incorporate the
Romanized English using the transliteration format used Sinhala language without using Unicode had to be found.
in the project. Samples of some verbs which have A Transliteration scheme was designed and implemented
correctly been analyzed are given below: as a solution.
• කළ (kələ) => කර +Kru+Past - Krudantha Past
tense of root කර (kərə). 5. Experiments & Results
• කෙළේ (kəle:) => කර+VNF+Derived+Dec - Non
Finite, Derived, Declarative form of root කර
(kərə). 5.1 Training
• ගිෙය්ය (giye:yə) => ය
+VFM+Derived+Past+3P+Sg+Mus – Finite, Methodology. The initial verb roots and rules were
Derived, Past tense, 3rd Person, Singular, acquired from the textbook Kriya Pathaya [1]. These
Muscular verb of root ය (yə). were used during the development of the parser. After
Likewise, some examples of verb forms which have completing the implementation, the corpus was used to
been accurately generated are given below: acquire words for training purposes. First, a list of verbs
• ලබ +VFM+Pure+Past+3P+Sg => ලැබීය was manually filtered out of the corpus. Since the corpus
(læbi:yə) – Finite, Pure, Past tense, 3rd Person, itself contained distinct words, the verb list also
Singular tense of root ලබ (labə). comprised of distinct verbs. Next 700 verbs out of a list of
• හිත +VNF+Derived+Nom => හිතනවා 1631 were chosen from the verb list for training. The
(hithənəva:) - Non finite, Derived, Nominative training set was acquired by grouping the corpus into 200
verb of root හිත (hithə). word sets and extracting every other set. Thus, the
• කිය+VFM+Derived+Past+3P+Sg+Fem => training set was formed from the verbs 1 − 200, 401 −
කීවාය (ki:va:yə) - Fintie, Derived, Past tense, 3rd 600, 701 − 900, and 901 − 1000. This was done because
Person, Singular, Feminine verb of root කිය an allowance for different data domains was needed. It
(kiyə). was assumed that the corpus contained data from different
domains and the words that are spatially similar displayed
similar morphotactic characteristics.
4.8 Problems & Challenges
The training results with respect to the number of
errors are shown in Figure 6. Although this helps to give a
Insufficient domain knowledge: The primary difficulty
general picture, it must be noted that the training set itself
faced was the lack of linguistic expertise. Since the
was not completely accurate, i.e. the verb list contained
author’s knowledge on Sinhala verb structure was limited,
several manual processing errors such as including non-
the first task was to learn the language features. As the
verbs in the list, including compound words (eg : සිදුකරන
study and understanding of the Sinhala’s linguistic
sidukərənə ), etc.
structures was very important in designing a linguistic
model, it was crucial that a thorough study was done first
before going into implementation.
5.3 Testing
References
◌ං H ◌ඃ M ෙ◌ e
අ a ආ A ෛ◌ å [1]. J.B.Disanayake. Basaka Mahima: 11 - Kriya Pathaya. s.l. :
ඇ æ ඈ Æ ෙ◌ෝ O S. Godage & brothers, 2001.
ඉ I ඊ I ◌ෟ - [2]. A general computational model for word-form recognition
උ u ඌ U ◌ෳ - and production. Koskenniemi, Kimmo. Stanford, California :
ඍ R ඎ RR ෙ◌ේ E Association for Computational Linguistics, 1984.
එ e ඒ E ෙ◌ො o
[3]. Lauri Karttunen, Kenneth R. Beesley. Finite State
ඓ ã ඔ o ෙ◌ෞ à
Morphology. s.l. : CSLI Publications, 2003.
ඕ O ඖ au ◌ෲ õ
ක ka ඛ Ka ග ga [4]. Herath D.L, Weerasinghe A.R. A stemming algorithm to
ඝ Ga ඞ ña ඟ Ya analyze inflectional morphology of sinhala nouns. unpublished.
ච ca ඡ Ca ජ ja
ඣ Ja ඤ qa ඥ Qa [5]. Two-level morphology: A general computational model for
ඦ ôa ට ta ඨ Ta word-form recognition and production. Koskenniemi, K. s.l. :
ඩ da ඪ Da ණ Na Association for Computational Linguistics, 1983. Vol.
Publication 11.
ඬ Fa ත wa ථ Wa
ද xa ධ Xa න na [6]. A short history of two-level morphology. Lauri Karttunen,
ඳ Va ප pa ඵ Pa Kenneth R. Beesley. s.l. : COLING 1992: The 15th
බ ba භ Ba ම ma International Conference on Computational Linguistics, 2001.
ඹ Sa ය ya ර ra
ල la ව va ශ za [7]. Xerox research centre. [Online]
http://www.xrce.xerox.com./.
ෂ Za ස sa හ ha
ළ La ෆ fa ◌් - [8]. al, A. Bharathi et. Natural Language Processing : A
◌ා A ◌ැ æ ◌ෑ Æ Paninian Perspective. New Delhi : Prentice-Hall of India, 1996.
◌ි I ◌ී I ◌ු u
◌ූ U ◌ෘ ß [9]. Analysis of Sanskrit text : Parsing and semantic. Pawan
Goyal, Vipul Arora, Laxmidhar Behera. Rocquencourt,
France : s.n., 2007.
Acknowledgment [10]. Anupam. Sanskrit as Indian networking language : A
Sanskrit parser. 2004.
I am deeply indebted to Dr. A R Weerasinghe for
conducting the supervision of this project and offering me
[11]. Inflectional Morphology Analyzer for Sanskrit. Girish
valuable insight from time to time.
Nath Jha, Muktanand Agrawal, Subash, Sudhir K. Mishra,
Many thanks to Dr. Lalith Premaratne for his guidance
Diwakar Mani, Diwakar. Rocquencourt, France : s.n., 2007.
throughout the course of this work as the examiner of my
project.
I am sincerely grateful to Dr. Chamath Keppetiyagame
for co-ordinating the research projects and advising on the
Sinhala Latex system.
I am also thankful to Mr.Dulip Herath & all the