IITC 2008p4 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

A Morphological Parser for Sinhala Verbs

T.N.E. Fernando1, A.R. Weerasinghe2


University of Colombo School of Computing
35, Reid Avenue, Colombo 07, Sri Lanka.
Email: 1niro_ucsc@yahoo.com, 2arw@ucsc.cmb.ac.lk
Address: 161/14A, Lumbini Mavatha, Dalugama, Kelaniya, Sri Lanka

Abstract morphology. It is a syllabic script with its own writing


This paper presents a morphological parser capable of system which is an offspring of the Brahmi script.
analyzing and generating Sinhala verbs. Morphological
analysis and generation plays a vital role in many 1.2 The Sinhala Verb Morphology
applications related to natural language processing, such
as spell checkers, grammar checkers, intelligent Morphological analysis, particularly in a
information retrieval, machine translation and other morphologically rich language such as Sinhala can reveal
complex applications. The parser consists of a lexicon of much information. For example, the verb නටමි natəmi (I
more than 400 verb stems and handles 45 inflectional dance) implies that the subject of the sentence - from
rules for each stem. Analyses produces the verb stem which this verb was extracted - must be first person,
together with its feature tags depicting verb class, person, singular. The Sinhala verb chiefly comprises of a verb
number, tense, gender, mood, voice, etc. The parser is root and a set of auxiliaries which enhance the meaning
modelled in the framework of two level morphology model given in the verb root. In order to identify a word as a
using Xerox finite state morphology tools. To our verb, it must contain a verb root and its last morpheme
knowledge, this is the first such parser for Sinhala verbs. (suffix) must denote action, i.e., it must be a ‘verb suffix’.
Keywords: Morphology, Natural Language The suffix must be taken into account because even nouns
Processing, Parsing, Sinhala can be created by using ‘verb’ roots [1]. For example,
consider the following list of nouns which all share the
verb root බල balə (to look): බලන්නා balanna: (the
1. Introduction person who looks), බැලීම bæli:mə (the act of looking),
බලන්ෙනක් balannek (a person who looks).
Morphology is an important area in computational Owing to the complexity of Sinhala morphology, it is
linguistics that studies word structure, how words are impossible to have an exhaustive lexical listing. For
formed and how they are related to other words. instance, one stem could generate more than 45 inflected
Morphological parsing is a computation, which takes as verb forms. That is why there is an urgent need for a
input a derived form of a word and outputs the dictionary parser that will use the morphological system to compute
form of the word & vice versa. It can yield information the part of speech and inflectional categories of Sinhala
that is useful in many NLP applications such as spell words.
checkers, grammar checkers, machine translation and
intelligent information retrieval. 1.3 Approach
This paper presents a morphological parser that takes
as input Sinhala verbs and outputs its stem plus detailed The parser was modelled using Koskenniemi’s Two
feature structure information. In generation mode, the Level Morphology (TWOL) approach [2] and
aforementioned functionality is reversed. implemented using the Xerox finite state tool [3] (xfst
version 8.1.3 non-commercial). The lexicon was
1.1 The Sinhala Language programmed in the lexc language and contains over 400
verb stems that were collected by compiling a section of
Sinhala, a language belonging to Indo-Aryan branch of the corpus. Orthographic & phonological rules are
the Indo-European languages, has a rich system of modelled using the regex language and represents 45
basic inflectional forms.
1.4 Scope by combining such a Krudanthə verb with a verb suffix.
Table 2 lists several examples for these types of verbs [1].
One reason for limiting this study to verbs only is that,
to consider the whole Sinhala language (nouns, Krudantha Verb Suffix Derived Verb
adjectives, etc) would require a very large amount of form
effort, resources and time and thus outside the scope of an baləna emi (1st Person balənnemi
undergraduate study. Also, a morphological parser for Singular)
Sinhala nouns has already been developed by the LTRC, bælu: emi (1st Person bæluvemi
University of Colombo School of Computing [4]. Singular)
Furthermore, the scope is based on the language rules baləna e: balənne:
presented by Prof. J.B.Disanayake in his book Kriya Table 2 : Derived verbs
Pathaya as it is one of the most comprehensive studies of
Sinhala verbs in published work [1]. From the examples in Table 1 and Table 2 it is clear that
the suffixes used in the two verb forms are also different.
2. Morphology of Sinhala verbs
Finite verbs vs. Non-finite verbs. In Sinhala, the finite
forms of a verb are the forms where the verb shows tense,
2.1 Types of Sinhala Verbs
person, gender or number. The form of the finite form
must match with that of the sentence’s subject and these
Contemporary Sinhala linguistics categorizes verbs
verbs can form independent clauses, which can stand on
into several categories. According to Prof. J B Disanayake
their own as complete sentences. On the contrary, Non-
[1], verbs can be denoted depending on six factors. Of
finite verb forms have no person, tense, gender or
these, the following two categories were chosen to be
number. Therefore there is no relation between a non-
used as the two main dividing factors for the verbs in the
finite verb and the sentence’s subject. For example, the
project:
sentence ‘මම බත් කමි - I eat rice’ contains the finite verb
1) Pure (ශුද්ධ ) vs. Derived (සාධිත )
කමි (eat) and the sentence ‘මා බත් කද්දී අම්මා මට කතා
2) Finite(අවසාන ) vs. Non-Finite (අනවසාන )
කළාය - Mom called me when I was eating rice’ contains
Several of the other categories were represented as
the non-finite verb කද්දී (eating).
morphological features. Therefore four types of verbs
were considered for the parser: finite-pure, finite-derived,
non finite-pure & non finite-derived. They were 2.2 Phonological & Orthographic Rules
categorized thus because the morphotactics and the
orthographies that affect the Sinhala verb are somewhat In many languages how roots and suffixes are fused
similar in the aforementioned four classes. together to form valid words are governed by a set of
elaborate phonological rules [5]. Sinhala verb roots can be
Pure verbs vs. Derived verbs. Pure verbs are formed categorized into four classes depending on the way the
when a verb root (kriya: prəkurthi) such as balə is verb roots are inflected [1]. Table 3 shows four examples
combined with a verb suffix (kriya: prathyə). Table 1 lists belonging to the aforementioned four classes;
several examples for these types of verbs.
Verb Root Verb Suffix Pure Verb Root Singular- Plural- Singular- Plural-
Category Present Present Past Past
balə mi (1st Person Singular) baləmi
balə balayi baləthi bæli: (i:) bæli:
st (yi) (athi) (i:)
balə mu (1 Person Plural) baləmu
adi adiyi (yi) adithi ædi: (i:) ædi: (i:)
st (athi)
balə i:mi (1 Person Singular) bæli:mi
æle æleyi (yi) ælethi æli: (i:) æli: (i:)
(athi)
Table 1 : Pure verbs
ka ka: (a:) kathi kæ: (æ:) kæ: (æ:)
A specialty in Sinhala is the usage of not just roots to (athi)
form verbs, but also a certain kind of verbs themselves. Table 3: Verb forms of Sinhala
These verbs are called Krudanthə verbs. A Krudanthə
verb is formed when a verb root is combined with a
Krudanthə suffix. For example, nə, və, unə, unu are
Krudanthə suffixes. Derived (sa:dhithə) verbs are formed
2.3 Verb Roots

Categorizing verb roots. Sinhala verb roots can be


categorized into four basic groups according to their last
vowel; ‘a group’, ‘e group’, ‘i group’ and the ‘irregular
group’. ‘a group’ consists of verb roots that end with the
vowel ‘a’, ‘e group’ of roots ending with ‘e’ and ‘i group’
with roots ending with ‘i’. The ‘irregular group’ is the
collection of renegade roots that do not conform to any of
the previous three categories [1]. Although the suffix list
is somewhat similar across the groups of verb roots, there
are differences in the orthographical rules employed to
fuse the roots and the suffixes together.
Therefore organization of the lexicon in the parser has
used this classification of verb roots.

3. Two Level Morphology for Sinhala


It has been shown in [6] that concatenation,
composition and iteration are sufficient means for Figure 1 : Basic view of the parser
describing the morphology of languages with
concatenative morphological processes. Therefore it was
3.2 Xerox Finite State Transducer
decided to employ the TWOL approach to model the
Xerox has created an integrated set of software tools to
Sinhala verb morphology.
facilitate linguistic development. Their tools (xfst, twolc,
lexc) are built on top of a software library that provide
3.1 Methodology algorithms for creating automata from regular expressions
and equivalent formalisms and contains both classical
The methodology of encoding the Sinhala verb operations such as union and composition and also new
morphology using the two level morphology model is algorithms such as replacement and local
summarized as follows: sequentialisation. Over the years, the products of their
The lexicon stores the collection of known Sinhala research have come to be used all over the world in many
verb roots such as balə, natə, etc and suffixes such as mi, linguistic applications such as morphological analysis,
mu, emi, etc. It also dictates the rules that specify the legal tokenization, and shallow parsing of a wide variety of
combinations of morphemes (morphotactics) and is natural languages. The xfst tool has been licensed to over
encoded as a finite-state network. 70 universities world-wide. Many components have been
The rules that determine the form of each morpheme incorporated into commercial software [7]. For this
(orthographical alterations), i.e., spelling rules used to project, the non-commercial version was used.
model the changes that occur in the word, usually when
two morphemes combine - for example the Sandhi rules Corpus. A corpus is important for the training and testing
concerning Sinhala verbs - are implemented as finite-state of the parser because a corpus helps identify the “in”
transducers. forms of the language, i.e. the contemporary language.
The lexical network and the rule transducers are then For example, even though Sinhala text books would list
composed together into a single network, a ‘lexical certain words such as kərəhi, baləhu, (the 2nd person) etc,
transducer’ that incorporates all the morphological these words are now not in everyday usage.
information about the language including the lexicon of The project used the 7 million Sinhala words corpus
morphemes, derivation, inflection, etc. with 312,000 distinct words, put together by the
Figure 1 shows the basic view of the morphological University of Colombo School of Computing language
parser. research laboratory. Since the corpus contains a mixture
of all kinds of words: nouns, verbs, and adjectives etc,
only the distinct verb forms were manually filtered out.
4. Implementation of the Parser
4.1 System Architecture 4.3 Lexicon

The steps involved in the operation of the parser can be The lexicon comprises of several files which are
summarized as follows: programmed in lexc (lexicon compiler). There is a lexicon
1. The input text file is fed to the transliteration file for each class of verb root and these contain the verb
module. This file contains the Sinhala words that need to roots and morphemes recorded during implementation
be analyzed or generated and the text should be in Sinhala and training. As mentioned before in section 2.3 four
Unicode. basic verb root classes were identified. However, some of
2. The transliterator converts the contents of the input these verb groups could further be categorized into sub-
text file into Romanized Sinhala text. groups depending on their orthography. For example, the
3. xfst(Xerox finite state transducer) is invoked and the roots adi අදි and ari අරි both belong to the same ‘i group’.
compiled finite state network is loaded onto the stack. The However, they behave differently in their past tense form;
transliterated input text file is given as the input for the adi අදි becomes ædda ඇද්ද while ari අරි becomes æriya
xfst’s ‘apply up’ or ‘apply down’ commands, depending ඇරිය. This phenomenon is not isolated and since there are
on analysis or generation mode. other roots that display similar behaviour [1], they were
4. xfst analyzes/generates the input strings and the put into separate sub groups. Therefore separate lexicons
output is written to a text file in ANSI encoding. were created for each of the sub-groups. Altogether there
5. The output file is processed by the result formatter are 11 such lexicon files. The format of the lexicon files
and a formatted text file in Unicode encoding is produced. programmed in lexc is given in Figure 3.
6. The formatted output file is input into the
transliterator and the transliterated output file is produced.

4.2 Transliterator

Because the Xerox finite state-tool version 8.1.3 does


not support Unicode, an alternative was needed to
represent the Sinhala script. Therefore a transliteration
scheme was designed for the representation of Sinhala
characters in Roman notation inside the system. As a
standardized transliteration scheme for Sinhala does not
exist, the scheme used here is unique to this system. The
transliteration program is written in Java, and takes the
input text file and converts the Sinhala characters to Figure 3 : Format of the lexicon
lexicon
Romanized script. Figure 2 shows the structure of the
The Multichar_Symbols define the tag-set used inside
transliterator.
that file. LEXICON Root is a reserved name,
corresponding to the start of the network. Other
LEXICONS are defined as needed and are named
according to the requirements of the grammar. The
optional END keyword terminates the lexc description
[3].
All morphemes (prefixes, roots, suffixes, etc) that are
used to build verbs are organized into the sub-lexicons
which reside inside the core lexicon. Lexicons are chiefly
used for describing non-phonological stem end
Figure 2 : Structure of Transliterator alternations. Thus the language denoted by the lexicon is
typically unfinished orthographically. That is, the lexical
strings produced by the lexc grammars are actually
The transliteration scheme used is roughly based on the
intermediate strings and need some degree of
system presented for Roman notation for Devanagari in
modification before they can be candidates for the target
Natural Language Processing - A paninian perspective
language. These modifications may reflect orthographical
[8]. The complete listing of the transliteration scheme
conventions and/or phonological processes such as
used in this project is listed in Appendix B.
deletion, vowel frontation, epenthesis, etc. An example is
given in figure 5. The resulting surface string is
transformed into a recognizable verb after passing
through the phonological/orthographical layer containing
the above alteration rules. The string further has to go
through the filter layer to come out as a valid string in the
target language. These two layers are explained in detail
later (in sections 4.4 & 4.5). The separately compiled
lexicon and rule network layers are subsequently
composed together. Figure 4 illustrates this process.

Figure 5 : Intermediate strings in lexicon

4.5 Morphotactic Filter Layer

It is impossible to lay down rigid rules declaring the


behaviour of a natural language, more so for a
morphologically rich language such as Sinhala. The filter
layer acts as a cleanup transducer to the surface side. The
surface side is filtered for removing certain verbs that
comply with the morphotactics and alteration rules, but
Figure 4 : Parser as a transducer however are not in the Sinhala language due to some
irregularities.
4.4 Orthographical/Phonological Layer For example, the verb root ‘va’ has a lexical form of
va+VFM+Derived+Past+3P+Sg+Mus (the finite derived
The Phonological/Orthographical layer comprises of verb form in past tense 3rd person singular muscular)
.regex files that contain all the alternation rules needed to which corresponds to the surface strings ‘vu:ve:yə’ and
map the intermediate string to the ultimate target string ‘vu:ye:yə’. However, there is only one surface form in the
(surface string). These mappings are notated using xfst feminine version of the form, namely ‘vu:va:yə’. Even
replace rules, are compiled into transducers, and are though the form ‘vu:ya:yə’ is generated as according to
composed on the bottom of the lexical transducer. the lexical rules, it is not an acceptable verb in the
There are 4 such rule files, one for each verb category, language. To try to impose this irregularity inside the
dictating the orthographies related to each category. For lexicon would break the design of the lexicon. Such
example, in Figure 5 the first rule: [n-> {nn} || (vowel) irregularities violate the principle that lexicons describe
%+Kru] dictates that all instances of the symbol ‘n’ in the the rules of morphotactics and two level rules can be used
context of it being followed by a vowel and/or the symbol to formalize the distribution and the phonological
‘%+Kru’ must be replaced by the symbols ‘nn’. relations of stem variants [3]. Stem final alternations have
The second rule: [%+Kru->0] simply states that the often become individual properties of a word and are not
intermediate tag ‘+Kru’ be replaced by the empty string. predictable by phonological rules.
The third rule: [vowel ->0 || vowel] instructs the program Therefore a filter layer filter.regex is used on the
to delete any vowel that is followed by another vowel. surface side to remove the above invalid verb form.

4.6 The Tag Set

A set of multi-character symbol tags are used as the


morphological and syntactic tags that convey information
about part-of-speech, tense, mood, number, gender, etc.
Multi-character symbols are treated as atomic entities by
the regular expression compiler; that is, the multi-
character symbol +VFM is stored and manipulated just Absence of a tagged-corpus: Because we are yet to have
like a, b and c. For example, the sigma alphabet of the a tagged-corpus for the Sinhala language, a considerable
network compiled from [{bala} “+VFM”:0] consists of effort had to be spent on manually extracting words for
the symbols a, b, l and +VFM. the distinct verb list for the corpus. Due to the author’s
Since a standardized tagset for Sinhala does not exist lack of expertise on the Sinhala verb, some verbs may
yet, a set of tags were formalized by consulting several have been omitted from the verb corpus in the process,
Sinhala language textbooks. The complete set of multi- and also several words that are not verbs may have been
character symbol tags is given in Appendix A. added.

4.7 Parser Output Unicode support: The Xerox compiler version 8.1.3 does
The parser is capable of analyzing and generating strings not have Unicode support. Since it was not possible to
at the word level in both Unicode Sinhala format and acquire a version that does, a way to incorporate the
Romanized English using the transliteration format used Sinhala language without using Unicode had to be found.
in the project. Samples of some verbs which have A Transliteration scheme was designed and implemented
correctly been analyzed are given below: as a solution.
• කළ (kələ) => කර +Kru+Past - Krudantha Past
tense of root කර (kərə). 5. Experiments & Results
• කෙළේ (kəle:) => කර+VNF+Derived+Dec - Non
Finite, Derived, Declarative form of root කර
(kərə). 5.1 Training
• ගිෙය්ය (giye:yə) => ය
+VFM+Derived+Past+3P+Sg+Mus – Finite, Methodology. The initial verb roots and rules were
Derived, Past tense, 3rd Person, Singular, acquired from the textbook Kriya Pathaya [1]. These
Muscular verb of root ය (yə). were used during the development of the parser. After
Likewise, some examples of verb forms which have completing the implementation, the corpus was used to
been accurately generated are given below: acquire words for training purposes. First, a list of verbs
• ලබ +VFM+Pure+Past+3P+Sg => ලැබීය was manually filtered out of the corpus. Since the corpus
(læbi:yə) – Finite, Pure, Past tense, 3rd Person, itself contained distinct words, the verb list also
Singular tense of root ලබ (labə). comprised of distinct verbs. Next 700 verbs out of a list of
• හිත +VNF+Derived+Nom => හිතනවා 1631 were chosen from the verb list for training. The
(hithənəva:) - Non finite, Derived, Nominative training set was acquired by grouping the corpus into 200
verb of root හිත (hithə). word sets and extracting every other set. Thus, the
• කිය+VFM+Derived+Past+3P+Sg+Fem => training set was formed from the verbs 1 − 200, 401 −
කීවාය (ki:va:yə) - Fintie, Derived, Past tense, 3rd 600, 701 − 900, and 901 − 1000. This was done because
Person, Singular, Feminine verb of root කිය an allowance for different data domains was needed. It
(kiyə). was assumed that the corpus contained data from different
domains and the words that are spatially similar displayed
similar morphotactic characteristics.
4.8 Problems & Challenges
The training results with respect to the number of
errors are shown in Figure 6. Although this helps to give a
Insufficient domain knowledge: The primary difficulty
general picture, it must be noted that the training set itself
faced was the lack of linguistic expertise. Since the
was not completely accurate, i.e. the verb list contained
author’s knowledge on Sinhala verb structure was limited,
several manual processing errors such as including non-
the first task was to learn the language features. As the
verbs in the list, including compound words (eg : සිදුකරන
study and understanding of the Sinhala’s linguistic
sidukərənə ), etc.
structures was very important in designing a linguistic
model, it was crucial that a thorough study was done first
before going into implementation.

Lack of standards: There is a marked lack of standards


for Sinhala linguistics. While there exists numerous
valuable text books, etc, there is no set of agreed rules and
conventions for the language. Finally it was decided to
adopt Prof. J. B. Dissanayake’s [2] structure for the
project.
training progressed decreased while the number of new
roots increased. Furthermore, the total number of new
roots found far outweighs the total number of new rules
added. This is due to the corpus structure, which is
frequency based. Thus, the verbs at the top are the highest
occurring verbs in the language.

Figure 6 : Overall results in training

Corrections. The corrections that were performed were


twofold:
• Adding new verb roots
• Adding new rules
The first correction, adding a new root was done
whenever a variation of a verb root that was not in the
lexicon was encountered. The new roots that were added Figure 8 : Rules added during training
are illustrated as a line chart in Figure 7.
5.2 Error Prediction

From a total of 700 words, the parser analyzed 495


words correctly, thus placing the number of errors at 205
excluding non-verbs. Using these statistics the expected
error for unseen data was predicted at 29%.

Predicted error = Total number of errors X 100


Total number of words parsed

5.3 Testing

A set of unseen data was chosen from the remaining


200 word sets that were left from the ones taken for
training. Thus, the test data consisted of 300 unseen verbs.
Figure 7 : Roots added during training Without taking into account the number of non-verbs,
which was found to be 6, the testing results are illustrated
The second correction, the addition of a new rule was in Table 4.
done when the new rule encountered was another
variation of a rule that was already compiled. Otherwise, Verb set Total Number of Number of
if the rule was completely novel, it was ignored. For (in number of errors due to errors due to
example, the Non finite, Pure, Lakshya verb of the verb corpus) errors new roots new rules
root ය ‘ya’ was initially defined as යන්ට ‘yanta’. 201-300 24 14 10
However, during training several other variations of this 301-400 27 14 13
rule were found. After the corrections the parser gives
601-700 39 20 19
four generations for this form:
• යෑමට yæ:mətə Table 4 : Testing results
• යාමට ya:mətə
• යන්නට yannətə According to these figures, out of a set of 300 test
• යන්ට yantə words, there were 90 parse failures. After removing the
Figure 8 shows the new rules that were added during number of non-verbs (25 in number) in the test set, the
training. The number of rules that were encountered as input set came down to 275 words. This gives an error
rate of 32.73% as the actual error and a parse rate of
67.27%. However, it is obvious that a majority of failures Because the high frequency verbs share common roots,
were caused by verb roots that were not in the system the actual number of verb roots is less at the top of the
(See Table 4). Ideally, the verb roots in the language corpus than at the bottom.
could have been obtained by using a stemmer. However,
such a stemmer for Sinhala does not yet exist. Therefore, 5.4 Evaluation
if the error caused by data sparseness, i.e., new roots,
were to be eliminated, an estimate of the true error rate The actual error rate of the system was found to be
can be derived. This was calculated by removing the 32.73% (See section 5.3) while the predicted error was
number of failures due to roots that were not in the 29% (See section 5.2). Thus the actual error differs from
lexicon which resulted in an error rate of 15.27% thus the predicted error by 3.73%. Several reasons for this
giving a success rate of 84.73% .This is summarized in deviation could be deduced from the results.
Figure Firstly, since the corpus is organized in a frequency
9. based structure it would ensure that the words at the top
are the most frequently occurring words in the language.
Furthermore, this frequency based structure ensures that
the top part of the corpus contains different verb forms
sharing the same verb root. Therefore, the number of rules
learnt decreases as data from the lower portions of the
corpus are trained, while the number of new roots learnt
increases. Thus, the further a word is situated in the
corpus; the least likely it is to be included in the lexicon.
Irregular verbs in the language could be another culprit
for the deviation. Actually it is not so much the existence
of irregular verb roots, but more so is the fact that even
normal verb roots often have irregular forms associated
with them. For example the verb root දකි daki is a
Figure 9 : Results of testing
common and regular root in the language. Yet it too
In the testing set too, the number of new roots behaves in an irregular fashion in instances such as its
encountered have increased. In accordance to the training Krudantha past tense form. The regular verb of this
results, this too can be explained by examining the specification is given as දැක්කා dækka: while it also has
distribution of the corpus (see Figure 10). another verb form දුටු dutu. The other verb roots in this
verb root’s class such as අදි adi, අමදි amadi, පිරිමදි
pirimadi do not display this behaviour either. Such
complicated irregularities are not easy to be tackled by
linguistic generalizations.
The third cause for the deviation could be the negated
and compound verbs. In Sinhala, negation occurs when a
‘no’ prefix is added to the verb root. However, as
negation is not handled in the current implementation,
negated verbs would fail at the analyzer even if that same
verb form is accepted without the negation. For example,
the verb කියා kiya: is correctly analyzed by the parser as
kiyə+VNF+Pure+Pri. Yet the parser fails on its negated
verb ෙනොකියා nokiya:. Furthermore, there is a debate
whether certain verbs in Sinhala should be treated as two
separate words or one word. For example, the verb
කරෙගන kərəgenə is sometimes written as one word,
while in some texts it is given as two words: කර ෙගන
kərə genə. Since the convention used in the parser is to
treat it as two words, the parser fails when it encounters
instances in the corpus where it is given as one word.
Figure 10 : Verb distribution
distribution
6. Related work
There have been several work carried out in incorporate a guesser to the system that could
morphological parsing in Sinhala and Sanskrit. In Sinhala, give an approximate result by segmenting the
a parser has been developed by Herath et al [4] for nouns, input and trying to locate patterns according to
which uses stemming for analysis. set of rules and known verb roots.
For Sanskrit, Goyal et al [9] and Anupam [10] have • Misspelled words and compound words are also
shown that by employing deterministic finite automata not handled in the current implementation. A
(DFA) to construct the rule base, it is possible to get an rule based segmenting algorithm could be
efficient parse for Sanskrit text and claims to parse for 10 developed for the division of such compound
verb classes. Goyal et al [9] has used a separate ‘Sandhi’ words. For misspelled words, the parser could
module to break up the sandhis before inputting the text suggest some known verbs which have the
to the DFA. Girish et al [11] has used a strictly rule based minimum edit distance from the misspelled
database and POS tagging approach for parsing Sanskrit string.
without using DFAs and claims to have verb forms of • The current system contains a single finite state
commonly used 450 verb roots . All Goyal et al [9] , network that encodes the morphological and
Anupam [10] and Girish et al [11]have used the grammatical information of Sinhala without
Devanagari format and the Panini framework [8] for their concentrating on a specific domain. However, it
construction of Sanskrit morphology. None however, would be beneficial to have a several networks
claims to be able to function as a two way parser, i.e. with a core network which encodes the basic
generate as well as analyze. morphology. The other networks would extend
the core network, and they could be used to
7. Conclusions focus on specific domains, multiple
orthographies, multiple levels of strictness, etc.
The research work on the project concludes with
several points supported by its experimental results.
Essentially the main conclusion derived is that the two- Appendix A
level parsing method can successfully be applied to
develop a morphological parser for Sinhala verbs. Also, it The Tag Set
is evident that the Xerox finite state transducer tool can be +VFM = Finite
used to implement the lexical transducers that encode the +VNF = Non Finite
language morphology. +Pure = Pure
At the time of writing this report, a tagged corpus for +Derived = Derived
Sinhala does not exist. For the approach followed in this +Kru = Krudantha
parser, a tagged corpus is not necessary as it is a rule +Sg = Singular
based method. Therefore, it is clear that this is a suitable +Pl = Plural
method for languages and applications where a tagged +1P = First Person
corpus is not available. +2P = Second Person
Using the two-level morphology approach is beneficial +3P = Third Person
as it supports both analysis and generation without +Pres = Present Tense
explicitly needing to implement the rules both ways. It is +Past = Past Tense
also highly extensible since new roots and new rules can +Command = Command
easily be added to the system. In addition it is obvious +Invol = Involitive
that extensive linguistic research and modeling is +Mus = Masculine
imperative for the success of this type of project. +Fem = Feminine
Although the primary goal of the research was +Cont = Continuous
achieved, there are a number of areas where +Mix = Mixed
improvements and extensions can be added: +Pos = Positive
• The verb rules considered in this project are +Neg = Negative
those that are documented in Prof. +Cond = Conditional
J.B.Disanayake’s book Kriya Pathaya [1]. For +Avas = Avasthika
practical purposes, a thorough research needs to +Lakshya = Lakshya
be done to derive a complete set of rules for the +Pri = Prior
Sinhala verb. +Bhv = Bhavaroopa
• Currently, the parser fails for unknown verb +Pra = Prayojya
roots, i.e. verb roots that are not stored in the +Anan = Ananthara
lexicon. However it may be possible to
+Dec = Declarative members of the Language Research Laboratory for
+Cau = Causative helping me clarify many linguistic aspects in Sinhala
+Nom = Nominative grammar and giving me access to many resources, and
also to Mr. Harshula Jayasuriya for explaining the
nuances of the LKLUG keyboard layout.
Appendix B Last but not the least, my special thanks to my family
and all my colleagues for their encouragement and
The Transliteration Scheme understanding.

References
◌ං H ◌ඃ M ෙ◌ e
අ a ආ A ෛ◌ å [1]. J.B.Disanayake. Basaka Mahima: 11 - Kriya Pathaya. s.l. :
ඇ æ ඈ Æ ෙ◌ෝ O S. Godage & brothers, 2001.
ඉ I ඊ I ◌ෟ - [2]. A general computational model for word-form recognition
උ u ඌ U ◌ෳ - and production. Koskenniemi, Kimmo. Stanford, California :
ඍ R ඎ RR ෙ◌ේ E Association for Computational Linguistics, 1984.
එ e ඒ E ෙ◌ො o
[3]. Lauri Karttunen, Kenneth R. Beesley. Finite State
ඓ ã ඔ o ෙ◌ෞ à
Morphology. s.l. : CSLI Publications, 2003.
ඕ O ඖ au ◌ෲ õ
ක ka ඛ Ka ග ga [4]. Herath D.L, Weerasinghe A.R. A stemming algorithm to
ඝ Ga ඞ ña ඟ Ya analyze inflectional morphology of sinhala nouns. unpublished.
ච ca ඡ Ca ජ ja
ඣ Ja ඤ qa ඥ Qa [5]. Two-level morphology: A general computational model for
ඦ ôa ට ta ඨ Ta word-form recognition and production. Koskenniemi, K. s.l. :
ඩ da ඪ Da ණ Na Association for Computational Linguistics, 1983. Vol.
Publication 11.
ඬ Fa ත wa ථ Wa
ද xa ධ Xa න na [6]. A short history of two-level morphology. Lauri Karttunen,
ඳ Va ප pa ඵ Pa Kenneth R. Beesley. s.l. : COLING 1992: The 15th
බ ba භ Ba ම ma International Conference on Computational Linguistics, 2001.
ඹ Sa ය ya ර ra
ල la ව va ශ za [7]. Xerox research centre. [Online]
http://www.xrce.xerox.com./.
ෂ Za ස sa හ ha
ළ La ෆ fa ◌් - [8]. al, A. Bharathi et. Natural Language Processing : A
◌ා A ◌ැ æ ◌ෑ Æ Paninian Perspective. New Delhi : Prentice-Hall of India, 1996.
◌ි I ◌ී I ◌ු u
◌ූ U ◌ෘ ß [9]. Analysis of Sanskrit text : Parsing and semantic. Pawan
Goyal, Vipul Arora, Laxmidhar Behera. Rocquencourt,
France : s.n., 2007.
Acknowledgment [10]. Anupam. Sanskrit as Indian networking language : A
Sanskrit parser. 2004.
I am deeply indebted to Dr. A R Weerasinghe for
conducting the supervision of this project and offering me
[11]. Inflectional Morphology Analyzer for Sanskrit. Girish
valuable insight from time to time.
Nath Jha, Muktanand Agrawal, Subash, Sudhir K. Mishra,
Many thanks to Dr. Lalith Premaratne for his guidance
Diwakar Mani, Diwakar. Rocquencourt, France : s.n., 2007.
throughout the course of this work as the examiner of my
project.
I am sincerely grateful to Dr. Chamath Keppetiyagame
for co-ordinating the research projects and advising on the
Sinhala Latex system.
I am also thankful to Mr.Dulip Herath & all the

You might also like