Example Based Machine Translation For English-Sinhala Translations

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Example Based Machine Translation for English-Sinhala Translations

Anne Mindika Silva, Ruvan Weerasinghe


University of Colombo School of Computing
Colombo 07, Sri Lanka.
Contact No: 0714230465
E-mail: anne.mindika@gmail.com

Abstract manual translation is the huge demand for translation


This paper presents an Example Based Machine with the too few human translators. The productivity of
Translation System which can be used for English – the human translators can be greatly increased with
Sinhala translations mainly to be used in the government Machine Translation techniques.
domain. The System uses a bilingual corpus of English -
Sinhala aligned at sentence level, as the knowledge base 1.1 Approach
of the System. Given a source phrase, the System retrieves
the English sentences and the corresponding Sinhala
sentences in which the input phrase is found (Intra- This paper presents a Machine Translation Tool for
Language Matching). Then the System performs a scoring English-Sinhala translations to be used for translating
algorithm on the retrieved Sinhala sentences to find the government documents. The approach used in this study
most occurring Sinhala phrase in the set, which is most is Example Based Machine Translation which uses a
likely to be the best candidate translation for the phrase sentence aligned bilingual corpus, and a list of function
(Inter-Language Matching). The output of the System has words of both languages.
obtained BLEU scores of 0.17 - 0.26 for 3-gram analysis Example Based Machine Translation techniques prove
using one reference translation. ideal for a less-resourced language like Sinhala, since it
Index Terms— Example Based Machine Translation, allows the researches to experiment on the virtues of the
Natural Language Processing, Sinhala Language technique without waiting for the resources to become
Processing available.
One of the advantages of the approach is that the
1. Introduction quality of the translation will improve incrementally as
the example set become more complete, without the
need to update and improve detailed grammatical and
Sri Lanka has a multiracial society comprising of 74% lexical descriptions. [15]
Sinhala-speaking population and 18% Tamil-speaking
population. English is spoken competently by about 1.2 Motivation
10% of the population. Translation between the 3
official languages - Sinhala, Tamil and English is very
important in a multi-cultural country like Sri Lanka. Although many Free and Commercial systems exists
Given the ethnic conflict in the country, translation can for the widely used languages (English, French,
play a vital role in bringing together the society by German, etc.), systems for English-Sinhala and Tamil-
improving the understanding among the peoples. Sinhala translations are not availble.
Translation by definition is the activity of interpreting There is very little local (Sinala/Tamil) content
of the meaning of a text in one language - the source available in electronic form even now, so that locals
text - and the production, in another language, of a new, have to rely mostly on English material in Sri Lanka.
equivalent text - the target text, or translation. Machine Not only international content but also much of
Translation can be considered as the attempt to academic content is available only in English. This
automate all, or part of the process of translating from problem is mainly faced by speakers who are only
one human language to another. The major problem of familiar with Sinhala and Tamil. It would be highly
desirable if the English content can be translated into the like of the source language, as well as the culture of
Sinhala/Tamil using Machine Translation techniques. its speakers.
Another requirement is that most of the material in
Sinhala/Tamil to be translated into English so that the
local knowledge and culture can be easily disseminated 2.2 Problems in translation
to the global community.
Translation between Sinhala and Tamil is also very
Translation in general is a difficult activity and there
desirable, especially in the current context of conflict to
are several problems faced even by human translators
develop a good relation between the two main cultural
[11]:
groups of Sri Lanka.
The source text may be difficult to read,
In Sri Lanka there are many kinds of material that
needs to be prepared in all three official languages: mispelled/ misprinted, incomplete and may be
Sinhala, Tamil and English. For example, government inaccurate.
documents, gazettes, public notices, etc. are issued in all Language problems such as, dialect terms,
three languages. This is an area where Machine unexplained acronyms and abbreviations, proper
Translation could be put into valuable use, especially names, obscure jargon and idioms, slang
when relevant to a particular domain. Rhymes, poetic meters, highly specific cultural
references and humour
1.3 Scope The Problem of Untranslatability - untranslatable
words
Words having different meanings in different
The system is expected to provide possible
translations for chunks of the source text (words, contexts
phrases, sentences) in the target language so the user of Same word that are having different meanings
the System (the Human Translator) can select the most depending on the culture
suitable translation or else can insert his own Words having different levels of precision
translation. Expressions referring to concepts that do not
In case the user defines a new translation, it should be exist in another language
saved in the corpus so that it can be re-used in future
translations (The functionality of a Translation
Memory).
The bilingual corpus can be a shared resource so that a 2.3 Problems in English-Sinhala translations
group of translators working on the same domain can
benefit by each other's contribution to the system.
Sinhala is a language of SOV-(Subject-Object-
Verb)word order, whereas English is a SVO -
2. Background (Subject-Verb-Object) word order language.
In Sinhala, there are almost no subordinate
clauses as in English, but only non-finite clauses
2.1 Translation that are formed by the means of participles and
verbal adjectives. E.g.: "The man who writes
books" translates to /pot liyana miniha:/, literally
The Translation Process, can be described simply as,
"books writing man".
[11]
1. Decoding the meaning of the Source Text, and Sinhala is a left-branching language, which
2. Re-encoding this meaning in the Target means that determining elements are usually put
Language. in front of what they determine. An exception to
To decode the meaning of a text the translator must first this is statements of quantity which usually stand
identify its component "translation units", that is to say behind what they define. E.g.: "the four books"
the segments of the text to be treated as a cognitive unit. translates to /pot hatara/, literally "books four".
A translation unit may be a word, a phrase or even one or
more sentences. This process requires thorough There are no prepositions, only postpositions.
knowledge of the grammar, semantics, syntax, idioms and E.g.: "under the book" translates to /pota jata/,
literally "book under"
Sinhala is a Pro-drop language: The subject of a
sentence can be omitted when it is redundant
because of the context. E.g.: The sentence
/koheda gie:/, literally "where went", can mean
"where did I/you/he/she/we... go". Also the
copula "to be" is generally omitted: "I am rich"
translates to /mama po:sat/, literally "I rich".
There is a four-way deictic system (which is
rare): There are four demonstrative stems /me:/
"here, close to the speaker", /o:/ "there, close to
the person adressed", /ara/ "there, close to a third
person, visible" and /e:/ "there, close to a third
person, not visible".

defining a level of representations for texts which is


2.4 Trends in translation abstract enough to make translation itself
straightforward, but which is at the same time
superficial enough to permit sentences in the various
The classical architectures for machine translation are source and target languages to be successfully mapped
Direct Translation, Transfer Approach and Interlingua into that level of representation. That is, successful MT
Approach. Real systems tend to involve combinations involves a compromise between depth of analysis or
of elements from these three architectures; thus each is understanding of the source text, and the need to
best thought of as a point in an algorithmic design space actually compute the abstract representation.
rather than as an actual algorithm. In this sense, Transfer systems are less ambitious than
In direct translation, we proceed word-by-word Interlingual systems, because they accept the need for
through the source language text, translating each word often quite complex mapping rules between the most
as we go. Direct translation uses a large bilingual abstract representations of source and target sentences.
dictionary, each of whose entries is a small program As the linguistic knowledge increases, MT systems too
with the job of translating one word. should improve based on linguistic rules encoding that
In transfer approaches, we first parse the input text, knowledge. This position is based on the fundamental
and then apply rules to transform the source language assumption that finding a sufficiently abstract level of
parse structure into a target language parse structure. representation for MT is an attainable goal. However,
We then generate the target language sentence from the some researchers have suggested that it is not always
parse structure. the case that the deepest level of representation is
In interlingua approaches, we analyze the source necessarily the best level for translation. Also for
language text into some abstract meaning languages that have similar properties, Shallow transfer
representation, called an interlingua. We then generate methods can be used without going for syntax level
into the target language from this interlingual transfers.
representation. The currently available MT systems can also be
A common way to visualize these three approaches is classified as,
with is with the Vauquois triangle shown in Fig. 1 [6]. 1. Machine Translation - where the translator
The triangle shows the increasing depth of analysis supports the machine, and
required (on both the analysis and generation end) as we 2. Computer Assisted Translation - where the
move from the direct approach through transfer computer program supports the translator
approaches, to interlingual approaches. In addition, it
shows the decreasing amount of transfer knowledge
needed as we move up the triangle, from huge amounts 2.5 Machine Translation
of transfer at the direct level (almost all knowledge is
transfer knowledge for each word) through transfer
(transfer rules only for parse trees or thematic roles) Machine Translation (MT) is a form of translation
through interlingua (no specific transfer knowledge). where a computer program analyses the source text and
Most Transfer or Interlingual rule-based systems are produces a target text without human intervention. In
based on the idea that success in practical MT involves
Machine Translation, the translator supports the representation. The target language is then
machine. generated out of the interlingua. [11]
At its basic level, MT performs simple substitution of Both Statistical Machine Translation and Example
atomic words in one natural language for words in Based Machine Translation are mainly based on the
another. Using corpus techniques, more complex Direct Translation model. Both use Machine Learning,
translations can be performed, allowing for better Data Driven approaches where Pattern Recognition,
handling of differences in linguistic typology, phrase Data Mining concepts can be put to use. The advantages
recognition, and translation of idioms, as well as the of these two systems are non-reliance on expert
isolation of anomalies. Current machine translation knowledge, learnability and trainability. While EBMT
software often allows for customisation by domain or systems place more reliance on the examples, SMT
profession (such as weather reports) - improving output systems place more reliance on Statistical techniques.
by limiting the scope of allowable substitutions.
Although most such systems (e.g.: Alta Vista's 'Babel
3. Example Based Machine Translation
Fish', Google's Translation facility), produce what is
called a "gisting translation" - a rough translation that
gives the "gist" of the source text, in fields with highly
limited ranges of vocabulary and simple sentence The basic assumption of EBMT is: "If a previously
structure, for example in weather reports, machine translated sentence occurs again, the same translation
translation can deliver very useful results. Improved is likely to be correct again". [1]
output quality can also be achieved by human This idea is sometimes thought to be reminiscent of
intervention. [10] how human translators proceed when using a bilingual
There are some identified sub fields in Machine dictionary: looking at the examples given to find the
Translation: Source Language (SL) example that best approximates
Dictionary Based Machine Translation :- what they are trying to translate, and constructing a
Machine translation can use a method based on translation on the basis of the Target Language (TL)
example that is given. [2]
dictionary entries, which means that the words The general architecture of an EBMT system
will be translated as a dictionary does - word by presented by Konstantinidis [4] is given in Fig. 2. The
word, usually without much correlation of system begins with the input referred to as the source
meaning between them. [12] text. The most similar and analogous examples are
Statistical Machine Translation (STAT MT or retrieved from the source language database. The next
SMT) :- SMT tries to generate translations using step is to retrieve the corresponding translations of the
analogous examples. And the final step is to recombine
statistical methods based on bilingual text
the examples into the final translation.
corpora. The document is translated on the
probability that a string in English e is the
translation of a string in French f using
parameter estimation. The statistical translation
models can be word based or phrase based, as
many more recent designs. Models based on
syntax have also been tried. [13]
Example Based Machine Translation (EBMT) :-
EBMT is essentially translation by analogy.
EBMT is also regarded as a case-based
reasoning approach to MT, where previously
resolved translation cases are reused to translate
new SL text.
Interlingual Machine Translation :- Uses rule-
based machine translation approach. According
to this approach, the source language, (i.e. the
text to be translated) is transformed into an The EBMT approach, which was proposed by Nagao
interlingual, language independent uses raw, unanalysed, unannotated bilingual data and a
set of SL and TL lexical equivalences mainly expressed
in terms of word pairs (with SL and TL verb match can involve considering a large number of
equivalences expressed in terms of case frames) as the possibilities. [15]
linguistic backbone of the translation process. [7] The
translation process is mainly a matching process which
aims at locating the best match in terms of semantic 4. The proposed system
similarities between the input sentence and the available
example in the database.
In EBMT, instead of using explicit mapping rules for
4.1 Functionality of the system
translating sentences from one language to another, the
translation process is basically a procedure of matching
the input sentence against the stored example A prototype has been developed for English to Sinhala
translations. translations (using Visual C#.net 2005) which uses a
The basic idea is to collect a bilingual corpus of sentence aligned bilingual corpus as its Knowledge
translation pairs and then use a best match algorithm to Base. Given a source sentence to translate, it allows the
find the closest example to the source phrase in Translator to find the most suitable translations at
question. This gives a translation template, which can phrase-level and thus provides the facility to quickly
then be filled in by word-for-word translation. arrange the suggested phrases to form the target
The distance calculation, for finding the best match sentence. The finding of the suitable translations for
for a source phrase, can involve calculating the source chunks is done by retrieving the source and
closeness of items in a hierarchy of terms and concepts target language sentences which match or contain the
provided by a thesaurus. For a given input, the system given input, and then determining the best target
will then calculate how close it is to various stored language match by using a scoring algorithm.
example translations based on the distance of the input It also has the facility to take as input, a source file to
from the example in terms of the thesaurus hierarchy be translated (a plain text file in which the sentences are
and how likely the various translations are on the basis separated by line-breaks) and then assist the user to
of frequency ratings for elements in the database of translate the file, taking one sentence at a time.
examples. In order to do this, it must be assumed that Another important feature is the System's ability to
the database of examples is representative of the texts learn from past translations. The user can save the
we intend to translate. newly translated content into the Corpus, so that the
The systems using memory based approach, examine new knowledge can be used for subsequent translations.
the MT problem from a human learning point of view Since the Corpus should be sentence aligned, the
and exploits the language models based on corpus, System remembers the past translations at sentence
statistics and examples and applying analogy principle level until it is added to the Corpus.
for translation by making use of past experiences.
Some EBMT systems operate on parse trees, or find
the most similar complete sentence and modify its
translation based on the differences between the
4.2 The Knowledge Base
sentence to be translated and the matched example. [8]
It is evident that the feasibility of the approach The current System essentially requires no knowledge
depends crucially on the collection of good data. of the structure of the languages, grammar rules,
However, one of the advantages of the approach is that morphological analysis or parsing, although they can be
the quality of translation will improve incrementally as used to improve the outcome in future developments.
the example set becomes more complete, without the The sources of Knowledge that the System uses is a
need to update and improve detailed grammatical and bilingual corpus of English-Sinhala and function word
lexical descriptions. Moreover, the approach can be (in lists of the source and target languages.
principle) very efficient, since in the best case there is Thus, the preliminary tasks involved in the experiment
no complex rule application to perform. All one has to included, finding pre-translated material for English and
do is find the appropriate example and (sometimes) Sinhala in electronic form and aligning the text at
calculate distances. sentence level to be added to the Corpus.
However, there are some complications. For example, The sources for the pre-translated materiel were the
one problem arises when one has a number of different Order Books of the Parliament (obtained by the
examples each of which matches part of the string, but courtesy of the Parliament of Sri Lanka) and Vibhasha
where the parts they match overlap, and/or do not cover Translation Magazine published by the Centre for
the whole string. In such cases, calculating the best Policy Alternatives (CPA) which were available in
Sinhala, English and Tamil.
The translated text were then aligned at sentence level (english-sentence-4, sinhala-sentence-4)
using an interface which was developed as part of the .........
System. The aligned sentences were saved as text files (english-sentence-n, sinhala-sentence-n)
in UTF-16 encoding. An initial Corpus which consists )
of several files mainly in the Government and Political in which the input phrase S is occurring in each
domain was thus prepared to be used by the System. english-sentence-i, and sinhala-sentence-i is the
Also a list of function words was saved in two corresponding translation of english-sentence-i.
separate files so that the Translator can edit the content The user can set up the parameters for the matching
to suit his needs. These functions words are to be used process, namely, to match for the exact input phrase
as stop words when looking for pre-occurrences and (matching for contiguous words), or to find matches
finding best matches in the Corpus. where all of the words given as input occurs anywhere
in the english-sentence-i.
Thus, when matching in the second mode (matching
4.3 Steps of the translation process for all the words regardless of position), the system
retrieves English sentences in which,
There are gaps between words matching input
Once the user inputs a Source Sentence to be words. e.g.: A X Y B Z C can match input chunk
translated, the system checks to see whether an exact A B C.
match can be found in the English files of the Corpus. If
The word order is different from that in the input
an exact match is found, the corresponding Sinhala
sentence is retrieved from the Corpus and returned as chunk. e.g.: B C A can match A B C.
the Output Sentence. The user can also specify whether to drop the
If an exact match is not found, the user can break the functional words of the input string when matching for
source sentence into logical phrases and see whether the words. (This option is not available when matching with
System can suggest an acceptable translation for the the exact input string)
phrase. Given a source phrase, the System retrieves the
English sentences and the corresponding Sinhala
sentences in which the input phrase is found (Intra- 4.5 Inter-Language matching
Language Matching).
Then the System performs a scoring algorithm on the
retrieved Sinhala sentences to find the most occurring In this phase, the system calculates a score to find the
Sinhala phrase in the set, which is most likely to be the most occurring phrases in the list of target language
best candidate translation for the phrase (Inter- sentences (sinhala-sentence-i) retrieved from Intra-
Language Matching). Once the scoring is done, the user Language Matching.
can select the best match from the highest scoring Here, an assumption is made that if the input string is
outputs and proceed to translate another phrase of the n words long, then the corresponding translation of the
source sentence. string would be n/2 to 2*n words long. The double
Once the entire sentence is translated, the user can length assumption may not be suitable for long input
make necessary modifications (phrase re-ordering, etc.) strings, but is important when translating 1-2 word
and proceed to translate the next sentence. When the strings.
user has translated the whole file, he can save the entire A score is calculated starting from n/2 contiguous
translation to a file and also can add the new words in each of the target strings to 2*n words long
translations to the Corpus. contiguous strings. E.g.: If the input string is 2 words
long, and the current target language sentence is A B C
D E F G, then the substrings that would be considered
4.4 Intra-Language matching for the scoring process would be,
A, B, C, D, E, F, G
AB, BC, CD, DE, EF, FG
The input to this step is an English sentence/phrase ABC, BCD, CDE, DEF, EFG
which is submitted for translation. The output is a list of ABCD, BCDE, CDEF, DEFG
the form, When only the above-given target string has been
(input phrase S scored, all the above listed substrings would get a score
(english-sentence-1, sinhala-sentence-1) of 1. But when the score is calculated for all of the
(english-sentence-2, sinhala-sentence-2) target language sentences, the substring that occurs
(english-sentence-3, sinhala-sentence-3) most often will get the highest score, and thus is most
likely to be the best matching translation for the given Since it is unlikely that an exact match is found for the
input string. above-given sentence, the user has to translate the input
The user has the option to disregard the function sentence at phrase level.
words of the target language when scoring, to avoid the If the user tries to translate the phrase : The Sri
function words getting an unnecessary high score, Lankan government, the System will find all the
which would affect the output. sentences (using Intra-Language Matching) in which the
input phrase occurred in the Corpus. (See Fig. 3 for the
5. An example translation list of sentences retrieved.)
The System would then perform Inter-Language
matching, to find the target string that is most likely to
be the translation of the given input phrase. (See Fig. 4)
An example matching output in the process of
The user can accept one of the suggested translations
translating the following source sentence is described
and then proceed to translate another phrase.
below.
“The Sri Lankan government is in turmoil after one of
the constituent parties quit the ruling coalition last week
, leaving the Peoples Alliance ( PA ) as a minority in 6. Evaluation
parliament with only 109 out of 225 seats.”

6.1 Evaluation of Machine Translation

One of the most difficult things in machine translation


is the evaluation of a proposed system/algorithm. Given
the ambiguity of natural languages, it is hard to assign
numbers to the output of natural language processing
applications.
When evaluating machine translation, a "good
translation" or "better translation" is hard to define. Also
in machine translation, there may not be one good
translation. Even when a sentence is translated by two
' " $ ( % humans, there may be variances in word choice and
) word order. Typically, there are many "perfect"
! " # $ % & translations for a given source sentence. Even experts
may not agree when coming to a conclusion as to which
translation is better.
Human evaluations of machine translation weigh
many aspects of translation, including adequacy, fidelity
and fluency of the translation. Although human
evaluations are extensive, they are also very expensive
and time consuming. Because of this, the need for
quick, inexpensive automatic machine translation
evaluation has arisen, especially for machine translation
researchers and developers.
It is accepted that the closer a machine translation is to
a professional translation, the better it is. The fluency
and the adequacy of the output sentences can be
checked by n-gram analysis. If there is a reference
translation available, it becomes possible to compare the
output with the references and to put a number to the
notion of "good translation".
Some automatic machine translation techniques
include BLEU, NIST, WER (Word Error Rate), PER
(Position-independent word error rate) and GTM
! " # (General Text Matcher). [5]
$ % &
6.2 The Bleu translation metric slow and manual process, there existed lots of pre-
translated material which could be used as the Testing
Set. The bilingual corpus which was used for the
Bleu Metric (Sometimes called the Blue metric) is an experiment consisted of approximately 3000 sentence
IBM-developed metric. The central idea is that the pairs. Out of the text that has not been added to the
closer a machine translation is to a professional human Corpus, a source text consisting of 94 sentences was
translation, the better it is. To check how close a extracted to be translated from the Order paper of
candidate translation is to a reference translation, a n- Parliament for Wednesday, August 10, 2005, for the
gram comparison is done between both translations. evaluation.
The closeness metric has been designed after the By using the selected input text, two tests were carried
highly successful word error rate metric used by the out.
speech recognition community, appropriately modified Test 1 : Translation of the file by always
for multiple reference translations and allows for selecting the top Sinhala Match for the selected
legitimate differences in word choice and word order. input phrase
The main idea is to use a weighted average of variable
Test 2 : Translation of the file by allowing the
length phrase matches against the reference translations.
[3] user to select a target phrase out of the top 5
Basically, it compares n-grams of the candidate with matches produced out of inter-language
the n-grams of the reference translation and counts the matching
number of matches. These matches are position In each test, the translation proceeded by first
independent. The more the matches, the better the searching for an exact match for each sentence, and in
candidate translation is. This sort of modified n-gram case that an exact match is not found, the sentence was
precision scoring captures two aspects of translation: translated at phrase level.
adequacy and fluency. A translation using the same In addition, the user always had to accept the
words (1-grams) as in the references tends to satisfy suggestions given by the system, even if the system did
adequacy. The longer n-gram matches account for not produce a good translation. And the user was not
fluency. [3] allowed to do word/phrase reordering or to use his own
The BLEU metric ranges from 0 to 1. Few translations translation.
will attain a score of 1 unless they are identical to a The two Sinhala translations thus produced, were
reference translation. The score gets improved when evaluated using the Bleu metric by using only one
there are more reference translations per sentence. A reference translation. The reference translation for the
brevity penalty is given if the length of the result is less input text was prepared by using the original translation
than the length of the references. of the text.
However, because the evaluation is based on n-gram
comparison with reference sentences, it is possible to
make sentences with completely different meaning by 6.4 Evaluation results
switching words/n-grams and still get high scores. Also
the opposite can occur, for example when the Machine
Translation algorithm consequently translates a certain The Blue score for the translation produced by Test 1
constituent to "New South Wales politics" it is using 3-gram analysis is given below:
penalized heavily when reference texts mention Precision 1-gram: 0.980392 but used
"politics of New South Wales" when using larger n- 1.000000 because of smoothing
grams. [14] Precision 2-gram: 0.200000 but used
0.400000 because of smoothing
Precision 3-gram: 0.000000 but used
6.3 Evaluation of the system 0.500000 because of smoothing
Weighted Precision: 0.477648
Evaluation of Machine Translation should be typically Brevity Penalty: 0.544524
Used "Add one" smoothing
done using texts that is not seen by the System
previously. For this purpose a Training Set and Testing -------------------------
Set is defined from the available data. The translated Bleu = 0.260091
The Blue score for the translation produced by Test 2
output is then compared with one or more reference
translations to get the translation score. using 3-gram analysis is given below:
Precision 1-gram: 0.924242 but used
The output of the System was evaluated using the
Bleu translation metric. Since corpus alignment was a 0.939394 because of smoothing
Precision 2-gram: 0.066667 but used 7.1 Problems encountered
0.133333 because of smoothing
Precision 3-gram: 0.000000 but used
0.111111 because of smoothing One of the major problems encountered at the initial
Weighted Precision: 0.226156 stages was the lack of a bilingual Corpus for English-
Brevity Penalty: 0.784723 Sinhala. Although there existed bilingual Corpora for
Used "Add one" smoothing the widely used languages, resources for Machine
------------------------- Translation of Sinhala and Tamil are still very rare.
Bleu = 0.177470 Therefore much time was initially spent on building up
Also, the Bleu scores for the evaluation using 1-gram, a sentence-aligned bilingual corpus to be used as the
2-gram and 3-gram analysis is given in Table 1. knowledge base of the system.
Also the lack of Sinhala documents in electronic form
N-gram Test 1 Test 2 was another problem encountered in the course of the
1-gram 0.533847 0.725274 project. In addition, both the corresponding English and
2-gram 0.241119 0.194789 Sinhala translations were needed to align the Corpus.
The data required for the alignment of the Corpus
3-gram 0.260091 0.177470
were obtained from the Sri Lanka Parliament and from a
* $ magazine published by the Centre for Policy
Alternatives. Also, the alignment had to be done
manually, which was a time consuming task
In order to cope with the structural differences
6.5 Discussion of results between the two languages, a complex alignment
algorithm is needed which makes use of a tagged
Corpus and parsing techniques to determine the parse
Typically, a manual translation gets a Bleu score of tree of the source and target language sentences.
0.4. and Statistical Machine Translation Systems The System's inability to automatically break down
typically score 0.05 - 0.25. State of the art French- the input sentence into logical phrases was another
English MT systems have been known to score 0.25 factor that limits the efficiency of the system. The main
with 2-4 reference translations. problem was that, if the input phrases were not logical,
For English-Sinhala translations BLEU scores of 0.02 the suggested translations for the phrases would also be
- 0.06 have been obtained. And for Sinhala-Tamil meaningless.
translations BLEU scores of 0.12 - 0.14 have been
obtained. Since the number of reference translations
affects the score, "adjusted" Sinhala-Tamil score is said
to be close to 0.185. [9]
7.2 Conclusions reached
The evaluation results show a higher Bleu score for
the result of Test 2 (when the user can select out of the
The use of Example Based Machine Translation for
top 5 mathes) for 1-gram analysis. But in both 2-gram
the translation of government documents proves to be
and 3-gram analysis, the output of Test 1 (when the user
appropriate since it uses formal and language and
has to accept the top match) has obtained a higher score.
follows the same format in most cases.
Also the result of Test 1 has been given a higher brevity
In order to increase the probability that a suggested
penalty.
target string is the translation of the given input string,
When comparing with other translation systems, the
the System should increase the number of sentences
chunking of the source sentence into logical phrases and
retrieved from the Intra-Language Matching process.
the selection of the best translation out of the
This could be done by accepting sentences in which all
suggestions may have been influenced by the user's
the input string words occur anywhere in the sentence,
competence in translation.
by disregarding function words when matching for
But, the systems ability to suggest acceptable
words and by accepting morphological variants of the
translations source language phrases and the exact
words when matching.
matches, may also have affected the score. It has yet to
Also the best candidate out of the target language
be experimented how the System would perform when
phrases can be given a higher score, by disregarding the
automatic chunking and alignment is available.
function words and also if possible by giving a penalty
for the candidate phrases based on the variance of the
number of words in the candidate phrase and the input
7. Conclusion phrase.
7.3 Future extensions
[5] Lin Franz Josef Och Lin, Chin-Yew. Orange: a method for
evaluating automatic evaluation metrics for machine translation.
Suggestions for future extensions for the System are
[6] Jurafsky Daniel Martin James. Speech and Language
given below: Processing : An Introduction to Natural Language Processing,
To automate the process of breaking down the Computational Linguistics, and Speech Recognition. Pearson
input sentence into logical phrases Education, 2000.
To introduce an alignment module to
[7] M Nagao. A framework of a mechanical translation between
automatically align the translations of the
Japanese and English by analogy principle, in “Artificial and
phrases to form the output sentence Human Intelligence: edited review papers at the International
To incorporate morphological analysis to the NATO Symposium on Artificial and Human Intelligence”.
matching process in order to increase the number Elsevier Science Publishers, Amsterdam, 1984.
of matching sentences in the Corpus [8] McFetridge P. Popowich F. & Toole J. Turcato, D. A unified
To use a penalty for the phrase length when example based and lexicalist approach to machine translation.
calculating scores for the candidate target 1999. 8th International Conference on Theoretical and
language phrases, such that the score gets higher Methodological Issues in Machine Translation (TMI '99),
http://www.cs.sfu.ca/research/groups/ NLL/elib/guide.html.
when the number of words in the target string
approximates the number of words in the input [9] A.R. Weerasinghe. A statistical machine translation
string approach to sinhala-tamil language translation. Colombo, 2003.
International Information Technology Conference.
To use the System to for the translation of other
language pairs. (e.g.: Sinhala-Tamil) [10] Wikipedia. Machine translation.
http://en.wikipedia.org/wiki/Machine_translation. 18/08/2006

[11] Wikipedia. Translation.


Acknowledgments http://en.wikipedia.org/wiki/Translation. 18/08/2006

The authors wish to thank Dr. H. L. Premaratne for [12]Wikipedia. Dictionarybased machine translation.
reviewing the paper, Mr. Dulip Herath of Language http://en.wikipedia.org/wiki/
Technology Research Laboratory (LTRL) and the staff Dictionarybased_machine_translation. 18/08/2006
of the LTRL.
Special thanks to Mr. Jagath Gajaweera, Director- [13]Wikipedia. Statistical machine translation.
Legislative of the Parliament of Sri Lanka for the kind http://en.wikipedia.org/wiki/Statistical_machine_translation.
18/08/2006
cooperation in obtaining translated material in Sinhala,
English and Tamil. [14] Simon Zwarts. Machine translation evaluation.

[15] D.J. Arnold, Lorna Balkan, Siety Meijer, R.Lee Humphreys


References and Louisa Sadler. Machine Translation: an Introductory Guide,
Blackwells-NCC, London, 1994

[1] Ralf D. Brown, "Example-Based Machine Translation in the


Pangloss System". In Proceedings of the 16th International
Conference on Computational Linguistics (COLING-96), p.
169-174. Copenhagen, Denmark, August 5-9, 1996.

[2] Nikos Drakos. http://www.muni.cz/usr/wong/


teaching/mt/notes/node24.html. 19/08/2006.

[3] Todd Ward Wei-Jing Zhu Kishore Papineni, Salim Roukos.


Bleu: a method for automatic evaluation of machine translation.
Philadelphia, July 2002. 40th Annual Meeting of the
Association for Computational Linguistics (ACL).

[4] M. Konstantinidis. Example-based machine translation.


http://www.cs.cmu.edu/afs/cs.cmu.edu/
user/ralf/pub/WWW/ebmt/ebmt.html, 1999. 24/08/2006.

You might also like