Syllabification Algorithm Based On Syllable Rules Matching For Malay Language
Syllabification Algorithm Based On Syllable Rules Matching For Malay Language
Syllabification Algorithm Based On Syllable Rules Matching For Malay Language
Abstract:- In this paper, we present a new syllabification algorithm for Malay language. Syllabification is the
process to extract or divide syllable from words. Syllabification process is language dependent where each
language can have its own set of syllable structure. Syllabication is an important component in speech
synthesizer, speech recognition and transliteration system. Syllabification algorithms have been proposed in
many languages including English, Spanish, Myanmar, Singhala, Chinese and ect. Unfortunately, there are not
many information regarding evaluation of syllabification scheme for Malay. In this paper, we propose an
efficient algorithm based on syllable rules matching. In order to evaluate the algorithm, a prototype has been
developed to measure the accuracy of syllabification. We evaluate our method using Bernama, Kamus Dewan
and Overlap data collection. The syllable rules matching achieved 60.7% accuracy on BERNAMA collection,
77.4% on Kamus Dewan Collection and 71.6% on Overlap collection.
1 Introduction
Syllabification is an essential component of text to perspective, a syllable is often made up of a
speech system[1]. For example syllable unit is use consonant plus a vowel or a single vowel. This
to produce natural sounding and unrestricted speech follows the principle of maximal onset – minimal
synthesis system for Hindi. [2]. In unit selection text coda. The maximal onset principle states that the
to speech system, syllable produce acceptable result maximum number of consonants possible to attach
if is in the same context for certain language. a syllable onset [8]. In the Fig.1, syllable is made of
Raghavendra and Yegnanarayana [3] shown that rhyme and onset. Within rhyme (or core) we find
syllable based synthesizers produce better sounding peak (or nucleus) and coda.
speech than diphone and phone. Syllable and other
sound unit is also use to exploit acoustic
Syllable
representations of speech for synthesis, together
with linguistic analyses of text to produce natural
sound text-to-speech system.
Onset Rhyme
In other experiments, it’s shown that syllable
produce promising result in Telegu, Bangla,
Romanian, Turkish text-to-speech system
.[2][3][4][5]. In the area of automatic speech
Peak Coda
recognition, syllabic-like units has been proposed as
a building unit in speech recognition system as
alternative to phonemes unit. Syllabic-like units is Fig.1: Syllable Structure
that by modeling perceptually more meaningful
units, the better modeling of speech can be We choose consonant and vowel (CV) sequence
achieved. [7]. as phonological unit because almost all language
has CV or CVC word. Different languages have
1.1 Syllabification Theory different structure of syllable such as V, CV, VC,
Onset-Rhyme (OR) models of syllable structure CVC, etc. A syllable is a unit of speech that
were developed by Fudge in the 1969 If consists of a vocalic center (nucleus) surrounded
we look vowel and consonant from phonological by a consonantal onset and a consonantal coda,
one or both of which may be empty. If C stands space is usually use in phonetic transcription to
for Consonant, and V for Vowel, the structure of separate syllable. Carnie Mellon University
Malay syllable is [C]V[C], where the two C's are produced Pronunciation Dictionaries for North
optional (indicated by square bracket). The American English contains over 125,000 words
consonant C can be represented by one, two, or [9]. Pronunciation dictionaries provide
three characters. The vowel V can be represented accurate syllabification but result in tedious process
by one or two characters [10]. to add for new words. Rule-based syllabification
In Malay language, onset and coda for syllable usually apply universal principles, e.g.: sonority
sequencing principle & Maximal Onset principle
which have more than one consonant are limited to
or by using template-matching and resyllabification
load word only. For example, word psikologi which [10]. Statistical approaches, on the other hand use
are loan word from English. It also note that Malay machine learning technique. For example,
syllable structure are different from English even for supervised and unsupervised learning to predict the
loan word. probability distribution over parameter set.
Statistical techniques require a large quantity of
1.2 Syllabification Approach training data before the prediction give a
Until now, diverse syllabification approach has meaningful result.
been presented in different papers. Shankar In this paper, we use rule-based approach for
Ananthakrishnan used statistical syllabification syllabification Malay since there are no available
approach, which apply maximum likehood training data. In section 2, we will study the
estimation (MLE) and expectation maximization for morphology on phonology of Malay words.
English language [10]; Karin Müller use Followed by section 3, is an explanation on how
probabilistic context-free grammar for multilingual we do a collection of Malay wordlist and corpus.
syllabification [5]. Ouellet and Dumouchel The elaboration of the syllabification algorithm will
introduced Heuristic Syllabification using N-gram be discussed in section 4. Section 5 is on how
statistical method for English language. we evaluate and observe our algorithm. Finally we
Heriberto Cuay´ahuitl developed Spanish conclude our work in section 6.
syllabification algorithm based on grammatical
rules [11]. George Anton Kiraz and Bernd
M¨Obius implemented multilingual 2 Malay Word
syllabification u sing weighted Finite-State Model. Malay language or bahasa Melayu is an official
Other approaches that deal with syllable structure language of Malaysia, Brunei and Singapore. Malay
are Maimaitimin Saimaiti1 and Zhiwei Feng [9], belongs to the Astronesian (Malayo-Polynesian)
applied a rule-based approach that uses the Principle family group. Together with Indonesian language
of Maximum Onset for Urgur language. Ruvan (Indonesian form of Malay), the Malay language is
Weerasinghe, Asanka Wasala, and Kumudu spoken by over 300 million people worldwide.
Gamage [11] described rule based syllabification Malay has two types of writing system, ie. Rumi
algorithm for Sinhala after study the syllable (Latin alphabet) and Jawi (Arabic Script).
structure and linguistic rules for syllabification of However, Malay is normally written in Rumi.
Sinhala words. Yousif A. El-Imam and Zuraidah This paper discuss on how to develop rule
Md. Don[12] proposed algorithm for syllbiciatioan based syllabification algorithm that based on
based on Maximum Onset principle for text to phonological and morphological.
speech system for Arabic and Malay. Another
example are that algorithm was proposed by Nur- 2.1 Malay Phonology
Hana Samsudin and Tang Enya Kong [13, 19]. The The sound system of Malay language is pretty
system used four syllable structure (CV), (VC), simple. The phonemes of most words can be
(CVC) (V) cluster with a few sub-models proposed determined from the grapheme. Malay language
for exception such as loan words pronounciation have 25 consonants:
such as Bali Ranaivo-Malançon [14, 15].
Bali Ranaivo-Malançon proposed a structure of /[b],[d],[dʒ],
Malay pronunciation dictionary and set of
phonological rules for Malay language. [f],[ɡ],[h],[j],[k],[l],[m],[n],[ŋ],[ɲ],[p],[r],[s],[ʃ],[t],[tʃ]
The methods in essence can be divided into three ,[v],[w], [x],[z],[ ʔ ]/.
broad techniques: Dictionary based, Rules based
and Corpus (or statistical) based. Those three 6 vowels
approaches have their advantages and
disadvantages. The most simple approach to empty
/[a],[e],[i],[o],[u],[ə]/
Table 1
and 4 diphthongs 8 Syllables or More Possible Combination
/[au],[ai],[oi],[ua]/.
Syllable
Example
2.2 Malay Morphology Structure
Unlike many languages, Malay is agglutinative V Ayam (chicken)
language meaning that new word are generally
formed by adding prefix, suffix, circumfix, and infix VC Anda (you)
to root word as shown in Fig.2. CV Batu (stone)
CVC Hantu (ghost)
CVCC Insurans
circumfix
CCV Infrastruktur (infrastructure)
root
prefix suffix CCVC Praktikal (Practical)
infix
CCCV Strategi (Strategy)
CCCVC Struktur (Structure)
Fig.2: Malay Affix Word Structure
The pattern of Malay syllable is straightforward and
2.3 Malay Syllable Structure generated by combining syllable structure above.
According to onset rhyme model the general We use this seven syllable structure as our base to
structure of Malay Syllable consist of Onset, generate rules for our template matching syllable
Rhyme, Peak and Coda. This syllable structures are structure. From the table 1, the maximum example
displayed as a tree diagram. For example word of Malay syllable structure is word of struktur
tenteram is presented in Fig.3. (structure).
tenteram
(peace)
MSS=(C)3V(C)
No. of No. of
Types Source
files words
Online
4155 Bernama 18,631
Newspaper
Electronic Kamus Dewan
1 67,233
Dictionary Edisi ke-4
Combination
1 Overlap 79,795
of both source
Total 165,659
sub syllabification_longgest_match {
Convert Word
Load the set of syllables rules
to CVC
from syllable-file
Tokenized sentences into word
separated by space.
Syllable while (char space is
Syllable Rules
Rules found) do for i = 1 to
Matching length of word
for-each letter
if consonant letter convert
Syllable into 1 else convert into 0
Extraction if not found in
syllabification matching
rules.. add
else
match syllable from rules
Syllable Dictionary files end if
Format End if
End for End for End while
Print syllabified string
End for
Fig.5: Syllabification Process }
Sub Syllabification_template_match {
4.1 Syllable Rules Format Load the set of syllables rules from
syllable-file
In our formulation of a syllable rules description Tokenized sentences into
to automatically extract rules that improve word separated by space.
syllabification accuracy, the user are ask to align while (char space is
syllable CVC structure to it’s syllable variable found) do for i = 1 to
length of word
length. This system will try to induce correct rule for-each letter
for syllable rules matching. The accuracy of if consonant letter convert
syllabification is depends on the number of the into 1 else convert into 0
syllable rules. We collect 188 of constructed if not found in
syllabification matching
syllable rules in this experiment. Table 4 shows the rules.. add
first 10 entries of syllable rules format to match to else
the correct syllable pattern. match syllable from rules
files end if
End if
Table 4 End for End for End while
The 10 First Entries of Syllable Rules Format Print syllabified string
End for
Syllable Rules
101 Æ 1, 2 Finally, after matching the syllable rules pattern,
010 Æ 3 the correct and incorrect syllabifications of the
100 Æ 1, 1, 1 words used in our prototype and produce an output
as a plain text files as the following format:
0110 Æ 2, 2
1010 Æ 1, 3 akad a-kad
0101 Æ 2, 2 akademi a-ka-de-mi
0010 Æ 4 akah a-kah
akaid a-ka-id
0100 Æ 4
akak a-kak
110101 Æ 1, 3 akal a-kal
01001 Æ 3, 2 akalkan a-kal-kan
akan a-kan
akang a-kang
akanlah a-kan-lah
Table 5
Syllabification Results (Percentage Correct) based
on Kamus Dewan and BERNAMA
6 Conclusion
In this paper, a syllable rules matching for
syllabification process for Malay language was
presented. Even though there are there are not many