Syllabification Algorithm Based On Syllable Rules Matching For Malay Language

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Recent Researches in Applied Computer and Applied Computational Science

Syllabification Algorithm based on Syllable Rules Matching for Malay


Language

HAFIZ MUSA, RABIAH A.KADIR, AZREEN AZMAN, M.TAUFIK ABDULLAH


Department of Multimedia, Faculty of Computer Science and Information Technology
Universiti Putra Malaysia
43400 UPM Serdang. Selangor Darul Ehsan, Malaysia
hafizmusa@gmail.com, {rabiah, azreen, taufik}@fsktm.upm.edu.my

Abstract:- In this paper, we present a new syllabification algorithm for Malay language. Syllabification is the
process to extract or divide syllable from words. Syllabification process is language dependent where each
language can have its own set of syllable structure. Syllabication is an important component in speech
synthesizer, speech recognition and transliteration system. Syllabification algorithms have been proposed in
many languages including English, Spanish, Myanmar, Singhala, Chinese and ect. Unfortunately, there are not
many information regarding evaluation of syllabification scheme for Malay. In this paper, we propose an
efficient algorithm based on syllable rules matching. In order to evaluate the algorithm, a prototype has been
developed to measure the accuracy of syllabification. We evaluate our method using Bernama, Kamus Dewan
and Overlap data collection. The syllable rules matching achieved 60.7% accuracy on BERNAMA collection,
77.4% on Kamus Dewan Collection and 71.6% on Overlap collection.

Key-words: - Syllabification, Text-to-Speech, Syllable Matching, Speech Synthesizer, Elicitation

1 Introduction
Syllabification is an essential component of text to perspective, a syllable is often made up of a
speech system[1]. For example syllable unit is use consonant plus a vowel or a single vowel. This
to produce natural sounding and unrestricted speech follows the principle of maximal onset – minimal
synthesis system for Hindi. [2]. In unit selection text coda. The maximal onset principle states that the
to speech system, syllable produce acceptable result maximum number of consonants possible to attach
if is in the same context for certain language. a syllable onset [8]. In the Fig.1, syllable is made of
Raghavendra and Yegnanarayana [3] shown that rhyme and onset. Within rhyme (or core) we find
syllable based synthesizers produce better sounding peak (or nucleus) and coda.
speech than diphone and phone. Syllable and other
sound unit is also use to exploit acoustic
Syllable
representations of speech for synthesis, together
with linguistic analyses of text to produce natural
sound text-to-speech system.
Onset Rhyme
In other experiments, it’s shown that syllable
produce promising result in Telegu, Bangla,
Romanian, Turkish text-to-speech system
.[2][3][4][5]. In the area of automatic speech
Peak Coda
recognition, syllabic-like units has been proposed as
a building unit in speech recognition system as
alternative to phonemes unit. Syllabic-like units is Fig.1: Syllable Structure
that by modeling perceptually more meaningful
units, the better modeling of speech can be We choose consonant and vowel (CV) sequence
achieved. [7]. as phonological unit because almost all language
has CV or CVC word. Different languages have
1.1 Syllabification Theory different structure of syllable such as V, CV, VC,
Onset-Rhyme (OR) models of syllable structure CVC, etc. A syllable is a unit of speech that
were developed by Fudge in the 1969 If consists of a vocalic center (nucleus) surrounded
we look vowel and consonant from phonological by a consonantal onset and a consonantal coda,

ISBN: 978-960-474-281-3 279


Recent Researches in Applied Computer and Applied Computational Science

one or both of which may be empty. If C stands space is usually use in phonetic transcription to
for Consonant, and V for Vowel, the structure of separate syllable. Carnie Mellon University
Malay syllable is [C]V[C], where the two C's are produced Pronunciation Dictionaries for North
optional (indicated by square bracket). The American English contains over 125,000 words
consonant C can be represented by one, two, or [9]. Pronunciation dictionaries provide
three characters. The vowel V can be represented accurate syllabification but result in tedious process
by one or two characters [10]. to add for new words. Rule-based syllabification
In Malay language, onset and coda for syllable usually apply universal principles, e.g.: sonority
sequencing principle & Maximal Onset principle
which have more than one consonant are limited to
or by using template-matching and resyllabification
load word only. For example, word psikologi which [10]. Statistical approaches, on the other hand use
are loan word from English. It also note that Malay machine learning technique. For example,
syllable structure are different from English even for supervised and unsupervised learning to predict the
loan word. probability distribution over parameter set.
Statistical techniques require a large quantity of
1.2 Syllabification Approach training data before the prediction give a
Until now, diverse syllabification approach has meaningful result.
been presented in different papers. Shankar In this paper, we use rule-based approach for
Ananthakrishnan used statistical syllabification syllabification Malay since there are no available
approach, which apply maximum likehood training data. In section 2, we will study the
estimation (MLE) and expectation maximization for morphology on phonology of Malay words.
English language [10]; Karin Müller use Followed by section 3, is an explanation on how
probabilistic context-free grammar for multilingual we do a collection of Malay wordlist and corpus.
syllabification [5]. Ouellet and Dumouchel The elaboration of the syllabification algorithm will
introduced Heuristic Syllabification using N-gram be discussed in section 4. Section 5 is on how
statistical method for English language. we evaluate and observe our algorithm. Finally we
Heriberto Cuay´ahuitl developed Spanish conclude our work in section 6.
syllabification algorithm based on grammatical
rules [11]. George Anton Kiraz and Bernd
M¨Obius implemented multilingual 2 Malay Word
syllabification u sing weighted Finite-State Model. Malay language or bahasa Melayu is an official
Other approaches that deal with syllable structure language of Malaysia, Brunei and Singapore. Malay
are Maimaitimin Saimaiti1 and Zhiwei Feng [9], belongs to the Astronesian (Malayo-Polynesian)
applied a rule-based approach that uses the Principle family group. Together with Indonesian language
of Maximum Onset for Urgur language. Ruvan (Indonesian form of Malay), the Malay language is
Weerasinghe, Asanka Wasala, and Kumudu spoken by over 300 million people worldwide.
Gamage [11] described rule based syllabification Malay has two types of writing system, ie. Rumi
algorithm for Sinhala after study the syllable (Latin alphabet) and Jawi (Arabic Script).
structure and linguistic rules for syllabification of However, Malay is normally written in Rumi.
Sinhala words. Yousif A. El-Imam and Zuraidah This paper discuss on how to develop rule
Md. Don[12] proposed algorithm for syllbiciatioan based syllabification algorithm that based on
based on Maximum Onset principle for text to phonological and morphological.
speech system for Arabic and Malay. Another
example are that algorithm was proposed by Nur- 2.1 Malay Phonology
Hana Samsudin and Tang Enya Kong [13, 19]. The The sound system of Malay language is pretty
system used four syllable structure (CV), (VC), simple. The phonemes of most words can be
(CVC) (V) cluster with a few sub-models proposed determined from the grapheme. Malay language
for exception such as loan words pronounciation have 25 consonants:
such as Bali Ranaivo-Malançon [14, 15].
Bali Ranaivo-Malançon proposed a structure of /[b],[d],[dʒ],
Malay pronunciation dictionary and set of
phonological rules for Malay language. [f],[ɡ],[h],[j],[k],[l],[m],[n],[ŋ],[ɲ],[p],[r],[s],[ʃ],[t],[tʃ]
The methods in essence can be divided into three ,[v],[w], [x],[z],[ ʔ ]/.
broad techniques: Dictionary based, Rules based
and Corpus (or statistical) based. Those three 6 vowels
approaches have their advantages and
disadvantages. The most simple approach to empty

ISBN: 978-960-474-281-3 280


Recent Researches in Applied Computer and Applied Computational Science

/[a],[e],[i],[o],[u],[ə]/
Table 1
and 4 diphthongs 8 Syllables or More Possible Combination
/[au],[ai],[oi],[ua]/.
Syllable
Example
2.2 Malay Morphology Structure
Unlike many languages, Malay is agglutinative V Ayam (chicken)
language meaning that new word are generally
formed by adding prefix, suffix, circumfix, and infix VC Anda (you)
to root word as shown in Fig.2. CV Batu (stone)
CVC Hantu (ghost)
CVCC Insurans
circumfix
CCV Infrastruktur (infrastructure)
root
prefix suffix CCVC Praktikal (Practical)
infix
CCCV Strategi (Strategy)
CCCVC Struktur (Structure)
Fig.2: Malay Affix Word Structure
The pattern of Malay syllable is straightforward and
2.3 Malay Syllable Structure generated by combining syllable structure above.
According to onset rhyme model the general We use this seven syllable structure as our base to
structure of Malay Syllable consist of Onset, generate rules for our template matching syllable
Rhyme, Peak and Coda. This syllable structures are structure. From the table 1, the maximum example
displayed as a tree diagram. For example word of Malay syllable structure is word of struktur
tenteram is presented in Fig.3. (structure).

tenteram
(peace)
MSS=(C)3V(C)

In Malay language, the actual spoken syllables are


the basis of syllabification in grapheme too. Our
Syllable Syllable Syllable
Initially Medially Finally syllabification are based on etymological (ie.
morphological) instead of phonetic principle. The
following table, Table 2 present an example of
Malay syllabification for a complete sentence ‘Saya
Onset Nucleus Coda Onset Nucleus Coda suka makan nasi ayam yang sedap’.
Onset Nucleus
Table 2
Example of Malay Syllabification
t e n t e r a m

Fig.3: Malay Syllable Structure for Word tenteram Syllable Structure


(peace)
Saya suka makan nasi ayam yang sedap.
In the syllable of Malay’s word tenteram (peace), Sa-ya # su-ka # ma-kan # na-si # a-yam # yang # se-dap
the nucleus is e, the onset is t, the coda is n, and the
rhyme is en. This syllable can be abstracted as a cv.cv # cv-cv # cv-cvc # cv-cv # v-cvc # cvcv # cv-cvc
consonant-vowel-consonant syllable, abbreviated
CVC. There are several syllable structure in Malay I like to eat chicken rich which are delicious
language such as CV, VC, CVC, CCVC, CVCC, where - for syllable boundaries and # for word
CCCV and CCCVC, which C stand for consonant boundaries
and V for vowel. Table 1 shows the 8 syllables or
more possible combination for Malay syllable The frequency of Malay syllable structure is given
structure.

ISBN: 978-960-474-281-3 281


Recent Researches in Applied Computer and Applied Computational Science

in Fig.4 that is obtained from the Malay text Table 3


prepared in a research of this domain. Malay Corpus Collection

No. of No. of
Types Source
files words
Online
4155 Bernama 18,631
Newspaper
Electronic Kamus Dewan
1 67,233
Dictionary Edisi ke-4
Combination
1 Overlap 79,795
of both source
Total 165,659

3.1 Syllable Dictionary


In addition of data collection, we require a
dictionary with the syllable unit transcription. A
word is manually segmented to each of syllable.
Syllable dictionary will be used as our baseline
syllabification.
Fig.4: Number of Syllable in Malay

Most syllable patterns in Malay are disyllabic. For


example makan(eat) Æ CV+CV. Trisyllabic also 4 Syllabification Process
common in Malay word for example siapa(who) A syllable parser is a program that takes an input
Æ CV+V+CV, Only 5% have syllable number text and cut it into segments called syllables.
equal or greater than four. For example matahari Syllabification is language-dependent where each
(sun) Æ CV+CV+CV+CV. The longest syllable language has its own syllables structure. Our
structure in Malay is twelve for word approach is divided into four different phases.
ketidakberperikemanusiaan (not having Firstly, the parser will read the text document, then
humanitarian feelings) Æ it will normalize the text and remove punctuation.
CV+.CV+CVC+CVC+CV+CV+CV+CV+CV+CV The normalize word is then converted into 1 for
V+VC. We need a well define syllable rules for Vowels and 0 for Consonants. Then, it will try to
Malay to provide correct syllabification of Malay match the converted CVC text with the syllable
word. The rule is defines at syllable level and word pattern rules.
level. In our syllable parser, the users are asked to
extract the syllable and align the syllable. For
example: word pandai (smart) will converted
010011 and align to 1,3. (010011 Æ 1,3) and
3 Text Scopus Acquisitions store into syllable rules so that any word that have
In this study we used Kamus Dewan (Malay for similar structure as the word pandai (smart) will
The Institute Dictionary) and BERNAMA automatically extracted using (010011 Æ 1,3) rules.
(Malaysian National News Agency) corpus to Finally, the syllable parser prints the output of text
evaluate our parser. Kamus Dewan document into syllable dictionary format. The
(http://prpm.dbp.gov.my) is a Malay language process of syllabification is shown in Fig.5.
dictionary published by Dewan Bahasa dan
Pustaka and is the most authoritative Malay-to-
Malay dictionary. We crawling website to collect
text from Malaysian National News Agency,
BERNAMA (http://www.bernama.com) which
provides Malaysian news including politic,
economy, sort from newspaper article. A third data
(hereafter Overlap) collection was derived from
both data collection. We presented different types
of Malay words in Table 3.

ISBN: 978-960-474-281-3 282


Recent Researches in Applied Computer and Applied Computational Science

4.2 Syllable Rules Matching


Text Syllable rules matching process is part of
Document Normalization syllabification. The following is the pseudo code for
syllable rules matching.

sub syllabification_longgest_match {
Convert Word
Load the set of syllables rules
to CVC
from syllable-file
Tokenized sentences into word
separated by space.
Syllable while (char space is
Syllable Rules
Rules found) do for i = 1 to
Matching length of word
for-each letter
if consonant letter convert
Syllable into 1 else convert into 0
Extraction if not found in
syllabification matching
rules.. add
else
match syllable from rules
Syllable Dictionary files end if
Format End if
End for End for End while
Print syllabified string
End for
Fig.5: Syllabification Process }
Sub Syllabification_template_match {
4.1 Syllable Rules Format Load the set of syllables rules from
syllable-file
In our formulation of a syllable rules description Tokenized sentences into
to automatically extract rules that improve word separated by space.
syllabification accuracy, the user are ask to align while (char space is
syllable CVC structure to it’s syllable variable found) do for i = 1 to
length of word
length. This system will try to induce correct rule for-each letter
for syllable rules matching. The accuracy of if consonant letter convert
syllabification is depends on the number of the into 1 else convert into 0
syllable rules. We collect 188 of constructed if not found in
syllabification matching
syllable rules in this experiment. Table 4 shows the rules.. add
first 10 entries of syllable rules format to match to else
the correct syllable pattern. match syllable from rules
files end if
End if
Table 4 End for End for End while
The 10 First Entries of Syllable Rules Format Print syllabified string
End for
Syllable Rules
101 Æ 1, 2 Finally, after matching the syllable rules pattern,
010 Æ 3 the correct and incorrect syllabifications of the
100 Æ 1, 1, 1 words used in our prototype and produce an output
as a plain text files as the following format:
0110 Æ 2, 2
1010 Æ 1, 3 akad a-kad
0101 Æ 2, 2 akademi a-ka-de-mi
0010 Æ 4 akah a-kah
akaid a-ka-id
0100 Æ 4
akak a-kak
110101 Æ 1, 3 akal a-kal
01001 Æ 3, 2 akalkan a-kal-kan
akan a-kan
akang a-kang
akanlah a-kan-lah

ISBN: 978-960-474-281-3 283


Recent Researches in Applied Computer and Applied Computational Science

5 Evaluation and Observation Table 4


There is no validated reference for Malay Comparison of Syllabification Process of Syllable
syllabification algorithm in the light of maximal Rules Matching and Longest Matching
onset theory. Our experiment used the longest
Syllable Longest Syllable Rules Improvement
matching as a benchmark of evaluation approach to
Dictionary Matching Matching Percentage
determine the correctness of an algorithm.
Vowels 90.5 87.2 -3.3
Evaluation of syllabification algorithm concerns
with the correctness of algorithm. We implement Diphthong 99.7 76.9 -22.8
our experiment in two independent phases. First Vowel
80.0 89.6 9.6
phase of experiment, we compared three approaches Combination
with the special collected syllable dictionary as Consonant 90.2 94.8 4.6
discussed in section 3. In the 2nd phase, we analyzed Cluster
an error on how our algorithm perform on larger 95.3 62.5 -32.8
Consonant
dataset (Dewan bahasa, Bernama and the overlap Derive Words 100.00 100.00 0
data).
Swha Rules 100.00 100.00 0
5.1 Comparison with a Syllable Dictionary
We proposed a method of automatic syllabification of
Malay word by a comparison of different syllable There is an improvement on vowel combination and
extraction of each word in seven types of syllable consonant syllable dictionary by using syllable rules
dictionaries. This syllable dictionary contain selected matching approach, which are 9.6% and 4.6%
word collection such as Vowel, Diphthong, Vowel respectively. Other syllables dictionary show
Combinations, Consonant, Cluster Consonant, Derive
Words, Loan Words and Schwa rules. Fig.7 shows the
decrement and produce the same percentage.
recognition rate using syllable dictionary for longest Further analysis is then conducted on the result
matching and syllable rules matching approaches. of syllable rules matching to examine the influence
of larger and random syllable of this technique. This
dataset contain larger data and random wordlist
collected from web. The data is manually collected
from online newspaper, electronic dictionary and the
overlap of these two sources. We have been
evaluated and observed on how the data perform in
data set. The result of this analysis is listed in Table
5.

Table 5
Syllabification Results (Percentage Correct) based
on Kamus Dewan and BERNAMA

Syllable Rules Matching

Recognize Word Accuracy


Bernama
11,303 74.96
Fig.7: Recognition Rate using Syllable Dictionary (18,631)
Kamus
The results of syllabification based on syllable rules Dewan 52,020 76.68
matching and longest matching are shown in Table (67,233)
4. Overlap
57,155 75.22
(79,795)

6 Conclusion
In this paper, a syllable rules matching for
syllabification process for Malay language was
presented. Even though there are there are not many

ISBN: 978-960-474-281-3 284


Recent Researches in Applied Computer and Applied Computational Science

information regarding evaluation of syllabification Multilingual Syllabification Using


scheme for Malay language, but in this experiment Weighted Finite-State Transducers,In
we found out that syllabification algorithm have Proc. 3rd ESCA Workshop on Speech
been used for other languages. We decided to Synthesis , p59-64, 1998.
implemented the syllabification algorithm towards [8] Ruvan Weerasinghe, Asanka Wasala and
Malay language and the result shows that Malay
Kumudu Gamage, A Rule Based
language improved the recognition rate on vowel
consonant and consonant syllable dictionary. It is Syllabification Algorithm for Sinhala,
also produced above 70% of recognition word. Lecture Notes in Computer Science,
For the further research, we will look at the Volume 3651, Natural Language
elicitation approach in the improvement of Processing (IJCNLP 2005), Pages 438-
syllabification process and the intelligibility of 449, 2005.
pronunciation for Malay language. [9] Maimaitimin Saimaiti , Zhiwei Feng, A
Syllabification Algorithm and Syllable
Statistics of Written Uyghur,
References
http://corpus.bham.ac.uk/corplingproceedi
[1] Gelbukh, Alexander and Heriberto
ngs07/pap
Cuayáhuitl. A Syllabification Algorithm for
Spanish. Heidelberg: Springer Berlin, 2004. [10] Zhihong Hu; Schalkwyk, J.; Barnard,
[2] Mahar, J.A.; Memon, G.Q.; Shah, S.H.A.; , E.; Cole, R., Speech recognition using
WordNet Based Sindhi Text to Speech syllable-like units, In Proceeding of
Synthesis System, In Proceeding of Second Fourth International Conference on
Conference on Computer Research and Spoken Language 1996 (ICSLP96) on ,
Development 2010, pp.20-24, 7-10 May 2010 vol.2, pp.1117-1120, 3-6 Oct 1996
[3] Raghavendra, E.V.; Yegnanarayana, B.; [11] Ruvan Weerasinghe, Asanka Wasala and
Prahallad, K., Speech synthesis using Kumudu Gamage, A Rule Based
approximate matching of syllables, Syllabification Algorithm for Sinhala,
Spoken Language Technology Workshop, Lecture Notes in Computer Science of
2008 (SLT 2008). pp.37-40, 15- 19 Dec. Natural Language Processing (IJCNLP
2008 2005) Volume 3651/2005, pp. 438-449,
[4] Islam, M.R.; Saha, R.S.; Hossain, A.R.; , 2005.
"Automatic Reading from Bangla PDF [12] Y.A. El-Imam and Z.M. Don. Text-to-
Document Using Rule Based Speech Conversion of Standard Malay.
Concatenative Synthesis," 2009 International Journal of Speech
International Conference on Signal Technology 3, Kluwer Academic
Processing Systems , vol., no., pp.521- Publishers, pp. 129-146, 2000.
525, 15-17 May 2009 [13] N.H. Samsudin and T.E. Kong. A Simple
[5] Karin Müller. Probabilistic Context- Malay Speech Synthesizer Using Syllable
Free Grammars for Phonology, Concatenation Approach. In proceeding
Morphological and Phonological of MMU International Symposium on
Learning. In proceedings of the 6th Information and Communications
Workshop of the ACL Special Interest Technologies 2004 (M2USIC 2004).
Group in Computational Phonology [14] B. Ranaivo-Malacon, Computational
(SIGPHON), Philadelphia, pp. 70–80, Analysis of Affixed Words in Malay
July 2002 Language, In Proceeding of International
[6] Ouellet, P., Dumouchel, P. Heuristic Symposium on Malay/Indonesian
Syllabification and Statistical Syllable- Linguistics,Penang, 2004 .
Based Modeling for Speech-Input Topic [15] Ranaivo-Malançon Bali. Malay lexical
Identification, Workshop on Grammar and analysis through corpus-based
NLP. Montreal, Quebec, Canada, 13–14, approach, In Proceeding of Persidangan
October 2001 Antarabangsa Leksikologi dan
[7] G. Anton Kiraz , B. Möbius Leksikografi Melayu

ISBN: 978-960-474-281-3 285


Recent Researches in Applied Computer and Applied Computational Science

[16] (PALMA 2005), Kuala Lumpur, Technology Malaysia.


Malaysia, 2005 [19] Kamus Dewan. Dewan Bahasa dan Pustaka,
[17] Tan Tien Ping, Li HaiZhou, Tang Enya Kuala Lumpur, Malaysia,2004.
Kong, Xiao Xiong, et al., MASS: A Malay [20] Samsudin, N.H., Tang, E.K., & Chuah,K.
Language, LVCSR Corpus Resource. O- Adjacency Analysis for Designing Unit
Cocosda'09. Beijing, 2009 Selection Speech Model on Micro
[18] T.T. Swee. The Design and Verification Prosodic Level. In Proceeding National
of Malay Text To Speech Synthesis Computer Sciences Postgraduate
System. Master Thesis, Dept. of Colloquium 2005 (NaCSPC’05), Pulau
Engineering (Electrical), University Pinang, Malaysia, 2005.

ISBN: 978-960-474-281-3 286

You might also like