Shihadeh Alqrainy
School of Computing
Faculty of Computing Sciences and Engineering
De Montfort University
July, 2008
written text with its grammatical category or part-of-speech, i.e. noun, verb, prepo-
sition, adjective, etc. It is the most common disambiguation process in the field of
Natural Language Processing (NLP). POS tagging systems are often preprocessors in
The Arabic language has a valuable and an important feature, called diacritics, which
are marks placed over and below the letters of the word. An Arabic text is partially-
vocalised 1 when the diacritical mark is assigned to one or maximum two letters in the
Diacritics in Arabic texts are extremely important especially at the end of the word.
They help determining not only the correct POS tag for each word in the sentence,
but also in providing full information regarding the inflectional features, such as tense,
number, gender, etc. for the sentence words. They add semantic information to words
which helps with resolving ambiguity in the meaning of words. Furthermore, diacritics
ascribe grammatical functions to the words, differentiating the word from other words,
This thesis presents a rule-based Part-of-Speech tagging system called AMT - short
for Arabic Morphosyntactic Tagger. The main function of the AMT system is to as-
sign the correct tag to each word in an untagged raw partially-vocalised Arabic corpus,
and to produce a pas tagged corpus without using a manually tagged or untagged
lexicon (dictionary) for training. Two different techniques were used in this work, the
The rules in the pattern-based technique technique are based on the pattern of the
testing word. A novel algorithm, Pattern-Matching Algorithm (PMA), has been de-
signed and introduced in this work. The aim of this algorithm is to match the testing
The lexical and contextual technique on the other hand is used to assist the pattern-
based technique technique to assign the correct tag to those words not have a pattern to
follow. The rules in the lexical and contextual technique are based on the character(s),
the last diacritical mark, the word itself, and the tags of the surrounding words.
The importance of utilizing the diacritic feature of the Arabic language to reduce the
lexical ambiguity in pas tagging has been addressed. In addition, a new Arabic tag
set and a new partially-vocalised Arabic corpus to test AMT have been compiled and
presented in this work. The AMT system has achieved an average accuracy of 91 %.
Chapter 1
1.1 Overview
Natural Language Processing (NLP) is one of the Artificial Intelligence (AI) fields that
deals with analysing, understanding and generating the human languages in order to
interface with computers in both written and spoken contexts using natural human lan-
guages (e.g English, Arabic, French, etc.) instead of computer languages (e.g Java,
C++, etc.)1. Understanding human languages is not an easy task for a computer that
lacks the human knowledge of the world and the human experience with linguistic
Multiple levels of knowledge are required to process the human language. The list
below summaries some of the different form of knowledge relevant for natural lan-
• Phonological knowledge: how words are related to the sounds that realise them.
called morphemes. For example, the English word "cats" has two morphemes
(cat and s).
This information is extremely necessary to resolve any type of ambiguity that may
sonably interpreted in more than one way [33]. The ambiguity is arguably the single
most important problem in NLP [66]. Natural language has a huge number of ambigu-
ities at every level of description, such as, lexical (many words tend to have multiple
functions in a sentence), and semantic (some sentences can have multiple interpreta-
The main goal of the NLP field is to resolve the ambiguity that may found in human
process is a central first step in most NLP tasks, such as machine translation, informa-
The most common disambiguation process which has received extensive attention from
NLP research community is Part-Of-Speech (POS) tagging. POS tagging is the pro-
cess of labeling or classifying each word in written text with its part-of-speech, i.e.
noun, verb, preposition, adjective, etc. It concerns with lexical ambiguity resolution.
For example, the sentence" He will table the motion!' is tagged as follows :
2 Also called grammatical class or part-of-speech
The descriptive symbols or notations, PPS, MD, VB, AT, NN, BEZ, and JJ are called
pas tags. Each symbol or tag indicate that the word belongs to a particular grammati-
cal class. For example, PPS= subject pronoun; MD =modal; VB =verb (no inflection);
AT = article; NN = noun; BEZ = present 3rd sg form of" to be "; JJ = adjective.
Many words in languages are ambiguous : they may be assigned more than one pas
tag [114]. For example, the English word round may be a noun, an adjective, a prepo-
The word "table " in the above context is tagged as a verb while it can be a noun in
Resolving these lexical ambiguities constitutes the main challenge and the ultimate
goal of pas tagging system3 . Lexical information includes not only the part-of-speech
of the word but also the inflectional features of the word, such as, tense, person, num-
ber, mood, case and gender. In general, this information is extremely necessary to be
available to the tagging system. It is encoded in a descriptive symbol called a tag and
pas tagging is a very important intermediate step toward building many NLP applica-
tions, such as, text-to-speech synthesis, speech recognition, information retrieval (IR),
spelling correction, and parsing system. In addition, the most prominent and largely
developed field where the pas tagging used is a corpus linguistics [73,114]. NLP ap-
plications which need pas tagging system as important intermediate step and corpus
linguistics are discussed in more detail in section 2.2 and section 2.3 respectively.
3 Also called tagger system
the size of text corpus is increasing, it is becoming very difficult for the human tagger
to annotate the text in the corpus accurately. Furthermore, it requires great effort, cost,
and time. So, the development of an automatic pas tagger is highly desirable.
The main task of concern for this thesis is pas tagging over the Arabic language.
The current literature in the field of Arabic NLP shows that little research has been
done in pas tagging for Arabic. Very few attempts were made to develop the pas
tagger for Arabic such as the work done by Abuleil [15] in 1999. The aim of his tag-
ger is to use it as a first step in parsing Arabic newspaper text. Also EI-Kareh and
Al-Ansary [54] presented the semi-automatic pas tagger in 2000. The first tagger for
Arabic appeared in 2003 by Khoja [87] since the aim of this tagger was to produce a
tagged corpus. A few taggers later appeared, such as, the work done by Habash and
Rambow [71], Diab et aI. [51] and Marsi et al. [102] in 2005. Also, Alshamsi and
Guessom [127] and Harmin [75] presented a tagger system for Arabic in 2006. This
brief literature shows that the work in pas tagging for Arabic has been done in recent
years, while it was done for English, as an example, three decades ago.
Many reasons lie behind the lack of research on the Arabic language. A richly in-
flected and a complex morphological system that Arabic exhibits on one hand, and the
lack of resources such as the availability of large manually tagged Arabic corpus on the
other hand may constitutes the main reason behind the lack of research on the Arabic
language. In addition, the actual deployment of the use of computers and Internet in
The current taggers were built to tag unvocalised Arabic text using a lexicon or dic-
tionary that was tagged manually and used as a training corpus containing all possible
tags (lexical infolTIlation) for each word. At this point, the main task of the tagger is
to resolve the lexical ambiguity and to detelTIline the proper tag of ambiguous words
based on the context of the sentence.
The training corpus should be very huge for two reasons. The first reason is to achieve
very good accuracy like the taggers accuracy (98%-99%) used for English because a
very large amount of data was used to train them (e.g., hundreds of million words)
while the accuarcy of Khoja tagger as an example is 86% since her tagger was trained
on a very small training corpus (10,000 word) [87]. At the same time, Khoja state that
"Of course, having a tagger that did not require a tagged corpus was valuable to languages
other than English, where there was no tagged corpus available"( [88], p.29).
The second reason is to avoid the most important problem in POS tagging: unknown
words. Unknown words are words not appearing in the training corpus. Neither the
testing corpus nor the training corpus has lexical infolTIlation and tags for these words.
In case the tagger system deals with unvocalised Arabic text, a huge lexicon or training
corpus is required to be available to the tagging system. Unlike English, Arabic still
lacks a huge manually tagged corpus from which large amounts of training data can
POS tagger that needs as little training data as possible. Therefore, developing a POS
tagger for unvocalised Arabic text using a statistical approach as one of the two major
approaches (rule-based and statistical) that achieves reasonable accuracy seems very
1.2 Motivation
As mentioned earlier in section 1.1 the taggers were built for Arabic are based on a
lexicon or dictionary that was tagged manually for training and used to tag unvocalised
Arabic text. However, the Arabic language has a valuable and an important feature,
called diacritics, which are marks placed over and below the characters of the word.
An Arabic text may be written with diacritics or without. The text that appears without
diacritics is called unvocalised text. While written Arabic text with full representation
text when the diacritical mark assigned to one or maximum two letters in the word.
In addition, Arabic language has many signs that indicate the class of the word. Pat-
terns, grammatical rules, affixes 4 , and ending case, are examples of these signs. Based
answer regarding the field of Arabic NLP. These questions and the objectives of this
1. Does an automatic POS tagger system deals with partially-vocalised Arabic text
2. Do diacritics play an important role to resolve the lexical ambiguity that may arise
in Arabic text?
4 affixes in Arabic are those letters which precede the root of the word (prefixes), follow the root
(suffixes) or placed inside the root with which it is associated (infixes).
The literature carried out on the Arabic NLP shows that the answers to the previous
questions have not yet been. As mentioned above, the current taggers were built to
tag unvocalised Arabic text. A tagger system that deals with partially-vocalised Arabic
text does not yet exist. In addition, the importance of utilising the diacritic feature of
Arabic language to reduce the ambiguity in POS tagging has not been addressed. A
raw or a hand tagging corpus which contains partially-vocalised Arabic text also does
not exist. Finally, despite the current taggers were used a set of tag sets as described
in chapter 2, most of these tag sets were compiled to represent the general tag of the
word (the general part-of-speech) without including more linguistic attributes of the
Arabic word. In addition, these tag sets were not cover the most grammatical classes
of Arabic language and the inflectional feature of Arabic word as well. Therefore, a
• to create a POS tagger system deals with partially-vocalised Arabic text without
using a lexicon of Arabic words (tagged or untagged) especially for words be-
long to verb or noun classes, and at the same time achieves very good accuracy.
• to investigate the role of diacritic feature, especially at the end of the word (end-
ing case) in reducing the ambiguity and providing semantic information that
helps to determine the correct tag of each word in the testing corpus.
• to explore the possibility of using a novel technique to assign the correct tag to
each word in testing corpus based on the pattern of the word instead of the word
This research provides a new contributions to the field of the Arabic NLP in different
The ultimate contribution of this research is to develop the POS tagger sys-
tem called AMT (short for Arabic Morphosyntactic Tagger). AMT deals with
partially-vocalised Arabic text. The main aim of AMT is to annotate the testing
corpus, that is, adding POS tag or label to each word in the testing corpus and
requisite tool for many NLP tasks, such as, parsing and informational retrieval
systems. Chapter5 in this research show the design and implementation of AMT.
The fundamental component of any tagger system is the POS tag set that is used
in the tagging process [98]. The development of a tag set is an extremely nec-
essary step in building the tagging system. The need for a tag set comes from
the fact that there is no standardised and comprehensive Arabic tag set that cov-
ers the grammatical classes of Arabic language. Chapter 4 describe the steps of
designing a new Arabic tag set. The developed tag set follows the Arabic gram-
matical system, based upon POS classes and inflectional morphology that Arab
grammarians describe. During the course of developing this tag set, two Arabic
and Mr. Walid Alqrini6- Ministry of Education - Jordan. The consultation was
extended to cover other related issues such as, the rules of the Arabic language
and the testing corpus.
A raw corpus which contains partially-vocalised Arabic text is needed to test the
AMT tagger system. Such this corpus does not exist. This research provides a
domain; it covers a wide range of topics such as scientific and literary topics.
icon of patterns instead of using manually tagged. The rules in this technique
are based on the pattern of the word in testing corpus instead of the word it-
gorithm(PMA). The aim of this algorithm is to match the inflected word in the
The Lexical and Contextual Technique is used to assist the Pattern-Based Tech-
nique to assign the correct tag to the words not tagged by Pattern-Based Tech-
5Prof. Ali Alhamad site can be found at:
6Walid Alqrini email:
nzque. The rules in Lexical and Contextual Technique are based on the char-
acter(s), affixes, the last diacritical mark, the word itself, and the surrounding
vided into five sections. Section 2.1 introduces the problem of part-of-speech tagging
while some of its applications are introduced in Section 2.2. Section 2.3 discusses
corpus-based linguistics. The most important approaches used to solve the problem of
POS tagging are briefly examined in Section 2.4. The last section (section 2.5) defines
the POS tag set and also describes the previous work on POS tag sets.
Section 3.1 introduces an overview of the Arabic language. A brief history of the Ara-
bic script and the diacritic feature is presented in Section 3.2. The importance of the
diacritic feature in POS tagging for Arabic is discussed in Section 3.3. Section 3.4
Chapter 4 is concerned with the development of the tag set design presented in this
work and contains three main sections. Section 4.1 describes the criteria to take into
account while developing the POS tag set. Arabic inflectional features are explained
in Section -+.2. The last section (section 4.3) introduces the developed Arabic POS tag
This chapter is concerned with an implementation of the AMT system presented in this
work. It contains five main sections. The characteristics of the AMT tagger system are
defined in Section 5.1. The rule-based approach is described in Section 5.2. Section
5.3 explains the pattern-based technique used in this work while the lexical and con-
textual technique is explained in Section 5.4. A description of the tagger system and
Chapter 6 is devoted to the evaluation of results obtained from AMT. It contains three
main sections. Testing data is described in Section 6.1 while the details of each experi-
ment is done to evaluate the AMT tagger presented in Section 6.2. Finally, experimen-
Chapter 7 : Conclusion
This chapter contains the main conclusion yielded by this work and future research.
Chapter 2
To illustrate what part-of-speech tagging! is about, let us begin with a simple example
the Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary
The goal of part-of-speech tagging consists of labeling or tagging each word in the text,
including punctuation marks, with its correct part-of-speech. The following results are
For the simple text shown, the words in the sentence are followed by a tag, where the
slash "f' separates the word from the tag or part-of-speech symbol. The tag here is
taken from a predefined inventory of labels called a tag set. The tag AT indicates that
the word belongs to the grammatical class of articles; NP represents proper nouns; VBD
One of the most difficult problems which affects the POS tagging is text ambiguity.
ways [43]. Ambiguity is the most significant problem in processing natural language.
interpreted in more than one way [33]. Unlike grammars for computer programming
languages, grammars for natural languages like English, as an example, are usually
1Also called morpho syntactic categorisation or syntactic wordc1ass tagging. (see ref [73] )
runbiguous. Figure 2.1 shows the main ambiguity types in a natural language.
1 1
Lexical Syntactic Semantic
• Lexical Ambiguity
Lexical ambiguity occurs when a word has several meanings. For instance, the
word "Lie" = "Statement that you know it is not true" or "present tense of lay".
Words like "light", "note", "bear" and "over" are lexically ambiguous [40].
given more than one grammatical structure, and each has a different meaning. In
other word, when there are different possible syntactic parses for a grammatical
Consider the sentence "Fasten the assembly with the lever". This may be either
assembly, which has a lever attached to it. With the former interpretation, the
prepositional phrase "with the lever" is attached to the verb, and with the latter,
• Semantic Ambiguity
Semantic ambiguity occurs when a sentence has more than one way of read-
ity [40]. Semantic ambiguity refers to the broad category of ambiguity which
arises when the meaning of the sentence must be determined with the help of
ence is an example of semantic ambiguity. In the sentence "Start the engine and
keep it running", the fact that it refers to the engine is not inferable from the
POS tagging is the most common type of lexical disambiguation. POS Tagger system
is typically used to resolve the lexical ambiguity (ambiguity in a single word) based on
context using the surrounding words and grammar rules. For example, in the follow-
ing English sentence2 : "Book that flight" the word "Book" as shown in figure 2.2 is
ambiguous regarding its part-of-speech, it can be a verb [V] or a noun [N]. Similarly,
In Arabic, the same problem is faced. For example, in the Arabic 3 sentence4 shown in
Table 2.1. the word ~~, dkhl is ambiguous with regard to its part-of-speech. It can
Arabic Sentence: ~\
Transliteration: Albyt rmzy dkhl
Translation : " the house" "Ramzy" "entered", but it really means
.. Ramzy entered the house"
Since many words in languages are POS ambiguous, the lexical ambiguities become
the main problem that POS tagging system faces. Resolving these ambiguities consti-
tutes the main challenge in POS tagging. The tagger system should choose the best tag
for each word in the text which has more than one part-of-speech. It is clear and well
known that part-of-speech depend on context [45]. The word "table" as another exam-
ple, can be a verb in some contexts (e.g., "He will table the order") and a noun in others
(e.g., "The table is too big"). Therefore, adequate context and/or adequate semantic
information knowledge are required to resolve the problem of POS tagging [l09].
The Arabic language differs from English in terms of characteristics and grammati-
cal system as well, for example, (1) diacritic feature which is not present in English,
and (2) the root and pattern structure on which the Arabic morphological system based
is on. While Arabic has the diacritic feature, the Arabic text may be written without
When the text is written in an unvocalised form, resolving the lexical ambiguities in
3 Since a cursive system from right to left is used in written Arabic, the sentence is read from right to
left. More details of the Arabic language are discussed in chapter 3
4Transliterated Arabic words throughout this thesis are in italics while English translations are in
double quotes. All separated by commas.
this case resembles English language which is based on the context. But the case is
The testing corpus in this work is a partially-vocalized Arabic text. The diacritical
mark is assigned only to the last letter of each word in the testing corpus. There are
two reasons for choosing a partially-vocalised Arabic text as a testing corpus. The first
one is to investigate the importance of the last diacritical mark in reducing the lexical
ambiguity of the word and helping the POS tagger to resolve this ambiguity and to
assign the correct tag to the words in the testing corpus regardless of the context in
most cases. The second reason is to explore the possibility of applying pattern-based
rules to tag the testing words based on the pattern of the word instead of the word itself.
The importance of the diacritic feature in POS tagging and pattern-based approach
is described in more detail in Chapter 3 and Chapter 5 respectively. On the other hand,
ambiguity identification is crucial not only for the part-of-speech tagging, but also for
any other text processing dealing with content, such as, speech processing or semantic
annotation [33].
POS tagging is a preliminary stage for many NLP applications. The most prominent
and largely developed field where the POS tagging is used is corpus linguistics [73].
Corpus linguistics is described in more detail in Section 2.3. It is also a useful and an
important practical problem with potential NLP applications in many areas [45], such
as :
nouns) [50]. The user of World Wide Web will appreciate the importance of
accurate information retrieval. pas tagging removing the lexical ambiguity and
identifying the syntactic class of words. For example, the word" Cooking" can
be used either as a noun ( " Cooking is fun" ) or a verb ( " he is cooking lamb
" ). By identifying the syntactic role of the word" Cooking" within documents,
the results of searching for " Cooking fish ", as an example, would not include
• Parsing system
pas tagging system can be an important first step and an integral part for any
parsing system [50]. Since the parser needs lexical information for each word
before performing the parsing process, such this information usually obtained
• Word Processing
Since most word processors attempt to provide a check not only on spelling, but
• Machine Translation
A tagged version of each corpus on parallel corpora5 (text in different languages)
5corpora is a Latin plural of corpus. Next section defines corpus linguistic in more detail.
• Building Dictionaries
A tagged text has a great benefit also in building dictionaries. It has information
which can be of help to users of the dictionary such as language learners and
2.3.1 Introduction
Over the last decade, many efforts have been devoted to compile a large raw text cor-
pora. Corpus Linguistics is the study of linguistic phenomena through large collections
of machine-readable texts [9] : corpora. A corpus is defined by Leech [92, 140] as " a
large collection of natural language material stored in machine readable form that can be
The usability of corpus can be extremely enhanced by adding POS class to every word
in the corpus or any other relevant linguistic information which may be needed by the
linguist or other developers in NLP. Once the corpus is analysed, it constitutes a kind
of database that contains information about the linguistic structure and statistics of lan-
Since the fast development of computers with huge memory capabilities and software
on the one hand and the availability of large documents, books and publications on a
machine readable format, all these factors have made the compilation of these corpora
than an aspect of a specific language [86]. Many POS tagging systems built earlier used
different approaches especially for English language since it is the first language of cor-
pus linguistics [30]. The majority of these systems are designed to annotate the text
corpora, that is. they contain not only word, but also linguistic information on them,
such as, part-of-speech. A tagged corpus has a higher linguistic value, it provides
specific linguistic information which is very useful for developing lexical resources,
The history of corpus linguistics started at the beginning of the sixties when the first
printed American English corpus was compiled, which is known as Brown corpus.
1. Brown Corpus
The Brown Corpus [60,68] was Compiled by W. Nelson Francis and Henry
Kucera in Brown University and contains 500 samples, each about 2,000 words
The original edition of the text corpus was completed in 1964. It was revised
twice in 1971, and then revised and annotated with word tags 6 in 1979.
of the Brown corpus resulting from research collaboration between the Univer-
sity of Lancaster, the University of Oslo, and the Norwegian Computing Centre
for the Humanities. The text corpus was published in 1978, and its tagged edition
in 1986.
about 500,000 words of spoken British English collected from broadcast and
recorded materials. The texts were collected between 1959 and 1975.
sylvania. The Penn Treebank-I project ended in 1992 with 4.5 million words of
text, including the entire Brown corpus text, the Wall Street Journal Corpus, and
some other genres. The texts were tagged with POS tags. The data produced by
researchers) from different English speaking countries, such as USA, UK, Aus-
tralia, and New Zealand. It contains about one million words with regional va-
rieties of English for each component. For example, the ICE-GB 7 consists of
one million words completed in 1998. The texts in the corpus were published or
tains Over 100 million words of written and spoken modem British English (90%
written. 100/0 spoken). The corpus is encoded with SGML to represent POS tags
and automatically tagged.
7. SUSANNE Corpus
The SUSANNE Corpus [121] was created by Geoffrey Sampson with the spon-
sorship of the Economic and Social Research Council (UK). It contains about
8. TOSCA Corpus
TOSCA Corpus [12] has been compiled at the University of Nijmegen in 1986.
It contains about 1.5 million words of British English and consists of written
texts on education, history, philosophy, etc.
In addition, many other computerised English corpora have been developed, such
as : SEC: Spoken English Corpus [133], PoW: Polytechnic of wales corpus [128],
SCRIBE: Spoken Corpus Recordings In British English [29], COLT: Corpus of Lon-
don Teenager English [25] and IPSM: Industrial Parsing of Software Manuals [130].
Lastly in this list, the multi-tagged corpus, which is known as AMALGAM corpus [30]
corpus has been developed in Leeds University within the AMALGAM8 project by
Atwell et al. [31]. It contains texts from different genres of English corpora such as,
It becomes clear that English has been the productive field of research in corpus lin-
guistics and it stands out as the most computerised language in the world due to hun-
dreds, if not thousands of different corpora which have been developed and are being
The success of the English language in the field of natural language processing and
corpus linguistic encouraged other researchers to build their own corpora, such as :
Chinese ~The UCLA Chinese Corpus), Czech (Czech National Corpus), Danish (Dan-
ish Corpus), Spanish (LEXESP corpus, ), German (NEGRA corpus), French (TLF
corpus), Swedish (Bank of Swedish corpus), Catalan (CTILC corpus), Basque (EEBS
corpus), Basnian (Oslo corpus of Bosnian Texts), and many other languages [10].
Unlike English, Arabic has been much less fortunate in the field of research in corpus
linguistics as well as POS tagging for Arabic. A useful survey on existing resources
for Arabic corpora can be found in work done by Latfia Alsulaiti and Eric Atwell
[18]. However, a number of electronic unvocalised Arabic text raw corpora have been
An-Nahar Corpus [2] comprises articles in written Arabic collected from the
articles published between 1995 to 2000. The total size of the complete files in
2. AI-Hayat Corpus
Al-Hayat Corpus [1] has been compiled at the University of Essex, in collabora-
tion with the Open University. It contains 18,639,264 distinct tokens in 42,591
articles covering several subjects, such as, General, Car, Computer, News, Eco-
nomics, Science, and Sport. The size of the total file is 268 MB.
2003. It contains around three million written Arabic words collected from pub-
4. Nijmegen Corpus
Nijmegen Corpus [6] was compiled at Nijmegen University in 1996. It con-
DutchlDutch-Arabic dictionary.
2004. It contains 76 million tokens (869 MB) covering written Arabic texts col-
lected from Agence France Presse, Xinhua News Agency, and Umma Press from
1994 to 2000. The source material in this corpus was tagged using TIPSTER-
ing her MSc research project with Eric Atwell at University of Leeds in 2004. It
contains around 1M words covers written and spoken Arabic text collected from
websites and online magazines. It is the only corpus available free for public.
sylvania to develop an Arabic corpus containing one million words. The project
began with 734 files representing 166K words of written Modem Standard Ara-
bic news wire from the Agence France Presse corpus, which was released as
Arabic Treebank: Part 1. The second part was released as the 168K word cor-
pus, Arabic Treebank: Part 2. The Arabic Treebank: Part 3 corpus was released
In addition, there are other Arabic corpora have been compiled [8], such as, CLARA,
Egypt. DINAR, Leuven, and other corpora. Unfortunately, these corpora are not avail-
able to researchers free of charge except CCA corpus. However, some of these Arabic
corpora can be acquired from the Linguistic Data Consortium (LDC) and the European
2.4.1 Introduction
The pas tag set is a list of all the word classes that will be used in the tagging pro-
cess. It is the fundamental component of any tagger system and the first step for the
annotation of corpora [89]. A tag is a code or descriptive symbol that represents some
features or set of features attached to the word in a text [73, 105]. Thus, a pas tag set
is an inventory of labels used to classify and mark up words of a target text [74].
A new Arabic tag set called (ARBTAGS) has been developed. The justification be-
hind developing a new Arabic tag set is explained in section 2.4.4 while the previous
work in pas tag sets for English and other languages is described in section 2.4.2.
Since English corpora have been tagged by several POS tagging systems, numbers of
popular tag sets have been built also to support these POS systems. The list below
summarises some of these tag sets which can be found at the site of AMALGAM 9 .
The Brown tag set started with a set of 77 tags, and enlarged to about 226 tags
used to tag and enhance the coverage of Brown corpus. Sample of Brown tag set
can be seen in Table 2.2
granularity. The tag set contains 135 tags used to tag LOB corpus. Table 2.3
based on the LOB corpus tag set. In contrast, LOB tag set differentiates between
relative and interrogative WH-pronouns whereas SEC tag set does not. For ex-
ample, in SEC tag set, the tag WP used to cover WH-pronouns, interrogative,
• Penn Treebank tag set10 : The Penn Treebank tag set was used to tag Penn
Treebank corpus. It contains about 36 tags used. Sample of Penn Treebank tag
In addition, several English tag sets have been built and used to tag other corpora, such
as : URCEL C7 tag set, SUSANNE corpus tag set, TaSCA corpus tag set and PoW
Other tag sets have been designed for languages other than English, such as : Urdu
[74], French [41], African languages [80], Czech [79,83], Hungarian [136], Slovene
[53], German [94], Persian [104], Swedish [115], Hebrew [118], Italian [34], Span-
ish [27] and Turkisk [112]. However, a useful resources on comparison of some of the
Since there has not been much work done in pas tagging for Arabic, a very small
number of tag sets had been built. The list below summarises well-known tag sets that
Khoja [89] describes an Arabic tag set that has been built based on pas classes
and inflectional morphology system and used for her tagger system APT : An
Automatic Arabic Part-of-Speech Tagger. The tag set contains 177 detailed tags.
Each tag represents the name of the three main class (verb, noun, particle) and
their sub-classes including the inflectional features such as , gender, number and
person. For example, her tag set covers 57 type of verbs, 103 type of nouns, 9
type of Particles, 7 residual and 1 punctuation. Sample of Khoja tag set can be
El-Kareh and Al-Ansary [54,87] described an Arabic tag set used for their semi-
automatic tagger system. Their tag set contains 72 tags covering 3 sub-classes
of the main class verb, 46 sub-classes of the main class noun and 23 sub-classes
The LDC tag set was created by the Linguistic Data Consortium (LDC) team
and contains 24 tags used to tag Penn Arabic Treebank corpus. It is also used by
11 For more : http://www.ling.ohio-state.edulbromberg/postags/posproject.html
other works in POS tagging for Arabic, such as, the SVM tagger done by Mona
Diab [51] and Egyptian dialect POS tagger done by Duh and Kirchhoff [52]. The
POS tagger system. It contains 55 tags. As Alshamsi and Guessom point out,
that since the main use of their tagger is intended to be for Named Entity extrac-
tion, their tag set is not a fine-grained tag set. For example, they used the fol-
lowing tags: NOUN (noun), ADJ (adjective), PNOUN (proper noun), PRON
(pronoun), INDEF (indefinite noun) and DEF( definite noun) to represent the
noun category and its subcategories. On the other hand, PVERB (perfect verb),
(futurelImperative) tags were used to represent the verb category and its sub-
particles. In addition, some inflectional features, such as, person, number and
gender were added to their tag names to show the morphology analysis of the
word. For example, PRON~S tags means second person singular number fem-
inine/masculine gender pronoun. Table 2.7 shows a sample of their tag set.
In this work, an Arabic tag set called (ARBTAGS) has been developed. The rationale
behind developing our tag set comes from the fact that there is no standardised and
comprehensive Arabic tag set covering the most common types (sub-classes) of the
three main Arabic word classes.
The developed tag set differs from the tag sets which have been built for Arabic. The
main difference is a tag set hierarchy which is described in figure 2.3 and shows the
way that the Arabic word has been classified.
As shown in the tag set hierarchy, noun class is classified into sixteen sub-classes
(common, proper, Adjective, etc.), verb class into three sub-classes (perfect, imper-
fect, imperative), particle class into seven sub-classes (preposition, vocative, conjunc-
tion, etc.), and one punctuation. In addition, one more general tag is added to the above
general tags; this tag is used to represent the foreign word (Arabised word); it's [Fw].
These general tags represent the names of main classes and the sub-classes without
inflectional features, the developed tag set hierarchy is differs from the tag sets hier-
archy which have been built for Arabic. For example, Khoja (see figure 2.4 which
is reproduced from the original figure from [89]) was classified noun class into five
adjective), while particle class was categorised into nine sub-classes (prepositions, ad-
a Arabic Word
a~ Arabized (Nouns Words)
~ u
0:: ~
~ Relidual I I
• ...-4
~ ....
~ tr.J
~ I
Perfect I I ~
Pronoun Preposition
I ----... -.' I - ---_...-.- ~
Imperfed I Conditional Vocative t+I I ~
Imperative I t t L-...J Demonstrative I Conjunction ~ I N
Interrogative ..-J Cardinal I Exception U I ::s
• ...-4
Numeral Negation
I I Subjunctive
Conjunctive IJussive
~ Word ,
o~ I ,........,
~ 0\
0:: .........
~ Preposition
~ """
Perfect I t t I Adverbial ....
I Conjunctions I 01) ~
Imperfect I I (,). .,.1._"",,1
I n I 1--.1
H Inte:rj ection I ·S
Imperative Ordinal
Numerical Y
P ers onal
H Exceptions I ~
I H Negatives I N
Answers 01)
Specific I I Common ~
Sub ordinates
Alshamsi and Guessom [127] were classified noun class into four sub-class, par-
ticle class into four subclass (see Table 2.7). They point out, there is no need to have
fine-grained a tag set. since their tagger was intended to be for Named Entity extrac-
tion( [127], p.3-+). LDC tag set as another example was mapped from English tag set
The subclasses of the developed tag set, such as : verbal, diminutive, instrument, noun
of place, noun of time, conditional and interrogative, which belong to the noun class .
In addition, vocative, subjunctive and jussive subclasses which belong to the particle
class. These subclasses have not been mentioned before. Thus, using one of the tag
set has been built before will not capture all the subclasses shown in the developed tag
set hierarchy. In addition, the testing corpus in this work is a partially-vocalised text
which leads to use more inflectional features than described in the other tag set.
The developed tag set is based upon POS classes and inflectional morphology [24].
The tag names in the developed tag set uses terminology from Arabic tradition rather
than English grammar. For example, in Khoja tag set, the tag [VPP12M] is verb perfect
plural second-person masculine. As Atwell [30] point out, since Khoja Arabic tag set
came from the Lancaster URCEL tradition of Corpus Linguistics, she was influenced
by the English tag sets, such as, CLAWS heritage of tag set for LOB and BNC cor-
pora. Therefore, she has used terminology from English grammar rather than Arabic
tradition in naming categories and features. The author agrees with Atwell. It seems
that, not only khoja tag set uses terminology from English grammar rather than Arabic
tradition, but also the tag sets have been built for Arabic and described above used the
The tag names in the developed tag set uses terminology from Arabic tradition rather
verb, masculine gender, plural number, third person, subjunctive mood]. Details about
ARBTAGS tag set developed in this work contains 161 POS detailed tags, 101 nouns,
50 verbs, 9 particles, 1 punctuation; these tags are enriched with inflectional features
information. However, the general and detailed tags with examples have been de-
On the other hand, the usability of ARBTAGS has been tested in manual tagging and
built up a set of tagged text to serve as a goal corpus used to compare it with the results
obtained from the AMT tagger. Despite that Khoja built 177 detailed tags, but she ac-
tually used five main general tags (noun, verb, particle, punctuation and residual) and
a simplified version of the tagset (30 detailed tags) to make the training of POS tagger
computationally feasible( [88], p.71). While most of the tags used in the developed
tagger are detailed tags due to the main aim of the developed tagger, that is, to provide
a tagged corpus more useful for linguists and NLP developers to extract more linguistic
The existing literature shows that there are two main approaches to POS tagging stud-
ied so far, these are : the Rule-based Approach 12 and the Statistical Approach 13. Many
12also called linguistic approach or Knowledge-Based Approach
13 also called Probabilistic Approach or Stochastic Approach
POS tagging systems have been implemented using these approaches. The majority of
these systems were used to tag text corpora.
We categorise these systems based on whether the tagger systems are adopting the
rule-based approach or the statistical approach. On the other hand, some systems
adopt a hybrid approach (rule-based and statistical) and some other systems use other
approaches, such as, neural networks, machine learning algorithms and decision trees,
which have also been addressed. Here the focus is on the two main techniques. More
detail is provided and some well-known systems described. In addition, we also dis-
cuss the advantages and disadvantages of each of the two main approaches. A useful
and good survey on the POS tagging approaches can be found in the work done by
Abney [13].
The rule-based approach is based on incorporating a set of linguistic rules in the tag-
ger [49]. This approach uses the linguist-written language model that contains rules
ranging from a few hundreds to several thousands. The approach adopted here in this
work is based on the rule-based approach. The tagger presented in this work has two
main rule components, these are: pattern-based rules and lexical and contextual rules.
The pattern-based technique is a novel technique presented in this work. The basic
technique depend on the patterns of text words. A novel algorithm to match the Arabic
word in the testing corpus with its correct pattern in patterns lexicon has also been
built. In addition, a small amount of hand-written rules and constraint rules (lexical
and contextual rules) have been used to assist the main technique to assign the correct
tag to those words not tagged by pattern-based technique. The tagger system and the
The rule-based approach was the earliest approach for automated POS tagging. It dates
back to the 1960's and 1970's when automated POS tagging was initially explored by
Klein and Simmons [90] in 1963 and the work done by Greene and Rubin [63,68] in
1971 which considered the most representative of such pioneer taggers. Afterward,
a number of rule-based systems have been developed, such as, work done by Hin-
dle [77], Brodaa [38], Paulussen and Martin [113], Karlsson [85], Voutilainen [135]
and Brill [36,37]. Some of these systems have been built to tag corpora while other
This rule implies that a word that can be a preposition or a tense marker (i.e. the word
"to") should be tagged with the word TO (tense marker) when it precedes a word that
which itself a part-of-speech tagger. Their tagger uses several smaller English
verbs. adverbs, etc. It comprises 500 words all of which have unique grammar
codes (tags). Their CGC program performs several tests such as a suffix test us-
ing several different types of morphological information, and the context frame
rule test. Garside and Smith [63] define context frame rule as "a rule designed by
tial tag in the context of up to three tags on either side or that the potential tag was
impossible in this context". Furthermore, there are about 1,500 content word dic-
tionaries containing those nouns, verbs, and adjectives that are exceptions to the
science writing and reported that their system correctly tagged 90% of the words.
• TAGGIT system
Greene and Rubin developed [63,68] the first pioneering tagger system for En-
glish, which is known as, TAGGIT. It was the first tagger which introduced the
useful tool for linguistic research. TAGGIT was used to initially tag the one
lexicon containing a bout 3000 words was used in their TAGGIT program. The
lexicon was tagged manually, that is, each word in lexicon was assigned its tag(s)
. They used 3,000 context frame rules to disambiguate those words have more
than one tag. Each word is initially checked to see if it is found in the lexicon.
If the word is found on the lexicon and has one tag, this tag is extracted and
assigned to the word. If it has more than one tag, a set of context frame rules
have been applied to assign the best tag to the word. In addition, a suffix list of
450 strings has been used to tag the word not found on lexicon. If the word is not
found on the suffix List, the NN, JJ, and VB tags arbitrarily given to the word.
A set 77 tags was used. The authors reported that TAGGIT system correctly
tagged 77 -78 % of the words. Cutting et al. [47] point out, that the rest was done
• Fidditch system
a syntactic analysis of text and to build phrase structure trees. Fidditch has the
following components :
- a lexicon of about 100,000 words listing all possible parts of speech for
- a morphological analyzer to assign part of speech and root form for words
a set of about 350 rules to disambiguate lexical category. Fidditch has a set of
Hindle tried to acquire a new set of disambiguation rules automatically from the
tagged text of Brown corpus. The author claims that the performance of the
acquired rule set is much better than the set of rules for lexical disambiguation
written for the parser by hand over a period of several rules; the error rate is
approximately half that of the hand written rules .
which is known as, the Finite-State Intersection Grammar. ENGCG tagger con-
sists of two main rule components. The first component is a grammar specif-
solve the pending part-of-speech ambiguities as a side effect. It uses only lin-
- Tokeniser
heuristics rules.
The morphological analyzer assigns part of speech tags by looking each word
up in the lexicon contains about 80,000 and then applying heuristic rules for still
unrecognized words. The default tagging is noun when none of the rules apply.
A set of 139 tags was used. The author was tested the ENGCG system against a
test corpus of 38,000 words and he reported it correctly tagged 99% of the words.
The most remarkable feature of Brills's tagger system [36,37] which makes it
differs from other rule-based systems is that it automatically infers rules from a
acquiring the rules automatically. Rules are learned by iteratively collecting er-
rors and generating rules to correct them. Figure2.5 which is reproduced from
the original figure from [37], illustrates the learning process of Brills's tagger
the system assigns to every word its most probable POS tag, as estimated from
the small annotated training corpus. The training set is used here to determine
the most likely (frequent) tag for each word. For unknown words, the most
probable tag was guessed based on information such as the initial capital let-
ter or suffix analysis. For example, xxxxxxxion (where x represent any letter)
would be tagged as a noun because this is (presumably) the most common tag
for words ending in "ion". The tag in the second process compared to the true
the automatic annotated text to make it better resemble the manual annotation.
The tagger has a small set of rule templates. The templates are of the form:
transformation rule have been considered. The author also considered contex-
Change tag a to tag b when one of the two preceding (following) word is w
as follows:
Change the tag from preposition to adverb when the word two positions to the
right is as. Based on the remarkable accuracy the system achieved (97%), the
author showed that rule-based approach can achieves a high accuracy in com-
The statistical approach is based on collecting statistics from existing corpora. Since
it requires much less human effort than the rule-based approach, it is the most popu-
lar approach. Graside and Smith [63] point out, the general idea of this approach is
that, when a sequence of words, each with one or more potential tags is given, the
most likely sequence of tags can be chosen by calculating the probability of all pos-
sible sequences of tags, and then choosing the sequence with the highest probability.
cessful approved has been to model the sequence of tags in a sentence as a Hidden
Markov Model (HMM). To obtain a statistical language model, one needs to estimate
the model parameters, such as the probability that a certain word appears with a certain
tag (lexical probability)14, or the probability that a tag is followed by another (contex-
tual probability) 15 . These probabilities are trained on a manually tagged corpus [78].
Also, this estimation is usually done by computing unigram, bigram or trigram (N-
In order to define the goal of part-of-speech tagging systems with HMM models in
a little more detail, we consider the problem in its full generality 17 . Let Wl...N =
( WI, W2, W3, W4, ......... W N) be a sequence of words, where N is the length of word se-
quence, Cl...N = (CI' C2, C3, C4, ......... CN) be a sequence of part-of-speech or lexical
14probability of a part of speech given the word.
15 probability of a part of speech given k previous parts of speech
16N-gram model using information about both lexical probabilities and contextual probabilities
17 see ref [98] for more details
categories. When a word sequence is given, the goal of the part-of-speech system is to
find the sequence of part-of-speech or lexical categories that maximizes the probability
where H"i denotes the i th word in the word string and Ci denotes a part-of-speech tag
assigned to thei th word. After applying Bayes' rule approximation technique to ap-
After using further simplifying methods and approximation (see ref [23] for detailed
explanations) to reduce equation 2.2, the final form of formula becomes as follows:
The term P(wilci) in formula 2.3 is called the lexical probability that can be estimated
from a corpus of text labeled with a part-of-speech tag simply by counting the number
of occurrences of each word by tag. It represents the probability that a given tag is
The term P(cilci-l) in formula 2.3 is called a bigram probability; it indicates the like-
lihood of a tag given only the preceding tag. It can be estimated simply by counting
the number of times each pair of tags occurs and computing this to the individual tag
counts. For example, the probability that a verb (V) follows a noun (N) can be calcu-
lated as follows :
P(Ci = VIC.i-1 = N) ~ Count(N at positioni_l and V at positioni)
. Count(N at positioni-l)
The equation below described the general fonnula for N-gram language model which
bigram and trigram models could be simply derived from this general fonnula:
P(w~) = II P(wklw~-I) (2.4)
where W~-I denotes the word sequence WI, W2, W3, W4, ......... Wk-I·
In HMM, we directly observe the sequence of words only, while the sequence of tags
is hidden from the observer of the text; hence the tenn "Hidden Markov Model" is ap-
propriate [63]. In addition, when the estimates used for the tag transition probabilities
are derived from bigrams; that is, we have estimated the likelihood of tag given the
knowledge that a particular other tag precede it, this model is called first-order HMM.
A second-order HMM would uses tag transition estimates derived from trigrams, that
is, we estimated the likelihood of a particular tag given the knowledge that two partic-
ular other tags precede it( [63], p.l05). The simplest model would be a most-likely-tag
Although the peak use of the statistical approach in part-of-speech tagging appeared
in the eighties, the first attempt to use the statistical approach started with the work
done by Stolz et al. [129] in 1965. Afterward, many researchers presented valuable
tagging systems using the statistical approach. The seminal work is the CLAWS
system using HMM [63--65,93]. Merialdo [108] developed a POS tagger for En-
glish based on the probabilistic trigram model. Brants [35] proposed TnT, a statis-
tical POS tagger. Many other systems were built using statistical approach such as,
DeRose [117], Church [45], Cutting et al. [47], Weischedel et al. [137], Bahl and Mer-
cer [32], Samuelsson [122,123], and Kupiec [91]. However, A useful resource on the
WISSYN system
The WISSYN grammatical coder is the earliest known POS tagger developed by Stolz
et al [129] that uses probabilities to determine the grammatical classes (tags) of words.
It has four component phases: the dictionary, morphology, ad hoc and probability
phases. The first three phases accomplish the identification of the more frequently
occurring words, and the last performs the prediction of remaining words.
• Dictionary phase : a small dictionary was used contains 300 words represent
the most frequent words in English. It has not only the four main classes, noun,
verb, adjective and adverb, but also many further categories, such as, pronouns,
prepositions, negatives, determiners and other closed classes. Each word of the
input text is checked against the dictionary entries. If a word is located in the
dictionary, a tag is retrieved and assigned to the word. If a word is not found
verb, adjective and adverb). At this stage the authors reported that an average of
• Morphology Phase: this phase was constructed to deal with those words not
located in dictionary during the previous phase. A small suffix dictionary con-
tains 63 suffixes was used in this phase to determine the grammatical class of
a word (morphological characteristics). For example, one such suffix test scans
the word for its last four letters to determine if they match the -ship suffix, the
• Ad Hoc Phase : the first two phases of WISSYN system operate on each word
of the input sequentially as it is isolated. In other word, the context has no role.
In this phase and the next one, the context of the remaining words has been
taken into account. At this phase WISSYN system try to attempt clarification of
some of those words identified in either of the first two phases but which remain
ambiguous. For example, the word that, being a function word, is in the initial
contexts (e.g., as in that dog, that the dog jumped, the dog that jumped, etc.).
An Ad Hoc Phase uses rules to determine the most likely tag. This phase can
identify eight ambiguities, and include the various forms of that, and the verb
to be. The authors point out, they used the same principle to that employed by
Klein and Simmons [90] more generally, in that a specified set of frames is pro-
prepositions (e.g., in in in the house or as adverbs come in from the cold. The
authors reported that also this stage identifies 10% of words on average .
• Probability Phase: this phase was constructed to use set of conditional prob-
ability tables to predict the four main grammatical classes for those words not
tagged by the previous three phases. These probabilities were calculated from a
manually tagged corpus that contained about 28,500 words. At this phase, the
previous three tags, and the following three tags of a given word were examined.
The authors state that this phase correctly tagged around 20% of words in texts.
A test set contains 1916 words was used to test the tagger system. The tagging of the
text that occurred used a tag set consisting of 18 tags. An overall accuracy ofWISSYN
system, the authors reported that the system correctly tagged 92.8% of words.
CLAWS system
The original CLAWS(Constituent-Likelihood Automatic Word-Tagging System) sys-
Atwell E over the period 1981 to 1983 at the Unit for Computer Research on the En-
glish Language (UCREL) at the University of Lancaster [111]. It was used to tag about
one million words of (LOB) Corpus with 96-97% accuracy [103]. LOB tag set con-
CLAWS system has five phases : pre-editing, tag assignment, idiom-tagging, tag dis-
ated with pre-editing, tag assignment, idiom-tagging, and tag disambiguation phases
• WORDTAG: assigns to each word a list of all possible tags for that word by using
knowledge base or a set of rules to deduce the candidate word classes. WORD-
TAG program has a knowledge base contains 7200 words which are stored with
a list of their candidate tags. This program was constructed not to disambiguate
the lexical ambiguity of the word, but merely to assign a list of all possible tag(s)
to each word. At this stage, if the word has only one tag, then the tag associated
of more than one word). For example, "such that" assigned one tag.
• CHAINPROBS: designed to choose one of the candidate tags to those words still
have more than one tag at the end of the WORDTAG execution. CHAINPROBS
program uses statistical analysis (bigram). Garside [64] estimated that 35% of
LOB corpus words had more than one tag associated with them.
CLAWS was developed based on TAGGIT, except that CLAWS adopts a statistical
technique for figuring out cases with ambiguous categories. It uses a table of prob-
of all paths for each sequence of ambiguous words and eliminate sequences with low
proportion of the tagged Brown corpus. If tagging fails in ambiguous cases, context-
dependent disambiguation is carried out based on the context frame rules of TAGGIT.
CLAWS version 2 on the other hand developed over the period from 1983-1986 to
reduce the manual and automated pre-editing required by the system before any text
could be analysed. It differs little from CLAWS version 1. The main difference is the
automation of tag analysis itself. In addition, an extended tag set was used in CLAWS
2 containing 166 tags. Also, some change made in the WORDTAG program used in
CLAWS 1 as part of the overall goal of removing any manual pre-editing. For example,
p.78). The current version of CLAWS (version 4) was began in 1988 to undertake the
enormous task of tagging the 100 million word British National Corpus (BNC). In this
version of CLAWS, the authors separated the tagger from the tag set. They used BNC
tag set lS . In addition, they added a component that will enable it to handle SGML tags
since the BNC was marked up with these tags( [88], p.25). The authors reported that
seems that CLAWS is used a hybrid technique since it has a rule based and statistical
PARTS system
Church [45] has also implemented a statistical tagger called PARTS. It used the lexical
probability, which is the probability of observing part of speech i given word j, and the
previous parts of speech. The author calculates the product of the lexical probabilities
and the contextual probabilities for each combination of ambiguous word sequences.
The tag sequence that gets the highest probability is selected as the proper tagging
result. PARTS differs from CLAWS in terms of the statistical model they used. The
former used a trigram model while the later used a bigram model. Furthermore, PARTS
does not have a rule based component. The author reported that PARTS achieved an
1. Rule-based Approach
Rule-based approach has some features and advantages which can be summa-
represents knowledge in form of rules rather than stored data records. The
• The language model is written from a linguistic point of view and explicitly
2. Statistical Approach
The advantages of using a statistical approach [108] can be summarized as fol-
Samuelsson and Voutilainen [58] and Chanod and Tapanainen [42] show that a rule-
based tagger for English and French respectively can achieve better results than a sta-
tistical tagger.
Some implementations combine the statistical approach with the rule-based to build a
hybrid POS tagger. Chanod and Tapanainen [42] developed a tagger that use a combi-
nation of both statistical and rule-based approaches for French. Kuba et al. [26] built
a hybrid tagger for Hungarian. Schneider and Yolk [126] trained the Brill tagger to
Additionally, some different approaches have been used for building text taggers.
Schmid [125], Marques and Pereira [100], Antonio et al. [114] developed a POS tag-
ger using Neural Networks. They train a single-layer perceptron to produce the POS
Schmid trained his tagger on 2 million words of the Penn Treebank corpus, and tested
on 100,000 words of the corpus. Marques and Pereira trained their tagger on a very
small Portuguese training corpus (15,000) words, and tested on 2229 words. While
Antonio et al. trained their tagger on 46,461, and tested on 47,397 words of the Wall
Memory-based systems are basically a form of k-nearest neighbor systems where set
of cases (the training data) are kept in memory, and each test sample uses a distance
metric to determine which training samples are closest. Then, the test sample is classi-
fied as the same class as the training samples. The set of cases in this approach usually
consist of a word, its preceding and following context, and the POS of that word in
the context. The author trained his tagger using a tagged corpus. To tag a new sen-
tence. for each word and its context, the most similar case(s) where kept in memory
are selected and extracting the POS tags from these cases. The tagger was trained on
two different set size (two million words and 500,000 words). The author reported an
Decision trees have also been used to implement part-of-speech. A decision tree is
a tree such that each internal node is a feature test and the leaves are classes to be as-
signed to the tested individual. The trees are constructed using statistical information.
Marquiz and Rodriguiz [101] have implemented a POS tagger using decision tree that
has been tested and evaluated on the Wall Street Journal corpus. The authors reported
Maximum-Entropy on the other hand is another technique used for building text tag-
ger. This technique uses a statistical model can be classified as a Maximum Entropy
model. It uses many contextual "features" to predict the POS tag. The Maximum En-
tropy model trains from a corpus annotated with Part-Of-Speech tags and assigns them
trained on 962687 words taken from Wall Street Journal and it has been tested on
general and specifically POS tagging is growing significantly in recent years. The list
below summarise most of the work done in POS tagging for Arabic.
• Shereen Khoja [87, 88] describe a hybrid tagger system that uses both morpho-
logical rules and statistical techniques in the form of hidden Markov models.
• Abuleil and Evens [15] describe a system for building an Arabic lexicon auto-
matically by tagging Arabic newspaper text using some rules and morphological
• Diab et al. [51] present a Support Vector Machine (SVM) based approach to
All the systems described above were built to tag unvocalised Arabic text. Some of
vocalised Arabic text that waits for the user of the system to either confirm and
accept the output of their system or change it. Their system used statistical tech-
niques (HMM) and morphological rules. A small set of words were stored in its
lexicon with its class and subclass as well as some inflectional features.
Morphological rules was used to remove affixes and particles words from the
testing text. The analysis result of morphological component represent the main
class the word belong to or sub-class and inflectional feature which already
stored in lexicon. The analysis result passed to the user. At this stage, the user
may accept or reject the system result. In case the user reject the result, the
word analysed once a gain and passed to the user. When the system completes
its analysis without an accepted result from the user, the user in this case has an
throughout the use of the system and stored later in lexicon. This component
used to select the tag with the highest frequency without any intervention from
the user. Testing corpus collected from Egyptian AI-Abram newspaper was used
Shereen Khoja [87,88] developed the APT system that uses statistical and rule-
based approaches. In the authors' point of view, the APT is the first tagger
system for Arabic, for two reasons. First, it is the first fully-automatic tagger for
Arabic. While the second is the aim of this tagger is to produce a POS tagged
unvocalised Arabic corpus that may used as a useful tool for linguistic research.
A manually tagged lexicon containing 50,000 word was used to extract several
small lexicons. A training corpus containing about 10,000 word has been used to
train her tagger. APT tagger has two main components; a rule-based component
(stemmer) and a statistical component. Figure 2.6 which is reproduced from the
original figure from ( [88], p.78 ), illustrates how APT performs tagging.
Lexicon Words
Lookup with
Arabic Statistical
Words multiple Component
Stemmer tags
Words with
unique tags
APT performs the tagging process as follows. Each word is initially looked up
in the lexicon. If the word is found in the lexicon, then it is assigned all the pos-
sible POS tags as found in the lexicon. The word is then passed to the stemmer
regardless of whether it was found in the lexicon or not. The main function of
stemmer is to remove all prefixes, suffixes and infixes to produce the root. The
author does not mention the number of strings that were used in her stemmer
affix lists.
If a word could not be stemmed, and was not found in the lexicon, then it is
given the main tags (noun, verb, particle, residual and proper noun). As Khoja
points out, at this point, each word has at least one or more tag. If a word has
more than one tag, then this word (and its neighbors) are passed to the statisti-
cal component where the most likely tag is selected. APT statistical component
uses the contextual and lexical probabilities to determine the most likely tag of
the word. A corpus contains 1700 word has been prepared to test her tagger. The
author report that APT system correctly tagged 86% of the words.
Alshamsi and Guessom [127] presented a Part-of-Speech (POS) Tagger for Ara-
bic. The POS tagger resolves Arabic text POS tagging ambiguity through the
Markov Model (HMM). The main goal behind the development of their POS
tagger is to use it for Named Entity extraction. The input of the tagger is noun
Like Khoja work, their system has two main components; stemmer and statis-
tical component. The authors used Buckwalter's stemmer to stem the training
data. A training corpus contains 27594 nouns, 23554 verbs, 5722 adjectives and
5384 proper nouns of Arabic news articles has been used. The training corpus
During their tagging process, after the tokenizer converts the original input text
into a list of words using the space as a delimiter, the resulting list is passed to
the stemmer. A trigram language model has been constructed and used the tri-
gram probabilities in building their HMM model. Each word has more than one
tag been tagged by calculating the lexical and contextual probabilities. A test
corpus containing 944 words was used to test their tagger system. The authors
report that their tagger achieved 97%. This high level of accuracy is surprising
me due to the fact that they have a small training corpus. However, as the author
point out, they are in the process of enlarging the size of thier training corpus to
tagger was still in early development. The architecture of the tagger is based on
3-tiers; the client tier, the middle tier, and the database tier.
The client tier is a web browser which sends the user's message to the web
server and displays the returned results back to the user. The middle tier consists
of a web server, a scripting engine and NLP module which is responsible for
analysing the Arabic documents. While the third tier consists of an SQL server
The tagger used the Buckwalter dictionary and his morphological analyser [11]
These documents were translated into XML format to test the tagger. The user
can write a sentence and pass it to the tagger. Each word in the sentence is looked
up in the dictionary, analysed and segmented into prefix, stem, and suffix. The
result returned to the user contains all possibilities for the word. The author did
not mentioned any information about the tag set they used and the accuracy their
system achieved. However, based on their system snapshot, it seems they used
ger developed by Daelemans et al. [48] to produce a POS tagger for Arabic.
Memory-based tagging is based on the idea that words occurring in similar con-
They used Arabic Treebank-1 corpus and LDC tag set. Their training corpus
contains 150,966 words. The test set contains 15102 words, with 947 words do
MBT tagger has three modules; a lexicon module which stores for all words
occurring in the provided training corpus their possible tags, the second module
generates two distinct taggers; one for known words and the other for unknown
words. The known-word tagger used a lexicon, while the unknown-word tagger
attempts to derive as much information as possible from the surface form of the
The authors report an accuracy of the tagger using the first two modules on
the test corpus is 91.9% correctly assigned tags. They state that on the 14155
known words in the test set the tagger attains an accuracy of 93.1 %; while on
the 947 unknown words the accuracy is considerably lower: 73.6%. The third
module on their tagger has been designed to improve the precision and recall in
their system. The tagger integrated with morphological analysis was also built
duced by LDC and used for pas tagging Arabic text. The author used three
1. Prefixes lexicon contains 299 entries in the first release, 548 entries in the
second release.
2. Suffixes lexicon contains 299 entries in the first release, 906 entries in the
second release.
3. Stems lexicon contains 82,158 entries in the first release, 78,839 entries in
and prefix-suffix combinations. The data is written using his Arabic translitera-
tion system 19 instead of original Arabic script. The author Morphology Analysis
Each input word is segmented into three elements: prefix, stem and suffix. Each
element is looked up in its respective lexicon. If all three word elements (prefix,
stem, suffix) are found in their respective lexicons, then their respective com-
patibility tables used to determine whether they are compatible or not. Three
logical category of the stem? (Le., is the combination pair found in the list
2. if so, is the morphological category of the prefix compatible with the mor-
phological category of the suffix? (Le., is the combination found in the list
3. if so, is the morphological category of the stem compatible with the mor-
phological category of the suffix? (i.e., is the combination found in the list
If the answer to the last question is "yes" then the morphological analysis is
valid. The morphological analyser is produced all the variations of the input
word included the short vowel and diacritics. The pas tag (which is stored
in lexicons) for each variation also is produced. However, to those who are
This chapter contained a description of the pas tagging problem and NLP applications
that may use pas tagger systems as their first stage. We defined the concept of corpus
linguistics and pas tag set. The previous work on corpus linguistics and pas tag set
has been discussed. In addition, the different approaches used to solve the problem
have been examined and the previous work on pas tagging for English and Arabic has
been explored.
This work employs the rule-based approach. AMT tagger presented in this work has
two main rule components, these are: pattern-based rules and lexical and contextual
of patterns instead of using manually tagged lexicon or training corpus which contains
a set of Arabic words. The triggers in pattern-based rules depend on the patterns of
text words. A novel algorithm to match the Arabic word in testing corpus with its cor-
rect pattern in patterns of lexicon has also been built. In addition, a small amount of
hand-written rules and constraint rules have been used to assist the main technique to
assign the correct tag to those words not tagged by pattern-based technique.
The next chapter will covers some of the basics of the Arabic language, describe the
diacritics feature in Arabic and its importance in Arabic pas tagging. The pas tagger
Chapter 3
3.1 Introduction
The most prominent member of semitic languages family is the Arabic language. This
semitic family includes also Hebrew, Amharic, Maltese and Syriac. They all share the
pattern based morphology system. Furthermore, these semitic languages have a mor-
phological system based on a root, usually consisting of three consonants, and a pattern
structure. The root gives the basic lexical meaning of the word, while the pattern con-
sists of vowels and it signals the grammatical significance of the word l . Bar-Haim et
at. state that :
"Semitic languages have rich inflectional systems and a template-based derivational mor-
phology, which are manifested in a large variation of word forms "( [118], pA).
Arabic is considered as the most widely used member of the semitic languages. It is
spoken by more than 300 million Arabs around the world. Furthermore, it is also un-
derstood by more than 1.1 billion other Muslims. It has been a literary language since
the 6th century A.D, and is the liturgical language of Islam in its classical form [70].
Arabic words, like words in other Semitic language, are written with consonants. Ara-
bic language has several varieties, these are : Classical Arabic, Modern Standard Ara-
bic (MSA) and Colloquial (spoken) Arabic. Classical Arabic is the language of Qur' an
Islamic world. Modem Standard Arabic (MSA) is the language of the media, educa-
tion, and fonnal communication, which is understood by all Arabic speakers. Collo-
quial (spoken) Arabic is a local dialects of people throughout the Arab world [134].
The principal script used for writing the Arabic language is Arabic alphabet. It is
composed of 28 letters. On the other hand, writing in Arabic language is unicase; the
concepts that distinguish between U pperlLower case letters do not exist. Furthermore,
a cursive system from right to left is used in written Arabic 2 . The transliteration sys-
tem of Arabic Alphabet and other diacritical marks used in this thesis are described in
1For more infonnation : www.a-z-dictionaries.comllanguage/Arabic_dictionaries.html
2For more details:!users/sbettlarabic.htm
While the Arabic alphabet was originally used to write the Arabic language, it has
been adopted by other groups to write their own languages, such as Persian, Pashto
and Urdu. A letter in the Arabic language is written in multiple forms, depending
on where in a word a letter appears. It may appear in the beginning of a word (initial
form). anywhere other than the beginning or the end of a word (medial form) and in the
end of the word (final form)3. For example, in these Arabic words, ( UZ->\~4, mdArsa,
"schools"), ( C:' smEa, "to hear" ), ( 'r' hlma, "to dream" ), the letter i m (miim)
appears in initial, medial and final forms, respectively.
In Arabic language, a word may be an original word or Arabized word. The orig-
inal words have two subcategories: Derivative Arabic words and Fixed Arabic words,
while the Arabized words are nouns borrowed from foreign languages [56].
Derivative Arabic words, which are words belonging to the verb and noun classes,
have been built from the same root and obey the Arabic derivation rules [72]. For ex-
ample, the words, ~, mktb, "office", y \:{, ktAb, "book", ~ ktba, "he wrote",
Arabic script as well as latin script were derived from the first alphabet which was
created by the phoenicians in 1300 B.C. The phoenician script comprises 22 letters
as shown in figures 3.1 and written from right to left without capital letters. Since the
Phoenicians were living in Lebanon, Palestine and Syria (middle-east area), their script
was born in lebanon.
:\Iodem Latin
Early Latin
Early Greek
Early Aramaic
Early Arabi~
Later, the Aramaic alphabet originated from the Phoenicians in 1000 BC. Later, the
Nab ate an script was born in the city of Petra, north of the Red Sea-Jordan in 100 BC,
and spread allover the middle-east. The early Arabic alphabet was created in Kufa
(Iraq) in the middle of the first century. The old Arabic alphabet consisted of around
17 letter forms without dots or diacritical marks. The calligraphic styles for the old
With the birth of Islam, the Quran was written with the Quranic kufi script. Since
the missing dots and vowels in the old Arabic script are not clearly indicated, several
5Pigure 3.1 taken from : http://291etters.wordpress.comJ2007/0S128/arabic-type-history/
letters of the Arabic alphabet share the same shapes, for example, the letters y, u,
~. have the same shape (without dots), which definitely lead to confusion for Quranic
readers. Since the Quran became the reason to reform all the Arabic scripts found
in Arabia on one hand, and the number of non-Arab Muslims increased on the other
hand, some refonn was needed to avoid confusion and facilitating reading and learning
of Arabic as well.
The first system of developing the old Arabic script was invented by Abul Aswad
al Duali (688 AD) by placing large colored dots in order to help with pronunciation.
Later, a unifonn system to distinguish letters by using dots (in current usage) was de-
veloped by Al Hajjaj ibn Yusuf al Thaqafi. Lastly, Al Khalil ibn Ahmad al Farahidi
By using the dot system, one, two, or three dots to letters with similar phonetic char-
acteristics were added. A total of 28 letters containing three long vowels is obtained.
This unified well structured Arabic script was developed for the writing of the holy
scripts of the Quran in the 7th century with the development of calligraphic styles as
well. Later the Quran was written with the Quranic Naskh style6 .
On the other hand, the Phoenician alphabet was used as a model by the Greeks. letters
for vowels were added by the Greeks. Afterwards the Greek model became the model
The Arabic language has two types of vowels (long and short vowels). The long vowels
are three letters form a part of Arabic letters (Arabic alphabet)8. The short vowels are
three small vowel marks (see Table 3.1), which do not form part of the Arabic letters.
These marks are placed above and below the Arabic letter.
Fatha represents the sound of Ia! in bag, damma represents the sound of lui in put
Moreover, there are other five diacritical marks 9 • Three of them as shown in Table 3.2
called nunation (Tanween Fath pronounced lan/, Tanween Damm pronounced lun/,
Tanween Kasr pronounced lin/). Nunation is the doubling of the short vowels used at
Mark above the letter Mark above the letter Mark below the letter
Finally, the last two marks in use are sukun (absence of a vowel) which means that the
consonant is not followed by a vowel and gemination (Shadda) which means a dupli-
we use the term diacritics to represent all marks (including short vowels marks)
In Arabic language, diacritics can be used in Qura'n text, in other religious texts, in
classical poetry, in textbooks of children and foreign learners and in complex texts to
avoid ambiguity. The diacritic marks may be assigned to each character of the Arabic
word. in this case, an Arabic word is called fully-vocalised. When the diacritical marks
are assigned to most letters of the word, but not each, an Arabic word in this case is
marks assigned to one or maximum two letters in the word [56]. Table 3.4 shows an
The base of POS tagging is that many words are ambiguous regarding their gram-
matical category [109]. For instance, the word '\.-:"..A~",dhb in the unvocalised Arabic
sentence presented in Table 3.510, ( which either means "has gone" or "gold" ), can be
lOThe tags used in sentence presented in Table 3.5 have been described in more detailed in chapter 4.
a verb or a noun. Due to the fact that the sentence is unvocalised, this lexical ambigu-
task, because it is very difficult to predict the semantic meaning with the missing dia-
critics (at least one diacritical mark) in Arabic text. Furthermore, removing the ambi-
statistical technique, which still suffers from many disadvantages, including : needs a
manually tagged huge lexicon or training corpus, can't deal with unknown words and
Arabic NLP tasks, including parsing [95]. The use of diacritics in Arabic texts are
extremely important. The list below summarises the importance of using diacritics in
Arabic language :
meaning of words. For example, adding the short vowel (Fatha mark) to the last
letter of the word "~.)" presented in Table 3.5 to become '\.;.~.)" causes the
2. Determining the correct POS tag to the words in the sentence. For example, the
other words, and determining the syntactic position of the word in the sentence.
For example, short vowels used to indicate mood, aspect and voice endings for
verbs and case endings for nouns.
-+. Indicating the correct pronunciation of words, correct syntactical analysis which
leads to reducing problems for NLP applications such as text-to-speech or
The above list shows that using the diacritics in text is important to differentiate the
word from other words and determine the syntactic position of the word in the sentence
such as nominative, accusative, and genitive.
In addition, these diacritic marks determine the inflectional features 11 of the sentence
words, such as, gender, person, number, noun case, and verb mood.
In the above sentence, it is very difficult to determine the inflectional features for the
word w fa>, "attended" with the diacritics missing, especially the last diacritical mark
(ending case). Neither the context nor the word itself can provide any information on
inflectional features for such a word. Thus, the last diacritical mark helps not only in
determining the correct part-of-speech (general tag) of the words in the sentence, but
11 Arabic Inflectional Features described in chapter 4, Section 4.2.
also in providing full information regarding the inflectional features for the sentence
The possible last diacritical mark ( case ending) of the word w~ and the inflec-
Table 3.6: The possible last diacritical mark (case ending) of the word w~
The correct tags of the sentence presented in Table 3.5 where the suitable diacritical
mark has been added to the last letter of every word in the sentence are shown in
table 3.7.
" ~\Jaj\
Arabic Sentence: ~..r-' . ~.;
POS Tag: NuAj NuCnNm VePe
Transliteration: msrEAF AltAlbu dhba
Translation: The Student gone quickly
Table 3.7: Partially-vocalised Arabic sentence and its correct POS tag
A word can be defined as something that is uttered, intelligible, and has a full meaning
[69]. According to Arab grammarians, words in Arabic are classified into three Part-
of-Speech categories: Verb, Noun, and Particle. Each category has its meaning and its
3.4.1 Verb
The category of verb is defined as a word denoting an action and may be combined with
the particle [134]. In Arabic, traditionally, two verb forms are recognised; the Perfect
(past) and Imperfect (present). The third form, the Imperative, has been considered as
a variant of Imperfective by Arab grammarians. Each form has its distinguishing signs.
• Perfect Verb
The perfect verb indicates a state or a fact in the past [76]. It follows the pattern
of the root (ground form) 12 j.s, jEla, "do". For example, the root ~ ktba ,
"wrote", has the basic meaning of writing. It can be suffixed with many letters.
For instance, the letter w, taa. The suffix represents more inflectional features to
the word, such as, person, gender, number, and mood. For example, the words,
W. ktbtu, "I wrote" (first person, masculine), W. ktbta, "you wrote" (sec-
ond person, masculine), W, ktbtx, "she wrote" (third person, feminine) and
The above example shows that adding the diacritical mark on the last letter of
the word helps not only in determining the lexical category of the word, but also
• Imperfect Verb:
The imperfect verb expresses an action still unfinished at the time to which ref-
12the ground form and the derived forms are described in more detail in Chapter 5
erence is being made [76]. Also, it can be prefixed with one of the following
~ ~
four letters (called letters of present): I, 1..5, ~,), u. For example, the words, ~I,
" .
Aktbu. "I write". ~, yktbtu, "he write", ~, nktbtu, "we write", ~,
tktbtu, "she write". In addition, the imperfect verb can accept a particle. For
The imperative verb indicates an action demanded to be carried out in the fu-
ture [76]. It always comes in the second person. For example, the word y:(1,
'aktbx, " write !". Like the perfect and imperfect verbs, the imperative verb
can be suffixed with the letters ~,yaa, I, Alif, 0, nuun, ..5, waa to represent the
inflectional features of the word (see Table 3.8).
3.4.2 Noun
The category of noun is defined as a word denoting an essence and may be combined
with an article [134]. In Arabic, a noun has no temporal aspect. As Arab grammarians
described, a noun has a set of signs that are used to distinguish it from verbs and
1. Kasra mark
A noun can receive a kasra vowel mark when it is in the genitive case. In Arabic,
the words which belong to the verb category never receive a kasra mark. For
example, ~, mktbi, "office".
J Nunation mark
In Arabic language, neither the verb nor the particle receives any nunation mark.
A nunation mark appears only on the final letter of Arabic word which belongs
to noun category. These marks indicate that these words are indefinite. For
example, ~ ktabun, "book'''.
3. Vocatives
However, it is important to draw attention here, that it is not necessary to find one or
all of these signs to define a word as a noun. For example, in the following sentence
~-,..ul ~..,......,~ I ~-,oM ~"the computer teacher wrote the lesson ", the word
~..J oM, "teacher" is a noun, and none of the above signs is used to distinguish this
word. In this case we use the pattern of this word to distinguish it [69].
3.4.3 Particle
The category of particle is included the in remaining words. Particles used to assist
other words in their functions in the sentence [134]. In Arabic, the particle does not
not accept any of the signs that distinguish between nouns and verbs [69]. An example
"Hebrew shows similarity to Arabic in terms of its grammatical constituents of verbs, nouns,
and particles. The Hebrew nouns can certainly be in the genitive position, mimated (instead
of nunated), defined by i1 (instead of J~, and be predicated in the same way as in Arabic.
Nevertheless, when contrasted to Arabic, Hebrew enjoys a less complicated particle system"(
[69], p.2'+).
phology and Syntax. The former is the study of the form of the word while the later is
the grammatical arrangement of words in the sentence. On the other hand, Arabic mor-
phology has two subcategories: Derivational, how words are formed, and Inflectional,
how words interact with syntax, such as singular, dual and plural [120].
Morphology Syntax
Derivational Inflectional
In Arabic morphology, the Arabic word formation is based on a root [138]. Manyaf-
fixes can be attached to the root to form Arabic words. Arabic morphology consists of
a system of consonant roots which interlock with other consonant and vowels to form
word stems. The stem is formed by substituting the characters of the root into certain
verb forms [120].
A great number of other forms can be derived from the ground form (root) by insert-
ing a long vowel, lengthening the medial letter of the root, and/or adding consonantal
prefixes to produce a new word with a new meaning that still shares the basic meaning
of the root [138]. For example, the root " ~"ktb has the basic meaning of writing.
The root may be conjugated in many forms 13. Samples of the words that can be formed
and derived from the same root"~" ktb are shown in Tables 3.9,3.10,3.11.
Arabic words are modified not only by number, person, gender and tense, but also by
case and mood, definiteness and indefiniteness [22]. According to Arab grammarians,
from every verb, a verbal noun (Infinitive), a noun of time, an adjective noun, a noun
of place, diminutive noun, an instrument noun, a present (active) participle and past
I3Por more information: http://wahiduddin.netlwords/arabic_glossary.htm
~~ mktwbp letter
~t( ktAbun book
0 ~
mktbp office
~• M
kutybun booklet
Table 3.11: Samples of additional forms such as verbal, diminutive, Adjective nouns
created from the same simple root ~
Arab grammarians described, there are two types of sentences 14 : Verbal and Nominal
sentences .
• Verbal sentence
A verbal sentence is simply one which begins with a verb followed by a subject.
The verb in verbal sentence is always in singular form, where the subject may
The above sentences may be translated to English as "wrote the student(s) the
lesson". But it really means "the student(s) wrote the lesson". The underlined
words in the above sentences represent the subject in each sentence. The subject
in the sentence 1, 2, and 3, is singular, dual and plural, respectively, where the
• Nominal sentence
A nominal sentence is one which begins with a noun or subject. The verb in an
Arabic nominal sentence must agree with the subject in number and gender as
1. ' wJ\
d. ~lla)\
' wJ I\.::(·
2. ~.J .
l:}lla) \
~ •
3. ~.JwJ\ ~ y>lwl
The underlined words in the above nominal sentences" the student(s) wrote the lesson
" represent the verb in each sentence. The verb is changed to agree with the subject in
The above two types of sentences, which are VS0 15 and SVO respectively, are viewed
as being independent and neither of them is derived from the other. However, the Arab
grammarians assumed that the subject never precedes its verb, and take VSO as the
This chapter briefly described an overview of Arabic language and its script. The dia-
critic feature and its importance in reducing the lexical ambiguity and providing more
semantic information to the word text also addressed. Arabic as other semitic language
based on the fact that words are derived morphologically from roots. Many words are
derived with a new meaning that still share the basic meaning of the root. The applica-
tion to the root of a large number of morphological patterns determines the categorical
All Arabic words can be theoretically reduced to roots. To deduce a root from the
pattern and to decide which pattern has been imposed on the root is a prerequisite skill
for using an Arabic dictionary. According to Arab grammarians, there are three major
part-of-speech: verb, noun, and particle. Arabic not only has complex morphological
system but also exhibits a highly inflectional system as well. In next chapter, an Arabic
inflectional features will be describe in more detail beside the tag set which consider
Chapter 4
Atwell [30] presented a number of criteria to take into account while developing the
POS tag set. These criteria have been taken into account when developed the tag set.
be chosen in such a way that makes it easy for the user to remember the classes
of the text. Since producing a tagged corpus where the text has been enriched
anticipated outcome of ATM tagger presented in this work, the tag names have
been chosen to help linguists and NLP developers to remember the lexical class
of each word. For example, Ve for verb, Nu for noun and Pr for particle.
The tag set developer should take into account that the tag set should cover as-
pects of the theory of language and the characteristics of that language (i.e, in-
flectional feature). The developed tag set presented in this chapter follows the
Arabic grammatical system and is based upon the main three POS classes (verb,
noun, particle); these tags are enriched with inflectional features [24].
Usually, the lexical classes are defined in terms of paradigmatic forms (repre-
sentative set of the inflections of a noun, verb, etc), and syntagmatic functions
(syntactic function of the words). Since the short vowels and other diacritical
marks are available in our testing corpus, these vowels can encode the grammat-
ical class or feature information [30]. This criteria is taken into account during
4. Idiosyncratic words
Arabic like any other language has a number of words with special idiosyncratic
behavior. These words do not have patterns to follow, such as words belonging
to a particle class. Similarly, the English language has a number of words with
special idiosyncratic behavior. These words do not fit into traditional parts of
speech. For example, Brown and LOB tag sets analysed "a" as article tag AT,
but UPenn tag set analysed it as determiner DT. Our developed tag set analysed
these words based on their roles in the text as Arab grammarians classified these
5. Categorisation problem
The vowels (last diacritical mark) in our testing corpus add more linguistic in-
formation and reduce the ambiguity in categorising words. Most tags in the
developed tag set are detailed tags. Each tag being defined clearly and unam-
that all the words in the testing corpus can be tagged consistently.
locating an untagged input text and identifying words, punctuation marks, num-
bers and other marks. Some words need a combined tag. For example, the word
~j' l1yktbu, "and he writing" has the following tag PrCo+VeP iMaSnThDc
. This issue has been taken into account when developing our tag set.
It may appear in some proper nouns (e.g, Ala' a Alddin) but is treated as one word
application for teaching purposes, the tags in the developed tag set have been
The main technique in this work is based on the pattern of the word. It is inter-
ested to note that this technique may also be valid for other semitic languages
especially Hebrew language, since this language like Arabic has a morpholog-
ical system based on a root and a pattern structure on one hand, and has the
diacritic feature on the other hand. However, the tag set presented in this work
is based upon the three main POS classes, their sub-classes and inflectional mor-
phology. Thus, the guiding principle was compatibility with Arabic grammar
tradition [30].
EAGLES guidelines outline a set of features for tagsets 1 ; these guidelines are
designed to help standardise tagsets for what were then the official languages
attribute-value pairs (e.g. Gender is an attribute that can have the values Mascu-
line, Feminine or Neuter) [74]. Arabic has its own structure, feature (e.g diacrit-
ics), linguistic attributes (e.g dual number and jussive mood), which make this
language different from the languages for which EAGLES was designed [89].
In addition, there are other differences in the order of the constituents within the
sentence. For example, in Arabic, adjectives follow the noun which they mod-
ify [58]. Despite the fact that some classes from traditional Arabic linguistics
and grammar have not been compatible with EAGLES guidelines, some of the
English translations of class and feature names used in the developed tag set
were drawn from standard terminology found in the EAGLES guidelines [30].
oped to cover written Arabic text. The corpus in this work is partially-vocalised
Arabic corpus contains written Arabic text. It does not contain spoken text.
The tags were developed with a good level of granularity, leading to cover all the
sub-classes of the three main POS classes used in Arabic grammar. Each tag is
enriched with inflectional features, which seem to help the linguists to develop
developed tag set contains 161 2 detailed tags, 101 nouns, 50 verbs, 9 particles,
inflectional system. Therefore it is natural that the tag set should be a richly
articulated tag set, providing distinct codings for all classes of Arabic words.
At the same time, as Elworthy [59] point out, if all of the syntactic variations
which are realised in the inflectional system for highly inflected languages, such
as Arabic or Hungarian were represented in the tag set, there would be a huge
simple tagger.
cal information, such as gender, tense, number or person3 • Arabic is a highly inflected
ogy is used to express grammatical relations between words in sentence [16]. The list
4.2.1 Gender
Nouns and verbs in Arabic are morphologically marked for the inflectional feature
"Gender". Arabic has two genders: masculine and feminine. Like English, male per-
sons are masculine, female persons are feminine, but things may be masculine or femi-
nine. For example, in English, gender is indicated in the third person singular personal
pronouns as the feminine "she", the masculine "he" , and the neuter "it". The personal
pronoun "it" can refer to certain creatures of either sex (baby, cat) and to sexless things
(beauty, book) [21].
In Arabic a word such ~.J, fryqun, "team" may refer to (masculine or feminine)
gender. Nouns in Arabic may be recognised as feminine singular nouns by their gram-
matical form. For example, nouns ending by 0 (Ta Marbota), such as ~,jntpun, "gar-
den" or ending by ,I, such as ,~, ShrAp, "desert". Also, nouns may be recognised
as feminine plural nouns which are formed by adding the suffix u \ such as ~~,
the suffix 0-, or 0:., such as 0~--,...\.4, mdrswna, or ~--,...\.4, mdrsyna, "teachers".
In terms of Arabic verb, since the verb in Arabic is a combination of a verb and a
such as, gender, number, person and mood marker. In general, gender terms and forms
in Arabic as well as English do not always refer to biological gender [21,61]. How-
ever, the inflectional feature gender in our tag set has been classified into three genders:
4.2.2 Number
In Arabic, number is the inflection feature governing nouns and verbs. Unlike English,
Arabic has three forms of number:singular, dual and plural. Singular denotes only one,
dual denotes two individuals of a class or a pair of anything and plural denotes three
or more [21]. The dual is formed by adding the dual suffix ~ \ or 0' For example, the
words :Jj , wldun, ~\.:uj or 0..uj, wldAni or wldyni and ~'Sji, OwiAduN, which mean
" a boy ", ., two boys " and " boys " indicate singular, dual, and plural respectively.
4.2.3 Person
In Arabic, verbs and only personal pronouns inflect for three persons: the speaker (first
person), the person spoken to (second person), and the person spoken about (third
person). The first person in the singular denotes the speaker. In the plural it denotes
the speaker plus anybody else, one or more. The second person denotes the person
or persons spoken to. The third person denotes those other than the speaker or those
~ ~
spoken to [132]. For example, the personal noun 13\, AnA, "I", ~\, Anta, "you" and
jA, hwa, "he" indicate first, second, and third person respectively.
4.2.4 Mood
Arabic Verbs have three moods: Indicative, Subjunctive and Jussive (Imperative). The
mood markers are often short vowel marks placed at the end of the word (suffixes) such
as fatha, damma and kassra or sukun mark. For example, damma lui for indicative and
fatha Ia! for subjunctive. On the other hand, mood may be determined by particles
which govern or require a certain mood [120]. For example, the negative particle f' 1m
requires the jussive mood on the following verb such as the words ~ f' 1m yktbx,
"does not write". The mood of the verb word ~ is jussive.
4.2.5 Case
In Arabic the term "case" refers to inflectional marking. Arabic nouns have three cases:
nominative, accusative and genitive. They indicate the syntactic function of the word
and its relationship with other words in the sentence (e.g. singular, dual, masculine
plural, feminine plural forms take special case endings) [120]. These cases are indi-
cated by short vowel marks placed at the end of the word (suffixes). For examples,
the words L-';.J..u\ Aldrsa. ~.J..u\, Aldrsu and ~.J..u\, Aldrsi which mean "the lesson",
4.2.6 State
Arabic nouns are marked for definiteness or indefiniteness. In Arabic the definite article
J\, al used as a prefix to indicate definiteness. It is not an independent word like "the"
in English. In Arabic, "nunation" (tanween) marks are used as suffixes to indicate
indefiniteness [120]. For examples, the words yts::.l\, A IktAbu , "the book", yl:.(,
ktAbun, "a book" indicate definiteness and indefiniteness respectively.
The tag set hierarchy presented in this work follows the tradition of Arabic grammar.
Most of the Arabic grammar dictionaries, such as a dictionary of Arabic grammar [57]
As Arab grammarians described, each Arabic word belongs to one of the three main
1. Verb
In Arabic grammar, the main class (verb) comes with three sub-classes shown
in figure 4.1 (see also figure 2.3 in chapter 2). These sub-classes are classified
according to the tenses of verb in Arabic.
1 1
I Perfect I Imperfect I I Imperative I
Practically all semitic scholars agree that the tense of the verb does not express
the idea of time, but rather the idea of "finished act" and an "unfinished act". If
the act is incomplete or unfinished, the verb is the imperfect. However, the Arab
looks at these tenses as expressing the idea of time and not the idea of finished
imperfect verb is necessary, because the imperative verb is a form of the imper-
feet [61].
2. Noun
A noun in Arabic indicates a meaning by itself without being connected with the
notion of time and refers to a person, place, thing and event [96].
I-- Uninflected ~
Derivative I l
Primitive I
Proper ,
l Verbal Noun
I , Personal Pronoun I
1 Adjective Noun
1I Conjunctive Noun t
1 Relative Noun ,I Conditional Noun 1"""-
1 Diminutive Noun
I , Demonstrative Noun
"I Noun of Time
I , Numeral Noun
Grammatically, nouns in Arabic are of two kinds: inflected nouns, those nouns
that are affected with the inflectional features, such as, Adjecive, Verbal, Rela-
tive, etc., and uninfected, those nouns appear always in one case and can't af-
fected with the inflectional features, such as, personal, conjunctive, conditional,
The inflected nouns come also in two kinds: primitive (not derived from verb
~ ~
or noun) such as ~-', rjlun, "a man", L\, Osdun, " a lion", and derivative
(derived from verb or noun) such as ~, mktbpun, "library", derived from the
verb ~[61].
tive. Conditional). The list below summarises these sub-classes in a little more
• Common Noun
The vast sub-class of the main class (noun) in Arabic is common noun.
These nouns mayor may not be derived from the ground verb (root). Com-
may not [120]. For example, the words g~ I, Alshjrpa, "the tree" and
~ .
of, shjrpun, "a tree".
• Proper Noun
come from a variety of sources, many of them are Arabic words, but some
are non-Arabic (foreign words). These nouns may include the definite
~ ...
article JI or may not [120]. For example, 0..\..:J, lndn "London", o~lQJI,
AlqAhrpu, "cairo" .
The verbal nouns are derived from verb forms 4 . They follow a regular
4 verb fonns described in Chapter 5, Section 5.3
• Relative Noun
Relative nouns are formed from other nouns by adding the suffix C> (for
.. -
masculine) or ~ (for feminine) [61]. For example, the words ~, shk-
.. -
lyun. "formal" and 'i::}:>~\, Ardnypun, "Jordanian (fern),'.
• Noun of Time
In Arabic, to denote the noun of time, some patterns refer to the time when
the activity specified by the verb occurs or has been used [57]. For ex-
ample, the word ~y, mwEdun, "appointment" follows the patterns J.-.Lo,
• Adjective Noun
An adjective in Arabic is placed after the noun it qualifies, and in most
cases agrees with it in number and gender. On the other hand, the present
participle and past participle are used as adjectives in Arabic language [61].
"glorified". Adjective words like many other words in Arabic are derived
from the ground verb and each adjective word follows a certain pattern. For
.. ,
example, the words t \..P, SA1Hun, "good man", follows the pattern ~L;,
• Diminutive Noun
Arabic has a few diminutive forms of nouns which are actually used. They
are formed from trilateral noun (noun with three consonant). For example,
the word ~,jbylun, "a little mound" follows the pattern ~ [61].
• Instrument Nouns
Nouns of instrument in Arabic are of two kinds : those which are derived
from ground verb (root) such as the word tL:A.o, mftAHun, " a key", fol-
lows the pattern J~ and derived from the verb t?, ftHa, "he opened",
and those which are not derived from ground verb such as ~, skynun,
"knife" or J f:' jrsun, "bell" [61] .
• Interrogative Noun
Usually, the interrogative words (question words) are used at the beginning
, r.
of an Arabic sentence [76]. For example, the words, ~ kyfa, 0!.1, Ayna,
J-o, mtY, l~~, mAdhA and r km, which equivalent to the words, "how?",
"where ?", "when ?", "what ?", "how (many/much) ?" in English respec-
• Pronoun
Pronoun sub-class on our tag set represents the personal pronouns 5 . They
refer to persons or entities. On the other hand, the pronoun class in Ara-
show more difference in inflectional features, such as, gender, number and
person [120]. Table 4.1 shows the difference in the gender and number
of persons between Arabic and English language. This table shows that
for the Arabic first person there is no gender distinction. For the second
person. there are five forms of "You". For the third person, there are six
verbal distinctions and five pronoun distinctions. Thus, the total number of
English Arabic
First Person I, We l51, "AnA" , ~', "nHn"
Second Person You(Fe), You(Ma) ~I, "Anta" (MalSn),
I, "Anti" (Fe/Sn)
\.;.:j I, "AntmA" (Du)
f I, "Antm" (Ma/PI)
~ I, "Antn" (FeIPI)
Third Person He ~, "hwa" (MalSn)
She ~, "hy" (Fe/Sn)
They ~, "hmA" ((Ma/Fe)/Du)
~, H hm" (Ma/PI)
~, "hn" (FeIPI)
• Adverbial Noun
ally, most adverbs in Arabic are words used to answer the questions "when
hand. some adverbs are used as particles such as the words ~-, tHta, "un-
~ ~
der". J"";, jwqa, "over" and ~, qbla, "before". However, in our tag set
we have used one tag to represent the adverbial words which fall in our tag
• Demonstrative Noun
The demonstrative words in Arabic are determiners used with other nouns
the speaker. For example, the words liA, hdhA, ~~, dhlka, ~~jA, hWlA',
~ "
~j \, Awl'ka are equivalent to "this", "that", "these", "those" in English
flect for gender, number and case [120]. However, the demonstrative words
• Conditional Noun
In Arabic, the conditional noun is used between two sentences to show that
the second sentence depends on the first sentence [57]. For instance, the
• Noun of place
Arabic language has a specifically derived patterns which are used to de-
note the noun of place. These patterns refer to the place where the activity
specified by the verb occurs [120]. For example, the words :/f' mrkza,
mfElta respectively.
• Conjunctive Noun
ative clause to a noun or a noun phrase in the main clause of the sentence.
They may be definite or indefinite. In addition, they marked for gender and
s:. s:.
number [120]. For example, the words l:?'l}\, .f' t;" ~\ which are equiva-
lent respectively to the words, "who, which", "who, whoever", "that which,
• Numeral Noun
Arabic has a complex numeral system. It is one of the complicated features
numbers, they usually follow the noun that they modify and agree with it in
gender, but sometimes precede it. For example, J \!J \Jjl \, AlmWtmr Al-
- '
thani, "the second conference" and t;,~ 0Jr' Eshrwna ywmA, "twenty
day". The second type refer to cardinal numbers; these numbers are rather
For example, 0 8 \, Athnani, "two", ~..J \ 13.).&:.\' AhdY A lmdn , "one of the
cities". Numeral noun also inflect for gender, number and case [76].
3. Particle
Particles are of two kinds: Formation and Signification as shown in figure 4.3.
They are one of the three main POS classes in the Arabic language. Formation
particles are particles which constitute the characters of Arabic word, while sig-
nification particles are used with verbs and nouns; they are effective to signal the
mood of verb or the case of noun [120]. For example, the particles" (",1m,
"never" ... J" ,A.'!'. "in order" indicate Jussive and Subjunctive respectively.
, I
I Formation
I I Signifi cation
I Preposition
I Vocative
I Conjunction
I Exception
I Negation
I Subjunctive
I Jussive I Elision .~
ARBTAGS tags have been built based on the following main formula:
out this section, abbreviation symbols representing the name of each main POS class
and sub-class as well as the possible value of the inflectional features which have been
used to represent the tag in our tag set are shown between square brackets in each table.
Table -+.2 shows the abbreviation symbols of the main POS classes in Arabic.
S. represents the sub-classes of each main POS class in Arabic. The abbreviation
symbols of sub-classes of verb, noun, and particle class are shown in Tables 4.3, 4.4
G, represents the inflectional feature (Gender), used to inflect noun and verb sub-
classes. The possible values for the inflectional feature gender are shown in Table 4.6.
N, represents the inflectional feature (Number), used to inflect noun and verb sub-
classes. The possible values of the inflectional feature number can be seen in Table 4.7.
P, represents the inflectional feature (Person), used to inflect the verb sub-classes.
Table 4.8 shows the possible values of the inflectional feature person.
~I, represents the inflectional feature (Mood), used to inflect the verb sub-classes.
Table 4.9 shows the possible value of the inflectional feature mood.
C, represents the inflectional feature (Case) and it is used to inflect the noun sub-
classes. The possible value of the inflectional feature case can be seen in Table 4.10.
F. represents the inflectional feature (State) and it is used to inflect the noun sub-
classes. Table -+.11 shows the possible value of the inflectional feature state.
However, in Arabic, the first two main pas classes, verb and noun, can inflect
grammatically in the system of inflectional morphology, while the third one (particle)
can not [16]. For example, verb can inflect for person, number, gender and mood as
shown in figure 4.4, while the inflectional features for the noun class can be seen in
figure -+.5.
I Verb [Ve] I
Perfect [Pel Imperfect [Pi] Imperative [Pm]
~ I Masculine [Ma] J ~: IM
Indicative [Dc]
.: Subjunctive [Sj] I ~
~I Feminine [Fe] L
.I Jussive [Js] J D
~I Neuter [Ne] L
r I
~I Singular [Sn]
r- ~ First [Fs]
:1 Dual [Du] L
Second [Sc]
~I Plural [PI]
J.- Third [Th]
Before describe the detailed and general tags which have been used in ARBTAGS
tag set, let us summarise all the abbreviation symbols which have been used in the
Noun [Nu]
Com~on [Cn] Proper [Po] Verbal[I t] Adjective [Aj]
Relative [Re] Diminutive [Drn] Instrument [Is]
Noun of Place [pn] Noun of Time [Tn] Pronoun [ps]
Conjunctive [Cv] Conditional [Cd] Demonstrative [De]
Interrogative [In] Adverbial [Ad] Numeral [Nn]
G Masculine [Ma]
It-+--....I Nominative [Nrn]
E ----------~ C
N Feminine [Fe] .......-+--.1 Accusative [Ac] A
D ----------~ S
E Neuter [Ne] E
Genetive [Ge]
Particle [Prj
Preposition [Pp] Vocative [Vo] Exception [Ex]
Jussive/Elision [Jv]
As mentioned above, ARBTAGS has 28 general tags and 161 detailed tags. The
Category Abb
Verb Ve
Noun Nu
Particle Pr
Perfect Pe
Imperfect Pi
Imperati,'e Pm Inflectional Abb
Adjective Aj feature value
Verbal If Masculine Ma
Noun of Place Pn Feminine Fe
Noun of Time Tn Neuter Ne
Demonstrati,'e De Singular Sn
Relative Re Plural PI
Pronoun Ps Dual Du
Diminutive Dm First Fs
Instrument Is Second Sc
Proper Po Third Th
Adverb Ad Indicative Dc
Common Cn Subjunctive Sj
Interrogative In Jussive Js
Conjunctive Cv Nominative Nm
Conditional Cd Accusative Ac
Numeral Nn Genitive Ge
Preposition Pp Definite Df
Vocative Vo Indefinite Id
Exception Ex
Negation An
Subjunctive Sb
Jussive/Elision Jv
Conjunction Co
Foreign word Fw
Table 4.12: Abbreviation symbols used in ARBTAGS tag set
detailed tags not only represent the name of the class that the word belong to, but also
The rational behind developing detailed tags comes from two reasons. The first
reason is to enrich each word in the testing corpus with more linguistic information for
the word including the inflectional feature of the word. The tagged corpus becomes
more useful for linguists and NLP developers if most words are tagged with detailed
The second reason is that the pattern in the pattern-based technique represents the tem-
plate for the whole word. It not only includes the form of the word but also includes the
prefixes and suffixes attached to the word. The suffixes provide the inflectional feature
of the word. Since this pattern is generated automatically from three lexicons; prefixes
with its tags, forms with its tags and suffixes with its tags, the generated tag with each
pattern is a detailed tag. On the other hand, general tags are also used by applying the
has the following detailed tag VeP iMaP 1 ThDc, which means [Imperative verb, mas-
culine gender, plural number, third person, subjunctive mood}. While the general tag
NuPo may assign to the word such as i..fjA.-J' Rmzy which means [Proper noun}.
POS tag may be very coarse (e.g Ve "Verb") or very fine (e.g VePiMaPIFsJs" Verb,
Impeifect, Masculine, Plural, First Person, Subjunctive "), depending on the task or
application [114]. Since the main aim of AMT system is to produce a tagged corpus,
the tags were developed with a good level of granularity, where each tag is enriched
with inflectional features that meets the need of linguists and NLP developers. On the
other hand, the cardinality of the POS tag set makes the tagging between a morpholog-
ically ambiguous inflective language, e.g, Arabic and a language with poor inflection
such as English is different [78]. For example, the number of tags for perfect verbs
between the ARBTAGS tag set presented in this work and the Penn Treebank tag set
for English is shown in Table 4.13. The numbers 6 vs. 81 shown in table 4.13 illustrate
the differences very clearly.
Table 4.13: ARBTAGS tag set vs. Penn Treebank tag set
ARBTAGS tag set general tags are shown in Table 4.14, while a sample of detailed
tags can be seen in Table 4.15. However, the general and detailed tags with examples
Tag Description
VePeMaSnThSj Verb, Peifect, Masculine, Singular, Third Person, Subjunctive
VePeMaSnFsDc Verb, Peifect, Masculine, Singular, First Person, Indicative
VePeMaSnSeSj Verb, Peifect, Masculine, Singular, First Person, Subjunctive
VePeFeSnSeJs Verb, Peifect, Feminine, Singular, Second Person, Jussive
VePeFeSnThJs Verb, Peifect, Feminine, Singular, Third Person, Jussive
VePiMaPIFsJs Verb, Impeifect, Masculine, Plural, First Person, Subjunctive
VePiMaPIFsDc Verb, Impeifect, Masculine, Plural, First Person, Indicative
VePmMaSnSeJs Verb. Imperative, Masculine, Singular, Second Person, Jussive
VePmFeSnSeJs Verb, Imperative, Feminine, Singular, Second Person, Jussive
NuDeSnAcld Demonstrative Noun, Singular, Accusative,Indefinite
NuDeDuGeld Demonstrative Noun, Dual, Genitive, Indefinite
Nulnld Interrogative Noun, Indefinite
NuCvSnld Conjunctive Noun, Singular, Indefinite
NuAdld Adverbial Noun, Indefinite
NuNmld Numeral Noun, Indefinite
NuAjMsSnNmld Adjective Noun, Masculine, Singular, Nominative, Indefinite
NuAjMsSnNmDf Adjective Noun, Masculine, Singular, Nominative, Definite
NuAjMsSnAcDf Adjective Noun, Masculine, Singular, Accusative, Definite
NuAjMsSnGeDf Adjective Noun, Masculine, Singular, Genitive, Definite
NuIsMaDuGeld Instrument Noun, Masculine, Dual, Genitive, Indefinite
NuDmSnNmld Diminutive Noun, Singular, Nominative, Indefinite
NuReMaSnNmld Relative Noun, Masculine, Singular, Nominative, Indefinite
NuReMaDuGeDf Relative Noun, Masculine, Dual, Genitive, Definite
NuCnMaSnNmld Common Noun, Masculine, Singular, Nominative, Indefinite
NuCnFeSnNmld Common Noun, Feminine, Singular, Nominative, Indefinite
NuCnMaPIGeDf Common Noun, Masculine, Plural, Genitive, Definite
NuPsMaSnThAc Personal Noun, Masculine, Singular, Third Person, Accusative
This chapter presented a number of criteria to take into account while developing the
POS tag set. Arabic inflectional features, such as, gender, number, case, mood, person
and state are described. In this chapter, we described the steps of our tag set design.
An Arabic tag set called ARBTAGS contains 161 detailed tags and 28 general tags
covering an Arabic main POS classes and sub-classes which have been compiled and
introduced in this work. The developed tag set follows the Arabic grammatical system,
based upon POS classes and inflectional morphology that Arab grammarians describe.
The developed tag set differs from the tag sets which have been built for Arabic. The
main difference is a tag set hierarchy be introduced and compiled in this chapter. Since
the main aim of AMT system is to produce a tagged corpus, the tags were developed
with a good level of granularity, where each tag is enriched with inflectional features
Chapter 5
The tagger system (Arabic Morphosyntactic Tagger (AMT)) presented in this work has
• Lexicon Free
AMT did not require a manually tagged or untagged lexicon which contains
Arabic words. It requires the testing corpus only. Building a generic POS tagger
system without a lexicon depends on the language and the characteristics of its
grammar, both the morphological and the syntactical systems of that language.
It is possible for the tagger system presented in this work to tag one word regard-
less of the context. This possibility comes from (1) the fact that the word in the
testing corpus has a diacritical mark. The diacritical mark provides a semantic
infonnation and defines the inflectional features of the word, which help to re-
solve the lexical ambiguity may arise. (2) the main technique used in this work
is based on the pattern of the word instead of the word itself. Since the Arabic
word matches its correct pattern, the correct tag assigned to the word regardless
to assign the correct tag to each word in the testing corpus. Two different techniques
were used in this work; the pattern-based technique and the lexical and contextual
technique. The rules in the fonner technique are based on the pattern of the testing
word. While the rules in the later technique are based on the character(s), affixes, the
last diacritical mark, the word itself, and the surrounding words or on the tags of the
surrounding words.
of patterns instead of using manually tagged or untagged Arabic words lexicon for
training. Section 5.3 describes the pattern-based technique in more detail. The lexical
and contextual technique is used to assist the main technique to assign the correct tag
to those words not tagged by the pattern-based technique. Section 5.4 describes the
As mentioned in chapter 3, Arabic has a set of rules or signs described by Arab gram-
marians for more than 1400 years, such as, rules used to distinguish nouns from verbs
and particles. It has set of facts and characteristics, such as, each original Arabic word
has a pattern and many Arabic words follow only one pattern. Additionally, the dia-
critic is important feature (chapter 3). All of these facts and characteristics are taken
into account when the above techniques are built and used in this work.
The AMT system presented in this work is designed to accept any partially-vocalised
Arabic text as an input and produce a tagged text. The signs that indicate the category
of the word in Arabic language on the one hand, and the existence of diacritic feature
on the other hand play a great role in reducing the lexical ambiguity of the words and
providing a semantical infonnation to the word leading to assigning the correct tag for
each word in the testing corpus. In addition, due to the fact that semitic languages in
general have a morphological system based on a root and pattern structure, using the
pattern of the word instead of the word itself can achieve a good result in assigning the
On the other hand, statistical approach as the second main approach in POS tagging
requires a huge manually tagged lexicon to calculate the statistical information such as
the probability of the particular word and tag co-occurring [73]. This approach may be
useful in case we are dealing with an unvocalised Arabic text because with the missing
of the diacritical mark in this type of text, the word may has multiple POS tags. But
to achieve a remarkable accuracy using statistical approach, the manually tagged cor-
pus used for training should be very huge. Unlike English, Arabic still lacks a huge
manually tagged corpus from which large amounts of training data can be extracted.
For example. a training corpus with about 10,000 words which is used by Khoja [87]
in her tagger for Arabic, is definitely not sufficient to cover most words in Arabic lan-
guage. In addition, the small training corpus used in a statistical approach presents the
problem of unknown words.
Unknown words are words not appearing in the training corpus. Neither the testing
corpus nor the training corpus have a lexical information and tags for these words.
The statistical model in this case has no role in dealing with unknown words. So, if
the training corpus is very small and most words in testing corpus may be completely
different from the training corpus, the accuracy of the POS tagger in this case becomes
very weak.
At the same time, many POS tagger systems have been built for English based on
statistical approach and achieved very high accuracy. The reason behind achieving
this remarkable accuracy is the very huge lexicon which contains hundred of millions
of words that have been used in these systems. However, as mentioned above, AMT
system presented in this word did not used a lexicon for training. Thus, the rule-based
approach is the best approach to achieve the above goal due to the fact that the testing
Many computational work on Semitic languages assumed that a word may consist of
the following elements: Prefixes, Stem and Suffixes [45,84,110,118,119]. Arabic lan-
guage has a trilateral and quadrilateral verb form. The great majority of Arabic verbs
t.' E, while
are trilateral that contain three letters, the first letter is J ,f, the second is
the third letter is J, I. The Arab grammarians have used the trilateral verb form J..;,
The ground form of the trilateral and the form of the quadrilateral! verbs have derived
a great number of other forms by inserting a long vowel, lengthening the root medial
letter. and/or adding consonantal prefixes to produce a new word with new meaning
that still shares the basic meaning of the root2 [138] [81]. For example, the words
~, lEba, ~~, lAEbun which mean" he played", "player" respectively. The former
word represents the root and belongs to the verb class which has the basic template
form \,.;W. When adding the long vowel consonant" \ ", "A lif , to the medial letter
of the root, a new word ~~ belonging to the noun (adjective noun) class has been
J1. ,
produced, which has the derived form ~L;,fAElun and still shares the basic meaning
of the root. The ground form and other forms derived from the ground form are shown
in Table 5.1. However, these derived forms express various modifications of the idea
As Arab grammarians described, each original Arabic word has a pattern. M.Elaffendi
(prefix + stem + suffix), and at the same time, marks the positions of the radicals comprising
It is important to point out here that the pattern is different from the word; it has
IThe quadrilateral verb form is JLJ fEll, by doubling the third letter of the ground form. In Arabic,
these verbs are rare.
2In English language, the produced words are which are termed (stems).
no meaning itself, but its a template that indicates the positions of the root letters. The
pattern represents the lexical category of the word and indicates the syntactic and se-
In this work the word "pattern" is used to represent the template of the whole word in-
cluding the prefixes, form (root+infixes) and suffixes, which are attached to the word.
The pattern in Arabic shares the word on the affixes may be added to the ground form
(root). For example, the word 0~~-" wySAjhHwna, "to shake hands" has the pat-
tern 0~ ~-', "wyfAElwna" as shown in figure 5 .1. The root of the word 0~-' is
c!'-, SjH which has the fonn Jd,fEl, while the whole pattern is 0~~~' wyfAE1-
The existence of the last diacritical mark in both the pattern and the word is very im-
"- , l l
, 9 L .to
• l
Arabic Word WJ -:J
, t t t•
Arabic Pattern WJ t s L ....! -'J
l l l ..
I Root Form (f u , E t , I J ) [fEI, Ja.i] I
Figure 5.1: the word 0~.4!j and its pattern 0~~j
portant. Without it, it becomes very difficult in most cases to determine the lexical cat-
egory and to define the inflectional features of the word. For example, the word ~ll,
ghAfl has the pattern J.'-~ as shown in figure 5.2, but the word still has an ambiguity
regarding its lexical category and semantic meaning due to missing the last diacritical
mark in both the pattern and the word. It may be j;ll., ghAfla if the last diacritical
mark is fatha mark, in this case, it means "take advantage of someone's inattention"
". ,
and it belongs to the verb class or ~U:., ghAflun if the last diacritical mark is nunation
mark (tanween danun), which means "inattentive" and belongs to noun class. Thus,
while the last diacritical mark is missing in the pattern as well as the word, the lexical
In the Arabic language, there is no word has more than one pattern to follow. At
the same time, you may find hundreds of Arabic words may follow one pattern. For
example, the words, "0y.?- h rbwna, "drink"
",ys to -:
, uyo 'lIl, ysmEwna, "t0 hear" ,
oy.~, yDrbwna, "to beat", 0~ yktbwna, "to write", 0~ yjhmwna, "to under-
More than 500 words other these words follow the above pattern. All the above words
belong to the imperfect verb class. Another example, all of the Arabic words with
three consonants, end with fatha mark and follow the pattern J..;, tEla, "do", are per-
fect verb words.
The case is also valid for the words belonging to noun class. For example, all the
,. , ." ,
Arabic words following the pattern J;:- L;, fAElun, such as, ~t:;, qAtlun, "killer",? Lz,
The above examples show that the last diacritical mark plays a great role in determining
the correct tag and adding a semantical infonnation to the word. In addition, using the
pattern of the word means that building a pattern lexicon with 100 entries may cover
15,000 words, which constitute the main advantage of the pattern-based technique.
Since a lexicon of Arabic words or training corpus in this system is not required, in-
stead, we generated a lexicon of patterns which are associated with the last diacritical
1- A single lexicon of all prefixes including all valid concatenations. Tag is also asso-
ciated with each prefix.
2- A single lexicon of all fonns. Tag is also associated with each Form.
3- A single lexicon of all suffixes associated with the suitable last diacritical mark. Tag
Table 5.2 shows a simple part of the prefix, form and suffix lexicons for some im-
perfect verb words. The combined pattern lexicon is shown in Table 5.3.
Table 5.2: Sample of prefixes, fonns, suffixes for some imperfect verb words
There are two important things to point out here. The first is that the tags attached
to fonns and suffixes in Table 5.2 are valid tags only if these suffixes attached to
these fonns. In other word, the tag of the fonn may change depending on the suffixes
attached to this fonn. For example, the tag [VePi] associated with the form J.'-~ (sec-
ond line in Table 5.2) is valid only in case the fonn ~~ is combined with the suffixes
presented in Table 5.2. If the suffixes changed, the tag of the forms should also need
to be changed. For example, Table 5.4 shows that the tag of the form J.'-~, JAEl is
changed due to the changes happening in suffixes. The combined pattern lexicon can
Table 5.3: Sample of pattern lexicon shows the patten for some imperfect verb words
Table 5.4: Sample of prefixes, fonns, suffixes for some perfect verb words
2 ~~ fAElta VePeMaSnScSj
Table 5.5: Sample of pattern lexicon shows the patten for some perfect verb words
Usually, the prefixes have no tags 3 unless the prefixes represent a particle , suc h as,
a conjunction particle, in this case a separate tag is to be associated with this particle
to show that this word has a combined tag. For example, the word (P.,J' wyshrhu,
"and to explain" has the following tag [PrCo+VePiMaSnThDc]. [PrCo] is the tag
of conjunction particle J' w, "and" which appears in the word as well as the pattern.
The second thing is the tag of each suffix represents the inflectional feature of the
word. Each form should have at least one suffix, that is, the last diacritical mark. The
length of the suffixes ranges between I to 4 or 5 letters. The length of the prefixes on
the other hand ranges between 0 to 4 or 5 letters. So, it becomes clear that the tags
The rules in the pattern-based technique can be represent using the following general
Assign the tag (T) to the testing word (W) if the testing word matching the pattern (P)
where T is a variable over a set of tags in pattern lexicon, W is a variable over a set of
testing words, and P is a variable over a set of patterns in pattern lexicon. For example,
lexicon to check for its correct pattern. The correct pattern here is P = J.-A:., yjElu (the
second pattern in Table 5.3). The tag [VeP iMaSnThDc] which associated with the
pattern ~ is then extracted from pattern lexicon and assigned to the word~.
An important question must be asked here. How the testing word matched its correct
pattern ?
3In other POS tagger system built for Arabic, a separate tag such as [Def] used to represent the
definite article JI, "AI". In the current system, this tag included with the inflectional feature of the word
with the symbol [Of]
To answer this question, a novel algorithm has been developed and described in next
section. The purpose of this algorithm is to show how the testing word is matched its
correct pattern in the pattern lexicon.
Since the lexicon in AMT is a pattern lexicon not an Arabic words lexicon, an algo-
rithm to match the Arabic word in the testing corpus with its correct pattern in patterns
lexicon is required. A novel algorithm has been introduced in this work to achieve
the above goal. The pseudo code of the pattern-matching algorithm is described in
Algorithm 1. The steps of the algorithms with examples are described below in more
Step - 1 :
The first step in the algorithm is responsible to return from the pattern lexicon all the
patterns that have the same length of the testing word. For example, the word ~,
jktbtna, "and they wrote" has the length4 = 7 (see figure 5.3). The returned patterns
that have the same length of the word ~ are shown in Table 5.6. The next step
(Step - 2) of the algorithm shows how to calculate the identical letters between the
testing word ( ~ ) and the fourth pattern ( ~ ) as an example.
Table 5.6: Number of identical letters between the word ~ and its patterns
Step - 2 :
The second step of the algorithm is responsible for calculating the number of identical
letters between the testing word and the patterns which are returned from perfonning
step-I. The aim of this step is to reduce the number of returned patterns. For example,
the identical letters between the word ~ and the pattern ~ are shown in figure
5.3. The number of identical letters between the word ~ and each returned pattern
can be seen in Table 5.6.
w -
.. • .
~ ~ ~ ..s •
p -
W ~
.. -.J ~ ...a• •
t4 t3 t2 t
Figure 5.3: The identical letters between the word ~ and the pattern ~
Step - 3 :
Choose the pattern(s) which have the maximum number of identical letters. Since the
fourth pattern in Table 5.6 has the maximum number of identical letters with the test-
ing word, the algorithm will chooses this pattern for the word ~.
Step - 4 :
Replace the letters of W which correspond (Mirror) to the letters J, f, t: E and J, I
(the letters J, t and J represent the root letters) in the pattern(s) (P) which have the
maximum identical number with the word (W). Add the remaining letters in W without
change and store the new pattern in NP. Figure 5.4 describes how to perform this step.
.. , .. ..s ..!•
w - W -' -'
. *- * *- -
P(1) = W -' -l s,..! ..!
• •
NP W .:i -l s,..! ..!
NP = P(1) ?
Figure 5.4 shows clearly that a new pattern has been created with the same length of
the original pattern (P(I)) and the word (W). The letters which do not correspond to
the root form are the same in the word, the original pattern, and the new pattern. These
letters represent the affixes which are added to the ground form (root). Since NP =
P( I), this means that ~ is the correct pattern for the word ~.
In most cases the algorithm is returned one pattern has the maximum number of iden-
tical letters with the testing word as in the above example. But, sometimes, more than
one pattern has been returned, each pattern has the same number of identical letters
This step ( Step - 4 ) is not used only to check that the only pattern which has the
maximum number of identical letters with the testing word is the correct pattern, but
also to choose the correct pattern in case the algorithm is returned more than one pat-
tern, each has the same identical letters with the testing word.
For example, suppose W = the word 0~- 0 d" !, ysmEwna, "to hear". Table 5.7 shows
the patterns that have the same identical letters with the word 0Yo ttl!.
PNo Paterrn Word Identical letters Num
1 ~~ 0.,...0 :II! the letters 4..f' t.: 0, and last mark 4
the letters 4..f' t.: 0, and last mark
~ 0yo :II! 4
Table 5.7: Number of identical letters between the word 0Yo I"! and its patterns
During this step, the algorithm is responsible to determine which one of the above
patterns is a correct pattern for the word 0~-0 III!. The first pattern J..'-~ has been
checked if it is the correct pattern for the word 0~- 0 '''! or not. Figure 6.3 shows the
W -
W~ S
-A .....
W ~ s L ..a ..
P(1) =
NP -
W -l_s -A ..a• ~
NP = P(1)?
Figure 5.5: Matching the word 0Yo "'! with the pattern J..'-~
It is clear from figure 5.5 that NP does not equal the pattern pel), because the letter..4,
miim in the word (W) is differs from its corresponding letter in P(l) (\, (Alij»). So, the
pattern ~~ in this case, is not the correct pattern of the word 0Yo I"!'
Similarly, the pattern ~ is not the correct pattern of the word 0-","0 '''! as shown in
figure 5.6, because the letter ..4, mUm in the word (W) is differs from its corresponding
w WJ s ... zuA 'j
P(2) =
W ~ t ..
~ ..J
.! ..
NP - W -l
S ... .! ..J
= P(2)?
Figure 5.6: Matching the word 0Yo ".! with the pattern ~
The last pattern 0~ has been checked by the algorithm as shown in figure 5.7.
Since NP = P(3), then the pattern 0~ is the correct pattern of the word 0..,.-0 "'!.
The algorithm in this case will choose the pattern 0~ as a correct pattern for the
word ~yo 1M!.
w -
WJ ~ ..4 .....
t t.
W -' ~ ..
P(3) = ~ .! ..J
NP -
- W J -l V .!
NP = P(3)?
Step - 5 :
The last step in the above algorithm is responsible to extract the tag associated with
the correct pattern from pattern lexicon, and assigned this tag to the testing word. For
example, the tag VePiMaPIThSj (see Table 5.3) is extracted and assigned to the
word 0-"0 !. f
The pattern-based technique described in section 5.3.2 which depend on the pattern
of each word in the testing corpus constitute the main technique in this work. In fact,
it is not easy for one person to generate all the patterns which cover all the words in
the Arabic language. Since most words in Arabic belong to the noun class, difficulties
may appear especially in collecting all the patterns of the words belonging to this class.
In terms of the words belonging to verb class, the case is different. It is easy to collect
the verb forms and all affixes associated with these forms, as the pattern lexicon is
generated automatically.
The pattern lexicon in this work contains 8718 patterns. Most of these patterns are
patterns for the words which belong to verb class. The tag set hierarchy (see 2.3)
covers most types of sub-classes belong to noun class. Some of these sub-classes have
certain patterns, for example, the patterns of adjective nouns, instrument nouns, verbal
nouns and diminutive nouns, which are generated automatically and added to the pat-
terns lexicon. The difficulties may appear in collecting and generating the patterns for
As mentioned earlier, the patterns lexicon contains 8718 patterns, these patterns defi-
nitely not sufficient to cover all the Arabic words, especially, those words belonging to
the noun and verb classes. For this reason, the lexical and contextual technique is used
in this work to assist the pattern-based technique to tag those words not have patterns
On the other hand, All the tags in the pattern-based technique are detailed tags, be-
cause these tags have been generated automatically with patterns. These tags not only
represent the name of class (e.g. perfect verb, imperfect verb), but also included the
inflectional feature of the word, such as, number, gender, person, and mood using the
prefixes and suffixes attached to the verb form of the word. In contrast, the tags of
those words belong to the noun or particle class and tagged by the lexical and contex-
As mentioned in section 3.4.2, Arabic language has a set of rules or signs, which
have been described by Arab grammarians and used to distinguish nouns from verbs
2. An Arabic word has the genitive case (end with kasra mark)
In the Arabic language, neither the words belong to verb class nor the particle class can
share the above rules. These rules have been taken into account when applying lexical
Lexical rules are used to analyse words and take advantage of the internal structure of
words. The triggers in the lexical rules depend on the character(s), affixes, and the last
diacritical mark of the word. The name of the rules in lexical and contextual technique
are written in the same way that Brill [37] has represented his rules and templates.
The names of the lexical rules (in parenthesis) and the description of each rule are
given below:
-l-- The last two characters of the current word are C.(L2CHCWD)
Where X is a variable over the set of diacritic marks, C is a variable over the set of
An example of a lexical rule is shown below. The list of lexical rules with examples
Contextual rules are used to assign the correct tag of the particular word based on the
surrounding words or on the tags of the surrounding words. The triggers in the contex-
tual module depend on the current word itself, and the tags or words on the context of
given below:
Where Z is a variable over all words in the testing corpus, Y is a variable over the set
of tags.
An example of contextual rule is shown below. The list of contextual rules with exam-
This rule means: Assign NuCnGeld tag to the current word if the the tag of the
On the most obvious problem in tagging the Arabic text is recognising proper nouns.
A proper noun in Arabic may be represent the name of a specific person, place, orga-
nization, thing, an idea, an event, date, time, or other entity. Unlike English language,
Arabic does not distinguish between lower and upper case letters; this makes it not
nearly as easy to locate the proper nouns as in English text. Furthermore, these words
may be solid or derived or words borrowed from another language (Arabi sed words),
which add another level of complexity to recognising these words [15] [14].
Abuleil and Evens [15] presented a technique for tagging proper nouns in Arabic text,
which depends on the keywords stored in a lexicon. Table 5.8 shows how they have
In this work, their classification (keywords) have been applied, but by using the
No Classification Example
1 Personal names (title) ~jA..J ~.J" Mr.Ramzi
Personal names (job title) tL; ~)" President.Saleh
3 Organization names u..J~y:..) t....~, DeMontfort University
4 Locations (political names) W.J ~J~' French Republic
5 Locations ( natural) names) (,)L;- ~M, Amman City
6 Times "
J~ \~, Month of September
7 Product (,)~ ~~ Nikon Camera
8 Events u ~l;- ~fW' Cars Exhibition
This rule means: Assign NuPo tag to the current word if the preceding word is (~.A.o
~ ~
Furthermore, the particle lexicon contains those words belonging to particle class has
been built in this work. The motivation behind building the particle lexicon comes
from the fact that, during the initial experiments which have been done to test the tag-
ger, some words have been tagged incorrectly. Since the pattern-based module has
been designed for those words belonging to verb class or noun class, some words be-
longing to particle class have been matched the incorrect patterns when applying the
pattern-matching algorithm to those words. For example, the word ~j, wmnhA, "and-
from it" match the pattern ~,fElhA as shown in figure 5.8 and takes an incorrect tag,
because this word belongs to particle class while all the words follow the pattern lfL.j
belonging to verb class, such as the word ~ ktbhA, "he wrote it" as shown in fig-
ure 5.9. Thus, to reduce the errors in tagging such words and enhance the performance
of the tagger system, the decision has been taken to generate a separate particle lexicon.
", - ........
"" ...;,_... j
~ ~ ~
p = L....a -1_c... ..a
NP= U ..J~..l
Figure 5.8: Matching the word Figure 5.9: Matching the word
~j with the pattern \i.W ~ with the pattern \iW
all prefixes including all valid concatenations, a single lexicon of all Arabic words be-
Table 5.10 shows a sample of particles lexicon which generated from Table 5.9 ele-
~ PrPpFeSn ~j PrCo+PrPpFeSn
l? PrPpFeSn ~j PrCo+PrPpFeSn
mark assigned only to the last letter of each word in testing corpus) Arabic text as
input, and to produce a POS tagged partially-vocalised Arabic corpus. AMT as shown
/ Untagged
~ --
J Tokenizer Module
~~ ,r
Module AMT Lexical and
Contextual Module
I ,";
It Patterns Lexicon DJ .,
The list below describes the function of each module in more detail.
• Tokeniser Module
A token is not just a word. It is defined as a sequence of characters having a
collective meaning [17]. A token represents any special character, number and
word. The main function of a tokeniser module is to convert the untagged input
text into a form that is more manageable by the machine. This conversion is
tagged input text and identifying words, punctuation marks, numbers and other
marks using the space as a delimiter. The tokeniser simply separates the input
text into tokens including the splitting of punctuation marks (such as full stops
• Pattern-based Module
The main function of this module is to look up each testing word in the patterns
the testing corpus with its correct pattern in the patterns lexicon. If the correct
pattern of the testing word is found in the patterns lexicon, the tag extracted from
patterns lexicon and assigned to the testing word. After this module finished its
task, the remaining words are then passed to the lexical and contextual Module.
The lexical and contextual module has been built in this system to assist the
pattern-based module to tag those words not having patterns stored in the pat-
terns lexicon. This module is responsible for applying the lexical and contextual
rules to assign the correct tag to each word not tagged by the pattern-based mod-
AMT performs many steps during the tagging process as shown in figure 5.11. During
the tagging process, the token is first looked up in the particle lexicon. If it is found,
then the tag extracted and associated to the token. The token is then passed to the
Raw Text
Tokeniser Module
N Pattern-based
e Module
T Lexical and
o Applying UC
k Rules Contextual
e Module
check if the token has a pattern in the pattern lexicon or not. If the token matches its
pattern, then the tag is extracted from the pattern lexicon and assigned to the token.
If the pattern of the token is not found in the pattern lexicon, then it is passed to the
At this stage, the lexical and contextual module has been applied to assign the cor-
rect tag to each token which has not been tagged by the pattern-based module. Finally,
for those very few tokens still untagged by the above modules, a user intervention
menu has been designed in the main menu (see figure 5.12) of the system to allow the
user to add a new pattern and its general tag or at least the simple form of tag (e.g Ve)
for verb words or (Nu) for noun words if the token belongs to verb or noun class, or
the token itself and its general tag (P r) if this token belongs to particle class.
Since this tagger system has been designed to tagging Arabic text, it is expected that it
is easier for the Arabic user to use his knowledge to tag those words still untagged by
adding the simple form of token tag. The main purpose of user intervention menu is to
enrich the pattern lexicon as well as the particle lexicon with new entries, which lead
to develop a tagger system can accept any partially-vocalised Arabic text. It is inter-
ested to point out that adding one pattern by the user means that many Arabic words in
The main menu of the AMT system with example shows how to perform the tag-
ging process for a very simple part of the partially-vocalised Arabic text can be seen in
figure 5.12.
of AMT tagger : free manually tagged lexicon or training corpus, word level tagging
and tagging partially-vocalised Arabic text. In the current literature such a tagger does
not exist. A new technique with a novel algorithm has been applied for AMT system.
Since a lexicon of Arabic words or training corpus in this system is not needed, instead,
we generated a lexicon of patterns which are associated with the last diacritical mark
In this work the word "pattern" is used to represent the template of the whole word in-
1fIIII . • •
• \ ~~J' \.e! ~\ ~~\ ::,. "lI.>'o'4 )J~\ ~~ ~ \~~j. )J~'I ~ \~\J l.l~ ~~1 ~
~ .) ~'J:. ~l Js- ""~\-j ~I ~\ ~ lA;WJ l.l.JJ~ ~;.w -..I~\ ~ ~i #-
I~~\JW~\J ..J-j~'tJ.:,.a~ ~. ~ ~I"W~~. ~ Ja'I;4 ~1.o..,j~1 ~
lJ,jj.i .j\;' wJ r ~~ ~;yo ~..6.Sj r .~I ~I"W~I ~~ ~ ~\ :"I >,,~\;- ,j1".
1 hi. 'l',1 hn: lf~. t'Jh'n~
eluding the prefixes, form (root+infixes) and suffixes, which are attached to the word.
The main technique based on the pattern of the word instead of the word itself.
In addition, the lexical and contextual rules have been used in this system to assist
the pattern-based technique to tag those words not having a pattern stored in pattern
lexicon. The AMT system presented in this work deals with partially-vocalised Arabic
text. It is the first POS tagger uses purely rule-based approach. A full description of
AMT tagger system modules and the function of each module also has been addressed
in this chapter. Finally, we described the tagging process that AMT system carried out.
Chapter 6
A partially-vocalised Arabic text is needed to test the AMT system. The lack of large
tain the testing corpus for the tagger, a new partially-vocalised Arabic corpus has been
compiled. It contains 20,000 words. Since the text in school textbooks contains dia-
critics, the corpus is extracted and collected from these textbooks via the official site
The text in testing corpus had been normalised manually; that is, the diacritics other
than the last diacritical mark have been removed and the last diacritical mark has been
added to those words do not have it. Despite that not all words in school textbooks
have diacritics, especially for the higher level classes, but the text in school textbooks
The aim of the normalisation process which has been done with consultation and col-
laboration of an Arabic linguist2 is to ensure that each word in the corpus is attached
only with the correct last diacritical mark. Also, the corpus is manually tagged for
The corpus is chosen and extracted from different books for different levels of school
classes. It is not limited to a particular domain; it covers a wide range of topics such
Test data for the experiments was taken from the testing corpus. The data sets con-
sists of raw original Arabic script words; no further annotations exist for this data set.
1. Set-!: consists of 3170 words representing several articles extracted from the
book of computer science and other science topics, such as biology for different
1 ., ., . n
2The author gratefully acknowledge the collaboration of Mr. Walid Alqnm - Mlmstry of EducatlO
- Jordan. Email:
2. Set-2: consists of 7620 words representing several articles extracted from the
3. Set-3: consists of 9210 words representing several articles extracted from the
Five experiments were done to evaluate the AMT tagger system. The first three exper-
iments were performed on set-I, set-2, and set-3 respectively. The fourth experiment
was performed to calculate the ratio of pattern-based module and lexical and contex-
tual module that have been applied in the above three experiments. The last experiment
was done on a different text, that is the Quran text. A sample of the Quran text was
taken from chapter Almulk and Alforqan, it contains 1016 words. The diacritics were
removed except the last diacritical mark. The aim of this experiment is to get a picture
of the AMT performance on a different text. The results of this experiment also de-
There are several measurements used to indicate the performance of tagger systems.
Success rate 3, ambiguity4, recall and precision are the most popular measures which
try to indicate the accuracy of the tagger output( [73], p.82 ). Success rate measure is
used in case the tagger is assigned a single tag to each token as the tagger presented in
Ambiguity measure is used when the tagger is assigned multiple tags per token. Ambi-
guity is calculated by dividing the total number of tags by the total number of tokens.
Recall and precision, which find their original in information retrieval are also an alter-
native pair of measures used in tagging. Recall is calculated by dividing the number of
correct token-tag pairs that is produced, by the number of correct token-tag pairs that
is possible. Precision is the number of correct token-tag pairs that is produced, divided
by the total number of token-tag pairs that is produced. Like the success rate measure,
ambiguity and recall and precision are expressed as a percentage ( [73], p.83 ).
Since the AMT tagger presented in this work produces a single tag to each word in
testing corpus. success rate measure was used in to indicate the performance of the
AMT system. The success rate for each experiment and the ratio of tag types which
have been used in each experiment were calculated. In addition, the distribution of
POS classes for the first three experiments has also been addressed. The details of the
results for each experiment are described in the following sections ( 6.2.1 - 6.3.1).
6.2.1 Experiment-l
The first experiment was performed on the first set. AMT correctly tagged 89% of
the first experiment are detailed tags which included inflectional feature for each word
(see figure 6.2). This ratio indicates that the majority of the correctly tagged tokens
were not at the general level. In addition, the distribution of POS classes for the text in
experiment-1 can be seen in figure 6.3. It is expected that the ratio of tokens in the text
~ 2500
1000 89% 11%
Set-1 size Correct Incorrect
3170 2830 340
which belong to noun class is a higher ratio since most of the Arabic words belong to
noun class rather than any other POS classes. Usually but not always (depend on the
testing text) particles in Arabic have the second higher ratio after the ratio of nouns.
6.2.2 Experiment-2
The second experiment was performed on the second set which contains 7620 words.
AMT correctly tagged 94% of set-l words as shown in figure 6.4. During this exper-
iment (figure 6.5), out of the correctly tagged words, 78% of tags which have been
assigned to tokens in the secod experiment are detailed tags, while 22% are general
tags. The ratio is varies according to the type of text. In addition, figure 6.6 shows that
63% of text tokens in this experiment belong to noun class, while 13% belong to verb
class. Tokens belonging to particle class and puctuation class consititute 16% and 8%
6.2.3 Experiment-3
The third experiment was performed on set-3 which contains 9210 words. Out of set-3
size, AMT correctly tagged 91 % of set-3 words as shown in figure 6.7. Out of the
correctly tagged words in this experiment, the ratio of detailed tags which have been
assigned to tokens is 59% while 41 % are general tags (see figure 6.8). On the other
hand, 67% of text tokens in this experiment belong to noun class, while 10% belong to
verb class. Also, figure 6.9 shows that 17% of text tokens in this experiment belong to
Verbs Pun
78% 63%
particle class and the ratio of tokens belonging to puctuation class is 9%.
z 91% 9%
Set-3 size Correct Incorrect
6.2.4 Experiment-4
This experiment was performed to get a picture of the ratio of pattern-based module
and lexical and contextual module that have been applied in the above three experi-
ments. Figure 6.10 shows that 91 % of testing coprus tokens are tagged correctly. Out
General Pun
Velhls --. -
of the tokens tagged correctly, 48% of correctly tagged tokens are achieved by apply-
ing pattern-based module while 52% are achieved by applying lexical and contextual
20000 l~----------~------------~-----------'
o f-Io-- Achieved by Achieved by
Correctly tagged applying lexical and
applying pattern-
tokens contextual rules
based rules
18433 8848 9585
The results in the first three experiments show that the correctly tagged words vary
according to the domain of each text. The style and text content is one of the main rea-
sons that affect the accuracy of the tagger system. The text in experiment-l is related
to a computer science topic where some words belong to Arabised words; which are
not original Arabic words came from other international languages and do not have a
root or pattern. For example, the word .J~ kmbywtr, "computer". Most of these
and experiment-3. As the text of experiment-2 is related to Arabic language topic and
specified for school level where most of words in the text of this experiment are origi-
nal Arabic words which have a root and a pattern, it is an expected result. In addition,
the percentage of correctly tagged words that belong to verb class and punctuation
The different subject of the text in experiment-3 which is related to literary topics
is probably the reason why accuracy of tagging this text is low. Many proper nouns
and Arabised words are used in this type of text. Since recognising proper nouns con-
stitutes the most obvious problem in tagging Arabic text, most of the errors came from
proper and Arabised nouns. These words belonging to proper and Arabised nouns are
In addition, Arabic has irregular verb words such as, the word ~, Dl=a, "to go
astray". Also some words in Arabic language are considered as primitive verbs such
as, ~, b'sa, ·'what a bad ... !". These words are also tagged incorrectly. For example,
the word ~ matches the pattern J.d and the wrong tag assigned to this word.
The above three experiments show that most words in the Arabic language belong
to noun class followed by particles, verbs and punctuation marks. An overall accuracy
of the tagger system has been calculated by comparing the tagger system output with
the goal corpus that is manually tagged. The tagger achieves 91 %. Since, there is no
On the other hand, 48% of accuracy is achieved by applying the pattern-based module
while 529c is achieved by applying the lexical and contextual module (see figure 6.10).
Since the ratio of the patterns in the lexicon which belong to verb class is higher than
the patterns of words which belong to noun class on the one hand, and most of the
words in the testing corpus which belong to noun class rather than any other POS class
on the other hand, it is natural these words have been tagged using lexical and contex-
tual rules. For this reason, the accuracy achieved by applying the pattern-based rules
One of the problems we are faced during experiments is the tag of the passive per-
fect verb. The passive perfect verb word is tagged and assigned with the same detailed
tag assigned to active perfect verb since these words (passive and active perfect verb)
share the same last diacritical mark. For example, the words ~ katba, "he wrote",
and ~ kutba, "it was written", the former represents an active perfect verb while the
later represents passive perfect verb. Since both words share the same last diacritical
mark and match the pattern J..;5, AMT will be extracted and assigned the detailed tag
5ignore any diacritical mark other than the last one
VePeMaSnThS j to both words which means Perfect Verb, Masculine gender, Singu-
lar number, Third Person, Subjunctive mood. This detailed tag is correct for the former
word ( ~) because this word means "there is a gentleman who wrote" and the inflec-
tional features MaSnThS j describes that clearly. While it is not correct (except mood
feature(Sj)) as a detailed tag for the later word ( ~) since this word which describes
something other than human (book,lesson) has been written. At the same time, the
general tag VePe is valid and correct for both words ( ~ and y:S'). This example
shows that a smaller tag set (general) may contribute to increase the performance of
the tagger.
AMT system presented in this work does not differentiate between the passive and
active perfect verb and assign a detailed tag to both words. This problem appears only
to those words represent passive perfect verb. Despite that the general tag is valid and
correct for those words, but solving this problem means to adding another additional
diacritic mark to the first letter of each word in the testing corpus and each pattern in
the pattern lexicon which requires great effort and time compared with the very scarce
number of words that can be found in testing corpus since most of perfect verb words in
Arabic are active perfect verb. For example, out of our testing corpus words, 0.0005%
of words are passive perfect verb. Thus, the general tag assigned to these words.
Another problem appeared during the experiments relates to nouns end with long
vowel I, Alif Some nouns are wrongly matched with verb patterns. As an example, the
word lAY:, nmwhA, "growth" matches the pattern ~ as shown in figure 6.11.
Since the pattern ~ is a verb class pattern, an incorrect tag was extracted from pat-
terns lexicon and assigned to the word lA.,.c, because this word belongs to noun class.
The main reason behind the error in matching the incorrect pattern is, the pattern as
Figure 6.11: Matching the word lAy: with the pattern If.ld
well as the word do not ended with a diacritical mark, instead they are ended with the
long vowel letter ( \, Alif), this letter letter fills the place of the last diacritical mark
(fatha mark).
A very few number of words are ended with the \, Alif which belonging to noun class,
but these words still remaining the pattern-matching algorithm is not 100% accurate.
The best soluation to solve this problem is to compile a lexicon contains all the Arabic
root words. One more step may add to the pattern-matching algorithm. The aim of
this new step is to extract from the testing word the three letters which corresponding
the root letters ( J f, L E, J l) in the pattern. The new word then look up in the root
lexicon to check if this word constitute a valid Arabic root or not. If the word found in
root lexicon, then the original testing word belongs to verb class, otherwise it belongs
to noun class. For example, the root of the word lAy: is y: (see figure 6.11). The word
y: is not a valid Arabic root. So, the original testing word lAy: belongs to noun class.
Compiling a lexicon contains all the Arabic valid roots is possible, but it needs a time
to compile, due to the fact that this problem did not have a noticeable impact on the ef-
fectiveness of our tagger performance because the number of words that can be found
in testing corpus (i.e 0.0043%) is scarce. In addition, the emergence of this problem
came in the final stages of our experiements and out of the scope in this research.
The size of the tag set for an annotation system has a direct influence on the accuracy
of the tagging system. A smaller tag set may contribute to increase the performance of
the tagger. But using a smaller tag set means providing a less linguistic information
making the whole tagging system less useful for linguistic and NLP developers (i.e. to
build an educational system), especially if the aim of the tagger is to produce a tagged
Out of the overall tagged corpus tokens, 68% of the tags were detailed tags while
32% are general tags as shown in figure 6.12. Since all the tags in pattern lexicon
are deatiled tags, each token is tagged by applying the pattern-based rules definitely
assigned a detailed tag. In addition, most of the tags designed with lexical and con-
textual rules are also detailed tags. Most of the 32% of general tags included one or
sometimes two inflectional features. In other word, most of the tags designed with
lexical and contextual rules, are attached with inflectional features such as, mood, state
or case. For example, NuCnld (Common noun, Indefinite). Such a tag has been cal-
culated with 32% as general tag. However, we tend to enrich each word in the testing
All POS tagging systems were built for Arabic (described in chapter 2) share the
following characteristics (1) they deal with unvocalised Arabic text (2) they need a
manually tagged corpus. The current tagger system deals with partially-vocalised Ara-
bic text without using a manually tagged corpus. In addition, the current tagger asslgs
the tag to the testing word based on the pattern of that word instead of the word itself.
Despite that the current tagger is uses a different technique and a different type of text,
we still would like to compare the results obtained from the current tagger (AMT) with
the results of Khoja tagger (APT). Unfortunately, the source code of Khoja tagger is
not available on her site . In addition, we had no luck in contacting her to acquire the
source code for her tagger.
A sample of 1500 words have been taken randomly from the above three test sets,
the last diacritical mark removed from the words, in other word, the text become un-
vocalised. An experiment was performed to tag this sample, and the result is shown in
figure 6.13.
Detaied General 1400
tags tags 1200
-E 1000
~ 800
z 600
Sample size Correct Incorrect
Figure 6.12: Detailed and general Figure 6.13: Success rate for un-
tag ratio overall in the correctly vocalised sample text which con-
tagged corpus tains 1500 words
The AMT correctly tagged 21 % of the unvocalised sample text. Most of the correctly
tagged tokens in this sample belong to particle and punctuation marks and some proper
nouns. It is not a surprise result since the patterns as well as the lexical and contextual
rules examined the last diacritical mark during the tagging process.
In addition, the result of experiment-4 shows the importance of the last diacritical
mark in reducing the lexical ambiguity and providing the semantic information to the
word which helping the pas tagger to determine the correct tag of each word in the
testing text. AMT correctly tagged 91 % of testing words. Since the majority of Arabic
words are noun words, a default tag, that is, NuCn (Common noun) is assigned to the
remaining words (9%). These words are stored in a special list and were reviewed.
The deafult tag is correct for most of the remaining words, and is reduced the ratio of
Another experiment was performed to get a picture of the tagger accuracy score in dif-
ferent text. A sample of the Quran text was taken from chapter Almulk and Alforqan
(see figure 6.14), it contains 1016 words. The diacritics were removed except the last
diacritical mark. A set contains 1016 words was taken from the Quran. The diacritics
~) <VePeMaSnThSj> ~ <NuCnGe> <Pun> J~ <Nuen> ..>'" <VePeMaSnThSJ> J.J-' cPr> ..:.- <NuCnAcDf> ,
~Ia. <NuenNmOf> .~ c:VePeMa$nThSj> ~ <Nuen>:":"" <NuCnGeld> ...:~ <NuenAeDI> ~
.. Summary of Results
s..., of Reaulb lot QUIlIn lext with totalsile 1016 Word Token
CoIIparing With GOAlIeIIt ( Manuao, Tegged) III fob.. :
DistriJuIion of teat lokena 1ICf0SS the syateJI Nodules and POS lag sel : Dishibution olleat lokena across POS categories
Punctuation 84 91
The AMT system correctly tagged 88% of the Quran sample as shown in figure 6.15.
Out of the tokens tagged correctly, 43% of correctly tagged tokens are achieved by ap-
plying pattern-based module while 57% are achieved by applying lexical and contex-
tual module. Some of the sample words in this experiment (experiment-5) are classical
Arabic words. Since the Modem Standard Arabic (MSA) text is used in the current
usage, the Arabic writers are used the meaning of these words instead of using the
classical Arabic words. A sample of classical words which have been used in the
Quran text and their meaning in MSA text can be seen in Table 6.1.
Most of these classical words are tagged incorrectly due to the fact that the patterns
are valid patterns for MSA text rather than Classical text. On the other hand, the Quam
text shares the MSA text in some errors described above. For example, some proper
nouns are used also in the Quran text, such as ~ y, ~!.r. \, (Y and~. Each
of these proper nouns do not have a pattern to follow. Also, some nouns are wrongly
~~ shqwq creases
. o.J~ HjArP stones
0~ ~
.. Emyqa deeply
matched with verb patterns especially the pattern J..;. Furthermore, the same problem
was appeared during the Quran text experiment relates to nouns end with long vowel I.
Despite the errors described above, the AMT has achieved very good accuracy in the
Quran text. Figure 6.15 shows that most of the Quran text are similar to MSA text in
regarding to POS classes they belonged to. For example, most of the Quran words are
belonged to nouns and particles (72%) rather than verb words. In addition, 43% of the
Quran sample words have been tagged using pattern rules. This is a nature ratio since
the Quran words are words derived from the root and most of the Quran words have
patterns to follow. At the same time, 69% of the sample words are tagged by deatiled
The summary of all the results obtained from the AMT system for all the experiments
described above show that the correctly tagged words vary according to the domain of
each text. The AMT system achieved very good accuracy due to the fact that it does
Since there is no a huge tagged corpus available to the tagger system presented in this
work, this accuracy enables us to point out that it is possible to build a tagger system
for Arabic that did not require a huge tagged corpus. Such this tagger helps to solve
the problem of the lack of a huge tagged corpus for Arabic in the current literature.
In addition, the diacritical mark especially at the last letter of the word plays a great
role in reducing the lexical ambiguity and determining the correct POS tag to each
word in testing corpus. Despite that the tagger system presented in this work has many
strength points (described in next chapter, section 7.4), the problems that have been
• The system does not accurate in tagging proper nouns and Arabised words.
• The system does not differentiate between the passive and active perfect verb and
As mentioned earlier in section 6.3, the above shortcomings did not have a noticeable
impact on the effectiveness of the tagger performance because the number of words
that can be found in testing corpus is scarce. For example, 0.0005% of testing words
are passive perfect verb and 0.0043% relates to nouns end with long vowel \, Alif.
However, solving these shortcomings to enhance the performance of AMT tagger are
tem using a new partially-vocalised Arabic testing Corpus. The description of the data
sets were used during the experiments is shown. The result of the experiments and the
analysis of these results are also explained. The results show that AMT is achieved an
average accuracy 91 % of the testing corpus which contains 20,000 words. The short-
comings that AMT system has also mentioned. The main conclusion yielded during
the course of this research, the strenght points that the tagger system has, and future
Chapter 7
Several Part-of-Speech tagging systems with high tagging accuracy have been devel-
oped, especially for English based on text statistics or on grammar rules. Unlike En-
glish, the Part-of-Speech tagging systems for Arabic as a research field in Arabic NLP
for Arabic. These systems were built to tag unvocalised Arabic text using a lexicon
or dictionary that was tagged manually and used as a training corpus containing all
The Arabic language has a valuable and an important feature, called diacritics, which
are marks placed over and below the characters of the word. An Arabic text may be
written with diacritics or without. An Arabic text that appears without a short vowel
and diacritics is called unvocalised text while written Arabic text with full representa-
tion of short vowels and other diacritics marks is called fully-vocalised text. An Arabic
text is a partially-vocalised text when the the diacritical marks assigned to one or max-
This thesis represents a substantial starting point for developing a rule-based part-of-
speech tagging system deals with partially-vocalised Arabic text. It is the first tagger
(1) uses only linguistic rules, (2) investigate the role of the last diacritical mark in help
to determine the correct pas tag to each word in testing corpus. The main function of
A novel technique: pattern-based, has been explored using a novel algorithm (pattern-
matching algorithm). In this technique, the Arabic word was tagged based on its pat-
tern. A lexicon of patterns which are associated with the last diacritical mark was
generated automatically and used instead of a huge Arabic word lexicon. The advan-
tages of this technique are twofold: First, it does not need a lexicon or training corpus.
Second, it reduces the space since hundreds of Arabic words may follow one pat-
tern. Additionally, a set of linguistic rules (lexical and contextual technique) based on
the character(s), affixes, the last diacritical mark, the word itself, and the surrounding
words or on the tags of the surrounding words were used to tag those words not tagged
by pattern-based technique.
The system developed to answer hypothesis and research questions mentioned in chap-
ter 1. Since the accuracy of the AMT system that can be achieved is 91 %. This enables
1. it is possible to build a tagger system for Arabic with out needs a huge lexicon for
2. the diacritical mark especially at the last letter of the word plays a great role in
reducing the lexical ambiguity and determining the correct POS tag to each word
in testing corpus.
tagged lexicon which still not available in the current literature and most of the
Section 7.1 summarise the importance of diacritic feature. Section 7.2 describe the
contributions of this research while section 7.3 point out a direction for future works.
The lack of diacritics in Arabic texts is presented as a major challenge to most Arabic
NLP tasks. The use of diacritics in Arabic texts are extremely important. The list
1. They add a semantic information to words which helps with resolving ambiguity
2. They help determining the correct pas tag to the words in the sentence.
3. They ascribe grammatical functions to the words, differentiating the word from
other words, and determining the syntactic position of the word in the sentence.
In addition, the last diacritical mark helps not only in determining the correct part-of-
speech of the words in the sentence, but also in providing full information regarding
7.2 Contributions
The contributions of this research to the field of NLP can be summarise as follow:
This research has developed a POS tagger system called AMT (short for Arabic
Morphosyntactic Tagger). AMT deals for the first time with partially-vocalised
Arabic text . The main aim of AMT is to annotate the testing corpus by adding
POS tag or label to each word in the testing corpus and toproduce a POS tagged
partially-vocalised Arabic text. It can also used as a prerequisite tool for many
A new morpho syntactic tag set that is derived from the ancient Arabic grammar
has been developed, which is based on the Arabic system of inflectional mor-
phology. The tag set does not follow the traditional Indo-European tag set that
is based on Latin but instead it's based on the semitic tradition of analysing lan-
guage. These tags contain a large amount of information and add more linguistic
attributes to the word. The Arabic tag set contains 161 detailed tags and 28 gen-
eral tags covering an Arabic major POS classes and sub-classes which have been
sen and extracted from different books for different levels of school classes has
been compiled in this work and introduced in chapter 6. The corpus was tagged
using AMT system presented in this research. It will be available (both raw and
AMT is rule-based system. It has two rule components. The first component
pattern of the testing word. These patterns are associated with the last diacritical
rithm) has been designed and built in this work and introduced in chapter 5. The
aim of this algorithm is to match the inflected word in the testing corpus with its
pattern in the pattern lexicon. The second component is the lexical and contex-
tual rule. The trigger in the contextual rules depends on the current word itself,
the tags or words on the context of the current word, while the trigger in the
During the course of this research, there are many areas that deserve more study, these
• Further research in the expansion of pattern lexicon to contains all Arabic pat-
• Improving the tagger by defining and encoding an additional set of Arabic tag-
ging rules.
• Evaluate the tagger and compare its result with other tagger(s) deal with
• Building a lexicon contains all Arabic roots to enhance the perfonnance of the
tagger system and pattern-match algorithm as well.
7.4 Summary
In conclusion, all the POS tagging systems for Arabic described in this work (see
chapter 2) were built to tag unvocalised Arabic text. AMT system presented in this
• It is the first tagger uses purely rule-based approach, applied a novel technique,
that is, pattern-based technique. The tag assigned to the word based on the pattern
• It is the first tagger investigated the role of diacritic feature in the Arabic language.
text. The last diacritical mark plays a great role to remove a great deal of lexical
Appendix A
Tagset Appendices
Person, Subjunctive
Second Person, Jussive
09 Ktbna They(pI,Fe)
VePeFePIThl s Verb, Perfect, Feminine, Plural,
VePeMaPIThDc Verb, Perfect, Masculine, Plural, ~ ktbwA
Third Person, Indicative
VePeNeSnFsls Verb, Perfect, Neuter, Singular, El mthmx I teach them
First Person, lussive
VePeNePISel s Verb, Perfect, Neuter, Plural, Sec- ~~ Elmtwhrnx You teach them
. ~
lar, Third Person, Indicative
VePiFeSnThJs Verb, Imperfect, Feminine, Singu- tElrnhmx You(Sn) teach
lar, Third Person, Subjunctive them
VePrnMaSnSeJ s Verb, Imperative, Masculine, Sin- d Ouktbx You(Sn,Ma)
gular, Second Person, Jussive Write
VePmFeSnSeJ s Verb, Imperative, Feminine, Singu-
-. Ouktbyx You(Sn,Fe) Write
lar, Second Person, J ussive
VePmNeDuSeSj Verb, Imperative, Neuter, Dual, L::.(I OuktbA You(Du) Write
Second Person, Subjunctive
PrSb Subjunctive Particle J In
1m Never
PrJs JussivelElision Particle ~
Pr Particle I~I ~
IdhA If
N uPsMaSnSeAcId "
Personal Noun, Masculine, Singu- ~I Onta You(Sn,Ma)
lar, Third Person, Accusative, In-
tive, Indefinite
Noun, Singular, O.AA h*h This(Sn)
NuDeSnGeId Demonstrative
Genitive, Indefinite
Genetive, Indefinite
Genetive, Definite
~ . Instructor(Fe,Sn)
Adjective Noun, Feminine, Singu- WaA mElmPN
lar, Nominative, Indefinite
Genetive, Indefinite
Genetive, Definite
Nominative, Indefinite
Genetive, Indefinite
Nominative, Definite
Accusative, Definite
Genetive, Definite
Accusative, Indefinite
Accusative, Definite
native, Indefinite
Nominative, Indefinite
Accusative, Indefinite
Nominative, Definite
Genitive, Indefinite
Genitive, Definite
Genitive, Definite
~).:JI AlmdArsi Schools (PI)
NuCnFePIGeDf Common Noun, Feminine, Plural,
Genitive, Definite
Genitive, Definite
Appendix B
alphabet to give a reader unfamiliar with the language sufficient information for accurate pronunciation
still presented. Marshall Hodgson ( [70] ,p4) define transliteration as : "is the rendering of the spelling
of a word from the script of one language into another language'" , and transcription as : " is the
rendering of the sound of a word so that a reader can pronounce". Por example, the transliteration of
the Arabic word ~ may 'ktb', while one of the transcription is "kataba", other may be "kutub".l
Many different approaches and a variety of ways for transliteration (romanizing) Arabic language have
been developed. Some of these transliteration systems are listed below :
• Romanization Tables adopted by the US Library of Congress and the American Library
• DIN 31635 developed by the Deutsches Institut fr Normung (German Institute for
Unfortunately, none of the systems described above is an universal standard for transliteration and
transcription Arabic language. All the systems described above suffer from many difficulties, such as :
they use special characters or add special marks to normal characters which make these systems
difficult to memories as well as most of these systems cannot be used easily with a standard computer
keyboard. On other hand, a few Arabic letters have a clear equivalent in the Roman
Due to some difficulties described above, each person uses their own standard. Throughout this thesis
we use a transliteration system compound from Buckwalter and Al-kitaab transliteration systems with
a little bit of my own update to transliterate diacritical marks described in table B.4.
form of the letter. Arabic script is a cursive and written from right to left [76].
Table B.1 shows the various forms of Arabic letters and transliteration of each letter which has been
26 haa 0 h hassel
27 waaw j w welcome
28 yaay y young
Table B.2: Hamza (glottal stop) with Alif, waaw, and yaay consonants
Furthermore, table B.3 shows the transliteration system for short vowels diacritical marks, while the
Sukun ~ x
Shadda w"". -
Appendix C
JIj F3CHCWD and Kasra mark CWDLM PrCo+NuCnGeDf
NuPo PWD ~~ or ~~ or ~~
Appendix D
