Arabic Essay PDF

A Morphological - Syntactical Analysis
Approach For Arabic Textual Tagging
Shihadeh Alqrainy
Thesis submitted in partial fulfillment

of the requirements for the Degree of
Doctor of Philosophy
in Computer Science
School of Computing
Faculty of Computing Sciences and Engineering
De Montfort University
July, 2008
Abstract
Part-of-Speech (POS) tagging is the process of labeling or classifying each word in
written text with its grammatical category or part-of-speech, i.e. noun, verb, prepo-
sition, adjective, etc. It is the most common disambiguation process in the field of
Natural Language Processing (NLP). POS tagging systems are often preprocessors in
many NLP applications.
The Arabic language has a valuable and an important feature, called diacritics, which
are marks placed over and below the letters of the word. An Arabic text is partially-
vocalised 1 when the diacritical mark is assigned to one or maximum two letters in the
word.
Diacritics in Arabic texts are extremely important especially at the end of the word.
They help determining not only the correct POS tag for each word in the sentence,
but also in providing full information regarding the inflectional features, such as tense,
number, gender, etc. for the sentence words. They add semantic information to words
which helps with resolving ambiguity in the meaning of words. Furthermore, diacritics
ascribe grammatical functions to the words, differentiating the word from other words,
and determining the syntactic position of the word in the sentence.

IVocalisation (also referred as diacritisation or vowelisation).
1
This thesis presents a rule-based Part-of-Speech tagging system called AMT - short
for Arabic Morphosyntactic Tagger. The main function of the AMT system is to as-
sign the correct tag to each word in an untagged raw partially-vocalised Arabic corpus,
and to produce a pas tagged corpus without using a manually tagged or untagged
lexicon (dictionary) for training. Two different techniques were used in this work, the
pattern-based technique and the lexical and contextual technique.
The rules in the pattern-based technique technique are based on the pattern of the
testing word. A novel algorithm, Pattern-Matching Algorithm (PMA), has been de-
signed and introduced in this work. The aim of this algorithm is to match the testing
word with its correct pattern in pattern lexicon.
The lexical and contextual technique on the other hand is used to assist the pattern-
based technique technique to assign the correct tag to those words not have a pattern to
follow. The rules in the lexical and contextual technique are based on the character(s),
the last diacritical mark, the word itself, and the tags of the surrounding words.
The importance of utilizing the diacritic feature of the Arabic language to reduce the
lexical ambiguity in pas tagging has been addressed. In addition, a new Arabic tag
set and a new partially-vocalised Arabic corpus to test AMT have been compiled and
presented in this work. The AMT system has achieved an average accuracy of 91 %.
1
Contents
Abstract i
Dedication xii
Acknowledgments xiii
Publications xv
Index of Transliteration xvi
1 Introduction 1
1.1 Overview 1
1.2 Motivation . 6
1.3 Research Hypothesis, Aims and Objectives 6
1.4 Significant Research Contributions 8
1.5 Outline of the thesis . . . . . . . . 10
2 Related Concepts and Literature Review 12

2.1 Part-of-Speech (POS) Tagging Problem 13
2.2 Applications of Part-of-Speech Tagging 17
2.3 Corpus-based Linguistics . . . . . . . . 19
11
2.3.1 Introduction. . . . . . . . . . . . . . . . . . . 19
2.3.2 Existing Corpora : English and other languages 20
2.3.3 Arabic language corpora 23
2.4 Part-of-Speech tag set. 25
2.4.1 Introduction. . 25
2.4.2 English and other languages POS tag sets 26
2.4.3 Arabic POS tag sets ......... 29
2.4.4 Justification for a new Arabic tag set. 32
2.5 Part-of-Speech Tagging Approaches 36
2.5.1 Rule-based Approach. 37
2.5.2 Statistical Approach 44
2.5.3 Advantages and disadvantages of rule-based and statistical ap-
proaches ........... 52
2.5.4 Hybrid and Other approaches 53
2.5.5 Arabic POS Tagging Systems 55
2.6 Chapter Summary . . . . . . . . 63
3 Arabic Language and POS tagging 65
3.1 Introduction . . . . . . . . . . . 65
3.2 Arabic script and diacritics feature 68
3.2.1 Brief history ...... 68
3.2.2 Arabic Diacritical Marks 70
3.3 Importance of the diacritic feature in Arabic POS tagging 71
3.4 Arabic Major grammatical part-of-Speech 74
3.4.1 Verb. 75
3.4.2 Noun 76
111
3.4.3 Particle ....... 77
3.5 Arabic Grammatical System 78
3.5.1 Morphology System 79
3.5.2 Syntax System 80
3.6 Chapter Summary . 82
4 Tag set Design 83
4.1 Tag set design criteria . 83

4.2 Arabic Inflectional Features . 87
4.2.1 Gender 88
4.2.2 Number. 89
4.2.3 Person. 89
4.2.4 Mood 89
4.2.5 Case. 90
4.2.6 State 90
4.3 ARBTAGS-The developed Tag set 90
4.3.1 ARBTAGS Hierarchy .. 90
4.3.2 Tag design of ARBTAGS . 99
4.3.3 Detailed and general tags in ARBTAGS tag set 102
4.4 Chapter Summary . . . . . . . . . ........... 107
5 Design and Implementation of AMT 109

5.1 AMT Characteristics . . . . . . . 109
5.2 Rule-based - the developed approach. 110
5.2.1 Justification for using the rule-based approach. 111
5.3 Pattern-based technique - A novel technique. 112
5.3.1 Pattern-based Rules .. 116
IV
5.3.2 Pattern-matching algorithm 120
5.4 Lexical and Contextual technique . 126
5.4.1 Lexical Rules . . 127
5.4.2 Contextual Rules 128
5.5 A description of the tagger system 132
5.5.1 Tagger Modules 132
5.5.2 Tagging Process 133
5.6 Chapter Summary . . . . 135
6 Evaluation of Results obtained from AMT 137

6.1 Testing Data sets . . . . . . . . . . . . .... 137
6.2 AMT Experiments and accuracy measurement . 139
6.2.1 Experiment-1 . 140
6.2.2 Experiment-2 141
6.2.3 Experiment-3 142
6.2.4 Experiment-4 . 143
6.3 Experimental results Analysis 145
6.3.1 The Quran text experiment . 151
6.4 Summary of results obtained from the AMT system . 153
6.5 Chapter Summary . .................. 154
7 Conclusion 156
7.1 Importance of diacritic feature 158
7.2 Contributions 159
7.3 Future Works 160
7.4 Summary .. 161
v
Bibliography 161
Appendices 178
A Tagset Appendices 179

A.1 General Tags 179
A.2 Detailed Tags 180
B The Arabic Language Orthography 191

B.1 Arabic words and the Roman alphabet 191
B.2 Arabic alphabet and other diacritical marks 192
C Lexical and Contextual Rules 195

C.1 N ames and description of lexical rules 195
C.2 Lexical Rule Examples . . . . . . . . 195
C.3 N ames and description of contextual rules 196
C.4 Examples used contextual rules . . . . 196
D Permission for Collecting Testing Corpus 197
VI
List of Figures
2.1 Ambiguity types ..... 14
2.2 The possible values of tag 15
2.3 ARBTAGS tag set hierarchy 33
2.4 Khoja tag set hierarchy [89] 34
2.5 Transformation-Based Error-Driven Learning. 43
2.6 How APT performs tagging. . ........ 57
3.1 The origin of the Arabic script 68
3.2 The Arabic grammatical system 78
4.1 Categories of Arabic verb 91
4.2 Categories of Arabic noun 92

4.3 Categories of Arabic particle 99
4.4 Verb sub-classes and their inflectional features 102
4.5 Noun sub-classes and their inflectional features 103
4.6 Particle sub-classes . . . . . . . . . . .. 103
5.1 the word 0yi1.A!j and its pattern 0~~j 115
5.2 the word jjtl. and its pattern J.'-~ ..... 116
5.3 The identical letters between the word ~ and the pattern ~ 122
5.4 Matching the word ~ with the pattern ~ . . . . . . . . .. 123
Vll
5.5 Matching the word 0~·0 '"! with the pattern ~~ 124
5.6 Matching the word 0-"0 . ,! with the pattern ~ 125
5.7 Matching the word 0~·0 '''! with the pattern 0~ 125
5.8 Matching the word ~j with the patternIf.W 131
5.9 Matching the word ~ with the pattern If.W 131
5.10 An overview of AMT . . . . 132
5.11 How AMT performs tagging 134
5.12 Tagging process for simple part of text . 136
6.1 Success rate of experiment-1 . . . . . . 141

6.2 Detailed and general tags ratio in experiment-1 141
6.3 Distribution of POS classes in experiment-1 141
6.4 Success rate of experiment-2 . . . . . . . . 142
6.6 Distribution of POS classes in experiment-2 143
6.7 Success rate of experiment-3 . . . . . . . . 143
6.9 Distribution of POS classes in experiment-3 .. 144
6.10 Percentage of rules applicability based on type. 144
6.11 Matching the word tAy: with the pattern If.W 148
6.12 Detailed and general tag ratio overall in the correctly tagged corpus 150
6.13 Success rate for unvocalised sample text which contains 1500 words 150
6.14 A sample of Quran text . . . 151
6.15 The result of the Quran text . 152
Vll1
List of Tables
2.1 Ambiguous words in Arabic sentence 16
2.2 Sample of Brown tag set 26

2.3 Sample of LOB tag set 27
2.4 Sample of Penn Treebank tag set 28
2.5 Sample of Khoja tag set . 30

2.6 The LDC POS tagset . . 31
2.7 Sample of Alshamsi and Guessoum tag set . 31
3.1 Arabic short vowels diacritics . . . . . 70

3.2 Nunation (Tanween) Vowels diacritics 70
3.3 Sukun and Shadda vowels ..... . 71
3.4 Vocalisation state of the Arabic word . 71
3.5 Unvocalized Arabic sentence and its POS tags. 72
3.6 The possible last diacritical mark (case ending) of the word w~ . 74
3.7 Partially-vocalised Arabic sentence and its correct POS tag 74
3.8 Samples of imperative verbs and their inflectional features 76
3.9 Samples of past tense (perfect) verb forms . . . . . . . . . 79
3.10 Samples of present (imperfect) and imperative verb forms. 80
IX
3.11 Samples of additional forms such as verbal, diminutive, Adjective
nouns created from the same simple root ~ . ............ 80
4.1 Personal pronouns between Arabic and English 96

4.2 Abbreviation symbols of the main pas classes 100
4.3 Abbreviation symbols of the sub-classes of class verb 100
4.4 Abbreviation symbols of the sub-classes of class noun . 100
4.5 Abbreviation symbols the sub-classes of class particle . 100
4.6 The possible value of the inflectional feature (Gender) . 101
4.7 The possible value of the inflectional feature (Number). 101
4.8 The possible value of the inflectional feature (Person) .. 101
4.9 The possible value of the inflectional feature (Mood). 101
4.10 The possible value of the inflectional feature (Case) .. 101
4.11 The possible value of the inflectional feature (State). 102
4.12 Abbreviation symbols used in ARB TAGS tag set 104
4.13 ARBTAGS tag set vs. Penn Treebank tag set. 106
4.14 ARBTAGS general tags . . . . . . . . 106
4.15 Sample of detailed tags in ARB TAGS 107
5.1 Derived forms from the ground form (root) 114
5.2 Sample of prefixes, forms, suffixes for some imperfect verb words 117
5.3 Sample of pattern lexicon shows the patten for some imperfect verb
words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.4 Sample of prefixes, forms, suffixes for some perfect verb words . 118
5.5 Sample of pattern lexicon shows the patten for some perfect verb words 118
5.6 Number of identical letters between the word ~ and its patterns 120
5.7 Number of identical letters between the word 0..,.-0 fill and its patterns 124
x
5.8 Classification of Proper noun . . . . . . . . . . . . . . . . . . . .. 130
5.9 Sample of prefixes, particle word, suffixes for some particles words. 131
5.10 Sample of particles lexicon . . . . . . . . . . . . . . . . . . . . .. 131
6.1 Some of Quran words VS MSA words 153
B.l Arabic Alphabet. . . . . . . . . . . . 193
B.2 Hamza (glottal stop) with Alif, waaw, and yaay consonants 194
B.3 Arabic short vowels. . . . . . . . . . . . . . . . . . . . . 194
B.4 Other diacritical marks (Nunation,Sukun,gemination) in Arabic. 194
Xl
Dedication
To my lovely wife, who gave me unchanged affection, endless love, and constant en-
couragement over the years.
To my children, Ramzi, Dou' a, Iman, Ala'a and Malak for their patience, love, and
for enduring the ups and downs during the completion of this thesis.
This thesis is dedicated to them.
XlI
Acknowledgments
First of all, I would like to thank my God, who gave me the strength to finish this the-
sis. This thesis would not have materialised without the aid and collaboration of many
people whom I wish to thank.
I would like to start with acknowledging my first supervisor, Dr. Aladdin Ayesh. He
was always available with an accurate advice, an interesting suggestion or an encour-
aging word and a listening ear. His patience, guidance and help, both personally and
professionally, have been greatly appreciated. I also deeply appreciate the dedicated
support of the members of my supervision team, Prof. Robert John and Dr. John Cow-
ell who guided the completion of this thesis with many helpful insights and valuable
comments.
I am thankful to the members of my family, my mother, brothers and sisters for their
long support. I also thankful my brothers in law. I would like to acknowledge the
financial support afforded me by my brothers, Fathi Alqrainy, Dr.Saleh Abu-Soud and
Husain Dolat during a very tough period of my life. The only word I remember at this
point is gratitude.
I would also like to express my best thanks to my brothers, Walid Alqrainy and
XlII
Ibrabiem Abu farab for his continuing support. I would like to express my deepest
gratitude to my closed friend, Hasan Alserhan, for his kind assistance with so many
things through all the good times and especially through the bad times that we have
spent together during our trip.
I am also thankful to all my friends in UK and Jordan. Last but not least, without
the financial support of Albalqa' a Applied University in Jordan, my stay in the United
Kingdom would have been even harder and this research would not have happened.
Their contribution was certainly very helpful and to them I wish to extend my special
thanks.
XIV
List of Publications
1. Shihadeh Alqrainy and Aladdin Ayesh. Developing a Tagset for Automated POS
Tagging in Arabic. WSEAS TRANSACTIONS on COMPUTERS, 5(11):2787-
2792,2006.
2. Shihadeh Alqrainy and Aladdin Ayesh. Word Class Tagger and Tagset design
for partial-vocalized Arabic Text. In proceedings of 2nd Jordan International
Conference on Computer Science and Engineering (JICCSE 2006), Albalqa'a
Applied University, JORDAN, December 2006.
3. Shihadeh Alqrainy and Aladdin Ayesh. Rule-based Part-of-Speech Tagger for
Arabic. Submitted to (ACM) Transactions on Asian Language Information Pro-
cesszng.
xv
Index of Transliteration
Arabic Alphabets 2
No Name Con Trans No Name Con Trans

1 Alif \ A 15 Daad . D
I.f
2 baa w . b 16 Taa J:, T
3 taa W t 17 DRaa ~ DR
4 thaa :.
th 18 ayn E
...
w
L.
5 Jllm 19 ghayn gh
( ]
L
6 Raa R 20 faa J f
C
.
7 khaa kh 21 qaaf '-' q
C
8 daal ~ d 22 kaaf ~ k
9 dhaal ~ dh 23 laam J 1
10 raa r 24 mum m
..J \
11 zaay .
..J z 25 nuun 0 n
12 snn if s 26 haa 0 h
13 shiin if sh 27 waaw j w
14 Saad S 28 yaay !..5 y
I.f -
continued
2In Arabic Alphabets table: Con=Consonant, Trans=Transliteration
XVI
Hamza (glottal stop) and Ta Marboota Consonants.
Name Consonant Transliteration

Hamza , ,
hamza above Alif "\ 0

hamza below Alif \ I
"
hamza above waaw " W
-'
hamza above yaay :s }
Ta Marboota 0" p
Alif Maqsoura 4.5 Y
Short Vowels Marks
Name Mark in consonant Transliteration Pronunciation

,
Fatha sign ~ a Ia!
.-
damma sign ~ u lui
kasra sign ~
, I Iii
Other diacritical marks (Nunation,Sukun,gemination)

~
Tanween fath ~ an Ian!

Tanween damm
..
~ un lun!
Tanween kasr ~ In lin!
~
0
Sukun ~ x
Shadda .
u'" -
XVll
Chapter 1
Introduction
1.1 Overview
Natural Language Processing (NLP) is one of the Artificial Intelligence (AI) fields that
deals with analysing, understanding and generating the human languages in order to
interface with computers in both written and spoken contexts using natural human lan-
guages (e.g English, Arabic, French, etc.) instead of computer languages (e.g Java,
C++, etc.)1. Understanding human languages is not an easy task for a computer that
lacks the human knowledge of the world and the human experience with linguistic
structures.
Multiple levels of knowledge are required to process the human language. The list
below summaries some of the different form of knowledge relevant for natural lan-
guage understanding( [23], p.IO) :
• Phonological knowledge: how words are related to the sounds that realise them.
• Syntactic knowledge: how words can put together to form sentences.

1For more: http://www.webopedia.comffERMlNINLP.html
I
1.1. OVERVIEW
• Semantic knowledge: the assignment of meaning to words in a sentence.
• Morphological knowledge: how words are constructed a smallest meaning units
called morphemes. For example, the English word "cats" has two morphemes
(cat and s).
• Pragmatic knowledge: how sentences are used in different situations.
This information is extremely necessary to resolve any type of ambiguity that may
arise. In NLP a word, a phrase, or a sentence is called ambiguous if it can be rea-
sonably interpreted in more than one way [33]. The ambiguity is arguably the single
most important problem in NLP [66]. Natural language has a huge number of ambigu-
ities at every level of description, such as, lexical (many words tend to have multiple
lexical category2 or senses), syntactic or structural (words having different structural
functions in a sentence), and semantic (some sentences can have multiple interpreta-
tions) [40]. Ambiguity types in NLP is discussed in more detail in chapter 2.
The main goal of the NLP field is to resolve the ambiguity that may found in human
language. Whether the ambiguity is lexical, syntactic or semantic, the disambiguation
process is a central first step in most NLP tasks, such as machine translation, informa-
tion retrieval, etc. [43].
The most common disambiguation process which has received extensive attention from
NLP research community is Part-Of-Speech (POS) tagging. POS tagging is the pro-
cess of labeling or classifying each word in written text with its part-of-speech, i.e.
noun, verb, preposition, adjective, etc. It concerns with lexical ambiguity resolution.
For example, the sentence" He will table the motion!' is tagged as follows :
2 Also called grammatical class or part-of-speech
2
1.1. OVERVIEW
He I PP S will I MD table I VB the I AT motion INN. I.
The descriptive symbols or notations, PPS, MD, VB, AT, NN, BEZ, and JJ are called
pas tags. Each symbol or tag indicate that the word belongs to a particular grammati-
cal class. For example, PPS= subject pronoun; MD =modal; VB =verb (no inflection);
AT = article; NN = noun; BEZ = present 3rd sg form of" to be "; JJ = adjective.
Many words in languages are ambiguous : they may be assigned more than one pas
tag [114]. For example, the English word round may be a noun, an adjective, a prepo-
sition or an adverb, or a verb. It is well-known that part-of-speech depends on context.
The word "table " in the above context is tagged as a verb while it can be a noun in
other context (e.g., "The table is ready") [44].
Resolving these lexical ambiguities constitutes the main challenge and the ultimate
goal of pas tagging system3 . Lexical information includes not only the part-of-speech
of the word but also the inflectional features of the word, such as, tense, person, num-
ber, mood, case and gender. In general, this information is extremely necessary to be
available to the tagging system. It is encoded in a descriptive symbol called a tag and
typically stored in a lexicon or a dictionary [73].
pas tagging is a very important intermediate step toward building many NLP applica-
tions, such as, text-to-speech synthesis, speech recognition, information retrieval (IR),
spelling correction, and parsing system. In addition, the most prominent and largely
developed field where the pas tagging used is a corpus linguistics [73,114]. NLP ap-
plications which need pas tagging system as important intermediate step and corpus
linguistics are discussed in more detail in section 2.2 and section 2.3 respectively.
3 Also called tagger system
3
1.1. OVERVIEW
pas tagging can be done manually by linguists or automatically by computer. Since
the size of text corpus is increasing, it is becoming very difficult for the human tagger
to annotate the text in the corpus accurately. Furthermore, it requires great effort, cost,
and time. So, the development of an automatic pas tagger is highly desirable.
The main task of concern for this thesis is pas tagging over the Arabic language.
The current literature in the field of Arabic NLP shows that little research has been
done in pas tagging for Arabic. Very few attempts were made to develop the pas
tagger for Arabic such as the work done by Abuleil [15] in 1999. The aim of his tag-
ger is to use it as a first step in parsing Arabic newspaper text. Also EI-Kareh and
Al-Ansary [54] presented the semi-automatic pas tagger in 2000. The first tagger for
Arabic appeared in 2003 by Khoja [87] since the aim of this tagger was to produce a
tagged corpus. A few taggers later appeared, such as, the work done by Habash and
Rambow [71], Diab et aI. [51] and Marsi et al. [102] in 2005. Also, Alshamsi and
Guessom [127] and Harmin [75] presented a tagger system for Arabic in 2006. This
brief literature shows that the work in pas tagging for Arabic has been done in recent
years, while it was done for English, as an example, three decades ago.
Many reasons lie behind the lack of research on the Arabic language. A richly in-
flected and a complex morphological system that Arabic exhibits on one hand, and the
lack of resources such as the availability of large manually tagged Arabic corpus on the
other hand may constitutes the main reason behind the lack of research on the Arabic
language. In addition, the actual deployment of the use of computers and Internet in
the Arab world began in the mid-nineties and grows continuously.
4
1.1. OVERVIEW
The current taggers were built to tag unvocalised Arabic text using a lexicon or dic-
tionary that was tagged manually and used as a training corpus containing all possible
tags (lexical infolTIlation) for each word. At this point, the main task of the tagger is
to resolve the lexical ambiguity and to detelTIline the proper tag of ambiguous words
based on the context of the sentence.
The training corpus should be very huge for two reasons. The first reason is to achieve
very good accuracy like the taggers accuracy (98%-99%) used for English because a
very large amount of data was used to train them (e.g., hundreds of million words)
while the accuarcy of Khoja tagger as an example is 86% since her tagger was trained
on a very small training corpus (10,000 word) [87]. At the same time, Khoja state that
"Of course, having a tagger that did not require a tagged corpus was valuable to languages
other than English, where there was no tagged corpus available"( [88], p.29).
The second reason is to avoid the most important problem in POS tagging: unknown
words. Unknown words are words not appearing in the training corpus. Neither the
testing corpus nor the training corpus has lexical infolTIlation and tags for these words.
In case the tagger system deals with unvocalised Arabic text, a huge lexicon or training
corpus is required to be available to the tagging system. Unlike English, Arabic still
lacks a huge manually tagged corpus from which large amounts of training data can
be extracted. At the same time, it is desirable in the authors' opinion to construct a
POS tagger that needs as little training data as possible. Therefore, developing a POS
tagger for unvocalised Arabic text using a statistical approach as one of the two major
approaches (rule-based and statistical) that achieves reasonable accuracy seems very
difficult at the present time.
5
1.2. MOTIVATION
1.2 Motivation
As mentioned earlier in section 1.1 the taggers were built for Arabic are based on a
lexicon or dictionary that was tagged manually for training and used to tag unvocalised
Arabic text. However, the Arabic language has a valuable and an important feature,
called diacritics, which are marks placed over and below the characters of the word.
An Arabic text may be written with diacritics or without. The text that appears without
diacritics is called unvocalised text. While written Arabic text with full representation
of diacritics marks is called fully-vocalised text. An Arabic text is a partially-vocalised
text when the diacritical mark assigned to one or maximum two letters in the word.
In addition, Arabic language has many signs that indicate the class of the word. Pat-
terns, grammatical rules, affixes 4 , and ending case, are examples of these signs. Based
on these distinctive characteristics of the Arabic language, a set of questions deserve
answer regarding the field of Arabic NLP. These questions and the objectives of this
research are described in more detail in the following section.
1.3 Research Hypothesis, Aims and Objectives
This research begins with the following four questions :
1. Does an automatic POS tagger system deals with partially-vocalised Arabic text
exist?
2. Do diacritics play an important role to resolve the lexical ambiguity that may arise
in Arabic text?
4 affixes in Arabic are those letters which precede the root of the word (prefixes), follow the root
(suffixes) or placed inside the root with which it is associated (infixes).
6
1.3. RESEARCH HYPOTHESIS, AIMS AND OBJECTIVES
3. Does a standarised and comprehensive Arabic tag set exist?
4. Does an Arabic corpus which contains partially-vocalized Arabic text exist?
The literature carried out on the Arabic NLP shows that the answers to the previous
questions have not yet been. As mentioned above, the current taggers were built to
tag unvocalised Arabic text. A tagger system that deals with partially-vocalised Arabic
text does not yet exist. In addition, the importance of utilising the diacritic feature of
Arabic language to reduce the ambiguity in POS tagging has not been addressed. A
raw or a hand tagging corpus which contains partially-vocalised Arabic text also does
not exist. Finally, despite the current taggers were used a set of tag sets as described
in chapter 2, most of these tag sets were compiled to represent the general tag of the
word (the general part-of-speech) without including more linguistic attributes of the
Arabic word. In addition, these tag sets were not cover the most grammatical classes
of Arabic language and the inflectional feature of Arabic word as well. Therefore, a
standarised and a comprehensive Arabic tag set does not exist.
The aim of this research is multifaceted :
• to create a POS tagger system deals with partially-vocalised Arabic text without
using a lexicon of Arabic words (tagged or untagged) especially for words be-
long to verb or noun classes, and at the same time achieves very good accuracy.
• to investigate the role of diacritic feature, especially at the end of the word (end-
ing case) in reducing the ambiguity and providing semantic information that
helps to determine the correct tag of each word in the testing corpus.
• to explore the possibility of using a novel technique to assign the correct tag to
each word in testing corpus based on the pattern of the word instead of the word
7
1.4. SIGNIFICANT RESEARCH CONTRIBUTIONS
itself.
• to present a comprehensive theoretical study of the diacritic and inflectional fea-
tures over the Arabic language.
1.4 Significant Research Contributions
This research provides a new contributions to the field of the Arabic NLP in different
ways, these contributions can be summarised as follows.
• AMT: Arabic Morphosyntactic Tagger
The ultimate contribution of this research is to develop the POS tagger sys-
tem called AMT (short for Arabic Morphosyntactic Tagger). AMT deals with
partially-vocalised Arabic text. The main aim of AMT is to annotate the testing
corpus, that is, adding POS tag or label to each word in the testing corpus and
produce a POS tagged partially-vocalised Arabic text. It is also used as a pre-
requisite tool for many NLP tasks, such as, parsing and informational retrieval
systems. Chapter5 in this research show the design and implementation of AMT.
• A new Tag set for Arabic
The fundamental component of any tagger system is the POS tag set that is used
in the tagging process [98]. The development of a tag set is an extremely nec-
essary step in building the tagging system. The need for a tag set comes from
the fact that there is no standardised and comprehensive Arabic tag set that cov-
ers the grammatical classes of Arabic language. Chapter 4 describe the steps of
designing a new Arabic tag set. The developed tag set follows the Arabic gram-
matical system, based upon POS classes and inflectional morphology that Arab
grammarians describe. During the course of developing this tag set, two Arabic
8
1.4. SIGNIFICANT RESEARCH CONTRIBUTIONS
linguists were consulted: Prof. Ali Alhamad5 [20]-Yarmouk University-Jordan
and Mr. Walid Alqrini6- Ministry of Education - Jordan. The consultation was
extended to cover other related issues such as, the rules of the Arabic language
and the testing corpus.
• Partially-Vocalised Arabic Corpus
A raw corpus which contains partially-vocalised Arabic text is needed to test the
AMT tagger system. Such this corpus does not exist. This research provides a
new partially-vocalised Arabic corpus. The corpus is not limited to a particular
domain; it covers a wide range of topics such as scientific and literary topics.
The detail of the corpus can be seen in chapter 6 .
• Pattern-Based Technique - A novel technique
This thesis represents a substantial starting point for developing a rule-based
part-of-speech tagging system. The research present two different techniques:
Pattern-Based Technique and Lexical and Contextual technique.
The basic idea of Pattern-Based Technique is to generate automatically a lex-
icon of patterns instead of using manually tagged. The rules in this technique
are based on the pattern of the word in testing corpus instead of the word it-
self. In addition, this research introduce a novel algorithm; Pattern-Match Al-
gorithm(PMA). The aim of this algorithm is to match the inflected word in the
testing corpus with its correct pattern in pattern lexicon.
The Lexical and Contextual Technique is used to assist the Pattern-Based Tech-
nique to assign the correct tag to the words not tagged by Pattern-Based Tech-
5Prof. Ali Alhamad site can be found at: http://www.yu.edu.jo/ArtsArabicDeptStaffltabidJ56IDefault.aspx
6Walid Alqrini email: walidalqrini123@yahoo.com
9
1.5. OUTLINE OF THE THESIS
nzque. The rules in Lexical and Contextual Technique are based on the char-
acter(s), affixes, the last diacritical mark, the word itself, and the surrounding
words or on the tags of the surrounding words. However, chapter 5 describe

these techniques in more detail.
1.5 Outline of the thesis
The rest of the thesis is organised as follows :
Chapter 2 : Related Concepts and Literature Review
The necessary background material of this research is presented in chapter 2. It is di-
vided into five sections. Section 2.1 introduces the problem of part-of-speech tagging
while some of its applications are introduced in Section 2.2. Section 2.3 discusses
corpus-based linguistics. The most important approaches used to solve the problem of
POS tagging are briefly examined in Section 2.4. The last section (section 2.5) defines
the POS tag set and also describes the previous work on POS tag sets.
Chapter 3 : Arabic Language and POS tagging
Section 3.1 introduces an overview of the Arabic language. A brief history of the Ara-
bic script and the diacritic feature is presented in Section 3.2. The importance of the
diacritic feature in POS tagging for Arabic is discussed in Section 3.3. Section 3.4
briefly defines the Arabic grammatical system.
Chapter 4 : Tag set Design
Chapter 4 is concerned with the development of the tag set design presented in this
work and contains three main sections. Section 4.1 describes the criteria to take into
account while developing the POS tag set. Arabic inflectional features are explained
10
1.5. OUTLINE OF THE THESIS
in Section -+.2. The last section (section 4.3) introduces the developed Arabic POS tag
set hierarchy and design.
Chapter 5 : Design and Implementation of AMT
This chapter is concerned with an implementation of the AMT system presented in this
work. It contains five main sections. The characteristics of the AMT tagger system are
defined in Section 5.1. The rule-based approach is described in Section 5.2. Section
5.3 explains the pattern-based technique used in this work while the lexical and con-
textual technique is explained in Section 5.4. A description of the tagger system and
the tagging process is described in section 5.5.
Chapter 6 : Evaluation of Results obtained from AMT
Chapter 6 is devoted to the evaluation of results obtained from AMT. It contains three
main sections. Testing data is described in Section 6.1 while the details of each experi-
ment is done to evaluate the AMT tagger presented in Section 6.2. Finally, experimen-
tal results analysis is introduced in Section 6.3.
Chapter 7 : Conclusion
This chapter contains the main conclusion yielded by this work and future research.
11
Chapter 2
Related Concepts and Literature
Review
Objectives
• To present the problem of part-of-speech tagging.
• To discuss some of its applications.
• To define the corpus linguistic in NLP.
• To define Part-of-Speech tag set.
• To describe the previous work on POS tag sets.
• To justify the need for a new Arabic tag set.
• To briefly examine the different approaches used to solve the
problem.
12
2.1. PART-OF-SPEECH (POS) TAGGING PROBLEM
2.1 Part-of-Speech (POS) Tagging Problem
To illustrate what part-of-speech tagging! is about, let us begin with a simple example
representing an English text ( [73], pA) :
the Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary
election produced no evidence that any irregularities took place.
The goal of part-of-speech tagging consists of labeling or tagging each word in the text,
including punctuation marks, with its correct part-of-speech. The following results are
expected as the output of the tagging process.
the/AT Fulton/NP County/NP Grand/NP Jury/NP saidmo FridaY/NR an/AT
investigation/NN Of/IN Atlanta's/NP$ recentlJJ primarY/NN election/NN producedmo
nO/AT evidence/NN thatlcs anY/OTI irregularities/NNs tookivBo plaCe/NN. / .
For the simple text shown, the words in the sentence are followed by a tag, where the
slash "f' separates the word from the tag or part-of-speech symbol. The tag here is
taken from a predefined inventory of labels called a tag set. The tag AT indicates that
the word belongs to the grammatical class of articles; NP represents proper nouns; VBD
for verbs; IN for prepositions; and so on.
One of the most difficult problems which affects the POS tagging is text ambiguity.
Something is ambiguous when it can be understood in two or more possible senses or
ways [43]. Ambiguity is the most significant problem in processing natural language.
In NLP a word, a phrase, or a sentence is called ambiguous if it can be reasonably
interpreted in more than one way [33]. Unlike grammars for computer programming
languages, grammars for natural languages like English, as an example, are usually
1Also called morpho syntactic categorisation or syntactic wordc1ass tagging. (see ref [73] )
13
runbiguous. Figure 2.1 shows the main ambiguity types in a natural language.
Ambiguity
1 1
Lexical Syntactic Semantic
Figure 2.1: Ambiguity types
The list below describe the ambiguity types in more detail.
• Lexical Ambiguity
Lexical ambiguity occurs when a word has several meanings. For instance, the
word "Lie" = "Statement that you know it is not true" or "present tense of lay".
Words like "light", "note", "bear" and "over" are lexically ambiguous [40].
• Syntactic (structural) Ambiguity

Syntactic (structural) ambiguity occurs when a given sequence of words can be
given more than one grammatical structure, and each has a different meaning. In
other word, when there are different possible syntactic parses for a grammatical
sentence. For example, the sentence "Visiting relatives can be so boring" is
structurally ambiguous (Who is doing the visiting?). Another example of such
ambiguity is the problem of attachment of modifiers to the proper constituents.
Consider the sentence "Fasten the assembly with the lever". This may be either
an instruction to fasten the assembly using a lever, or an instruction to fasten the
assembly, which has a lever attached to it. With the former interpretation, the
prepositional phrase "with the lever" is attached to the verb, and with the latter,
it is attached to the noun phrase object [33].
14
• Semantic Ambiguity
Semantic ambiguity occurs when a sentence has more than one way of read-
ing it within its context although it contains no lexical or structural ambigu-
ity [40]. Semantic ambiguity refers to the broad category of ambiguity which
arises when the meaning of the sentence must be determined with the help of
greater knowledge sources. The problem of resolving simple pronominal refer-
ence is an example of semantic ambiguity. In the sentence "Start the engine and
keep it running", the fact that it refers to the engine is not inferable from the
single clause "keep it running". Knowledge of the prior clause is necessary to
resolve the pronoun [33].
POS tagging is the most common type of lexical disambiguation. POS Tagger system
is typically used to resolve the lexical ambiguity (ambiguity in a single word) based on
context using the surrounding words and grammar rules. For example, in the follow-
ing English sentence2 : "Book that flight" the word "Book" as shown in figure 2.2 is
ambiguous regarding its part-of-speech, it can be a verb [V] or a noun [N]. Similarly,
the word "that" can be a determiner [DE T] , or a complementiser [C].
Words
Figure 2.2: The possible values of tag

2from: www.cse.ttu.edu.tw/chingyeh/courses/nlp/slides/ch8WordClassesAndPOSTagging.ppt
15
In Arabic, the same problem is faced. For example, in the Arabic 3 sentence4 shown in
Table 2.1. the word ~~, dkhl is ambiguous with regard to its part-of-speech. It can
be a verb if it means "entered ", or a noun "income".
Arabic Sentence: ~\
Transliteration: Albyt rmzy dkhl
Translation : " the house" "Ramzy" "entered", but it really means
.. Ramzy entered the house"
Table 2.1: Ambiguous words in Arabic sentence
Since many words in languages are POS ambiguous, the lexical ambiguities become
the main problem that POS tagging system faces. Resolving these ambiguities consti-
tutes the main challenge in POS tagging. The tagger system should choose the best tag
for each word in the text which has more than one part-of-speech. It is clear and well
known that part-of-speech depend on context [45]. The word "table" as another exam-
ple, can be a verb in some contexts (e.g., "He will table the order") and a noun in others
(e.g., "The table is too big"). Therefore, adequate context and/or adequate semantic
information knowledge are required to resolve the problem of POS tagging [l09].
The Arabic language differs from English in terms of characteristics and grammati-
cal system as well, for example, (1) diacritic feature which is not present in English,
and (2) the root and pattern structure on which the Arabic morphological system based
is on. While Arabic has the diacritic feature, the Arabic text may be written without
diacritics (unvocalised) or with them (partially or fully-vocalised).
When the text is written in an unvocalised form, resolving the lexical ambiguities in
3 Since a cursive system from right to left is used in written Arabic, the sentence is read from right to
left. More details of the Arabic language are discussed in chapter 3
4Transliterated Arabic words throughout this thesis are in italics while English translations are in
double quotes. All separated by commas.
16
2.2. APPLICATIONS OF PART-OF-SPEECH TAGGING
this case resembles English language which is based on the context. But the case is
different if the text is partially or fully-vocalised.
The testing corpus in this work is a partially-vocalized Arabic text. The diacritical
mark is assigned only to the last letter of each word in the testing corpus. There are
two reasons for choosing a partially-vocalised Arabic text as a testing corpus. The first
one is to investigate the importance of the last diacritical mark in reducing the lexical
ambiguity of the word and helping the POS tagger to resolve this ambiguity and to
assign the correct tag to the words in the testing corpus regardless of the context in
most cases. The second reason is to explore the possibility of applying pattern-based
rules to tag the testing words based on the pattern of the word instead of the word itself.
The importance of the diacritic feature in POS tagging and pattern-based approach
is described in more detail in Chapter 3 and Chapter 5 respectively. On the other hand,
ambiguity identification is crucial not only for the part-of-speech tagging, but also for
any other text processing dealing with content, such as, speech processing or semantic
annotation [33].
2.2 Applications of Part-of-Speech Tagging
POS tagging is a preliminary stage for many NLP applications. The most prominent
and largely developed field where the POS tagging is used is corpus linguistics [73].
Corpus linguistics is described in more detail in Section 2.3. It is also a useful and an
important practical problem with potential NLP applications in many areas [45], such
as :
17
2.2. APPLICATIONS OF PART-OF-SPEECH TAGGING
• IR: Information Retrieval
pas tagging system can enhance an IR application by selecting nouns or other

important words from a document (e.g sequences of proper nouns or common
nouns) [50]. The user of World Wide Web will appreciate the importance of
accurate information retrieval. pas tagging removing the lexical ambiguity and
identifying the syntactic class of words. For example, the word" Cooking" can
be used either as a noun ( " Cooking is fun" ) or a verb ( " he is cooking lamb
" ). By identifying the syntactic role of the word" Cooking" within documents,
the results of searching for " Cooking fish ", as an example, would not include
documents where the word is used as a noun [43].
• Parsing system
pas tagging system can be an important first step and an integral part for any
parsing system [50]. Since the parser needs lexical information for each word
before performing the parsing process, such this information usually obtained
from the output of a pas tagger.
• Word Processing
Since most word processors attempt to provide a check not only on spelling, but
also grammatically, knowing the category of misspelled word helps in reducing
the number of corrections [73].
• Speech synthesis system

Knowing pas can produce more natural pronunciations in the speech synthesis
system and more accuracy in the speech recognition system [50].
• Machine Translation
A tagged version of each corpus on parallel corpora5 (text in different languages)
5corpora is a Latin plural of corpus. Next section defines corpus linguistic in more detail.
18
2.3. CORPUS-BASED LINGUISTICS
in machine translation research facilitates the automatic identification of transla-
tion equivalents on words and phrase level [73].
• Building Dictionaries
A tagged text has a great benefit also in building dictionaries. It has information
which can be of help to users of the dictionary such as language learners and
teachers in acquiring or identifying a core vocabulary [73].
2.3 Corpus-based Linguistics
2.3.1 Introduction
Over the last decade, many efforts have been devoted to compile a large raw text cor-
pora. Corpus Linguistics is the study of linguistic phenomena through large collections
of machine-readable texts [9] : corpora. A corpus is defined by Leech [92, 140] as " a
large collection of natural language material stored in machine readable form that can be
easily accessed, automatically searched, manipulated, copied and transferred".
The usability of corpus can be extremely enhanced by adding POS class to every word
in the corpus or any other relevant linguistic information which may be needed by the
linguist or other developers in NLP. Once the corpus is analysed, it constitutes a kind
of database that contains information about the linguistic structure and statistics of lan-
guage usage [140].
Since the fast development of computers with huge memory capabilities and software
on the one hand and the availability of large documents, books and publications on a
machine readable format, all these factors have made the compilation of these corpora
19
no longer a difficult problem.
Corpus linguistics can be considered as an independent field; it is a methodology rather
than an aspect of a specific language [86]. Many POS tagging systems built earlier used
different approaches especially for English language since it is the first language of cor-
pus linguistics [30]. The majority of these systems are designed to annotate the text
corpora, that is. they contain not only word, but also linguistic information on them,
such as, part-of-speech. A tagged corpus has a higher linguistic value, it provides
specific linguistic information which is very useful for developing lexical resources,
inducing grammatical structure and estimating parameters of statistical model [101].
2.3.2 Existing Corpora : English and other languages
The history of corpus linguistics started at the beginning of the sixties when the first
printed American English corpus was compiled, which is known as Brown corpus.
Summarised below is a list, although not exhaustive, of some well-known comput-
erised English corpora:
1. Brown Corpus
The Brown Corpus [60,68] was Compiled by W. Nelson Francis and Henry
Kucera in Brown University and contains 500 samples, each about 2,000 words
of continuous written American English, from texts published in the US in 1961.
The original edition of the text corpus was completed in 1964. It was revised
twice in 1971, and then revised and annotated with word tags 6 in 1979.
2. LOB: Lancaster-Oslo-Bergen corpus

The LOB Corpus [28, 82] contains also approximately one million words of
6pOS tag sets are described in chapter 4
20
British English from publications of the year 1961. It is a British counterpart
of the Brown corpus resulting from research collaboration between the Univer-
sity of Lancaster, the University of Oslo, and the Norwegian Computing Centre
for the Humanities. The text corpus was published in 1978, and its tagged edition
in 1986.
3. LLC: London-Lund Corpus
The London-Lund Corpus [131] was compiled at Lund University. It contains
about 500,000 words of spoken British English collected from broadcast and
recorded materials. The texts were collected between 1959 and 1975.
-+. Penn Treebank Corpus

The Penn Treebank corpus [99, 124] was developed at the University of Penn-
sylvania. The Penn Treebank-I project ended in 1992 with 4.5 million words of
text, including the entire Brown corpus text, the Wall Street Journal Corpus, and
some other genres. The texts were tagged with POS tags. The data produced by
the Treebank is released through the Linguistic Data Consortium (LDC).
5. ICE: International Corpus of English

The International Corpus of English [67] was compiled by research teams (15
researchers) from different English speaking countries, such as USA, UK, Aus-
tralia, and New Zealand. It contains about one million words with regional va-
rieties of English for each component. For example, the ICE-GB 7 consists of
one million words completed in 1998. The texts in the corpus were published or
recorded between 1990-1996.
6. BNC: British National Corpus

The British National Corpus [65] began in 1991 and published in 1994. It con-
7For more infonnation: http://www.ucl.ac.uk/english-usage/projects/ice-gb/index.htm
21
tains Over 100 million words of written and spoken modem British English (90%
written. 100/0 spoken). The corpus is encoded with SGML to represent POS tags
and automatically tagged.
7. SUSANNE Corpus
The SUSANNE Corpus [121] was created by Geoffrey Sampson with the spon-
sorship of the Economic and Social Research Council (UK). It contains about
130,000 words of American English based on a subset of the million-word
Brown Corpus. It is a modification of the Gothenburg Corpus and is freely avail-

able without formalities for use by researchers.
8. TOSCA Corpus
TOSCA Corpus [12] has been compiled at the University of Nijmegen in 1986.
It contains about 1.5 million words of British English and consists of written
texts on education, history, philosophy, etc.
In addition, many other computerised English corpora have been developed, such
as : SEC: Spoken English Corpus [133], PoW: Polytechnic of wales corpus [128],
SCRIBE: Spoken Corpus Recordings In British English [29], COLT: Corpus of Lon-
don Teenager English [25] and IPSM: Industrial Parsing of Software Manuals [130].
Lastly in this list, the multi-tagged corpus, which is known as AMALGAM corpus [30]
(short for Automatic Mapping Among Lexico-Grammatical Annotation Models). This
corpus has been developed in Leeds University within the AMALGAM8 project by
Atwell et al. [31]. It contains texts from different genres of English corpora such as,
COLT, SEC and IPSM.
8http://www.comp.leeds.ac.uklamalgamJamalgamJamalgover.html
22
It becomes clear that English has been the productive field of research in corpus lin-
guistics and it stands out as the most computerised language in the world due to hun-
dreds, if not thousands of different corpora which have been developed and are being
developed.
The success of the English language in the field of natural language processing and
corpus linguistic encouraged other researchers to build their own corpora, such as :
Chinese ~The UCLA Chinese Corpus), Czech (Czech National Corpus), Danish (Dan-
ish Corpus), Spanish (LEXESP corpus, ), German (NEGRA corpus), French (TLF
corpus), Swedish (Bank of Swedish corpus), Catalan (CTILC corpus), Basque (EEBS
corpus), Basnian (Oslo corpus of Bosnian Texts), and many other languages [10].
2.3.3 Arabic language corpora
Unlike English, Arabic has been much less fortunate in the field of research in corpus
linguistics as well as POS tagging for Arabic. A useful survey on existing resources
for Arabic corpora can be found in work done by Latfia Alsulaiti and Eric Atwell
[18]. However, a number of electronic unvocalised Arabic text raw corpora have been
compiled, such as:
1. An-Nahar Newspaper Text Corpus
An-Nahar Corpus [2] comprises articles in written Arabic collected from the
articles published between 1995 to 2000. The total size of the complete files in
this corpus is 806 MB.
2. AI-Hayat Corpus
Al-Hayat Corpus [1] has been compiled at the University of Essex, in collabora-
tion with the Open University. It contains 18,639,264 distinct tokens in 42,591
23
articles covering several subjects, such as, General, Car, Computer, News, Eco-
nomics, Science, and Sport. The size of the total file is 268 MB.
3. Buckwalter Arabic Corpus

The Buckwalter Corpus [4] was compiled by Tim Buckwalter between 1986-
2003. It contains around three million written Arabic words collected from pub-
lic resources on the Web.
4. Nijmegen Corpus
Nijmegen Corpus [6] was compiled at Nijmegen University in 1996. It con-
tains Over 2M words of written Arabic words collected to build an Arabic-
DutchlDutch-Arabic dictionary.
5. Arabic Newswire corpus

The Arabic Newswire corpus [3] was compiled by David Graff and Kevin
Walker at University of Pennsylvania (Linguistic Data Consortium (LDC)) in
2004. It contains 76 million tokens (869 MB) covering written Arabic texts col-
lected from Agence France Presse, Xinhua News Agency, and Umma Press from
1994 to 2000. The source material in this corpus was tagged using TIPSTER-
style SGML and was transcoded to Unicode (UTF-8).
6. CCA: Corpus of Contemporary Arabic

The Corpus of Contemporary Arabic [18] was compiled by Latifa Alsulaiti dur-
ing her MSc research project with Eric Atwell at University of Leeds in 2004. It
contains around 1M words covers written and spoken Arabic text collected from
websites and online magazines. It is the only corpus available free for public.
7. Penn Arabic Treebank corpus

The Penn Arabic Treebank [7] project started in 2001 at the University of Penn-
24
2.4. PART-OF-SPEECH TAG SET
sylvania to develop an Arabic corpus containing one million words. The project
began with 734 files representing 166K words of written Modem Standard Ara-
bic news wire from the Agence France Presse corpus, which was released as
Arabic Treebank: Part 1. The second part was released as the 168K word cor-
pus, Arabic Treebank: Part 2. The Arabic Treebank: Part 3 corpus was released
in 2005 and it consists of 600 stories from the An-Nahar corpus.
In addition, there are other Arabic corpora have been compiled [8], such as, CLARA,
Egypt. DINAR, Leuven, and other corpora. Unfortunately, these corpora are not avail-
able to researchers free of charge except CCA corpus. However, some of these Arabic
corpora can be acquired from the Linguistic Data Consortium (LDC) and the European
Language Resources Association (ELRA).
2.4 Part-of-Speech tag set
2.4.1 Introduction
The pas tag set is a list of all the word classes that will be used in the tagging pro-
cess. It is the fundamental component of any tagger system and the first step for the
annotation of corpora [89]. A tag is a code or descriptive symbol that represents some
features or set of features attached to the word in a text [73, 105]. Thus, a pas tag set
is an inventory of labels used to classify and mark up words of a target text [74].
A new Arabic tag set called (ARBTAGS) has been developed. The justification be-
hind developing a new Arabic tag set is explained in section 2.4.4 while the previous
work in pas tag sets for English and other languages is described in section 2.4.2.
Previous work in Arabic pas tag sets is introduced in section 2.4.3.
25
2.4.2 English and other languages POS tag sets
Since English corpora have been tagged by several POS tagging systems, numbers of
popular tag sets have been built also to support these POS systems. The list below
summarises some of these tag sets which can be found at the site of AMALGAM 9 .
• Brown tag set
The Brown tag set started with a set of 77 tags, and enlarged to about 226 tags
used to tag and enhance the coverage of Brown corpus. Sample of Brown tag set
can be seen in Table 2.2
Tag Description ExampJe(s)

ABN detenninerlpronoun, pre-quantifier all, half, many,
nary
ABX detenninerlpronoun, double conjunction or pre-quantifier both
BED verb "to be ", past tense, 2nd person singular or all persons were
plural
CD numeral, cardinal two, one, 1
CS conjunction, subordinating that, as, after,
whether
DOD verb "to do ", past tense did, done
IN preposition of, in, for, by, at
MD modal auxiliary should, may,
might, will
HVN verb "to have ", present participle had
JJ adjective failure, burden,
court
NN noun, singular, common did, done
JJS adjective, semantically superlative top, chief, principal
NPS noun, plural, proper hases, Aderholds,
Chapelles
Table 2.2: Sample of Brown tag set
• LOB tag set

The LOB tag set was based on the Brown corpus tag set, but revised for fine
9 AMALGAM project: http://www.comp.leeds.ac.uklamalgamlamalgamlamalghome.htm
26
granularity. The tag set contains 135 tags used to tag LOB corpus. Table 2.3
shows a sample of LOB tag set.
Tag Description Tag Description

CS subordinating conjunction NN singular common noun
CD Cardinal number NNP singular common noun
with word initial capital
NP singular proper noun JNP adjective with word initial
capital
MD modal verb OD ordinal number
NPS plural proper noun PPL singular reflexive personal
pronoun
NR singular adverbial noun QL qualifier
VB base form of lexical verb VBD past tense of lexical verb
VBG present participle of lexical VBN past participle of lexical
verb verb
WPA nominative wh-pronoun ZZ letter(s) of the alphabet
TO infinitival TO NPLS plural locative noun with
word initial capital
NPL singular locative noun with JJ general adjective
word initial capital
Table 2.3: Sample of LOB tag set
• LLC tag set

The LLC tag set was used to tag London Lund Corpus. It contains about 210
tags.
• ICE tag set

The ICE tag set was used to tag the International Corpus of English. It contains
about 205 tags.
• SEC tag set

The SEC tag set was used to tag the Lancaster/IBM Spoken English Corpus. It is
based on the LOB corpus tag set. In contrast, LOB tag set differentiates between
relative and interrogative WH-pronouns whereas SEC tag set does not. For ex-
27
ample, in SEC tag set, the tag WP used to cover WH-pronouns, interrogative,
nominative or accusative and WH-pronouns, relative, nominative or accusative

whereas LOB used separate tags [30] .
• Penn Treebank tag set10 : The Penn Treebank tag set was used to tag Penn
Treebank corpus. It contains about 36 tags used. Sample of Penn Treebank tag
set can be seen in Table 2.4.

CC Coordinating conjunction NNS noun, plural
CD Cardinal number NP Proper noun, singular
EX Existential "there" NPS Proper noun, plural
FW Foreign word PDT Predeterminer
IN Preposition or subordina- POS Possessive ending
tion conjunction
JJ Adjective VB verb, base form
JJR Adjective, comparative VBD verb, past tense
JJS Adjective, superlative VBN verb, past particle
LS List item marker VBG verb, present participle
MD Modal VBP verb, non 3rd person singu-
lar present
NN noun, singular or mass VBZ verb, 3rd person singular
present
Table 2.4: Sample of Penn Treebank tag set
In addition, several English tag sets have been built and used to tag other corpora, such
as : URCEL C7 tag set, SUSANNE corpus tag set, TaSCA corpus tag set and PoW
corpus tag set [50,74].
Other tag sets have been designed for languages other than English, such as : Urdu
[74], French [41], African languages [80], Czech [79,83], Hungarian [136], Slovene
[53], German [94], Persian [104], Swedish [115], Hebrew [118], Italian [34], Span-
lOhttp://www.comp.leeds.ac.uklamalgamltagsets/upenn.html
28
ish [27] and Turkisk [112]. However, a useful resources on comparison of some of the
above tag sets can be found on [30,46].
2.4.3 Arabic POS tag sets
Since there has not been much work done in pas tagging for Arabic, a very small
number of tag sets had been built. The list below summarises well-known tag sets that
have been built for Arabic.
• Khoja tag set
Khoja [89] describes an Arabic tag set that has been built based on pas classes
and inflectional morphology system and used for her tagger system APT : An
Automatic Arabic Part-of-Speech Tagger. The tag set contains 177 detailed tags.
Each tag represents the name of the three main class (verb, noun, particle) and
their sub-classes including the inflectional features such as , gender, number and
person. For example, her tag set covers 57 type of verbs, 103 type of nouns, 9
type of Particles, 7 residual and 1 punctuation. Sample of Khoja tag set can be
seen in Table 2.5.
• EI-Kareh and AI-Ansary tag set
El-Kareh and Al-Ansary [54,87] described an Arabic tag set used for their semi-
automatic tagger system. Their tag set contains 72 tags covering 3 sub-classes
of the main class verb, 46 sub-classes of the main class noun and 23 sub-classes
of the main class particle.
• Linguistic Data Consortium (LDC) tag set l1 :
The LDC tag set was created by the Linguistic Data Consortium (LDC) team
and contains 24 tags used to tag Penn Arabic Treebank corpus. It is also used by
11 For more : http://www.ling.ohio-state.edulbromberg/postags/posproject.html
29

NCSgMNI Sing. Masc. Nom. Indef NCSgMND Sing. Masc. Nom. Def
common noun common noun
NCSgFND Sing. Fem. Nom. Def NCPIMNI Plu. Masc. Nom. Indef
common noun common noun
NP Proper noun NPrRSDuM Dual Masc. Spec. relative
pronoun
NNuCaSgM Sing. Masc. cardinal num- VPSg2M 2nd, Sing. Masc. perfect
ber verb
VPSg3M 3rd, Sing. Masc. perfect VPDu3M 3rd, Dual Masc. perfect
verb verb
VPPI3F 3rd, Plu. Fem. perfect verb VISg3MJ 3rd, Sing. Masc. Juss. im-
perfect verb
VIDu3MI 3rd, Dual Masc. Ind. lm- VIDu3MJ 3rd, Dual Masc. Juss. Im-
perfect verb perfect verb
VIPI2MI 2nd, Plu. Masc. Ind. lm- VIPI3MJ 3rd, Plu. Masc. Juss. Im-
perfect verb perfect verb
VIvSg2M 2nd, Sing. Masc. lmpera- VlvPI2M 2nd, Plu. Masc. imperative
tive verb verb
PPr Prepositions PE Exceptions
RF Residual, foreign PU Punctuation
Table 2.5: Sample of Khoja tag set
other works in POS tagging for Arabic, such as, the SVM tagger done by Mona
Diab [51] and Egyptian dialect POS tagger done by Duh and Kirchhoff [52]. The
LDC tag set can be seen in Table 2.6.
• Alshamsi and Guessom tag set

Alshamsi and Guessom [127] described an Arabic tag set used for their HMM
POS tagger system. It contains 55 tags. As Alshamsi and Guessom point out,
that since the main use of their tagger is intended to be for Named Entity extrac-
tion, their tag set is not a fine-grained tag set. For example, they used the fol-
lowing tags: NOUN (noun), ADJ (adjective), PNOUN (proper noun), PRON
(pronoun), INDEF (indefinite noun) and DEF( definite noun) to represent the
noun category and its subcategories. On the other hand, PVERB (perfect verb),
30

CC Coordinating conjunction DT Determiner
CD Cardinal number CONJ+N ~(£onjunction, Negation
Particle
FW Foreign word NPS Noun, plural
NN Noun, singular or mass IN Preposition or subordinat-
ing conjunction
JJ Adjective NNP Proper noun, singular
NNPS Proper noun, plural PRP Personal pronoun
PRP$ Possessive pronoun PUNC Punctuation
RB Adverb RP Particle
UH Interjection VBD Verb, past tense
VBN Verb, past participle VBP Verb, present
WP Wh-pronoun WRB Wh-adverb
NO~UNC NUMERIC_COMMA
Table 2.6: The LDC POS tagset
IVERB (imperfect verb), CVERB (imperative verb), MOOD~J (subjunctive
or jussive), MOOD-.I (indicative), SUFF ~UBJ (suffix subject) and FUTURE
(futurelImperative) tags were used to represent the verb category and its sub-
categories. For particle, INTERROGATE, NEGATION, CONJ and PREP
tags were used to represent interrogation, negation, conjunction and preposition
particles. In addition, some inflectional features, such as, person, number and
gender were added to their tag names to show the morphology analysis of the
word. For example, PRON~S tags means second person singular number fem-
inine/masculine gender pronoun. Table 2.7 shows a sample of their tag set.
CONJ DPRONMP NOUN PRON~MP
CVERB DPRONMS PNOUN PRON~S

IV3 PRON_lS PPRON~FP PPRON3FP
PVERB CVERB PREP PRON_3FS
DEF INTERROGATE PRON PRON_3MP
Table 2.7: Sample of Alshamsi and Guessoum tag set
31
2.4.4 Justification for a new Arabic tag set
In this work, an Arabic tag set called (ARBTAGS) has been developed. The rationale
behind developing our tag set comes from the fact that there is no standardised and
comprehensive Arabic tag set covering the most common types (sub-classes) of the
three main Arabic word classes.
The developed tag set differs from the tag sets which have been built for Arabic. The
main difference is a tag set hierarchy which is described in figure 2.3 and shows the
way that the Arabic word has been classified.
As shown in the tag set hierarchy, noun class is classified into sixteen sub-classes
(common, proper, Adjective, etc.), verb class into three sub-classes (perfect, imper-
fect, imperative), particle class into seven sub-classes (preposition, vocative, conjunc-
tion, etc.), and one punctuation. In addition, one more general tag is added to the above
general tags; this tag is used to represent the foreign word (Arabised word); it's [Fw].
So, the total size of general tags becomes 28 tag.
These general tags represent the names of main classes and the sub-classes without
inflectional features, the developed tag set hierarchy is differs from the tag sets hier-
archy which have been built for Arabic. For example, Khoja (see figure 2.4 which
is reproduced from the original figure from [89]) was classified noun class into five
sub-classes (common, proper, pronoun (personal, relative, demonstrative), numeral,
adjective), while particle class was categorised into nine sub-classes (prepositions, ad-
verbial, conjunctions, interjections, exceptions, negatives, subordinates, answer, expla-
nations).
32
~
~
\:)
E%
a Arabic Word
~
~
a~ Arabized (Nouns Words)
.8
~ u
0:: ~
~ Relidual I I
;..;
Q)
• ...-4
..d
~ ....
Q)
~ tr.J
OJ)
....
~
CI)
~~
~
~ I
Perfect I I ~
Pronoun Preposition
I ----... -.' I - ---_...-.- ~
~
Imperfed I Conditional Vocative t+I I ~
~
Imperative I t t L-...J Demonstrative I Conjunction ~ I N
Q)
;..;
Interrogative ..-J Cardinal I Exception U I ::s
OJ)
• ...-4
~
~
Numeral Negation
Ordinal
Adverbial
I I Subjunctive
Conjunctive IJussive
h
~
c:;
~
a~
~ Word ,
o~ I ,........,
~ 0\
00
0:: .........
~
;;>-.
...c::
~ Preposition
~
~ """
Q)
:.E
Perfect I t t I Adverbial ....
Q)
CZJ
I Conjunctions I 01)
....ro ~
('()
Imperfect I I (,). .,.1._"",,1
I n I 1--.1
Demonstrative
H Inte:rj ection I ·S
~
ro
Imperative Ordinal
Numerical Y
P ers onal
H Exceptions I ~
Adjective
Relative
I H Negatives I N
Q)
~
Answers 01)
......
Specific I I Common ~
Explanations
Sub ordinates
Alshamsi and Guessom [127] were classified noun class into four sub-class, par-
ticle class into four subclass (see Table 2.7). They point out, there is no need to have
fine-grained a tag set. since their tagger was intended to be for Named Entity extrac-
tion( [127], p.3-+). LDC tag set as another example was mapped from English tag set
and not rich enough to cover Arabic POS classes [102].
The subclasses of the developed tag set, such as : verbal, diminutive, instrument, noun
of place, noun of time, conditional and interrogative, which belong to the noun class .
In addition, vocative, subjunctive and jussive subclasses which belong to the particle
class. These subclasses have not been mentioned before. Thus, using one of the tag
set has been built before will not capture all the subclasses shown in the developed tag
set hierarchy. In addition, the testing corpus in this work is a partially-vocalised text
which leads to use more inflectional features than described in the other tag set.
The developed tag set is based upon POS classes and inflectional morphology [24].
The tag names in the developed tag set uses terminology from Arabic tradition rather
than English grammar. For example, in Khoja tag set, the tag [VPP12M] is verb perfect
plural second-person masculine. As Atwell [30] point out, since Khoja Arabic tag set
came from the Lancaster URCEL tradition of Corpus Linguistics, she was influenced
by the English tag sets, such as, CLAWS heritage of tag set for LOB and BNC cor-
pora. Therefore, she has used terminology from English grammar rather than Arabic
tradition in naming categories and features. The author agrees with Atwell. It seems
that, not only khoja tag set uses terminology from English grammar rather than Arabic
tradition, but also the tag sets have been built for Arabic and described above used the
same terminology in naming categories and features.
35
2.5. PART-OF-SPEECH TAGGING APPROACHES
The tag names in the developed tag set uses terminology from Arabic tradition rather
than English grammar. For example, VePiMaPIThDc, which means [Imperative
verb, masculine gender, plural number, third person, subjunctive mood]. Details about
the developed tag set design are provided in chapter 4.
ARBTAGS tag set developed in this work contains 161 POS detailed tags, 101 nouns,
50 verbs, 9 particles, 1 punctuation; these tags are enriched with inflectional features
information. However, the general and detailed tags with examples have been de-
scribed in full in Appendix A.1 and Appendix A.2.
On the other hand, the usability of ARBTAGS has been tested in manual tagging and
built up a set of tagged text to serve as a goal corpus used to compare it with the results
obtained from the AMT tagger. Despite that Khoja built 177 detailed tags, but she ac-
tually used five main general tags (noun, verb, particle, punctuation and residual) and
a simplified version of the tagset (30 detailed tags) to make the training of POS tagger
computationally feasible( [88], p.71). While most of the tags used in the developed
tagger are detailed tags due to the main aim of the developed tagger, that is, to provide
a tagged corpus more useful for linguists and NLP developers to extract more linguistic
information from it.
2.5 Part-of-Speech Tagging Approaches
The existing literature shows that there are two main approaches to POS tagging stud-
ied so far, these are : the Rule-based Approach 12 and the Statistical Approach 13. Many
12also called linguistic approach or Knowledge-Based Approach
13 also called Probabilistic Approach or Stochastic Approach
36
POS tagging systems have been implemented using these approaches. The majority of
these systems were used to tag text corpora.
We categorise these systems based on whether the tagger systems are adopting the
rule-based approach or the statistical approach. On the other hand, some systems
adopt a hybrid approach (rule-based and statistical) and some other systems use other
approaches, such as, neural networks, machine learning algorithms and decision trees,
which have also been addressed. Here the focus is on the two main techniques. More
detail is provided and some well-known systems described. In addition, we also dis-
cuss the advantages and disadvantages of each of the two main approaches. A useful
and good survey on the POS tagging approaches can be found in the work done by
Abney [13].
2.5.1 Rule-based Approach
The rule-based approach is based on incorporating a set of linguistic rules in the tag-
ger [49]. This approach uses the linguist-written language model that contains rules
t
ranging from a few hundreds to several thousands. The approach adopted here in this
work is based on the rule-based approach. The tagger presented in this work has two
main rule components, these are: pattern-based rules and lexical and contextual rules.
The pattern-based technique is a novel technique presented in this work. The basic
idea of this technique is to generate automatically a lexicon of patterns instead of using
manually tagged or untagged lexicon or training corpus. The triggers in pattern-based
technique depend on the patterns of text words. A novel algorithm to match the Arabic
word in the testing corpus with its correct pattern in patterns lexicon has also been
37
built. In addition, a small amount of hand-written rules and constraint rules (lexical
and contextual rules) have been used to assist the main technique to assign the correct
tag to those words not tagged by pattern-based technique. The tagger system and the
proposed approach are fully described in chapter 5.
The rule-based approach was the earliest approach for automated POS tagging. It dates
back to the 1960's and 1970's when automated POS tagging was initially explored by
Klein and Simmons [90] in 1963 and the work done by Greene and Rubin [63,68] in
1971 which considered the most representative of such pioneer taggers. Afterward,
a number of rule-based systems have been developed, such as, work done by Hin-
dle [77], Brodaa [38], Paulussen and Martin [113], Karlsson [85], Voutilainen [135]
and Brill [36,37]. Some of these systems have been built to tag corpora while other
systems were built for developing the parsing system.
A rule in the rule-based approach may be represented as follows [77]:
[ PREP + TNS ] -+ TNS [ N + V ]

where PREP = preposition, TNS = tense, N = noun, and V = verb.
This rule implies that a word that can be a preposition or a tense marker (i.e. the word
"to") should be tagged with the word TO (tense marker) when it precedes a word that
can be a noun or a verb.

Some of the well-known rule-based systems are now briefly discussed.
• CGC: Computational Grammar Coder system

Klein and Simmons [90] developed a larger question-answering system which
contains a syntactic analysis program that needs a part-of-speech tagger as a
38
necessary component. They developed a Computational Grammar Coder (CGC)
which itself a part-of-speech tagger. Their tagger uses several smaller English
dictionaries with a total of 20,000 words, such as function-word dictionary. This
dictionary containing articles. prepositions, pronouns, conjunctions, auxiliary
verbs. adverbs, etc. It comprises 500 words all of which have unique grammar
codes (tags). Their CGC program performs several tests such as a suffix test us-
ing several different types of morphological information, and the context frame
rule test. Garside and Smith [63] define context frame rule as "a rule designed by
a linguist based on observation of data, which specified some information on a poten-
tial tag in the context of up to three tags on either side or that the potential tag was
impossible in this context". Furthermore, there are about 1,500 content word dic-
tionaries containing those nouns, verbs, and adjectives that are exceptions to the
computational rules used in suffix tests. They ran an experiment on samples of
science writing and reported that their system correctly tagged 90% of the words.
• TAGGIT system
Greene and Rubin developed [63,68] the first pioneering tagger system for En-
glish, which is known as, TAGGIT. It was the first tagger which introduced the
idea of providing a text corpus annotated with part-of-speech information as a
useful tool for linguistic research. TAGGIT was used to initially tag the one
million words of the Brown Corpus grammatically [63]. A small dictionary or
lexicon containing a bout 3000 words was used in their TAGGIT program. The
lexicon was tagged manually, that is, each word in lexicon was assigned its tag(s)
. They used 3,000 context frame rules to disambiguate those words have more
than one tag. Each word is initially checked to see if it is found in the lexicon.
39
If the word is found on the lexicon and has one tag, this tag is extracted and
assigned to the word. If it has more than one tag, a set of context frame rules
have been applied to assign the best tag to the word. In addition, a suffix list of
450 strings has been used to tag the word not found on lexicon. If the word is not
found on the suffix List, the NN, JJ, and VB tags arbitrarily given to the word.
A set 77 tags was used. The authors reported that TAGGIT system correctly
tagged 77 -78 % of the words. Cutting et al. [47] point out, that the rest was done
manually over a period of several years.
• Fidditch system
Hindle [77] developed a tagger system to resolve the lexical disambiguation
problem within a deterministic parser called Fidditch. It is designed to provide
a syntactic analysis of text and to build phrase structure trees. Fidditch has the
following components :
- a lexicon of about 100,000 words listing all possible parts of speech for
each word, along with root forms for inflected words.
- a morphological analyzer to assign part of speech and root form for words
not in the lexicon.
- a complementation lexicon for about 4000 words.
- a list of about 300 compound words, such as, of course.
- a set of about 350 regular grammar rules to build phrase structure.
a set of about 350 rules to disambiguate lexical category. Fidditch has a set of
46 tags (inc1tlding 8 punctuations), mostly enriched with inflectional features.
Hindle tried to acquire a new set of disambiguation rules automatically from the
40
tagged text of Brown corpus. The author claims that the performance of the
acquired rule set is much better than the set of rules for lexical disambiguation
written for the parser by hand over a period of several rules; the error rate is
approximately half that of the hand written rules .
• ENGCG: ENGlish Constraint Grammar system

Voutilainen [135] developed a tagger system called ENGCG (ENGlish Con-
straint Grammar) for ambiguity resolution and a finite-state syntactic parser,
which is known as, the Finite-State Intersection Grammar. ENGCG tagger con-
sists of two main rule components. The first component is a grammar specif-
ically developed for resolution of part-of-speech ambiguities while the second
rule component is a syntactic grammar. This syntactic grammar is able to re-
solve the pending part-of-speech ambiguities as a side effect. It uses only lin-
guistic distributional rules. Their tagger consists of the following sequential
components:
- Tokeniser
- ENGCG morphological analyser consists of Lexicon and Morphological
heuristics rules.
- ENGCG morphological disambiguator
- Lookup of alternative syntactic tags
- Finite state syntactic disambiguator
The morphological analyzer assigns part of speech tags by looking each word
up in the lexicon contains about 80,000 and then applying heuristic rules for still
unrecognized words. The default tagging is noun when none of the rules apply.
41
A set of 139 tags was used. The author was tested the ENGCG system against a
test corpus of 38,000 words and he reported it correctly tagged 99% of the words.
• TBL: Transformation-Based error-driven Learning
The most remarkable feature of Brills's tagger system [36,37] which makes it
differs from other rule-based systems is that it automatically infers rules from a
training corpus [106]. Brill's rule-based tagger is based on a learning algorithm
called Transfonnation-based error driven learning (TBL). It is a technique for
acquiring the rules automatically. Rules are learned by iteratively collecting er-
rors and generating rules to correct them. Figure2.5 which is reproduced from
the original figure from [37], illustrates the learning process of Brills's tagger
system.
First, unannotated text is passed through an initial-state annotator. In this step,
the system assigns to every word its most probable POS tag, as estimated from
the small annotated training corpus. The training set is used here to determine
the most likely (frequent) tag for each word. For unknown words, the most
probable tag was guessed based on information such as the initial capital let-
ter or suffix analysis. For example, xxxxxxxion (where x represent any letter)
would be tagged as a noun because this is (presumably) the most common tag
for words ending in "ion". The tag in the second process compared to the true
annotation as indicated by the annotations assigned in the manually annotated
training corpus. A transformation can then be learned, which can be applied to
the automatic annotated text to make it better resemble the manual annotation.
The tagger has a small set of rule templates. The templates are of the form:
42
UNANNOTATED
TBXT
INITIAL.
STATe
ANNOTATED
TRUTH
T13XT
RULES
Figure 2.5: Trans formation- Based Error-Driven Learning.
Change tag a to tag b when the preceding (following) word is tagged z
A maximum of three words preceding or following the inflected word in his
transformation rule have been considered. The author also considered contex-
tual transformation templates. These templates used to capture the relationships
between words. The templates are of the form:
Change tag a to tag b when one of the two preceding (following) word is w
For example, one automatically acquired contextual transformation template is
as follows:
Change the tag from preposition to adverb when the word two positions to the
right is as. Based on the remarkable accuracy the system achieved (97%), the
43
author showed that rule-based approach can achieves a high accuracy in com-
parison to systems that are based on a statistical approach.
2.5.2 Statistical Approach
The statistical approach is based on collecting statistics from existing corpora. Since
it requires much less human effort than the rule-based approach, it is the most popu-
lar approach. Graside and Smith [63] point out, the general idea of this approach is
that, when a sequence of words, each with one or more potential tags is given, the
most likely sequence of tags can be chosen by calculating the probability of all pos-
sible sequences of tags, and then choosing the sequence with the highest probability.
A statistical model of language is used to disambiguate the word sequence. A suc-
cessful approved has been to model the sequence of tags in a sentence as a Hidden
Markov Model (HMM). To obtain a statistical language model, one needs to estimate
the model parameters, such as the probability that a certain word appears with a certain
tag (lexical probability)14, or the probability that a tag is followed by another (contex-
tual probability) 15 . These probabilities are trained on a manually tagged corpus [78].
Also, this estimation is usually done by computing unigram, bigram or trigram (N-
gram model)16 frequencies on tagged corpora.
In order to define the goal of part-of-speech tagging systems with HMM models in
a little more detail, we consider the problem in its full generality 17 . Let Wl...N =
( WI, W2, W3, W4, ......... W N) be a sequence of words, where N is the length of word se-
quence, Cl...N = (CI' C2, C3, C4, ......... CN) be a sequence of part-of-speech or lexical
14probability of a part of speech given the word.
15 probability of a part of speech given k previous parts of speech
16N-gram model using information about both lexical probabilities and contextual probabilities
17 see ref [98] for more details
44
categories. When a word sequence is given, the goal of the part-of-speech system is to
find the sequence of part-of-speech or lexical categories that maximizes the probability
of a sequence of tags l1...N given a sequence of words W1...N, that is :
T(W1...N) = argc1...NmaxP( C1...N IW1...N ) (2.1)
where H"i denotes the i th word in the word string and Ci denotes a part-of-speech tag
assigned to thei th word. After applying Bayes' rule approximation technique to ap-
proximate equation 2.1, it becomes as follows:
P(Cl...NIWl...N) = P(Cl...N) * P(Wl...N!cl...N)/P(Wl...N) (2.2)
After using further simplifying methods and approximation (see ref [23] for detailed
explanations) to reduce equation 2.2, the final form of formula becomes as follows:
(2.3)
The term P(wilci) in formula 2.3 is called the lexical probability that can be estimated
from a corpus of text labeled with a part-of-speech tag simply by counting the number
of occurrences of each word by tag. It represents the probability that a given tag is
realised by a specific word.
The term P(cilci-l) in formula 2.3 is called a bigram probability; it indicates the like-
45
lihood of a tag given only the preceding tag. It can be estimated simply by counting
the number of times each pair of tags occurs and computing this to the individual tag
counts. For example, the probability that a verb (V) follows a noun (N) can be calcu-
lated as follows :
P(Ci = VIC.i-1 = N) ~ Count(N at positioni_l and V at positioni)
. Count(N at positioni-l)
The equation below described the general fonnula for N-gram language model which
bigram and trigram models could be simply derived from this general fonnula:
N
P(w~) = II P(wklw~-I) (2.4)
k=1
where W~-I denotes the word sequence WI, W2, W3, W4, ......... Wk-I·
• Bigram Model can be derived as follow: P(w) = ITk P(wklwk - 1)
In HMM, we directly observe the sequence of words only, while the sequence of tags
is hidden from the observer of the text; hence the tenn "Hidden Markov Model" is ap-
propriate [63]. In addition, when the estimates used for the tag transition probabilities
are derived from bigrams; that is, we have estimated the likelihood of tag given the
knowledge that a particular other tag precede it, this model is called first-order HMM.
A second-order HMM would uses tag transition estimates derived from trigrams, that
is, we estimated the likelihood of a particular tag given the knowledge that two partic-
ular other tags precede it( [63], p.l05). The simplest model would be a most-likely-tag
choice for each word.
46
Although the peak use of the statistical approach in part-of-speech tagging appeared
in the eighties, the first attempt to use the statistical approach started with the work
done by Stolz et al. [129] in 1965. Afterward, many researchers presented valuable
tagging systems using the statistical approach. The seminal work is the CLAWS
system using HMM [63--65,93]. Merialdo [108] developed a POS tagger for En-
glish based on the probabilistic trigram model. Brants [35] proposed TnT, a statis-
tical POS tagger. Many other systems were built using statistical approach such as,
DeRose [117], Church [45], Cutting et al. [47], Weischedel et al. [137], Bahl and Mer-
cer [32], Samuelsson [122,123], and Kupiec [91]. However, A useful resource on the
statistical approach can be found in the work done by Merialdo [108].
Some of the well-known statistical systems are now briefly discussed.
WISSYN system
The WISSYN grammatical coder is the earliest known POS tagger developed by Stolz
et al [129] that uses probabilities to determine the grammatical classes (tags) of words.
It has four component phases: the dictionary, morphology, ad hoc and probability
phases. The first three phases accomplish the identification of the more frequently
occurring words, and the last performs the prediction of remaining words.
• Dictionary phase : a small dictionary was used contains 300 words represent
the most frequent words in English. It has not only the four main classes, noun,
verb, adjective and adverb, but also many further categories, such as, pronouns,
prepositions, negatives, determiners and other closed classes. Each word of the
input text is checked against the dictionary entries. If a word is located in the
dictionary, a tag is retrieved and assigned to the word. If a word is not found
in this dictionary it is considered to be of one of the four main classes (noun,
47
verb, adjective and adverb). At this stage the authors reported that an average of
60%-70%, of the words in a passage have been identified by this phase.
• Morphology Phase: this phase was constructed to deal with those words not
located in dictionary during the previous phase. A small suffix dictionary con-
tains 63 suffixes was used in this phase to determine the grammatical class of
a word (morphological characteristics). For example, one such suffix test scans
the word for its last four letters to determine if they match the -ship suffix, the
-ment suffix, or any of a number of other four-letter endings. When a match is
found, that word is assigned its appropriate tag.
• Ad Hoc Phase : the first two phases of WISSYN system operate on each word
of the input sequentially as it is isolated. In other word, the context has no role.
In this phase and the next one, the context of the remaining words has been
taken into account. At this phase WISSYN system try to attempt clarification of
some of those words identified in either of the first two phases but which remain
ambiguous. For example, the word that, being a function word, is in the initial
dictionary phase, but happens to have multiple class membership in different
contexts (e.g., as in that dog, that the dog jumped, the dog that jumped, etc.).
An Ad Hoc Phase uses rules to determine the most likely tag. This phase can
identify eight ambiguities, and include the various forms of that, and the verb
to be. The authors point out, they used the same principle to that employed by
Klein and Simmons [90] more generally, in that a specified set of frames is pro-
vided as diagnostic for particular identifications. For example, a routine which
processes certain preposition-adverbs to determine their exact usage either as
prepositions (e.g., in in in the house or as adverbs come in from the cold. The
48
authors reported that also this stage identifies 10% of words on average .
• Probability Phase: this phase was constructed to use set of conditional prob-
ability tables to predict the four main grammatical classes for those words not
tagged by the previous three phases. These probabilities were calculated from a
manually tagged corpus that contained about 28,500 words. At this phase, the
previous three tags, and the following three tags of a given word were examined.
The authors state that this phase correctly tagged around 20% of words in texts.
A test set contains 1916 words was used to test the tagger system. The tagging of the
text that occurred used a tag set consisting of 18 tags. An overall accuracy ofWISSYN
system, the authors reported that the system correctly tagged 92.8% of words.
CLAWS system
The original CLAWS(Constituent-Likelihood Automatic Word-Tagging System) sys-
tem (version 1) [63-65,93] was developed by Marshal I, Garside R, Leech G, and
Atwell E over the period 1981 to 1983 at the Unit for Computer Research on the En-
glish Language (UCREL) at the University of Lancaster [111]. It was used to tag about
one million words of (LOB) Corpus with 96-97% accuracy [103]. LOB tag set con-
tains 133 tags was used with CLAWS version 1.
CLAWS system has five phases : pre-editing, tag assignment, idiom-tagging, tag dis-
ambiguation, and post-editing. Furthermore, it composed of four separate programs
: PREEDIT, WORDTAG, IDIOMTAG, and CHAINPROBS. These programs associ-
ated with pre-editing, tag assignment, idiom-tagging, and tag disambiguation phases
respectively( [103], p.64).
• PREEDIT: concerned the preparation of text for processing by system.
49
• WORDTAG: assigns to each word a list of all possible tags for that word by using
knowledge base or a set of rules to deduce the candidate word classes. WORD-
TAG program has a knowledge base contains 7200 words which are stored with
a list of their candidate tags. This program was constructed not to disambiguate
the lexical ambiguity of the word, but merely to assign a list of all possible tag(s)
to each word. At this stage, if the word has only one tag, then the tag associated
with the word and the word is assumed to be correctly tagged.
• IDIOMTAG: designed to assign a single tag to the compound symbol (composed
of more than one word). For example, "such that" assigned one tag.
• CHAINPROBS: designed to choose one of the candidate tags to those words still
have more than one tag at the end of the WORDTAG execution. CHAINPROBS
program uses statistical analysis (bigram). Garside [64] estimated that 35% of
LOB corpus words had more than one tag associated with them.
CLAWS was developed based on TAGGIT, except that CLAWS adopts a statistical
technique for figuring out cases with ambiguous categories. It uses a table of prob-
abilities of predecessor and successor tags to calculate the likelihood (probabilities)
of all paths for each sequence of ambiguous words and eliminate sequences with low
probability. The predecessor/successor probabilities of tags are extracted from a large
proportion of the tagged Brown corpus. If tagging fails in ambiguous cases, context-
dependent disambiguation is carried out based on the context frame rules of TAGGIT.
CLAWS version 2 on the other hand developed over the period from 1983-1986 to
reduce the manual and automated pre-editing required by the system before any text
could be analysed. It differs little from CLAWS version 1. The main difference is the
automation of tag analysis itself. In addition, an extended tag set was used in CLAWS
50
2 containing 166 tags. Also, some change made in the WORDTAG program used in
CLAWS 1 as part of the overall goal of removing any manual pre-editing. For example,
WORDTAG program in CLAWS 2 dealt with capitalisation and abbreviation( [103],
p.78). The current version of CLAWS (version 4) was began in 1988 to undertake the
enormous task of tagging the 100 million word British National Corpus (BNC). In this
version of CLAWS, the authors separated the tagger from the tag set. They used BNC
tag set lS . In addition, they added a component that will enable it to handle SGML tags
since the BNC was marked up with these tags( [88], p.25). The authors reported that
CLAWS 4 achieved an overall accuracy of 96-97% of BNC corpus words. However, it
seems that CLAWS is used a hybrid technique since it has a rule based and statistical
components.
PARTS system
Church [45] has also implemented a statistical tagger called PARTS. It used the lexical
probability, which is the probability of observing part of speech i given word j, and the
contextual probability, which is the probability of observing part of speech i given k
previous parts of speech. The author calculates the product of the lexical probabilities
and the contextual probabilities for each combination of ambiguous word sequences.
The tag sequence that gets the highest probability is selected as the proper tagging
result. PARTS differs from CLAWS in terms of the statistical model they used. The
former used a trigram model while the later used a bigram model. Furthermore, PARTS
does not have a rule based component. The author reported that PARTS achieved an
overall accuracy of 95-99%.

18For more: http://www.scs.leeds.ac. uk!amalgam!amalgam!corpus/tagged!edited!ipsm_hncc5. prf.html
(see also URCEL C7 tag set)
51
2.5.3 Advantages and disadvantages of rule-based and statistical
approaches
1. Rule-based Approach
Rule-based approach has some features and advantages which can be summa-
rized as follows [36]:
• A vast reduction in the amount of stored information because this approach
represents knowledge in form of rules rather than stored data records. The
model not need a huge manually tagged corpus to calculate probabilities.
• The language model is written from a linguistic point of view and explicitly
describes linguistic phenomena.
• The model may contain many complex kinds of knowledge.
• The written rules are easy to understand and to maintain.
• High portability from one kind of text corpus to another.
• Allow the construction of an extremely accurate system.
While the disadvantage(s) of rule-based approach consist of:
• Less transporting of the language model to other language.
• The language mode requires a high labour of work and cost.
• Usually the language models do not consider frequency information.
2. Statistical Approach
The advantages of using a statistical approach [108] can be summarized as fol-
lows:
• When a huge manually tagged corpus in the desired language is available,
model transportation from other language becomes much easier.
52
• Language models consider frequency information.
• The probabilities can be estimated automatically from data.
While the disadvantage(s) of statistical approach consist of:
• Can't deal with unknown words.
• Model needs a huge manually tagged corpus to calculate probabilities.
• Model needs a huge matrix to represent the information.
Samuelsson and Voutilainen [58] and Chanod and Tapanainen [42] show that a rule-
based tagger for English and French respectively can achieve better results than a sta-
tistical tagger.
2.5.4 Hybrid and Other approaches
Some implementations combine the statistical approach with the rule-based to build a
hybrid POS tagger. Chanod and Tapanainen [42] developed a tagger that use a combi-
nation of both statistical and rule-based approaches for French. Kuba et al. [26] built
a hybrid tagger for Hungarian. Schneider and Yolk [126] trained the Brill tagger to
German, French respectively.
Additionally, some different approaches have been used for building text taggers.
Schmid [125], Marques and Pereira [100], Antonio et al. [114] developed a POS tag-
ger using Neural Networks. They train a single-layer perceptron to produce the POS
tag of a word. They reported an overall accuracy of 96.2%,92.7% 92% respectivally.
Schmid trained his tagger on 2 million words of the Penn Treebank corpus, and tested
on 100,000 words of the corpus. Marques and Pereira trained their tagger on a very
small Portuguese training corpus (15,000) words, and tested on 2229 words. While
53
Antonio et al. trained their tagger on 46,461, and tested on 47,397 words of the Wall
Street Journal corpus.
Daelemants et al. [48] developed a memory-based part of speech tagger-generator.
Memory-based systems are basically a form of k-nearest neighbor systems where set
of cases (the training data) are kept in memory, and each test sample uses a distance
metric to determine which training samples are closest. Then, the test sample is classi-
fied as the same class as the training samples. The set of cases in this approach usually
consist of a word, its preceding and following context, and the POS of that word in
the context. The author trained his tagger using a tagged corpus. To tag a new sen-
tence. for each word and its context, the most similar case(s) where kept in memory
are selected and extracting the POS tags from these cases. The tagger was trained on
two different set size (two million words and 500,000 words). The author reported an
average accuracy of 96.4%.
Decision trees have also been used to implement part-of-speech. A decision tree is
a tree such that each internal node is a feature test and the leaves are classes to be as-
signed to the tested individual. The trees are constructed using statistical information.
Marquiz and Rodriguiz [101] have implemented a POS tagger using decision tree that
has been tested and evaluated on the Wall Street Journal corpus. The authors reported
an overall accuracy of 96.16%.
Maximum-Entropy on the other hand is another technique used for building text tag-
ger. This technique uses a statistical model can be classified as a Maximum Entropy
model. It uses many contextual "features" to predict the POS tag. The Maximum En-
tropy model trains from a corpus annotated with Part-Of-Speech tags and assigns them
54
to previously unseen text.
Ratnaparkhi [116] developed Maximum-Entropy POS tagger. The tagger was
trained on 962687 words taken from Wall Street Journal and it has been tested on
133805 words. The author reported an overall accuracy of 96.6%.
2.5.5 Arabic POS Tagging Systems
As mentioned in chapter 1, the research in the Arabic computational linguistics in
general and specifically POS tagging is growing significantly in recent years. The list
below summarise most of the work done in POS tagging for Arabic.
• EI-Kareh and Al-Ansary [54] used a statistical approach to describe semi-

automatic POS tagger for Arabic.
• Shereen Khoja [87, 88] describe a hybrid tagger system that uses both morpho-
logical rules and statistical techniques in the form of hidden Markov models.
• Abuleil and Evens [15] describe a system for building an Arabic lexicon auto-
matically by tagging Arabic newspaper text using some rules and morphological
analysis.
• Andrew Freedman [62] implemented Brill's POS tagger for Arabic.
• Diab et al. [51] present a Support Vector Machine (SVM) based approach to
automatically tokenized, part-of-speech tag in Arabic text.
• Habash and Rambow [71] presented a morphological analyser based on Support
Vector Machine (SVM) based approach for tokenisation, part-of-speech tagging,
and morphological disambiguation in Arabic.
• Alshamsi and Guessom [127] described HMM POS tagger system.
55
• Marsi et al. [102] employed MBT, a memory-based tagger-generator and tagger
developed by Daelemans et al. [48] to produce a pas tagger for Arabic.
• Harmin [75] described a web-based Arabic tagger based on Buckwalter morpho-

logical analyser [39].
• Buckwalter [11] presented a morphological analyser for Arabic using lexicon

rules.
All the systems described above were built to tag unvocalised Arabic text. Some of
these systems are discussed below in more detail.
• Semi-automatic Arabic tagger system

EI-Kareh and Al-Ansary [54,87] described a semi-automatic tagger to tag un-
vocalised Arabic text that waits for the user of the system to either confirm and
accept the output of their system or change it. Their system used statistical tech-
niques (HMM) and morphological rules. A small set of words were stored in its
lexicon with its class and subclass as well as some inflectional features.
Morphological rules was used to remove affixes and particles words from the
testing text. The analysis result of morphological component represent the main
class the word belong to or sub-class and inflectional feature which already
stored in lexicon. The analysis result passed to the user. At this stage, the user
may accept or reject the system result. In case the user reject the result, the
word analysed once a gain and passed to the user. When the system completes
its analysis without an accepted result from the user, the user in this case has an
option to store his correct analysis.
Statistical component In their system used to calculate statistics collected
56
throughout the use of the system and stored later in lexicon. This component
used to select the tag with the highest frequency without any intervention from
the user. Testing corpus collected from Egyptian AI-Abram newspaper was used
to test their tagger. The authors report an accuracy of 90% .
• APT: Arabic part-of-speech tagger system
Shereen Khoja [87,88] developed the APT system that uses statistical and rule-
based approaches. In the authors' point of view, the APT is the first tagger
system for Arabic, for two reasons. First, it is the first fully-automatic tagger for
Arabic. While the second is the aim of this tagger is to produce a POS tagged
unvocalised Arabic corpus that may used as a useful tool for linguistic research.
A manually tagged lexicon containing 50,000 word was used to extract several
small lexicons. A training corpus containing about 10,000 word has been used to
train her tagger. APT tagger has two main components; a rule-based component
(stemmer) and a statistical component. Figure 2.6 which is reproduced from the
original figure from ( [88], p.78 ), illustrates how APT performs tagging.
Lexicon Words
Lookup with
Arabic Statistical
Words multiple Component
Stemmer tags
Words with
unique tags
Figure 2.6: How APT performs tagging
57
APT performs the tagging process as follows. Each word is initially looked up
in the lexicon. If the word is found in the lexicon, then it is assigned all the pos-
sible POS tags as found in the lexicon. The word is then passed to the stemmer
regardless of whether it was found in the lexicon or not. The main function of
stemmer is to remove all prefixes, suffixes and infixes to produce the root. The
author does not mention the number of strings that were used in her stemmer
affix lists.
If a word could not be stemmed, and was not found in the lexicon, then it is
given the main tags (noun, verb, particle, residual and proper noun). As Khoja
points out, at this point, each word has at least one or more tag. If a word has
more than one tag, then this word (and its neighbors) are passed to the statisti-
cal component where the most likely tag is selected. APT statistical component
uses the contextual and lexical probabilities to determine the most likely tag of
the word. A corpus contains 1700 word has been prepared to test her tagger. The
author report that APT system correctly tagged 86% of the words.
• HMM Part-of-Speech Tagger for Arabic system
Alshamsi and Guessom [127] presented a Part-of-Speech (POS) Tagger for Ara-
bic. The POS tagger resolves Arabic text POS tagging ambiguity through the
use of a statistical language model developed from Arabic corpus as a Hidden
Markov Model (HMM). The main goal behind the development of their POS
tagger is to use it for Named Entity extraction. The input of the tagger is noun
phrase and verb phrase Arabic sentences.
58
Like Khoja work, their system has two main components; stemmer and statis-
tical component. The authors used Buckwalter's stemmer to stem the training
data. A training corpus contains 27594 nouns, 23554 verbs, 5722 adjectives and
5384 proper nouns of Arabic news articles has been used. The training corpus
tagged manually with 55 POS tag set developed by the authors.
During their tagging process, after the tokenizer converts the original input text
into a list of words using the space as a delimiter, the resulting list is passed to
the stemmer. A trigram language model has been constructed and used the tri-
gram probabilities in building their HMM model. Each word has more than one
tag been tagged by calculating the lexical and contextual probabilities. A test
corpus containing 944 words was used to test their tagger system. The authors
report that their tagger achieved 97%. This high level of accuracy is surprising
me due to the fact that they have a small training corpus. However, as the author
point out, they are in the process of enlarging the size of thier training corpus to
reach one million words .
• Web-based Arabic tagger system

Harmin [75] described a web-based Arabic tagger. As the author points out, this
tagger was still in early development. The architecture of the tagger is based on
3-tiers; the client tier, the middle tier, and the database tier.
The client tier is a web browser which sends the user's message to the web
server and displays the returned results back to the user. The middle tier consists
of a web server, a scripting engine and NLP module which is responsible for
59
analysing the Arabic documents. While the third tier consists of an SQL server
and the database used in the tagger system.
The tagger used the Buckwalter dictionary and his morphological analyser [11]
distributed by Linguistic Data Consortium (LDC). The author collected about
42,000 HTML Arabic documents mostly from Al-Hayat Arabic newspaper.
These documents were translated into XML format to test the tagger. The user
can write a sentence and pass it to the tagger. Each word in the sentence is looked
up in the dictionary, analysed and segmented into prefix, stem, and suffix. The
result returned to the user contains all possibilities for the word. The author did
not mentioned any information about the tag set they used and the accuracy their
system achieved. However, based on their system snapshot, it seems they used
the LDC tag set.
• MBT:Memory-Based Tagger for Arabic
Marsi et al. [102] employed MBT, a memory-based tagger-generator and tag-
ger developed by Daelemans et al. [48] to produce a POS tagger for Arabic.
Memory-based tagging is based on the idea that words occurring in similar con-
texts will have the same POS tag.
They used Arabic Treebank-1 corpus and LDC tag set. Their training corpus
contains 150,966 words. The test set contains 15102 words, with 947 words do
not in the training corpus (unknown words).
MBT tagger has three modules; a lexicon module which stores for all words
60
occurring in the provided training corpus their possible tags, the second module
generates two distinct taggers; one for known words and the other for unknown
words. The known-word tagger used a lexicon, while the unknown-word tagger
attempts to derive as much information as possible from the surface form of the
word, by using its suffix and prefix letters as features.
The authors report an accuracy of the tagger using the first two modules on
the test corpus is 91.9% correctly assigned tags. They state that on the 14155
known words in the test set the tagger attains an accuracy of 93.1 %; while on
the 947 unknown words the accuracy is considerably lower: 73.6%. The third
module on their tagger has been designed to improve the precision and recall in
their system. The tagger integrated with morphological analysis was also built
as a separate part in their work to enhanced the accuracy.
• Buckwalter Arabic Morphological Analyser

Buckwalter [11] developed a morphological analyser for Arabic. It was pro-
duced by LDC and used for pas tagging Arabic text. The author used three
lexicons:
1. Prefixes lexicon contains 299 entries in the first release, 548 entries in the
second release.
2. Suffixes lexicon contains 299 entries in the first release, 906 entries in the
second release.
3. Stems lexicon contains 82,158 entries in the first release, 78,839 entries in
the second release.
61
In addition, the lexicons are supplemented by three morphological compatibility
tables used for controlling prefix-stem combinations, stem-suffix combinations,
and prefix-suffix combinations. The data is written using his Arabic translitera-
tion system 19 instead of original Arabic script. The author Morphology Analysis
Algorithm (MAA) is based on four assumptions:
- Words are composed of three elements: prefix, stem, and suffix.
- The prefix can have 0-4 characters.
- The stem can have I-infinite characters.
- The suffix can have 0-6 characters.
Each input word is segmented into three elements: prefix, stem and suffix. Each
element is looked up in its respective lexicon. If all three word elements (prefix,
stem, suffix) are found in their respective lexicons, then their respective com-
patibility tables used to determine whether they are compatible or not. Three
questions are asked here :
1. Is the morphological category of the prefix compatible with the morpho-
logical category of the stem? (Le., is the combination pair found in the list
of compatible prefix-stem morphological categories?)
2. if so, is the morphological category of the prefix compatible with the mor-
phological category of the suffix? (Le., is the combination found in the list
of compatible prefix-suffix morphological categories?)
3. if so, is the morphological category of the stem compatible with the mor-
phological category of the suffix? (i.e., is the combination found in the list
of compatible stem-suffix morphological categories?)

19Por more: http://www.qamus.orgltransliteration.htm
62
2.6. CHAPTER SUMMARY
If the answer to the last question is "yes" then the morphological analysis is
valid. The morphological analyser is produced all the variations of the input
word included the short vowel and diacritics. The pas tag (which is stored
in lexicons) for each variation also is produced. However, to those who are
interested, Buckwalter Arabic Morphological Analyser can be found in [5].
2.6 Chapter Summary
This chapter contained a description of the pas tagging problem and NLP applications
that may use pas tagger systems as their first stage. We defined the concept of corpus
linguistics and pas tag set. The previous work on corpus linguistics and pas tag set
has been discussed. In addition, the different approaches used to solve the problem
have been examined and the previous work on pas tagging for English and Arabic has
been explored.
This work employs the rule-based approach. AMT tagger presented in this work has
two main rule components, these are: pattern-based rules and lexical and contextual
rules. The basic idea of pattern-based technique is to generate automatically a lexicon
of patterns instead of using manually tagged lexicon or training corpus which contains
a set of Arabic words. The triggers in pattern-based rules depend on the patterns of
text words. A novel algorithm to match the Arabic word in testing corpus with its cor-
rect pattern in patterns of lexicon has also been built. In addition, a small amount of
hand-written rules and constraint rules have been used to assist the main technique to
assign the correct tag to those words not tagged by pattern-based technique.
The next chapter will covers some of the basics of the Arabic language, describe the
63
diacritics feature in Arabic and its importance in Arabic pas tagging. The pas tagger
design and the main technique is described in chapter 5.
64
Chapter 3
Arabic Language and POS tagging
Objectives
• To present an overview of Arabic language and its script.
• To describe the diacritic feature in Arabic.
• To explain the importance of diacritic feature in Arabic POS tag-
gmg.
• To briefly define the Arabic grammatical system.
3.1 Introduction
The most prominent member of semitic languages family is the Arabic language. This
semitic family includes also Hebrew, Amharic, Maltese and Syriac. They all share the
pattern based morphology system. Furthermore, these semitic languages have a mor-
phological system based on a root, usually consisting of three consonants, and a pattern
structure. The root gives the basic lexical meaning of the word, while the pattern con-
65
3.1. INTRODUCTION
sists of vowels and it signals the grammatical significance of the word l . Bar-Haim et
at. state that :
"Semitic languages have rich inflectional systems and a template-based derivational mor-
phology, which are manifested in a large variation of word forms "( [118], pA).
Arabic is considered as the most widely used member of the semitic languages. It is
spoken by more than 300 million Arabs around the world. Furthermore, it is also un-
derstood by more than 1.1 billion other Muslims. It has been a literary language since
the 6th century A.D, and is the liturgical language of Islam in its classical form [70].
It is exhibiting a rich inflected and morphological system.
Arabic words, like words in other Semitic language, are written with consonants. Ara-
bic language has several varieties, these are : Classical Arabic, Modern Standard Ara-
bic (MSA) and Colloquial (spoken) Arabic. Classical Arabic is the language of Qur' an
and c1assicalliterature. It is used as the language of religious practice throughout the
Islamic world. Modem Standard Arabic (MSA) is the language of the media, educa-
tion, and fonnal communication, which is understood by all Arabic speakers. Collo-
quial (spoken) Arabic is a local dialects of people throughout the Arab world [134].
The principal script used for writing the Arabic language is Arabic alphabet. It is
composed of 28 letters. On the other hand, writing in Arabic language is unicase; the
concepts that distinguish between U pperlLower case letters do not exist. Furthermore,
a cursive system from right to left is used in written Arabic 2 . The transliteration sys-
tem of Arabic Alphabet and other diacritical marks used in this thesis are described in
1For more infonnation : www.a-z-dictionaries.comllanguage/Arabic_dictionaries.html
2For more details: http://foolswisdom.com!users/sbettlarabic.htm
66
3.1. INTRODUCTION
Appendix B on page 191.
While the Arabic alphabet was originally used to write the Arabic language, it has
been adopted by other groups to write their own languages, such as Persian, Pashto
and Urdu. A letter in the Arabic language is written in multiple forms, depending
on where in a word a letter appears. It may appear in the beginning of a word (initial
form). anywhere other than the beginning or the end of a word (medial form) and in the
end of the word (final form)3. For example, in these Arabic words, ( UZ->\~4, mdArsa,
"schools"), ( C:' smEa, "to hear" ), ( 'r' hlma, "to dream" ), the letter i m (miim)
appears in initial, medial and final forms, respectively.
In Arabic language, a word may be an original word or Arabized word. The orig-
inal words have two subcategories: Derivative Arabic words and Fixed Arabic words,
while the Arabized words are nouns borrowed from foreign languages [56].
Derivative Arabic words, which are words belonging to the verb and noun classes,
have been built from the same root and obey the Arabic derivation rules [72]. For ex-
ample, the words, ~, mktb, "office", y \:{, ktAb, "book", ~ ktba, "he wrote",
are derived from the root y::.5', ktb, "meaning of writing".

Fixed Arabic words, are words which do not obey the Arabic derivation rules. For
example, the particles, "J", fy, "in", "~", mn, "from".
3Por more detail: http://www.ancientscripts.comJarabic.html

4Since this thesis is written using Latex, sometimes, an additional diacritical mark may be added
by the Latex system automatically and appear over some letters other than the last letter of the Arabic
word, such as the fatha mark appearing above the second letter ~ in the word J).>...o. These marks have
been ignored when dealing with the word or the pattern of the word. In this work, we are concerned
only with the last diacritical mark.
67
3.2. ARABIC SCRIPT AND DIACRITICS FEATURE
3.2 Arabic script and diacritics feature
3.2.1 Brief history
Arabic script as well as latin script were derived from the first alphabet which was
created by the phoenicians in 1300 B.C. The phoenician script comprises 22 letters
as shown in figures 3.1 and written from right to left without capital letters. Since the
Phoenicians were living in Lebanon, Palestine and Syria (middle-east area), their script
was born in lebanon.
:\Iodem Latin
Early Latin
Early Greek
Phoen~ian
Early Aramaic
Nabatllean
Early Arabi~
Figure 3.1: The origin of the Arabic script
Later, the Aramaic alphabet originated from the Phoenicians in 1000 BC. Later, the
Nab ate an script was born in the city of Petra, north of the Red Sea-Jordan in 100 BC,
and spread allover the middle-east. The early Arabic alphabet was created in Kufa
(Iraq) in the middle of the first century. The old Arabic alphabet consisted of around
17 letter forms without dots or diacritical marks. The calligraphic styles for the old
Arabic alphabet was kufi style.
With the birth of Islam, the Quran was written with the Quranic kufi script. Since
the missing dots and vowels in the old Arabic script are not clearly indicated, several
5Pigure 3.1 taken from : http://291etters.wordpress.comJ2007/0S128/arabic-type-history/
68
letters of the Arabic alphabet share the same shapes, for example, the letters y, u,
~. have the same shape (without dots), which definitely lead to confusion for Quranic
readers. Since the Quran became the reason to reform all the Arabic scripts found
in Arabia on one hand, and the number of non-Arab Muslims increased on the other
hand, some refonn was needed to avoid confusion and facilitating reading and learning
of Arabic as well.
The first system of developing the old Arabic script was invented by Abul Aswad
al Duali (688 AD) by placing large colored dots in order to help with pronunciation.
Later, a unifonn system to distinguish letters by using dots (in current usage) was de-
veloped by Al Hajjaj ibn Yusuf al Thaqafi. Lastly, Al Khalil ibn Ahmad al Farahidi
(786 AD) devised a diacritical system to replace Abu al Aswad system.
By using the dot system, one, two, or three dots to letters with similar phonetic char-
acteristics were added. A total of 28 letters containing three long vowels is obtained.
This unified well structured Arabic script was developed for the writing of the holy
scripts of the Quran in the 7th century with the development of calligraphic styles as
well. Later the Quran was written with the Quranic Naskh style6 .
On the other hand, the Phoenician alphabet was used as a model by the Greeks. letters
for vowels were added by the Greeks. Afterwards the Greek model became the model
for early latin, and ultimately all Western alphabets 7 .

6For more: http://sakkal.comlArtArabicCalligraphy.html
7http://www.answering-islam.org/Greenlseven.htm
69
3.2.2 Arabic Diacritical Marks
The Arabic language has two types of vowels (long and short vowels). The long vowels
are three letters form a part of Arabic letters (Arabic alphabet)8. The short vowels are
three small vowel marks (see Table 3.1), which do not form part of the Arabic letters.
These marks are placed above and below the Arabic letter.
Fatha Ia! lui . Kasra Iii

W Damma w. w
"
Mark above the letter Mark above the letter Mark below the letter
Table 3.1: Arabic short vowels diacritics
Fatha represents the sound of Ia! in bag, damma represents the sound of lui in put
and finally, kasra represents the sound of Iii in sit.
Moreover, there are other five diacritical marks 9 • Three of them as shown in Table 3.2
called nunation (Tanween Fath pronounced lan/, Tanween Damm pronounced lun/,
Tanween Kasr pronounced lin/). Nunation is the doubling of the short vowels used at
the end of indefinite nouns.
Tanween Fath lan/ w Tanween Damm lun/ y Tanween Kasr lin/ y

,;
Mark above the letter Mark above the letter Mark below the letter
Table 3.2: Nunation (Tanween) Vowels diacritics
Finally, the last two marks in use are sukun (absence of a vowel) which means that the
consonant is not followed by a vowel and gemination (Shadda) which means a dupli-
cation of the consonant; these marks are shown in Table 3.3.

8The three long vowels letters are: I Alif, j waaw, i.f yaa.
9 Somtimes, researchers distinguish between short vowel marks and diacritical marks. In this thesis,
we use the term diacritics to represent all marks (including short vowels marks)
70
3.3. IMPORTANCE OF THE DIACRITIC FEATURE IN ARABIC POS TAGGING
Sukun u•.. Shadda ...

u..
Mark above the letter Mark above the letter
Table 3.3: Sukun and Shadda vowels
In Arabic language, diacritics can be used in Qura'n text, in other religious texts, in
classical poetry, in textbooks of children and foreign learners and in complex texts to
avoid ambiguity. The diacritic marks may be assigned to each character of the Arabic
word. in this case, an Arabic word is called fully-vocalised. When the diacritical marks
are assigned to most letters of the word, but not each, an Arabic word in this case is
called half-vocalised. An Arabic word is partially-vocalised when the the diacritical
marks assigned to one or maximum two letters in the word [56]. Table 3.4 shows an
example on each of the vocalisation state of the Arabic word.
Translation : "I wrote the lecture today evening"

Transliteration : ktbtu AlmHADrpa msA'a Alywmi
Arabic Sentence in :
.. ..
Cy-\ g~~\ ~
Full-vocalized: ,~
..
Half-vocalized: i y-\
.. ,~ o~~\ $.. .. ..
g.?W\ ~.. .. ..
Partial-vocalized : C~\ ,l;...o
Table 3.4: Vocalisation state of the Arabic word
3.3 Importance of the diacritic feature in Arabic POS
tagging
The base of POS tagging is that many words are ambiguous regarding their gram-
matical category [109]. For instance, the word '\.-:"..A~",dhb in the unvocalised Arabic
sentence presented in Table 3.510, ( which either means "has gone" or "gold" ), can be
lOThe tags used in sentence presented in Table 3.5 have been described in more detailed in chapter 4.
71
a verb or a noun. Due to the fact that the sentence is unvocalised, this lexical ambigu-
ity is predictable. Thus, it requires an adequate context or/and an adequate knowledge
about the semantic information to be resolved [l09].
Adding semantic information knowledge to an unvocalised Arabic text is not an easy
task, because it is very difficult to predict the semantic meaning with the missing dia-
critics (at least one diacritical mark) in Arabic text. Furthermore, removing the ambi-
guity based on an adequate context requires a more sophisticated technique, such as a
statistical technique, which still suffers from many disadvantages, including : needs a
manually tagged huge lexicon or training corpus, can't deal with unknown words and
needs a huge matrix to represent the statistical information.
Arabic Sentence : ~--.r-o ~~I

. ~.)
POS Tag: NuAj NuCnNm VePe

NuCn NuCnAc NuCn
Transliteration : msrEA AltAlb dhb
Translation : The Student has gone quickly
Table 3.5: Unvocalized Arabic sentence and its POS tags
The lack of diacritics in Arabic texts is presented as a major challenge to most
Arabic NLP tasks, including parsing [95]. The use of diacritics in Arabic texts are
extremely important. The list below summarises the importance of using diacritics in
Arabic language :
1. Adding semantic information to the words leads to resolving ambiguity in the
meaning of words. For example, adding the short vowel (Fatha mark) to the last
letter of the word "~.)" presented in Table 3.5 to become '\.;.~.)" causes the
removal of ambiguity in the meaning of the word (has gone).
2. Determining the correct POS tag to the words in the sentence. For example, the
72
word .. ~~" definitely belongs to the Verb class.
3. Indicating grammatical functions to the words, differentiating the word from
other words, and determining the syntactic position of the word in the sentence.
For example, short vowels used to indicate mood, aspect and voice endings for
verbs and case endings for nouns.
-+. Indicating the correct pronunciation of words, correct syntactical analysis which
leads to reducing problems for NLP applications such as text-to-speech or
speech-to-text, and removing the semantical confusion of Arabic readers [139]

[95] [55].
The above list shows that using the diacritics in text is important to differentiate the
word from other words and determine the syntactic position of the word in the sentence
such as nominative, accusative, and genitive.
In addition, these diacritic marks determine the inflectional features 11 of the sentence
words, such as, gender, person, number, noun case, and verb mood.
For example, in the following Arabic sentence :
Arabic Sentence : u..( 0"'-,...u I w..rA>

Transliteration : HDrt Aldrs klh
Translation : "(I, She) attended all the lesson "
In the above sentence, it is very difficult to determine the inflectional features for the
word w fa>, "attended" with the diacritics missing, especially the last diacritical mark
(ending case). Neither the context nor the word itself can provide any information on
inflectional features for such a word. Thus, the last diacritical mark helps not only in
determining the correct part-of-speech (general tag) of the words in the sentence, but
11 Arabic Inflectional Features described in chapter 4, Section 4.2.
73
3.4. ARABIC MAJOR GRAMMATICAL PART-OF-SPEECH
also in providing full information regarding the inflectional features for the sentence
words (detailed tag).
The possible last diacritical mark ( case ending) of the word w~ and the inflec-
tional features for each case can be seen in Table 3.6.
Case ending Inflectional features

(Damma mark) ~~, HDrtu First person, singular number, Masculine
gender, indicative mood
(Sukun mark) ~~, HDrtx Third person, singular number, Feminine
gender, jussive mood
(Fatha mark) ~~, HDrta Second person, singular number, Masculine
gender, subjunctive mood
(Kasra mark) ,-?~, HDrti Second person, singular number, Feminine
gender, jussive mood
Table 3.6: The possible last diacritical mark (case ending) of the word w~
The correct tags of the sentence presented in Table 3.5 where the suitable diacritical
mark has been added to the last letter of every word in the sentence are shown in
table 3.7.
" ~\Jaj\
Arabic Sentence: ~..r-' . ~.;
.
POS Tag: NuAj NuCnNm VePe
Transliteration: msrEAF AltAlbu dhba
Translation: The Student gone quickly
Table 3.7: Partially-vocalised Arabic sentence and its correct POS tag
3.4 Arabic Major grammatical part-of-Speech
A word can be defined as something that is uttered, intelligible, and has a full meaning
[69]. According to Arab grammarians, words in Arabic are classified into three Part-
74
of-Speech categories: Verb, Noun, and Particle. Each category has its meaning and its
recognisable signs as described below.
3.4.1 Verb
The category of verb is defined as a word denoting an action and may be combined with
the particle [134]. In Arabic, traditionally, two verb forms are recognised; the Perfect
(past) and Imperfect (present). The third form, the Imperative, has been considered as
a variant of Imperfective by Arab grammarians. Each form has its distinguishing signs.
Furthermore, an Arabic verb has a temporal aspect inherent in it [69].
• Perfect Verb
The perfect verb indicates a state or a fact in the past [76]. It follows the pattern
of the root (ground form) 12 j.s, jEla, "do". For example, the root ~ ktba ,
"wrote", has the basic meaning of writing. It can be suffixed with many letters.
For instance, the letter w, taa. The suffix represents more inflectional features to
the word, such as, person, gender, number, and mood. For example, the words,
W. ktbtu, "I wrote" (first person, masculine), W. ktbta, "you wrote" (sec-
ond person, masculine), W, ktbtx, "she wrote" (third person, feminine) and
~ ktbti, "you wrote" (second person, feminine).

~ .
The above example shows that adding the diacritical mark on the last letter of
the word helps not only in determining the lexical category of the word, but also
in defining the inflectional features of the word.
• Imperfect Verb:
The imperfect verb expresses an action still unfinished at the time to which ref-
12the ground form and the derived forms are described in more detail in Chapter 5
75
erence is being made [76]. Also, it can be prefixed with one of the following
~ ~
four letters (called letters of present): I, 1..5, ~,), u. For example, the words, ~I,
" .
Aktbu. "I write". ~, yktbtu, "he write", ~, nktbtu, "we write", ~,
tktbtu, "she write". In addition, the imperfect verb can accept a particle. For
example, ~ J, In yktba, "he will not write".

• Imperative verb :
The imperative verb indicates an action demanded to be carried out in the fu-
~
ture [76]. It always comes in the second person. For example, the word y:(1,
'aktbx, " write !". Like the perfect and imperfect verbs, the imperative verb
can be suffixed with the letters ~,yaa, I, Alif, 0, nuun, ..5, waa to represent the
inflectional features of the word (see Table 3.8).
Arabic Word Inflectional features

~
y:(1, 'aktbx, "you (write !)" second person, singular, Masculine

~
~I, 'aktbyx, "you (write !)" second person, singular, Feminine

-. ~
l;:(1, 'aktbA, "you (write !)" second person, Dual, MasculinelFeminine

~
1j;(1, 'aktbwA, "you (write !)" second person, plural, Masculine

~
~I, 'aktbna, "you (write !)" second person, plural, Feminine
Table 3.8: Samples of imperative verbs and their inflectional features
3.4.2 Noun
The category of noun is defined as a word denoting an essence and may be combined
with an article [134]. In Arabic, a noun has no temporal aspect. As Arab grammarians
described, a noun has a set of signs that are used to distinguish it from verbs and
particles [69]. The list below describes these signs:
76
1. Kasra mark
A noun can receive a kasra vowel mark when it is in the genitive case. In Arabic,
the words which belong to the verb category never receive a kasra mark. For
example, ~, mktbi, "office".
J Nunation mark
In Arabic language, neither the verb nor the particle receives any nunation mark.
A nunation mark appears only on the final letter of Arabic word which belongs
to noun category. These marks indicate that these words are indefinite. For
example, ~ ktabun, "book'''.
3. Vocatives
A noun in Arabic may be placed in the vocative position, if it follows vocative

particle. For example, ~ ~,ya hasan.
4. Definition by an article JI (the in English)

A noun in Arabic is definite when begin with an article JI, ''AI''. For example,
~tCll, AlkAtbu, "the writer".
However, it is important to draw attention here, that it is not necessary to find one or
all of these signs to define a word as a noun. For example, in the following sentence
~-,..ul ~..,......,~ I ~-,oM ~"the computer teacher wrote the lesson ", the word
~..J oM, "teacher" is a noun, and none of the above signs is used to distinguish this
word. In this case we use the pattern of this word to distinguish it [69].
3.4.3 Particle
The category of particle is included the in remaining words. Particles used to assist
other words in their functions in the sentence [134]. In Arabic, the particle does not
77
3.5. ARABIC GRAMMATICAL SYSTEM
have a meaning without being attached to a noun or a verb. Furthermore, particles do
not accept any of the signs that distinguish between nouns and verbs [69]. An example
of Arabic particles is ~,fy, "in", ~, mn, "from" and ~, about.
In contrast of Hebrew language as a member of semitic languages, Griess state that:
"Hebrew shows similarity to Arabic in terms of its grammatical constituents of verbs, nouns,
and particles. The Hebrew nouns can certainly be in the genitive position, mimated (instead
of nunated), defined by i1 (instead of J~, and be predicated in the same way as in Arabic.
Nevertheless, when contrasted to Arabic, Hebrew enjoys a less complicated particle system"(
[69], p.2'+).
3.5 Arabic Grammatical System

There are two main categories of grammatical analysis in Arabic (see figure 3.2): Mor-
phology and Syntax. The former is the study of the form of the word while the later is
the grammatical arrangement of words in the sentence. On the other hand, Arabic mor-
phology has two subcategories: Derivational, how words are formed, and Inflectional,
how words interact with syntax, such as singular, dual and plural [120].
Arabic Grammatical System
Morphology Syntax
~
Derivational Inflectional
Figure 3.2: The Arabic grammatical system
78
3.5.1 Morphology System
In Arabic morphology, the Arabic word formation is based on a root [138]. Manyaf-
fixes can be attached to the root to form Arabic words. Arabic morphology consists of
a system of consonant roots which interlock with other consonant and vowels to form
word stems. The stem is formed by substituting the characters of the root into certain
verb forms [120].
A great number of other forms can be derived from the ground form (root) by insert-
ing a long vowel, lengthening the medial letter of the root, and/or adding consonantal
prefixes to produce a new word with a new meaning that still shares the basic meaning
of the root [138]. For example, the root " ~"ktb has the basic meaning of writing.
The root may be conjugated in many forms 13. Samples of the words that can be formed
and derived from the same root"~" ktb are shown in Tables 3.9,3.10,3.11.
Arabic Transliteration Translation

Word
~ ktba he wrote
~ ktbwA They wrote

W ktbtp She wrote
W' ktbnA We wrote
W ktbtu I wrote
W ktbta You wrote
Table 3.9: Samples of past tense (perfect) verb forms
Arabic words are modified not only by number, person, gender and tense, but also by
case and mood, definiteness and indefiniteness [22]. According to Arab grammarians,
from every verb, a verbal noun (Infinitive), a noun of time, an adjective noun, a noun
of place, diminutive noun, an instrument noun, a present (active) participle and past
I3Por more information: http://wahiduddin.netlwords/arabic_glossary.htm
79
Arabic Transliteration Translation

Word
~ • M
yktbu he writes
0~ yktbwna they write
~ tktbu She wrote
~ nktbu we write
~ :
y.:>1 Ouktubx write!
Table 3.10: Samples of present (imperfect) and imperative verb forms
Arabic Word Transliteration Translation

~\(
. kAtbp writer
~~ mktwbp letter
~t( ktAbun book
0 ~
mktbp office
~• M
kutybun booklet
Table 3.11: Samples of additional forms such as verbal, diminutive, Adjective nouns
created from the same simple root ~
(passive) participle may be derived [120].
3.5.2 Syntax System
The syntax system in Arabic refers to the grammatical arrangement of words. As
Arab grammarians described, there are two types of sentences 14 : Verbal and Nominal
sentences .
• Verbal sentence
A verbal sentence is simply one which begins with a verb followed by a subject.
The verb in verbal sentence is always in singular form, where the subject may
be singular, dual or plural. For example, in the following sentences :

14Por more information :http://www.multimediaquran.comlquranJarabic/grammar/sentence.htm1
80
1. ' wJ\ ~lla)\

~.J . ~
•
The above sentences may be translated to English as "wrote the student(s) the
lesson". But it really means "the student(s) wrote the lesson". The underlined
words in the above sentences represent the subject in each sentence. The subject
in the sentence 1, 2, and 3, is singular, dual and plural, respectively, where the
verb ~ "wrote" is always in the singular form .
• Nominal sentence
A nominal sentence is one which begins with a noun or subject. The verb in an
Arabic nominal sentence must agree with the subject in number and gender as
shown in the following sentences :
1. ' wJ\
~.J
d. ~lla)\
•
' wJ I\.::(·
2. ~.J .
l:}lla) \
~ •
3. ~.JwJ\ ~ y>lwl
The underlined words in the above nominal sentences" the student(s) wrote the lesson
" represent the verb in each sentence. The verb is changed to agree with the subject in
number and gender.
The above two types of sentences, which are VS0 15 and SVO respectively, are viewed
as being independent and neither of them is derived from the other. However, the Arab
grammarians assumed that the subject never precedes its verb, and take VSO as the
underlying word order for Arabic [19].

15V = Verb, S = Subject, 0 = Object
81
3.6 Chapter Summary
This chapter briefly described an overview of Arabic language and its script. The dia-
critic feature and its importance in reducing the lexical ambiguity and providing more
semantic information to the word text also addressed. Arabic as other semitic language
based on the fact that words are derived morphologically from roots. Many words are
derived with a new meaning that still share the basic meaning of the root. The applica-
tion to the root of a large number of morphological patterns determines the categorical
status of the resulting word.
All Arabic words can be theoretically reduced to roots. To deduce a root from the
pattern and to decide which pattern has been imposed on the root is a prerequisite skill
for using an Arabic dictionary. According to Arab grammarians, there are three major
part-of-speech: verb, noun, and particle. Arabic not only has complex morphological
system but also exhibits a highly inflectional system as well. In next chapter, an Arabic
inflectional features will be describe in more detail beside the tag set which consider
as a prerequisite step toward developing a tagger system.
82
Chapter 4
Tag set Design
Objectives
• To define the tag set design criteria.
• To describe the Arabic inflectional features.
• To explain the developed Arabic tag set hierarchy and design.
4.1 Tag set design criteria
Atwell [30] presented a number of criteria to take into account while developing the
POS tag set. These criteria have been taken into account when developed the tag set.
The list below summarises these criteria in a little more detail:
1. Mnemonic tag names

This criteria is concerned with the name of the tag. The name of the tag must
be chosen in such a way that makes it easy for the user to remember the classes
of the text. Since producing a tagged corpus where the text has been enriched
with linguistic information to be used in many NLP applications is the main
83
4.1. TAG SET DESIGN CRITERIA
anticipated outcome of ATM tagger presented in this work, the tag names have
been chosen to help linguists and NLP developers to remember the lexical class
of each word. For example, Ve for verb, Nu for noun and Pr for particle.
') Underlying linguistic theory
The tag set developer should take into account that the tag set should cover as-
pects of the theory of language and the characteristics of that language (i.e, in-
flectional feature). The developed tag set presented in this chapter follows the
Arabic grammatical system and is based upon the main three POS classes (verb,
noun, particle); these tags are enriched with inflectional features [24].
3. Classification by form or function
Usually, the lexical classes are defined in terms of paradigmatic forms (repre-
sentative set of the inflections of a noun, verb, etc), and syntagmatic functions
(syntactic function of the words). Since the short vowels and other diacritical
marks are available in our testing corpus, these vowels can encode the grammat-
ical class or feature information [30]. This criteria is taken into account during
the course of developing our tag set.
4. Idiosyncratic words
Arabic like any other language has a number of words with special idiosyncratic
behavior. These words do not have patterns to follow, such as words belonging
to a particle class. Similarly, the English language has a number of words with
special idiosyncratic behavior. These words do not fit into traditional parts of
speech. For example, Brown and LOB tag sets analysed "a" as article tag AT,
but UPenn tag set analysed it as determiner DT. Our developed tag set analysed
these words based on their roles in the text as Arab grammarians classified these
words.
84
5. Categorisation problem
The vowels (last diacritical mark) in our testing corpus add more linguistic in-
formation and reduce the ambiguity in categorising words. Most tags in the
developed tag set are detailed tags. Each tag being defined clearly and unam-
biguously. We considered each main POS class or subclass as a unique tag, so
that all the words in the testing corpus can be tagged consistently.
6. Tokenisation issues: what counts as a word?

Arabic text like English text needs a tokenisation process. It is responsible for
locating an untagged input text and identifying words, punctuation marks, num-
bers and other marks. Some words need a combined tag. For example, the word
~j' l1yktbu, "and he writing" has the following tag PrCo+VeP iMaSnThDc
. This issue has been taken into account when developing our tag set.
7. Multi-word Lexical Items

The Arabic language has very few idiomatic phrases (Multi-word Lexical Items).
It may appear in some proper nouns (e.g, Ala' a Alddin) but is treated as one word
and has one tag.
8. Target users and/or application

Since one of the aims of the tagged corpus is to use it for developing educational
application for teaching purposes, the tags in the developed tag set have been
designed to achieve maximum target use and customer satisfaction as welL
9. Availability and/or adaptability of tagger software

Arabic as well as other semitic languages has a morphological system based on
a root (usually consisting of three consonants or letters) and a pattern structure.
The main technique in this work is based on the pattern of the word. It is inter-
85
ested to note that this technique may also be valid for other semitic languages
especially Hebrew language, since this language like Arabic has a morpholog-
ical system based on a root and a pattern structure on one hand, and has the
diacritic feature on the other hand. However, the tag set presented in this work
is based upon the three main POS classes, their sub-classes and inflectional mor-
phology. Thus, the guiding principle was compatibility with Arabic grammar
tradition [30].
10. Adherence to standards
EAGLES guidelines outline a set of features for tagsets 1 ; these guidelines are
designed to help standardise tagsets for what were then the official languages
of the European Union. EAGLES tags are defined as sets of morpho-syntactic
attribute-value pairs (e.g. Gender is an attribute that can have the values Mascu-
line, Feminine or Neuter) [74]. Arabic has its own structure, feature (e.g diacrit-
ics), linguistic attributes (e.g dual number and jussive mood), which make this
language different from the languages for which EAGLES was designed [89].
In addition, there are other differences in the order of the constituents within the
sentence. For example, in Arabic, adjectives follow the noun which they mod-
ify [58]. Despite the fact that some classes from traditional Arabic linguistics
and grammar have not been compatible with EAGLES guidelines, some of the
English translations of class and feature names used in the developed tag set
were drawn from standard terminology found in the EAGLES guidelines [30].
11. Genre, register or type of language

This criteria is not fully applied in our developed tag set. The tags were devel-
oped to cover written Arabic text. The corpus in this work is partially-vocalised
1http://www.i1c.cnr.itlEAGLES96/annotate/
86
4.2. ARABIC INFLECTIONAL FEATURES
Arabic corpus contains written Arabic text. It does not contain spoken text.
12. Degree of delicacy of the tag set
The tags were developed with a good level of granularity, leading to cover all the
sub-classes of the three main POS classes used in Arabic grammar. Each tag is
enriched with inflectional features, which seem to help the linguists to develop
a robust educational system for learning Arabic as an example. However, the
developed tag set contains 161 2 detailed tags, 101 nouns, 50 verbs, 9 particles,
1 punctuation including 28 different POS general tags. Arabic language is char-
acterised by having a rich and an extensive morphological system as well as an
inflectional system. Therefore it is natural that the tag set should be a richly
articulated tag set, providing distinct codings for all classes of Arabic words.
At the same time, as Elworthy [59] point out, if all of the syntactic variations
which are realised in the inflectional system for highly inflected languages, such
as Arabic or Hungarian were represented in the tag set, there would be a huge
number of tags, and it would be practically impossible to implement or train a
simple tagger.
4.2 Arabic Inflectional Features
Grammatically, inflection is the marking of a word in written text to reflect grammati-
cal information, such as gender, tense, number or person3 • Arabic is a highly inflected
language [97]. It exhibits a rich inflectional morphology system. Inflectional morphol-
ogy is used to express grammatical relations between words in sentence [16]. The list
below describe the Arabic inflectional features in more detail:

2detailed tags included inflectional feature while general tags represent the name of the main class
and its sub-class without inflectional feature
3For more information: http://en.wikipedia.org/wiki!Inflection
87
4.2.1 Gender
Nouns and verbs in Arabic are morphologically marked for the inflectional feature
"Gender". Arabic has two genders: masculine and feminine. Like English, male per-
sons are masculine, female persons are feminine, but things may be masculine or femi-
nine. For example, in English, gender is indicated in the third person singular personal
pronouns as the feminine "she", the masculine "he" , and the neuter "it". The personal
pronoun "it" can refer to certain creatures of either sex (baby, cat) and to sexless things
(beauty, book) [21].
..
In Arabic a word such ~.J, fryqun, "team" may refer to (masculine or feminine)
gender. Nouns in Arabic may be recognised as feminine singular nouns by their gram-
..
matical form. For example, nouns ending by 0 (Ta Marbota), such as ~,jntpun, "gar-
den" or ending by ,I, such as ,~, ShrAp, "desert". Also, nouns may be recognised
as feminine plural nouns which are formed by adding the suffix u \ such as ~~,
jmylAtun, "beautiful women". Masculine plural nouns may be recognised by adding
the suffix 0-, or 0:., such as 0~--,...\.4, mdrswna, or ~--,...\.4, mdrsyna, "teachers".
In terms of Arabic verb, since the verb in Arabic is a combination of a verb and a
pronominal suffix or prefix, these pronominal affixes represent inflectioanl features,
such as, gender, number, person and mood marker. In general, gender terms and forms
in Arabic as well as English do not always refer to biological gender [21,61]. How-
ever, the inflectional feature gender in our tag set has been classified into three genders:
masculine, feminine and neuter.
88
4.2.2 Number
In Arabic, number is the inflection feature governing nouns and verbs. Unlike English,
Arabic has three forms of number:singular, dual and plural. Singular denotes only one,
dual denotes two individuals of a class or a pair of anything and plural denotes three
or more [21]. The dual is formed by adding the dual suffix ~ \ or 0' For example, the
words :Jj , wldun, ~\.:uj or 0..uj, wldAni or wldyni and ~'Sji, OwiAduN, which mean
" a boy ", ., two boys " and " boys " indicate singular, dual, and plural respectively.
4.2.3 Person
In Arabic, verbs and only personal pronouns inflect for three persons: the speaker (first
person), the person spoken to (second person), and the person spoken about (third
person). The first person in the singular denotes the speaker. In the plural it denotes
the speaker plus anybody else, one or more. The second person denotes the person
or persons spoken to. The third person denotes those other than the speaker or those
~ ~
spoken to [132]. For example, the personal noun 13\, AnA, "I", ~\, Anta, "you" and
jA, hwa, "he" indicate first, second, and third person respectively.
4.2.4 Mood
Arabic Verbs have three moods: Indicative, Subjunctive and Jussive (Imperative). The
mood markers are often short vowel marks placed at the end of the word (suffixes) such
as fatha, damma and kassra or sukun mark. For example, damma lui for indicative and
fatha Ia! for subjunctive. On the other hand, mood may be determined by particles
which govern or require a certain mood [120]. For example, the negative particle f' 1m
requires the jussive mood on the following verb such as the words ~ f' 1m yktbx,
"does not write". The mood of the verb word ~ is jussive.
89
4.3. ARB TAGS- THE DEVELOPED TAG SET
4.2.5 Case
In Arabic the term "case" refers to inflectional marking. Arabic nouns have three cases:
nominative, accusative and genitive. They indicate the syntactic function of the word
and its relationship with other words in the sentence (e.g. singular, dual, masculine
plural, feminine plural forms take special case endings) [120]. These cases are indi-
cated by short vowel marks placed at the end of the word (suffixes). For examples,
the words L-';.J..u\ Aldrsa. ~.J..u\, Aldrsu and ~.J..u\, Aldrsi which mean "the lesson",
indicate nominative, accusative and genitive respectively.
4.2.6 State
Arabic nouns are marked for definiteness or indefiniteness. In Arabic the definite article
J\, al used as a prefix to indicate definiteness. It is not an independent word like "the"
in English. In Arabic, "nunation" (tanween) marks are used as suffixes to indicate
indefiniteness [120]. For examples, the words yts::.l\, A IktAbu , "the book", yl:.(,
ktAbun, "a book" indicate definiteness and indefiniteness respectively.
4.3 ARBTAGS-The developed Tag set
4.3.1 ARBTAGS Hierarchy
The tag set hierarchy presented in this work follows the tradition of Arabic grammar.
Most of the Arabic grammar dictionaries, such as a dictionary of Arabic grammar [57]
classified the Arabic words as shown in chapter 2, figure 2.3.
As Arab grammarians described, each Arabic word belongs to one of the three main
classes; verb, noun or particle.
90
4.3. ARB TAGS-THE DEVELOPED TAG SET
1. Verb
In Arabic grammar, the main class (verb) comes with three sub-classes shown
in figure 4.1 (see also figure 2.3 in chapter 2). These sub-classes are classified
according to the tenses of verb in Arabic.
Verb
1 1
I Perfect I Imperfect I I Imperative I
Figure 4.1: Categories of Arabic verb
• The perfect (past) known in Arabic as Almadi.
• The imperfect (present) known as AlmDArE.
• The imperative (future) known as AIAmr.
Practically all semitic scholars agree that the tense of the verb does not express
the idea of time, but rather the idea of "finished act" and an "unfinished act". If
the act is incomplete or unfinished, the verb is the imperfect. However, the Arab
looks at these tenses as expressing the idea of time and not the idea of finished
or unfinished acts. In Arabic, to form the imperative verb, a knowledge of the
imperfect verb is necessary, because the imperative verb is a form of the imper-
feet [61].
2. Noun
A noun in Arabic indicates a meaning by itself without being connected with the
notion of time and refers to a person, place, thing and event [96].
91
I'
Noun
,
Inflected
1
I-- Uninflected ~
I
--..
•
Derivative I l
~
Primitive I
I
•
Proper ,
~
'Common
l Verbal Noun
I , Personal Pronoun I
1 Adjective Noun
1I Conjunctive Noun t
1 Relative Noun ,I Conditional Noun 1"""-
1~
1 Diminutive Noun
I , Demonstrative Noun
l Instrument Noun , 'Interrogative Noun l

1 Noun of Place , I Adverbial Noun
I
I
"I Noun of Time
I , Numeral Noun
I
Figure 4.2: Categories of Arabic noun
Grammatically, nouns in Arabic are of two kinds: inflected nouns, those nouns
that are affected with the inflectional features, such as, Adjecive, Verbal, Rela-
tive, etc., and uninfected, those nouns appear always in one case and can't af-
fected with the inflectional features, such as, personal, conjunctive, conditional,
etc.
The inflected nouns come also in two kinds: primitive (not derived from verb
~ ~
or noun) such as ~-', rjlun, "a man", L\, Osdun, " a lion", and derivative
~
(derived from verb or noun) such as ~, mktbpun, "library", derived from the
92
verb ~[61].
In Arabic. nouns can be categorised into the following sub-classes : ( Common,

Proper. Verbal. Relative, Noun of time, Adjective, Diminutive, Instrument, Noun
ofplace. Conjunctive, Interrogative, Pronoun, Adverbial, Numeral, Demonstra-
tive. Conditional). The list below summarises these sub-classes in a little more
detail:
• Common Noun
The vast sub-class of the main class (noun) in Arabic is common noun.
These nouns mayor may not be derived from the ground verb (root). Com-
mon nouns may include the definite article JI to indicate definiteness or
may not [120]. For example, the words g~ I, Alshjrpa, "the tree" and
~ .
of, shjrpun, "a tree".
• Proper Noun
Like English language, Arabic proper nouns include names of people,
places, names of cities, countries, and geographical features. These nouns
come from a variety of sources, many of them are Arabic words, but some
are non-Arabic (foreign words). These nouns may include the definite
~ ...
article JI or may not [120]. For example, 0..\..:J, lndn "London", o~lQJI,
AlqAhrpu, "cairo" .
• Verbal (Infinitive) Noun
The verbal nouns are derived from verb forms 4 . They follow a regular
4 verb fonns described in Chapter 5, Section 5.3
93
pattern. For example, the words ~~~, tdrysun, "instruction", t w,

tsAmHun, "tolerance", ~, mEtqdun, "belief', follow the patterns ~,
.. , ..
tjEylzm. ~~, tfAElun, ~, mftElun respectively [61].
• Relative Noun
Relative nouns are formed from other nouns by adding the suffix C> (for
.. -
masculine) or ~ (for feminine) [61]. For example, the words ~, shk-
.. -
lyun. "formal" and 'i::}:>~\, Ardnypun, "Jordanian (fern),'.
• Noun of Time
In Arabic, to denote the noun of time, some patterns refer to the time when
the activity specified by the verb occurs or has been used [57]. For ex-
..
ample, the word ~y, mwEdun, "appointment" follows the patterns J.-.Lo,
mfElun.
• Adjective Noun
An adjective in Arabic is placed after the noun it qualifies, and in most
cases agrees with it in number and gender. On the other hand, the present
participle and past participle are used as adjectives in Arabic language [61].
For example, the words~, mtk=run, "haughty" and ~, mED=mun,
"glorified". Adjective words like many other words in Arabic are derived
from the ground verb and each adjective word follows a certain pattern. For
.. ,
example, the words t \..P, SA1Hun, "good man", follows the pattern ~L;,
fAElun.
94
• Diminutive Noun
Arabic has a few diminutive forms of nouns which are actually used. They
are formed from trilateral noun (noun with three consonant). For example,
..
the word ~,jbylun, "a little mound" follows the pattern ~ [61].
• Instrument Nouns
Nouns of instrument in Arabic are of two kinds : those which are derived
from ground verb (root) such as the word tL:A.o, mftAHun, " a key", fol-
..
lows the pattern J~ and derived from the verb t?, ftHa, "he opened",
and those which are not derived from ground verb such as ~, skynun,
"knife" or J f:' jrsun, "bell" [61] .
• Interrogative Noun
Usually, the interrogative words (question words) are used at the beginning
, r.
of an Arabic sentence [76]. For example, the words, ~ kyfa, 0!.1, Ayna,
J-o, mtY, l~~, mAdhA and r km, which equivalent to the words, "how?",
"where ?", "when ?", "what ?", "how (many/much) ?" in English respec-
tively.
• Pronoun
Pronoun sub-class on our tag set represents the personal pronouns 5 . They
refer to persons or entities. On the other hand, the pronoun class in Ara-
bic may come as separate words-independent (subject) or take the form of

5the demonstrative, conjunctive and interrogative nouns in some Arabic grammar dictio~~es ~a11
also under the pronoun class. However, in our tag set, each sub-class has a different tag to dIstmgUIsh
between those words which belong to these classes in more precises (see [57]).
95
suffixes (object and possessive pronouns). An example of separate words,

~ ~
the words l5 I, AnA, ~ I, Anta, ~', nhnu, ~, hwa, ~, hya and ~, hm

which equivalent to the words, "I", "you", "we", "he", "she" and "they"
in English respectively. In contrast, English has a fewer number of classes
of personal nouns than Arabic, because the personal pronouns in Arabic
show more difference in inflectional features, such as, gender, number and
person [120]. Table 4.1 shows the difference in the gender and number
of persons between Arabic and English language. This table shows that
for the Arabic first person there is no gender distinction. For the second
person. there are five forms of "You". For the third person, there are six
verbal distinctions and five pronoun distinctions. Thus, the total number of
personal pronouns in Arabic is twelve, as opposed to the eight of English.
English Arabic
First Person I, We l51, "AnA" , ~', "nHn"
Second Person You(Fe), You(Ma) ~I, "Anta" (MalSn),
~
,
I, "Anti" (Fe/Sn)
\.;.:j I, "AntmA" (Du)
f I, "Antm" (Ma/PI)
~ I, "Antn" (FeIPI)
Third Person He ~, "hwa" (MalSn)
She ~, "hy" (Fe/Sn)
It
-
They ~, "hmA" ((Ma/Fe)/Du)
~, H hm" (Ma/PI)
~, "hn" (FeIPI)
Table 4.1: Personal pronouns between Arabic and English
96
4.3. ARBTAGS-THE DEVELOPED TAG SET
• Adverbial Noun
Grammatically, adverbs may belong to noun class or particle class. Usu-
ally, most adverbs in Arabic are words used to answer the questions "when
. ~ I Amsi "yestarday" , ~'f'

?". "where ?" and "how?" such as the words v..." A
,,~
shrqAan. "eastward" and ~ w" DAHkAan, "in laughing". On the other
hand. some adverbs are used as particles such as the words ~-, tHta, "un-
~ ~
der". J"";, jwqa, "over" and ~, qbla, "before". However, in our tag set
we have used one tag to represent the adverbial words which fall in our tag
set under the noun sub-classes [61].
• Demonstrative Noun
The demonstrative words in Arabic are determiners used with other nouns
or sometimes instead of nouns to show either distance from or proximity to
the speaker. For example, the words liA, hdhA, ~~, dhlka, ~~jA, hWlA',
~ "
~j \, Awl'ka are equivalent to "this", "that", "these", "those" in English
respectively. Arabic has a richer variety of demonstrative words which in-
flect for gender, number and case [120]. However, the demonstrative words
in Arabic do not have a pattern to follow .
• Conditional Noun
In Arabic, the conditional noun is used between two sentences to show that
the second sentence depends on the first sentence [57]. For instance, the
words, ~, mhma, ~ klma, \;l, Imma which either mean "whatever",
"whenever", "when", respectively.
97
• Noun of place
Arabic language has a specifically derived patterns which are used to de-
note the noun of place. These patterns refer to the place where the activity
specified by the verb occurs [120]. For example, the words :/f' mrkza,
"center", ~.J..M, mdrspa, "school" follow the patterns ~,mjEla, ~,
mfElta respectively.
• Conjunctive Noun
The conjunctive words in Arabic relate to an element in a subordinate rel-
ative clause to a noun or a noun phrase in the main clause of the sentence.
They may be definite or indefinite. In addition, they marked for gender and
s:. s:.
number [120]. For example, the words l:?'l}\, .f' t;" ~\ which are equiva-
lent respectively to the words, "who, which", "who, whoever", "that which,
whatever", "(he) who, whoever" in English.
• Numeral Noun
Arabic has a complex numeral system. It is one of the complicated features
of written Arabic. Numeral nouns in Arabic are of two types: ordinal
numbers, they usually follow the noun that they modify and agree with it in
gender, but sometimes precede it. For example, J \!J \Jjl \, AlmWtmr Al-
- '
thani, "the second conference" and t;,~ 0Jr' Eshrwna ywmA, "twenty
day". The second type refer to cardinal numbers; these numbers are rather
difficult to categorise due to some characteristics of Arabic language [120].
For example, 0 8 \, Athnani, "two", ~..J \ 13.).&:.\' AhdY A lmdn , "one of the
cities". Numeral noun also inflect for gender, number and case [76].
98
3. Particle
Particles are of two kinds: Formation and Signification as shown in figure 4.3.
They are one of the three main POS classes in the Arabic language. Formation
particles are particles which constitute the characters of Arabic word, while sig-
nification particles are used with verbs and nouns; they are effective to signal the
mood of verb or the case of noun [120]. For example, the particles" (",1m,
"never" ... J" ,A.'!'. "in order" indicate Jussive and Subjunctive respectively.
1
Particle
, I
,
I Formation
I I Signifi cation
I
I Preposition
~
I Vocative
~
I Conjunction
~
I Exception
~
I Negation
~
I Subjunctive
~
I Jussive I Elision .~
Figure 4.3: Categories of Arabic particle
4.3.2 Tag design of ARBTAGS
ARBTAGS tags have been built based on the following main formula:
[T ,S ,G ,N ,P ,M, C ,F] , Where:

T, represents the name of each main POS class in Arabic. On the other hand, through-
99
out this section, abbreviation symbols representing the name of each main POS class
and sub-class as well as the possible value of the inflectional features which have been
used to represent the tag in our tag set are shown between square brackets in each table.
Table -+.2 shows the abbreviation symbols of the main POS classes in Arabic.
I Verb [Vel I Noun [Nu] I Particle [P r ]

Table -+.2: Abbreviation symbols of the main POS classes
S. represents the sub-classes of each main POS class in Arabic. The abbreviation
symbols of sub-classes of verb, noun, and particle class are shown in Tables 4.3, 4.4
and 4.5 respectively.
I Perfect [P e ] I Imperfect [P i] I Imperative [Pm]
Table 4.3: Abbreviation symbols of the sub-classes of class verb
Proper [Po] Common [Cn] Adjective [Aj]

Verbal (Infinitive ) Relative [Re] Diminutive [Om]
[If]
Instrument [I s ] Noun of Place [Pn] Noun of Time [Tn]
Pronoun [P s ] Conjunctive [Cv] Conditional [Cd]
Demonstrative [De] Interrogative [In] Adverb [Ad]
Numeral noun [Nn]
Table 4.4: Abbreviation symbols of the sub-classes of class noun
Preposition [Pp] Vocative [Vo] Exception [Ex]

Conjunction [Co] Negation [An] Subjunctive [Sb]
Jussive/Elision [Jv]
Table 4.5: Abbreviation symbols the sub-classes of class particle
G, represents the inflectional feature (Gender), used to inflect noun and verb sub-
classes. The possible values for the inflectional feature gender are shown in Table 4.6.
100
Masculine [Ma] Feminine [Fe] Neuter [Ne]
Table 4.6: The possible value of the inflectional feature (Gender)
N, represents the inflectional feature (Number), used to inflect noun and verb sub-
classes. The possible values of the inflectional feature number can be seen in Table 4.7.
I Singular [Sn] I Dual [Du] I Plural [PI]
Table 4.7: The possible value of the inflectional feature (Number).
P, represents the inflectional feature (Person), used to inflect the verb sub-classes.
Table 4.8 shows the possible values of the inflectional feature person.
I First [Fs] I Second [Se] I Third [Th]
Table 4.8: The possible value of the inflectional feature (Person).
~I, represents the inflectional feature (Mood), used to inflect the verb sub-classes.
Table 4.9 shows the possible value of the inflectional feature mood.
I Indicative [De] I Subjunctive [Sj] I Jussive [Js]
Table 4.9: The possible value of the inflectional feature (Mood).
C, represents the inflectional feature (Case) and it is used to inflect the noun sub-
classes. The possible value of the inflectional feature case can be seen in Table 4.10.
I Nominative [Nm] I Accusative [Ae] I Genitive [Ge]
Table 4.10: The possible value of the inflectional feature (Case).
101
F. represents the inflectional feature (State) and it is used to inflect the noun sub-
classes. Table -+.11 shows the possible value of the inflectional feature state.
I Definite [Df] I Indefinite [Id]
Table 4.11: The possible value of the inflectional feature (State).
However, in Arabic, the first two main pas classes, verb and noun, can inflect
grammatically in the system of inflectional morphology, while the third one (particle)
can not [16]. For example, verb can inflect for person, number, gender and mood as
shown in figure 4.4, while the inflectional features for the noun class can be seen in
figure -+.5.
I Verb [Ve] I
+
Perfect [Pel Imperfect [Pi] Imperative [Pm]
~ I Masculine [Ma] J ~: IM
Indicative [Dc]
.: Subjunctive [Sj] I ~
~I Feminine [Fe] L
r
.I Jussive [Js] J D
,
~I Neuter [Ne] L
r I
~I Singular [Sn]
r- ~ First [Fs]
J:
:1 Dual [Du] L
I
J
...,I
Second [Sc]
I~
~I Plural [PI]
J.- Third [Th]
J~
I
Figure 4.4: Verb sub-classes and their inflectional features
4.3.3 Detailed and general tags in ARBTAGS tag set
Before describe the detailed and general tags which have been used in ARBTAGS
tag set, let us summarise all the abbreviation symbols which have been used in the
102
Noun [Nu]
Com~on [Cn] Proper [Po] Verbal[I t] Adjective [Aj]
Relative [Re] Diminutive [Drn] Instrument [Is]
Noun of Place [pn] Noun of Time [Tn] Pronoun [ps]
Conjunctive [Cv] Conditional [Cd] Demonstrative [De]
Interrogative [In] Adverbial [Ad] Numeral [Nn]
G Masculine [Ma]
It-+--....I Nominative [Nrn]
E ----------~ C
N Feminine [Fe] .......-+--.1 Accusative [Ac] A
D ----------~ S
E Neuter [Ne] E
Genetive [Ge]
R
N Singular [So] Definite [Dt] S

U T
M Dual [Du] A
B T
E Plural [PI] Indefinite [Id] E
R
Figure 4.5: Noun sub-classes and their inflectional features
Particle [Prj
+
Preposition [Pp] Vocative [Vo] Exception [Ex]
Conjunction [Pe] Negation [An] Subjunctive [Sb]
Jussive/Elision [Jv]
Figure 4.6: Particle sub-classes
developed tag set. These symbols can be seen in Table 4.12.
As mentioned above, ARBTAGS has 28 general tags and 161 detailed tags. The
103
Category Abb
Verb Ve
Noun Nu
Particle Pr
Perfect Pe
Imperfect Pi
Imperati,'e Pm Inflectional Abb
Adjective Aj feature value
Verbal If Masculine Ma
Noun of Place Pn Feminine Fe
Noun of Time Tn Neuter Ne
Demonstrati,'e De Singular Sn
Relative Re Plural PI
Pronoun Ps Dual Du
Diminutive Dm First Fs
Instrument Is Second Sc
Proper Po Third Th
Adverb Ad Indicative Dc
Common Cn Subjunctive Sj
Interrogative In Jussive Js
Conjunctive Cv Nominative Nm
Conditional Cd Accusative Ac
Numeral Nn Genitive Ge
Preposition Pp Definite Df
Vocative Vo Indefinite Id
Exception Ex
Negation An
Subjunctive Sb
Jussive/Elision Jv
Conjunction Co
Foreign word Fw
Table 4.12: Abbreviation symbols used in ARBTAGS tag set
detailed tags not only represent the name of the class that the word belong to, but also
represent the inflectional features of this word.
The rational behind developing detailed tags comes from two reasons. The first
reason is to enrich each word in the testing corpus with more linguistic information for
104
the word including the inflectional feature of the word. The tagged corpus becomes
more useful for linguists and NLP developers if most words are tagged with detailed
tags.
The second reason is that the pattern in the pattern-based technique represents the tem-
plate for the whole word. It not only includes the form of the word but also includes the
prefixes and suffixes attached to the word. The suffixes provide the inflectional feature
of the word. Since this pattern is generated automatically from three lexicons; prefixes
with its tags, forms with its tags and suffixes with its tags, the generated tag with each
pattern is a detailed tag. On the other hand, general tags are also used by applying the
lexical and contextual rules as a second technique in this work.
As an example of a detailed tag, the word 0-'....\A~, yshAhdwna, "they watching"
has the following detailed tag VeP iMaP 1 ThDc, which means [Imperative verb, mas-
culine gender, plural number, third person, subjunctive mood}. While the general tag
NuPo may assign to the word such as i..fjA.-J' Rmzy which means [Proper noun}.
POS tag may be very coarse (e.g Ve "Verb") or very fine (e.g VePiMaPIFsJs" Verb,
Impeifect, Masculine, Plural, First Person, Subjunctive "), depending on the task or
application [114]. Since the main aim of AMT system is to produce a tagged corpus,
the tags were developed with a good level of granularity, where each tag is enriched
with inflectional features that meets the need of linguists and NLP developers. On the
other hand, the cardinality of the POS tag set makes the tagging between a morpholog-
ically ambiguous inflective language, e.g, Arabic and a language with poor inflection
such as English is different [78]. For example, the number of tags for perfect verbs
between the ARBTAGS tag set presented in this work and the Penn Treebank tag set
105
for English is shown in Table 4.13. The numbers 6 vs. 81 shown in table 4.13 illustrate
the differences very clearly.
Penn Treebank tag set (En- Arabic tag set (ARBTAGS)

glish)
verbs VB.VBD,VBG,VBN,VB~ For Perfect verb only [VePe] :
VBZ
[MaFeNe] [SnDuPl] [FsScTh] [DcSjJs]
6 3X3X3X3=81
Table 4.13: ARBTAGS tag set vs. Penn Treebank tag set
ARBTAGS tag set general tags are shown in Table 4.14, while a sample of detailed
tags can be seen in Table 4.15. However, the general and detailed tags with examples
have been described in full in Appendix A.l an Appendix A.2.
Tag Dsecription Tag Description

VePe Perfect verb NuCd Conditional noun
VePi Imperfect verb NuDe Demonstrative noun
VePm Imperative verb Nuln Interrogative noun
NuPo Proper noun NuAd Adverbial noun
NuCn Common noun NuNn Numeral noun
NuAj Adjective noun Pun Punctuation mark
Nulf Verbal noun PrPp Preposition
NuRe Relative noun PrVo Vocative Particle
NuDm Diminutive noun PrCo Conjunction Particle
Nuls Instrument noun PrEx Exception Particle
NuPn noun of Place PrAn Negation Particle
NuTn noun of Time PrSb Subjunctive Particle
NuPs Pronoun PrJv Jussive Particle
NuCv Conjunctive noun Fw Foreign word
Table 4.14: ARBTAGS general tags
106
Tag Description
VePeMaSnThSj Verb, Peifect, Masculine, Singular, Third Person, Subjunctive
VePeMaSnFsDc Verb, Peifect, Masculine, Singular, First Person, Indicative
VePeMaSnSeSj Verb, Peifect, Masculine, Singular, First Person, Subjunctive
VePeFeSnSeJs Verb, Peifect, Feminine, Singular, Second Person, Jussive
VePeFeSnThJs Verb, Peifect, Feminine, Singular, Third Person, Jussive
VePiMaPIFsJs Verb, Impeifect, Masculine, Plural, First Person, Subjunctive
VePiMaPIFsDc Verb, Impeifect, Masculine, Plural, First Person, Indicative
VePmMaSnSeJs Verb. Imperative, Masculine, Singular, Second Person, Jussive
VePmFeSnSeJs Verb, Imperative, Feminine, Singular, Second Person, Jussive
NuDeSnAcld Demonstrative Noun, Singular, Accusative,Indefinite
NuDeDuGeld Demonstrative Noun, Dual, Genitive, Indefinite
Nulnld Interrogative Noun, Indefinite
NuCvSnld Conjunctive Noun, Singular, Indefinite
NuAdld Adverbial Noun, Indefinite
NuNmld Numeral Noun, Indefinite
NuAjMsSnNmld Adjective Noun, Masculine, Singular, Nominative, Indefinite
NuAjMsSnNmDf Adjective Noun, Masculine, Singular, Nominative, Definite
NuAjMsSnAcDf Adjective Noun, Masculine, Singular, Accusative, Definite
NuAjMsSnGeDf Adjective Noun, Masculine, Singular, Genitive, Definite
NuIsMaDuGeld Instrument Noun, Masculine, Dual, Genitive, Indefinite
NuDmSnNmld Diminutive Noun, Singular, Nominative, Indefinite
NuReMaSnNmld Relative Noun, Masculine, Singular, Nominative, Indefinite
NuReMaDuGeDf Relative Noun, Masculine, Dual, Genitive, Definite
NuCnMaSnNmld Common Noun, Masculine, Singular, Nominative, Indefinite
NuCnFeSnNmld Common Noun, Feminine, Singular, Nominative, Indefinite
NuCnMaPIGeDf Common Noun, Masculine, Plural, Genitive, Definite
NuPsMaSnThAc Personal Noun, Masculine, Singular, Third Person, Accusative
Table 4.15: Sample of detailed tags in ARBTAGS
4.4 Chapter Summary
This chapter presented a number of criteria to take into account while developing the
POS tag set. Arabic inflectional features, such as, gender, number, case, mood, person
and state are described. In this chapter, we described the steps of our tag set design.
An Arabic tag set called ARBTAGS contains 161 detailed tags and 28 general tags
covering an Arabic main POS classes and sub-classes which have been compiled and
107
introduced in this work. The developed tag set follows the Arabic grammatical system,
based upon POS classes and inflectional morphology that Arab grammarians describe.
The developed tag set differs from the tag sets which have been built for Arabic. The
main difference is a tag set hierarchy be introduced and compiled in this chapter. Since
the main aim of AMT system is to produce a tagged corpus, the tags were developed
with a good level of granularity, where each tag is enriched with inflectional features
that meets the need of linguists and NLP developers.
108
Chapter 5
Design and Implementation of AMT
Objectives
• To define the characteristics of AMT tagger.
• To define the proposed approach.
• To present a description of the tagger system.
• To describe the tagging process.
5.1 AMT Characteristics
The tagger system (Arabic Morphosyntactic Tagger (AMT)) presented in this work has
the following characteristics :
• Lexicon Free
AMT did not require a manually tagged or untagged lexicon which contains
Arabic words. It requires the testing corpus only. Building a generic POS tagger
system without a lexicon depends on the language and the characteristics of its
grammar, both the morphological and the syntactical systems of that language.
109
5.2. RULE-BASED - THE DEVELOPED APPROACH
• Word Level Tagging
It is possible for the tagger system presented in this work to tag one word regard-
less of the context. This possibility comes from (1) the fact that the word in the
testing corpus has a diacritical mark. The diacritical mark provides a semantic
infonnation and defines the inflectional features of the word, which help to re-
solve the lexical ambiguity may arise. (2) the main technique used in this work
is based on the pattern of the word instead of the word itself. Since the Arabic
word matches its correct pattern, the correct tag assigned to the word regardless
of the context in most cases as described in the next section.
5.2 Rule-based - the developed approach
The approach here is a rule-based. It is based on incorporating a set of linguistic rules
to assign the correct tag to each word in the testing corpus. Two different techniques
were used in this work; the pattern-based technique and the lexical and contextual
technique. The rules in the fonner technique are based on the pattern of the testing
word. While the rules in the later technique are based on the character(s), affixes, the
last diacritical mark, the word itself, and the surrounding words or on the tags of the
surrounding words.
The basic idea of the pattern-based technique is to generate automatically a lexicon
of patterns instead of using manually tagged or untagged Arabic words lexicon for
training. Section 5.3 describes the pattern-based technique in more detail. The lexical
and contextual technique is used to assist the main technique to assign the correct tag
to those words not tagged by the pattern-based technique. Section 5.4 describes the
lexical and contextual technique in more detail.
110
5.2. RULE-BASED - THE DEVELOPED APPROACH
As mentioned in chapter 3, Arabic has a set of rules or signs described by Arab gram-
marians for more than 1400 years, such as, rules used to distinguish nouns from verbs
and particles. It has set of facts and characteristics, such as, each original Arabic word
has a pattern and many Arabic words follow only one pattern. Additionally, the dia-
critic is important feature (chapter 3). All of these facts and characteristics are taken
into account when the above techniques are built and used in this work.
5.2.1 Justification for using the rule-based approach
The AMT system presented in this work is designed to accept any partially-vocalised
Arabic text as an input and produce a tagged text. The signs that indicate the category
of the word in Arabic language on the one hand, and the existence of diacritic feature
on the other hand play a great role in reducing the lexical ambiguity of the words and
providing a semantical infonnation to the word leading to assigning the correct tag for
each word in the testing corpus. In addition, due to the fact that semitic languages in
general have a morphological system based on a root and pattern structure, using the
pattern of the word instead of the word itself can achieve a good result in assigning the
correct tag to each word in the testing corpus.
On the other hand, statistical approach as the second main approach in POS tagging
requires a huge manually tagged lexicon to calculate the statistical information such as
the probability of the particular word and tag co-occurring [73]. This approach may be
useful in case we are dealing with an unvocalised Arabic text because with the missing
of the diacritical mark in this type of text, the word may has multiple POS tags. But
to achieve a remarkable accuracy using statistical approach, the manually tagged cor-
pus used for training should be very huge. Unlike English, Arabic still lacks a huge
111
5.3. PATTERN-BASED TECHNIQUE - A NOVEL TECHNIQUE
manually tagged corpus from which large amounts of training data can be extracted.
For example. a training corpus with about 10,000 words which is used by Khoja [87]
in her tagger for Arabic, is definitely not sufficient to cover most words in Arabic lan-
guage. In addition, the small training corpus used in a statistical approach presents the
problem of unknown words.
Unknown words are words not appearing in the training corpus. Neither the testing
corpus nor the training corpus have a lexical information and tags for these words.
The statistical model in this case has no role in dealing with unknown words. So, if
the training corpus is very small and most words in testing corpus may be completely
different from the training corpus, the accuracy of the POS tagger in this case becomes
very weak.
At the same time, many POS tagger systems have been built for English based on
statistical approach and achieved very high accuracy. The reason behind achieving
this remarkable accuracy is the very huge lexicon which contains hundred of millions
of words that have been used in these systems. However, as mentioned above, AMT
system presented in this word did not used a lexicon for training. Thus, the rule-based
approach is the best approach to achieve the above goal due to the fact that the testing
corpus in this work is a partially-vocalised Arabic text.
5.3 Pattern-based technique - A novel technique
Many computational work on Semitic languages assumed that a word may consist of
the following elements: Prefixes, Stem and Suffixes [45,84,110,118,119]. Arabic lan-
guage has a trilateral and quadrilateral verb form. The great majority of Arabic verbs
112
t.' E, while
are trilateral that contain three letters, the first letter is J ,f, the second is
the third letter is J, I. The Arab grammarians have used the trilateral verb form J..;,
jEla, "do" as paradigm (called ground form) to discuss word formation.
The ground form of the trilateral and the form of the quadrilateral! verbs have derived
a great number of other forms by inserting a long vowel, lengthening the root medial
letter. and/or adding consonantal prefixes to produce a new word with new meaning
that still shares the basic meaning of the root2 [138] [81]. For example, the words
~, lEba, ~~, lAEbun which mean" he played", "player" respectively. The former
word represents the root and belongs to the verb class which has the basic template
,
form \,.;W. When adding the long vowel consonant" \ ", "A lif , to the medial letter
of the root, a new word ~~ belonging to the noun (adjective noun) class has been
J1. ,
produced, which has the derived form ~L;,fAElun and still shares the basic meaning
of the root. The ground form and other forms derived from the ground form are shown
in Table 5.1. However, these derived forms express various modifications of the idea
conveyed by the ground form.
As Arab grammarians described, each original Arabic word has a pattern. M.Elaffendi
defined the morphological pattern as :

"a template that shows how the word should be decomposed into its constituent morphemes
(prefix + stem + suffix), and at the same time, marks the positions of the radicals comprising
the root of the word" [107].
It is important to point out here that the pattern is different from the word; it has
IThe quadrilateral verb form is JLJ fEll, by doubling the third letter of the ground form. In Arabic,
these verbs are rare.
2In English language, the produced words are which are termed (stems).
113
5.3. PAITERN-BASED TECHNIQUE - A NOVEL TECHNIQUE
Form Derived Transli- Modifications of ground form

no form teration
, , . ,
I Jd
. ,
tEla The ground form (No Modification)
l
2 ~ tEEla Doubling the second letter
3 ~~ fAEla Infixing the letter I
4 ~I AtEla Prefixing the letter I
.
5 ~ ttEla Prefixing the letter u
6 ~~ tfAEla Prefixing the letter u and infixing the letter I
..
7 ~I AntEla Prefixing the letters I and 0
.
8 J.-:il AftEla Prefixing the letters I and infixing the letter u
9 J.;I
..
AtElla Prefixing the letters I and doubling the third letter
10 J. jz- .,,1 AsttEla Prefixing the letters ~ \
Table 5.1: Derived forms from the ground form (root)
no meaning itself, but its a template that indicates the positions of the root letters. The
pattern represents the lexical category of the word and indicates the syntactic and se-
mantic roles [107].
In this work the word "pattern" is used to represent the template of the whole word in-
cluding the prefixes, form (root+infixes) and suffixes, which are attached to the word.
The pattern in Arabic shares the word on the affixes may be added to the ground form
(root). For example, the word 0~~-" wySAjhHwna, "to shake hands" has the pat-
tern 0~ ~-', "wyfAElwna" as shown in figure 5 .1. The root of the word 0~-' is
c!'-, SjH which has the fonn Jd,fEl, while the whole pattern is 0~~~' wyfAE1-
wna.
The existence of the last diacritical mark in both the pattern and the word is very im-
114
Diacritical Mark Suffixes Infixes Prefixes
"- , l l
, 9 L .to
• l
Arabic Word WJ -:J
, t t t•
Arabic Pattern WJ t s L ....! -'J
l l l ..
I Root Form (f u , E t , I J ) [fEI, Ja.i] I
Figure 5.1: the word 0~.4!j and its pattern 0~~j
portant. Without it, it becomes very difficult in most cases to determine the lexical cat-
egory and to define the inflectional features of the word. For example, the word ~ll,
ghAfl has the pattern J.'-~ as shown in figure 5.2, but the word still has an ambiguity
regarding its lexical category and semantic meaning due to missing the last diacritical
mark in both the pattern and the word. It may be j;ll., ghAfla if the last diacritical
mark is fatha mark, in this case, it means "take advantage of someone's inattention"
". ,
and it belongs to the verb class or ~U:., ghAflun if the last diacritical mark is nunation
mark (tanween danun), which means "inattentive" and belongs to noun class. Thus,
while the last diacritical mark is missing in the pattern as well as the word, the lexical
ambiguity remaining apparent.
In the Arabic language, there is no word has more than one pattern to follow. At
the same time, you may find hundreds of Arabic words may follow one pattern. For
,
example, the words, "0y.?- h rbwna, "drink"
",ys to -:
, uyo 'lIl, ysmEwna, "t0 hear" ,
oy.~, yDrbwna, "to beat", 0~ yktbwna, "to write", 0~ yjhmwna, "to under-
115
Arabic Word - J..-i

t t t.e
I Arabic Pattern -- J '# t
Figure 5.2: the word J.ili. and its pattern J;:-~
stand", :';j~yksrwna, "to break", 0~ . c.ymsHwna "to wipe", 0#. yHmlwna,

"to carry". :,; jfo yqfzwna, "to jump", follow the same pattern "0~", yjElwna.
More than 500 words other these words follow the above pattern. All the above words
belong to the imperfect verb class. Another example, all of the Arabic words with
three consonants, end with fatha mark and follow the pattern J..;, tEla, "do", are per-
fect verb words.
The case is also valid for the words belonging to noun class. For example, all the
,. , ." ,
Arabic words following the pattern J;:- L;, fAElun, such as, ~t:;, qAtlun, "killer",? Lz,
sAhrun, "magician" and ~~ kAtbun, "writer", can be categorised as Adjective nouns.
The above examples show that the last diacritical mark plays a great role in determining
the correct tag and adding a semantical infonnation to the word. In addition, using the
pattern of the word means that building a pattern lexicon with 100 entries may cover
15,000 words, which constitute the main advantage of the pattern-based technique.
5.3.1 Pattern-based Rules
Since a lexicon of Arabic words or training corpus in this system is not required, in-
stead, we generated a lexicon of patterns which are associated with the last diacritical
mark and generated automatically by combining:
116
1- A single lexicon of all prefixes including all valid concatenations. Tag is also asso-
ciated with each prefix.
2- A single lexicon of all fonns. Tag is also associated with each Form.
3- A single lexicon of all suffixes associated with the suitable last diacritical mark. Tag
is also associated with each suffix.
Table 5.2 shows a simple part of the prefix, form and suffix lexicons for some im-
perfect verb words. The combined pattern lexicon is shown in Table 5.3.
Prefixes Tag Fonns Tag Suffixes Tag

~y J.d tEl VePi 0jwna MaPIThSj
~jwy PrCo+ ~\,; fAEI VePi dammamark MaSnThDc

0 na FePIThDc
Table 5.2: Sample of prefixes, fonns, suffixes for some imperfect verb words
There are two important things to point out here. The first is that the tags attached
to fonns and suffixes in Table 5.2 are valid tags only if these suffixes attached to
these fonns. In other word, the tag of the fonn may change depending on the suffixes
attached to this fonn. For example, the tag [VePi] associated with the form J.'-~ (sec-
ond line in Table 5.2) is valid only in case the fonn ~~ is combined with the suffixes
presented in Table 5.2. If the suffixes changed, the tag of the forms should also need
to be changed. For example, Table 5.4 shows that the tag of the form J.'-~, JAEl is
changed due to the changes happening in suffixes. The combined pattern lexicon can
be seen in Table 5.5.
117
PNo pattern Transliteration Tag

1 ytElwna VePiMaPIThSj
2 ytElu VePiMaSnThDc
3 ytElna VePiFePIThDc
yfAElwna VePiMaPIThSj
5 yfAElu VePiMaSnThDc
6 yfAElna VePiFePIThSj
7 wytElwna PrCo+VePiMaPlThSj
8 wytElu PrCo+VePiMaSnThDc
9 wytElna PrCo+VePiFePlThSj
10 wyfAElwna PrCo+VePiMaPlThSj
11 wyfAElu PrCo+VePiMaSnThDc
12 wyfAElna PrCo+VePiFePlThDc
Table 5.3: Sample of pattern lexicon shows the patten for some imperfect verb words
Prefixes Tag Fonns Tag Suffixes Tag

J>-L; fAEI VePe fatha mark MaSnThSj
MaSnScSj
tm MaPlScJs
FePIThSj
Table 5.4: Sample of prefixes, fonns, suffixes for some perfect verb words
PNo pattern Transliteration Tag

1 ~~ fAEla VePeMaSnThSj
2 ~~ fAElta VePeMaSnScSj
3 .~~ fAEltm VePeMaPlScJs
4 Js-~ fAElna VePeFePIThSj
Table 5.5: Sample of pattern lexicon shows the patten for some perfect verb words
118
Usually, the prefixes have no tags 3 unless the prefixes represent a particle , suc h as,
a conjunction particle, in this case a separate tag is to be associated with this particle
to show that this word has a combined tag. For example, the word (P.,J' wyshrhu,
"and to explain" has the following tag [PrCo+VePiMaSnThDc]. [PrCo] is the tag
of conjunction particle J' w, "and" which appears in the word as well as the pattern.
[VePiMaSnThDc] is the tag of the word (p...
The second thing is the tag of each suffix represents the inflectional feature of the
word. Each form should have at least one suffix, that is, the last diacritical mark. The
length of the suffixes ranges between I to 4 or 5 letters. The length of the prefixes on
the other hand ranges between 0 to 4 or 5 letters. So, it becomes clear that the tags
generated with patterns are detailed tags.
The rules in the pattern-based technique can be represent using the following general
rule:
Assign the tag (T) to the testing word (W) if the testing word matching the pattern (P)
where T is a variable over a set of tags in pattern lexicon, W is a variable over a set of
testing words, and P is a variable over a set of patterns in pattern lexicon. For example,
suppose the testing word W = ~, yktbu, "do writing". W is looked up in patterns

~
lexicon to check for its correct pattern. The correct pattern here is P = J.-A:., yjElu (the
second pattern in Table 5.3). The tag [VeP iMaSnThDc] which associated with the
pattern ~ is then extracted from pattern lexicon and assigned to the word~.
An important question must be asked here. How the testing word matched its correct
pattern ?
3In other POS tagger system built for Arabic, a separate tag such as [Def] used to represent the
definite article JI, "AI". In the current system, this tag included with the inflectional feature of the word
with the symbol [Of]
119
To answer this question, a novel algorithm has been developed and described in next
section. The purpose of this algorithm is to show how the testing word is matched its
correct pattern in the pattern lexicon.
5.3.2 Pattern-matching algorithm
Since the lexicon in AMT is a pattern lexicon not an Arabic words lexicon, an algo-
rithm to match the Arabic word in the testing corpus with its correct pattern in patterns
lexicon is required. A novel algorithm has been introduced in this work to achieve
the above goal. The pseudo code of the pattern-matching algorithm is described in
Algorithm 1. The steps of the algorithms with examples are described below in more
detail.
Step - 1 :
The first step in the algorithm is responsible to return from the pattern lexicon all the
patterns that have the same length of the testing word. For example, the word ~,
jktbtna, "and they wrote" has the length4 = 7 (see figure 5.3). The returned patterns
that have the same length of the word ~ are shown in Table 5.6. The next step
(Step - 2) of the algorithm shows how to calculate the identical letters between the
testing word ( ~ ) and the fourth pattern ( ~ ) as an example.
PNo Paterm Word Identical letters Num

,
1 ~1.a- . ., ~ the last mark (Fatha) 1
, ~
2 ~\j ~ the letters J and last mark 2

3 ~ ~ the letters J, 0, and last mark 3
4 ~ ~ the letters J, w, 0, and last mark 4
Table 5.6: Number of identical letters between the word ~ and its patterns
4 the last diacritical mark is counted as a letter of the word
120
Let W = Inflected word, P(i) = pattern of W, T(i) = Tag of W,

L = Length of W or P(i) , R = Total number of patterns in lexicon,
D= Total number of patterns have the same length of word,
ILO) = The number of identical letters between each pattern and
word.
~= Total number of patterns have the maximum number of
identical letters with word

begin
Get the word W ;
D, ~ =0;
for i +- 1 to R do
while L(P(i)) = L(W) do
Ret urn P(~), T(i);
I D = D + 1,
end
end
for j +- 1 to D do
Count the number of identical letters between PG) and W;
Store result in ILO);
Next j ;
end
for j +- 1 to D do
Return PO), TO) which have the maximum number of ILG);
M = M + 1;
Next j ;
end
for k +- 1 to M do
Create a new pattern NP from W that is L(NP) = L(W) by
changing W letters which correspond (mirror) only to f, E,
1 letters in P(k);
if NP = P (k) then
Ret urn P(k) and T(k);
I Exi t the loop;
else
I Next i;
end
end
end
Algorithm 1: Pattern-matching algorithm
121
Step - 2 :
The second step of the algorithm is responsible for calculating the number of identical
letters between the testing word and the patterns which are returned from perfonning
step-I. The aim of this step is to reduce the number of returned patterns. For example,
the identical letters between the word ~ and the pattern ~ are shown in figure
5.3. The number of identical letters between the word ~ and each returned pattern
can be seen in Table 5.6.
w -
-
W
.. • .
~ ~ ~ ..s •
...!
p -
-
W ~
.. -.J ~ ...a• •
...!
t4 t3 t2 t
1
Figure 5.3: The identical letters between the word ~ and the pattern ~
Step - 3 :
Choose the pattern(s) which have the maximum number of identical letters. Since the
fourth pattern in Table 5.6 has the maximum number of identical letters with the test-
ing word, the algorithm will chooses this pattern for the word ~.
So, in this case W = ~, P(l) = ~.
Step - 4 :
Replace the letters of W which correspond (Mirror) to the letters J, f, t: E and J, I
(the letters J, t and J represent the root letters) in the pattern(s) (P) which have the
maximum identical number with the word (W). Add the remaining letters in W without
change and store the new pattern in NP. Figure 5.4 describes how to perform this step.
122
.. , .. ..s ..!•
1
".,
w - W -' -'
".,
. *- * *- -
P(1) = W -' -l s,..! ..!
".,
• •
NP W .:i -l s,..! ..!
NP = P(1) ?
Figure 5.4: Matching the word ~ with the pattern ~
Figure 5.4 shows clearly that a new pattern has been created with the same length of
the original pattern (P(I)) and the word (W). The letters which do not correspond to
the root form are the same in the word, the original pattern, and the new pattern. These
letters represent the affixes which are added to the ground form (root). Since NP =
P( I), this means that ~ is the correct pattern for the word ~.
In most cases the algorithm is returned one pattern has the maximum number of iden-
tical letters with the testing word as in the above example. But, sometimes, more than
one pattern has been returned, each pattern has the same number of identical letters
with the testing word.
This step ( Step - 4 ) is not used only to check that the only pattern which has the
maximum number of identical letters with the testing word is the correct pattern, but
also to choose the correct pattern in case the algorithm is returned more than one pat-
tern, each has the same identical letters with the testing word.
123
For example, suppose W = the word 0~- 0 d" !, ysmEwna, "to hear". Table 5.7 shows
the patterns that have the same identical letters with the word 0Yo ttl!.
PNo Paterrn Word Identical letters Num
,
1 ~~ 0.,...0 :II! the letters 4..f' t.: 0, and last mark 4
the letters 4..f' t.: 0, and last mark
~
~ 0yo :II! 4
')
3 0~ 0y o I"! the letters 4..f' j' 0, and last mark 4
Table 5.7: Number of identical letters between the word 0Yo I"! and its patterns
During this step, the algorithm is responsible to determine which one of the above
patterns is a correct pattern for the word 0~-0 III!. The first pattern J..'-~ has been
checked if it is the correct pattern for the word 0~- 0 '''! or not. Figure 6.3 shows the
result.
W -
,.-
W~ S
t
-A .....
t.
~
..
W ~ s L ..a ..
,.-
~
P(1) =
NP -
,.-
W -l_s -A ..a• ~
..
NP = P(1)?
Figure 5.5: Matching the word 0Yo "'! with the pattern J..'-~
It is clear from figure 5.5 that NP does not equal the pattern pel), because the letter..4,
miim in the word (W) is differs from its corresponding letter in P(l) (\, (Alij»). So, the
pattern ~~ in this case, is not the correct pattern of the word 0Yo I"!'
Similarly, the pattern ~ is not the correct pattern of the word 0-","0 '''! as shown in
figure 5.6, because the letter ..4, mUm in the word (W) is differs from its corresponding
letter in P(2) (.,;, (taa)).
124
,
w WJ s ... zuA 'j
.
..J
P(2) =
,
W ~ t ..
~ ..J
t.
.! ..
..J
,
•
NP - W -l
NP
S ... .! ..J
= P(2)?
..
Figure 5.6: Matching the word 0Yo ".! with the pattern ~
The last pattern 0~ has been checked by the algorithm as shown in figure 5.7.
Since NP = P(3), then the pattern 0~ is the correct pattern of the word 0..,.-0 "'!.
The algorithm in this case will choose the pattern 0~ as a correct pattern for the
word ~yo 1M!.
w -
,
WJ ~ ..4 .....
t t.
..
..J
W -' ~ ..
,
P(3) = ~ .! ..J
NP -
- W J -l V .!
•
..
..J
NP = P(3)?
Figure 5.7: Matching the word 0YO'M! with the pattern 0~
Step - 5 :
The last step in the above algorithm is responsible to extract the tag associated with
the correct pattern from pattern lexicon, and assigned this tag to the testing word. For
example, the tag VePiMaPIThSj (see Table 5.3) is extracted and assigned to the
word 0-"0 !. f
M
125
5.4. LEXICAL AND CONTEXTUAL TECHNIQUE
5.4 Lexical and Contextual technique
The pattern-based technique described in section 5.3.2 which depend on the pattern
of each word in the testing corpus constitute the main technique in this work. In fact,
it is not easy for one person to generate all the patterns which cover all the words in
the Arabic language. Since most words in Arabic belong to the noun class, difficulties
may appear especially in collecting all the patterns of the words belonging to this class.
In terms of the words belonging to verb class, the case is different. It is easy to collect
the verb forms and all affixes associated with these forms, as the pattern lexicon is
generated automatically.
The pattern lexicon in this work contains 8718 patterns. Most of these patterns are
patterns for the words which belong to verb class. The tag set hierarchy (see 2.3)
covers most types of sub-classes belong to noun class. Some of these sub-classes have
certain patterns, for example, the patterns of adjective nouns, instrument nouns, verbal
nouns and diminutive nouns, which are generated automatically and added to the pat-
terns lexicon. The difficulties may appear in collecting and generating the patterns for
other sub-classes especially common nouns.
As mentioned earlier, the patterns lexicon contains 8718 patterns, these patterns defi-
nitely not sufficient to cover all the Arabic words, especially, those words belonging to
the noun and verb classes. For this reason, the lexical and contextual technique is used
in this work to assist the pattern-based technique to tag those words not have patterns
in lexicon, especially those words which belonging to common noun sub-class.
On the other hand, All the tags in the pattern-based technique are detailed tags, be-
126
cause these tags have been generated automatically with patterns. These tags not only
represent the name of class (e.g. perfect verb, imperfect verb), but also included the
inflectional feature of the word, such as, number, gender, person, and mood using the
prefixes and suffixes attached to the verb form of the word. In contrast, the tags of
those words belong to the noun or particle class and tagged by the lexical and contex-
tual technique are vary from general to detailed.
As mentioned in section 3.4.2, Arabic language has a set of rules or signs, which
have been described by Arab grammarians and used to distinguish nouns from verbs
and particles. For example, they described these rules as follows:
1. An Arabic word ends with nunation (tanween)
2. An Arabic word has the genitive case (end with kasra mark)
3. An Arabic word begins with definite artilcle JI AI
4. An Arabic word follows the particle ~ yA
In the Arabic language, neither the words belong to verb class nor the particle class can
share the above rules. These rules have been taken into account when applying lexical
and contextual rules.
5.4.1 Lexical Rules
Lexical rules are used to analyse words and take advantage of the internal structure of
words. The triggers in the lexical rules depend on the character(s), affixes, and the last
diacritical mark of the word. The name of the rules in lexical and contextual technique
are written in the same way that Brill [37] has represented his rules and templates.
127
The names of the lexical rules (in parenthesis) and the description of each rule are
given below:
Assign tag T to the current word if :

1- The last mark of the current word is X.(CWDLM)
2- The first character of the current word is C.(FICHCWD)
3- The first two characters of the current word are C.(F2CHCWD)
-l-- The last two characters of the current word are C.(L2CHCWD)
5- The first three characters of the current word are C.(F3CHCWD)
Where X is a variable over the set of diacritic marks, C is a variable over the set of
characters of the current word.
An example of a lexical rule is shown below. The list of lexical rules with examples
can be seen in Appendix C.
• Tanween Damm CWDLM NuCnNmld

This rule means: Assign NuCnNmld tag to the current word if the last diacritical
mark of the current word is Tanween Damm.

".
For Instance, the word ~ ~, rjlun,"Man".
5.4.2 Contextual Rules
Contextual rules are used to assign the correct tag of the particular word based on the
surrounding words or on the tags of the surrounding words. The triggers in the contex-
tual module depend on the current word itself, and the tags or words on the context of
the current word.

The names of the contextual rules (in parenthesis) and the description of each rule are
given below:
Assign tag T to the current word if:
128
1- The preceding word is Z. (PWD)
2- The preceding tag is Y. (PWDTAG)
Where Z is a variable over all words in the testing corpus, Y is a variable over the set
of tags.
An example of contextual rule is shown below. The list of contextual rules with exam-
ples can be seen in Appendix C .
• NuCnGeld PWDTAG PrPp
This rule means: Assign NuCnGeld tag to the current word if the the tag of the
preceding word is P rPp.
For Instance, "~\ ~", mna Albyti, "from the house".
On the most obvious problem in tagging the Arabic text is recognising proper nouns.
A proper noun in Arabic may be represent the name of a specific person, place, orga-
nization, thing, an idea, an event, date, time, or other entity. Unlike English language,
Arabic does not distinguish between lower and upper case letters; this makes it not
nearly as easy to locate the proper nouns as in English text. Furthermore, these words
may be solid or derived or words borrowed from another language (Arabi sed words),
which add another level of complexity to recognising these words [15] [14].
Abuleil and Evens [15] presented a technique for tagging proper nouns in Arabic text,
which depends on the keywords stored in a lexicon. Table 5.8 shows how they have
classified these keywords.
In this work, their classification (keywords) have been applied, but by using the
lexical and contextual technique instead of using a lexicon.

"
For example, NuPo PWD ~..-\.A or ~..-\.A or ~..-\.A
129
No Classification Example
1 Personal names (title) ~jA..J ~.J" Mr.Ramzi
-
")
Personal names (job title) tL; ~)" President.Saleh
3 Organization names u..J~y:..) t....~, DeMontfort University
4 Locations (political names) W.J ~J~' French Republic
5 Locations ( natural) names) (,)L;- ~M, Amman City
6 Times "
J~ \~, Month of September
7 Product (,)~ ~~ Nikon Camera
8 Events u ~l;- ~fW' Cars Exhibition
Table 5.8: Classification of Proper noun
This rule means: Assign NuPo tag to the current word if the preceding word is (~.A.o
~ ~
or ~.A..o or ~.A..o). For Instance, (,)...\.:J ~.A..o, "London City".
Furthermore, the particle lexicon contains those words belonging to particle class has
been built in this work. The motivation behind building the particle lexicon comes
from the fact that, during the initial experiments which have been done to test the tag-
ger, some words have been tagged incorrectly. Since the pattern-based module has
been designed for those words belonging to verb class or noun class, some words be-
longing to particle class have been matched the incorrect patterns when applying the
pattern-matching algorithm to those words. For example, the word ~j, wmnhA, "and-
from it" match the pattern ~,fElhA as shown in figure 5.8 and takes an incorrect tag,
because this word belongs to particle class while all the words follow the pattern lfL.j
belonging to verb class, such as the word ~ ktbhA, "he wrote it" as shown in fig-
ure 5.9. Thus, to reduce the errors in tagging such words and enhance the performance
of the tagger system, the decision has been taken to generate a separate particle lexicon.
130
", - ........
"" ...;,_... j
~ ~ ~
p = L....a -1_c... ..a
NP= U ..J~..l
Figure 5.8: Matching the word Figure 5.9: Matching the word
~j with the pattern \i.W ~ with the pattern \iW
The particles lexicon is generated automatically by combining: a single lexicon of
all prefixes including all valid concatenations, a single lexicon of all Arabic words be-
longing to particle class and a single lexicon of all suffixes.
Table 5.10 shows a sample of particles lexicon which generated from Table 5.9 ele-
ments.
Prefixes Tag particle word Tag Suffixes Tag

j ' W, "and" PrCo+ J,. fy , "in"
M
PrPp \A , hA FeSn
~,mn, "from" PrPp {km MaPl
. I In, "if'
u,,'
Table 5.9: Sample of prefixes, particle word, suffixes for some particles words
particle Tag particle Tag

.
J PrPp Jj
M
PrCo+PrPp
PrPp ~j
PrCo+PrPp
~
.I
U"
PrAn 01j PrCo+PrAn
~ PrPpFeSn ~j PrCo+PrPpFeSn
?- PrPpMaPl ?-j PrCo+PrPpMaPl
l? PrPpFeSn ~j PrCo+PrPpFeSn
f-o PrPpMaPl f-oj PrCo+PrPpMaPl
Lf:1 PrAnFe Lf:1"j PrCo+PrAnFe
P1 PrAnMaP pI"j PrCo+PrAnMaPl
Table 5.10: Sample of particles lexicon
131
5.5. A DESCRIPTION OF THE TAGGER SYSTEM
5.5 A description of the tagger system
5.5.1 Tagger Modules
The main function of AMT is to take an untagged partially-vocalised (the diacritical
mark assigned only to the last letter of each word in testing corpus) Arabic text as
input, and to produce a POS tagged partially-vocalised Arabic corpus. AMT as shown
in figure 5.10 is composed of three main modules: Tokeniser Module, Pattern-based

Module. and Lexical and Contextual Module.
/ Untagged
Text
/
....
Tagged
Text
.--
~ --
~
J Tokenizer Module
J
~~ ,r
Pattern-based
Module AMT Lexical and
,
Contextual Module
I ,";
It Patterns Lexicon DJ .,
"
Figure 5.10: An overview of AMT
The list below describes the function of each module in more detail.
• Tokeniser Module
A token is not just a word. It is defined as a sequence of characters having a
collective meaning [17]. A token represents any special character, number and
word. The main function of a tokeniser module is to convert the untagged input
132
text into a form that is more manageable by the machine. This conversion is
called tokenisation. The tokenisation process is responsible for locating an un-
tagged input text and identifying words, punctuation marks, numbers and other
marks using the space as a delimiter. The tokeniser simply separates the input
text into tokens including the splitting of punctuation marks (such as full stops
and commas) from their previous words.
• Pattern-based Module
The main function of this module is to look up each testing word in the patterns
lexicon. It performs the pattern-matching algorithm steps to match each word in
the testing corpus with its correct pattern in the patterns lexicon. If the correct
pattern of the testing word is found in the patterns lexicon, the tag extracted from
patterns lexicon and assigned to the testing word. After this module finished its
task, the remaining words are then passed to the lexical and contextual Module.
• Lexical and Contextual Module
The lexical and contextual module has been built in this system to assist the
pattern-based module to tag those words not having patterns stored in the pat-
terns lexicon. This module is responsible for applying the lexical and contextual
rules to assign the correct tag to each word not tagged by the pattern-based mod-
ule.
5.5.2 Tagging Process
AMT performs many steps during the tagging process as shown in figure 5.11. During
the tagging process, the token is first looked up in the particle lexicon. If it is found,
then the tag extracted and associated to the token. The token is then passed to the
pattern-based module, where the pattern-matching algorithm is applied to the token to
133
Raw Text
Tokeniser Module
N Pattern-based
e Module
x•
t
T Lexical and
o Applying UC
k Rules Contextual
e Module
n
•
••••••••••••••••••••••••••••••
Figure 5.11: How AMT performs tagging
check if the token has a pattern in the pattern lexicon or not. If the token matches its
pattern, then the tag is extracted from the pattern lexicon and assigned to the token.
If the pattern of the token is not found in the pattern lexicon, then it is passed to the
lexical and contextual module.
At this stage, the lexical and contextual module has been applied to assign the cor-
rect tag to each token which has not been tagged by the pattern-based module. Finally,
for those very few tokens still untagged by the above modules, a user intervention
menu has been designed in the main menu (see figure 5.12) of the system to allow the
user to add a new pattern and its general tag or at least the simple form of tag (e.g Ve)
134
for verb words or (Nu) for noun words if the token belongs to verb or noun class, or
the token itself and its general tag (P r) if this token belongs to particle class.
Since this tagger system has been designed to tagging Arabic text, it is expected that it
is easier for the Arabic user to use his knowledge to tag those words still untagged by
adding the simple form of token tag. The main purpose of user intervention menu is to
enrich the pattern lexicon as well as the particle lexicon with new entries, which lead
to develop a tagger system can accept any partially-vocalised Arabic text. It is inter-
ested to point out that adding one pattern by the user means that many Arabic words in
Arabic language may match this pattern.
The main menu of the AMT system with example shows how to perform the tag-
ging process for a very simple part of the partially-vocalised Arabic text can be seen in
figure 5.12.
5.6 Chapter Summary

This chapter presented the design and implementation steps for a new rule-based POS
tagger called AMT : Arabic Morphosyntactic Tagger. We defined the characteristics
of AMT tagger : free manually tagged lexicon or training corpus, word level tagging
and tagging partially-vocalised Arabic text. In the current literature such a tagger does
not exist. A new technique with a novel algorithm has been applied for AMT system.
Since a lexicon of Arabic words or training corpus in this system is not needed, instead,
we generated a lexicon of patterns which are associated with the last diacritical mark
and generated automatically.
In this work the word "pattern" is used to represent the template of the whole word in-
135
1fIIII . • •
• \ ~~J' \.e! ~\ ~~\ ::,. "lI.>'o'4 )J~\ ~~ ~ \~~j. )J~'I ~ \~\J l.l~ ~~1 ~
~ .) ~'J:. ~l Js- ""~\-j ~I ~\ ~ lA;WJ l.l.JJ~ ~;.w -..I~\ ~ ~i #-
I~~\JW~\J ..J-j~'tJ.:,.a~ ~. ~ ~I"W~~. ~ Ja'I;4 ~1.o..,j~1 ~
lJ,jj.i .j\;' wJ r ~~ ~;yo ~..6.Sj r .~I ~I"W~I ~~ ~ ~\ :"I >,,~\;- ,j1".
-
1 hi. 'l',1 hn: lf~. t'Jh'n~
System Output -Tagged Text ~

• <NuCnGtOf> ;,..:l <PrPp>..J'" <NuCnAc> I)lJ <NuCnAC> I~ <NuCnNmDf> ~~~I <VePiFeSnThOc> ~~
... ~ <NuCnNmDf> J~ <NuDeFeSnGe> ~ <VePeFeSnThJs> ~ <PrCo+pr> I~!,J <Pun>
.... ~J <NuPs> • <Pun> '+ <HuCnGeDf> ~ <NuCnGeOt> ..:."'~ <PrPp> ~ <PrCo+NuCnGeDf>
~~ <PrCo+NUCnGeDf> . <Pun> u~ <NuCnAc> .4-0 <NuCnNmDf> ~J)a:J <PrCo+VePeMiSnSeJs>
t.......aJ <NUCnAc> ~~ <VePiFeSnThDc> ~~ <NuCnAcDf> '~Li2 <NuCnGeDf> ..:.1I'LIJ <PICo+PrAn>
I,.l,... <VePtM.lSnThOc> ~~ <PrCo> ; <VtPiMiSnThDc> ~ <PrCo+Pr> .is, <NuCnAc> • <Pun>
<PrAn> J <Pr> )IP <Pt.ln> ! <NuOeMaSnAc> ./.W <VePiFeSnThDc> >- <Nuln> <..6iS <VtPeMiSnThSj> • <PUn)
"';I..w, <vePeMaSnThSj> "';J.I.I <VePeFeSnThJs> ::';... <NuCnGeDf> ..:.1I~ <VePeMaSnThSj> ..,-
Compare Rt.suls Sare Results ~
Figure 5.12: Tagging process for simple part of text
eluding the prefixes, form (root+infixes) and suffixes, which are attached to the word.
The main technique based on the pattern of the word instead of the word itself.
In addition, the lexical and contextual rules have been used in this system to assist
the pattern-based technique to tag those words not having a pattern stored in pattern
lexicon. The AMT system presented in this work deals with partially-vocalised Arabic
text. It is the first POS tagger uses purely rule-based approach. A full description of
AMT tagger system modules and the function of each module also has been addressed
in this chapter. Finally, we described the tagging process that AMT system carried out.
136
Chapter 6
Evaluation of Results obtained from
AMT
Objectives
• To present the testing data sets.
• To define the measure was used to calculate system performance.
• To describe the experiments been done to evaluate AMT system.
• To explain the analysis of results.
• To present AMT system shortcomings
6.1 Testing Data sets
A partially-vocalised Arabic text is needed to test the AMT system. The lack of large
partially-vocalised Arabic corpus is one of the problems we faced. In order to ob-
tain the testing corpus for the tagger, a new partially-vocalised Arabic corpus has been
compiled. It contains 20,000 words. Since the text in school textbooks contains dia-
137
6.1. TESTING DATA SETS
critics, the corpus is extracted and collected from these textbooks via the official site
of ministry of educationl-Jordan, with a permission and authorization from the depart-
ment of curricula and textbooks management (see Appendix D).
The text in testing corpus had been normalised manually; that is, the diacritics other
than the last diacritical mark have been removed and the last diacritical mark has been
added to those words do not have it. Despite that not all words in school textbooks
have diacritics, especially for the higher level classes, but the text in school textbooks
is still the closest.
The aim of the normalisation process which has been done with consultation and col-
laboration of an Arabic linguist2 is to ensure that each word in the corpus is attached
only with the correct last diacritical mark. Also, the corpus is manually tagged for
comparison with the system tagged texts.
The corpus is chosen and extracted from different books for different levels of school
classes. It is not limited to a particular domain; it covers a wide range of topics such
as scientific topics and literary topics.
Test data for the experiments was taken from the testing corpus. The data sets con-
sists of raw original Arabic script words; no further annotations exist for this data set.
Data spread across three sets :
1. Set-!: consists of 3170 words representing several articles extracted from the
book of computer science and other science topics, such as biology for different
1http://www.elearning.jo/eduwave/elearningme.aspx ., ., . n
2The author gratefully acknowledge the collaboration of Mr. Walid Alqnm - Mlmstry of EducatlO
- Jordan. Email: walidalqriniI23@yahoo.com
138
6.2. AMT EXPERIMENTS AND ACCURACY MEASUREMENT
secondary school classes.
2. Set-2: consists of 7620 words representing several articles extracted from the
book of Arabic language topics for classes 7, 8, and 9 of elementary level.
3. Set-3: consists of 9210 words representing several articles extracted from the
books of literary topics for classes 10, 11 and 12 of secondary level.
6.2 AMT Experiments and accuracy measurement
Five experiments were done to evaluate the AMT tagger system. The first three exper-
iments were performed on set-I, set-2, and set-3 respectively. The fourth experiment
was performed to calculate the ratio of pattern-based module and lexical and contex-
tual module that have been applied in the above three experiments. The last experiment
was done on a different text, that is the Quran text. A sample of the Quran text was
taken from chapter Almulk and Alforqan, it contains 1016 words. The diacritics were
removed except the last diacritical mark. The aim of this experiment is to get a picture
of the AMT performance on a different text. The results of this experiment also de-
scribed in this chapter with more detail.
There are several measurements used to indicate the performance of tagger systems.
Success rate 3, ambiguity4, recall and precision are the most popular measures which
try to indicate the accuracy of the tagger output( [73], p.82 ). Success rate measure is
used in case the tagger is assigned a single tag to each token as the tagger presented in
this work. It is expressed as a percentage and defined as follows :

Number of correctly tagged tokens
S ucces s rate = --=--:......:..:..~=--~----:::-----....:::....-::---=-:----
Total number of tokens
3 also called correctness or score
4 also called average number of tags per token
139
Ambiguity measure is used when the tagger is assigned multiple tags per token. Ambi-
guity is calculated by dividing the total number of tags by the total number of tokens.
Recall and precision, which find their original in information retrieval are also an alter-
native pair of measures used in tagging. Recall is calculated by dividing the number of
correct token-tag pairs that is produced, by the number of correct token-tag pairs that
is possible. Precision is the number of correct token-tag pairs that is produced, divided
by the total number of token-tag pairs that is produced. Like the success rate measure,
ambiguity and recall and precision are expressed as a percentage ( [73], p.83 ).
Since the AMT tagger presented in this work produces a single tag to each word in
testing corpus. success rate measure was used in to indicate the performance of the
AMT system. The success rate for each experiment and the ratio of tag types which
have been used in each experiment were calculated. In addition, the distribution of
POS classes for the first three experiments has also been addressed. The details of the
results for each experiment are described in the following sections ( 6.2.1 - 6.3.1).
Section 6.3 describe the analysis of all experiment results.
6.2.1 Experiment-l
The first experiment was performed on the first set. AMT correctly tagged 89% of
set-1 words as shown in figure 6.1.

Out of the correctly tagged words, 66% of the tags which were assigned to tokens in
the first experiment are detailed tags which included inflectional feature for each word
(see figure 6.2). This ratio indicates that the majority of the correctly tagged tokens
were not at the general level. In addition, the distribution of POS classes for the text in
experiment-1 can be seen in figure 6.3. It is expected that the ratio of tokens in the text
140
3500
3000
-~
~ 2500
2000
-
¢
¢
z
1500
1000 89% 11%
500
0
Set-1 size Correct Incorrect
3170 2830 340
Figure 6.1: Success rate of experiment-l
which belong to noun class is a higher ratio since most of the Arabic words belong to
noun class rather than any other POS classes. Usually but not always (depend on the
testing text) particles in Arabic have the second higher ratio after the ratio of nouns.
General Verbs PUllctuation

tags
34%
Figure 6.2: Detailed and general Figure 6.3: Distribution of POS

tags ratio in experiment-l classes in experiment-l
6.2.2 Experiment-2
The second experiment was performed on the second set which contains 7620 words.
AMT correctly tagged 94% of set-l words as shown in figure 6.4. During this exper-
141
iment (figure 6.5), out of the correctly tagged words, 78% of tags which have been
assigned to tokens in the secod experiment are detailed tags, while 22% are general
tags. The ratio is varies according to the type of text. In addition, figure 6.6 shows that
Figure 6.4: Success rate of experiment-2
63% of text tokens in this experiment belong to noun class, while 13% belong to verb
class. Tokens belonging to particle class and puctuation class consititute 16% and 8%
respectivally.
6.2.3 Experiment-3
The third experiment was performed on set-3 which contains 9210 words. Out of set-3
size, AMT correctly tagged 91 % of set-3 words as shown in figure 6.7. Out of the
correctly tagged words in this experiment, the ratio of detailed tags which have been
assigned to tokens is 59% while 41 % are general tags (see figure 6.8). On the other
hand, 67% of text tokens in this experiment belong to noun class, while 10% belong to
verb class. Also, figure 6.9 shows that 17% of text tokens in this experiment belong to
142
General
tags
Verbs Pun
22%
Nouns
78% 63%

tags ratio in experiment-2 classes in experiment-2
particle class and the ratio of tokens belonging to puctuation class is 9%.
10000
8000
-=:
til
6000
-...e~
e
4000
z 91% 9%
2000
0
Set-3 size Correct Incorrect
9210 8418 792
Figure 6.7: Success rate of experiment-3
6.2.4 Experiment-4
This experiment was performed to get a picture of the ratio of pattern-based module
and lexical and contextual module that have been applied in the above three experi-
ments. Figure 6.10 shows that 91 % of testing coprus tokens are tagged correctly. Out
143
General Pun
Velhls --. -
59% 17% 64%

tags ratio in experiment-3 classes in experiment-3
of the tokens tagged correctly, 48% of correctly tagged tokens are achieved by apply-
ing pattern-based module while 52% are achieved by applying lexical and contextual
module.
20000 l~----------~------------~-----------'
18000
16000
14000
12000
10000
8000
6000
4000
2000
o f-Io-- Achieved by Achieved by
Correctly tagged applying lexical and
applying pattern-
tokens contextual rules
based rules
18433 8848 9585
Figure 6.10: Percentage of rules applicability based on type
144
6.3. EXPERIMENTAL RESULTS ANALYSIS
6.3 Experimental results Analysis
The results in the first three experiments show that the correctly tagged words vary
according to the domain of each text. The style and text content is one of the main rea-
sons that affect the accuracy of the tagger system. The text in experiment-l is related
to a computer science topic where some words belong to Arabised words; which are
not original Arabic words came from other international languages and do not have a
root or pattern. For example, the word .J~ kmbywtr, "computer". Most of these
words are tagged incorrectly.
The percentage of correctly tagged words in experiment-2 is higher than experiment-l
and experiment-3. As the text of experiment-2 is related to Arabic language topic and
specified for school level where most of words in the text of this experiment are origi-
nal Arabic words which have a root and a pattern, it is an expected result. In addition,
the percentage of correctly tagged words that belong to verb class and punctuation
class is higher than those words in experiment 1, and 3.
The different subject of the text in experiment-3 which is related to literary topics
is probably the reason why accuracy of tagging this text is low. Many proper nouns
and Arabised words are used in this type of text. Since recognising proper nouns con-
stitutes the most obvious problem in tagging Arabic text, most of the errors came from
proper and Arabised nouns. These words belonging to proper and Arabised nouns are
very difficult to recognise and tagged incorrectly.
....-
In addition, Arabic has irregular verb words such as, the word ~, Dl=a, "to go
astray". Also some words in Arabic language are considered as primitive verbs such
145
as, ~, b'sa, ·'what a bad ... !". These words are also tagged incorrectly. For example,
,
the word ~ matches the pattern J.d and the wrong tag assigned to this word.
The above three experiments show that most words in the Arabic language belong
to noun class followed by particles, verbs and punctuation marks. An overall accuracy
of the tagger system has been calculated by comparing the tagger system output with
the goal corpus that is manually tagged. The tagger achieves 91 %. Since, there is no
training corpus in this system, this accuarcy is very good.
On the other hand, 48% of accuracy is achieved by applying the pattern-based module
while 529c is achieved by applying the lexical and contextual module (see figure 6.10).
Since the ratio of the patterns in the lexicon which belong to verb class is higher than
the patterns of words which belong to noun class on the one hand, and most of the
words in the testing corpus which belong to noun class rather than any other POS class
on the other hand, it is natural these words have been tagged using lexical and contex-
tual rules. For this reason, the accuracy achieved by applying the pattern-based rules
is lower than achieved by applying the lexical and contextual rules.
One of the problems we are faced during experiments is the tag of the passive per-
fect verb. The passive perfect verb word is tagged and assigned with the same detailed
tag assigned to active perfect verb since these words (passive and active perfect verb)
share the same last diacritical mark. For example, the words ~ katba, "he wrote",
and ~ kutba, "it was written", the former represents an active perfect verb while the
later represents passive perfect verb. Since both words share the same last diacritical
mark and match the pattern J..;5, AMT will be extracted and assigned the detailed tag
5ignore any diacritical mark other than the last one
146
VePeMaSnThS j to both words which means Perfect Verb, Masculine gender, Singu-
lar number, Third Person, Subjunctive mood. This detailed tag is correct for the former
word ( ~) because this word means "there is a gentleman who wrote" and the inflec-
tional features MaSnThS j describes that clearly. While it is not correct (except mood
~
feature(Sj)) as a detailed tag for the later word ( ~) since this word which describes
something other than human (book,lesson) has been written. At the same time, the
general tag VePe is valid and correct for both words ( ~ and y:S'). This example
shows that a smaller tag set (general) may contribute to increase the performance of
the tagger.
AMT system presented in this work does not differentiate between the passive and
active perfect verb and assign a detailed tag to both words. This problem appears only
to those words represent passive perfect verb. Despite that the general tag is valid and
correct for those words, but solving this problem means to adding another additional
diacritic mark to the first letter of each word in the testing corpus and each pattern in
the pattern lexicon which requires great effort and time compared with the very scarce
number of words that can be found in testing corpus since most of perfect verb words in
Arabic are active perfect verb. For example, out of our testing corpus words, 0.0005%
of words are passive perfect verb. Thus, the general tag assigned to these words.
Another problem appeared during the experiments relates to nouns end with long
vowel I, Alif Some nouns are wrongly matched with verb patterns. As an example, the
word lAY:, nmwhA, "growth" matches the pattern ~ as shown in figure 6.11.
Since the pattern ~ is a verb class pattern, an incorrect tag was extracted from pat-
terns lexicon and assigned to the word lA.,.c, because this word belongs to noun class.
The main reason behind the error in matching the incorrect pattern is, the pattern as
147
P
NP=
Figure 6.11: Matching the word lAy: with the pattern If.ld
well as the word do not ended with a diacritical mark, instead they are ended with the
long vowel letter ( \, Alif), this letter letter fills the place of the last diacritical mark
(fatha mark).
A very few number of words are ended with the \, Alif which belonging to noun class,
but these words still remaining the pattern-matching algorithm is not 100% accurate.
The best soluation to solve this problem is to compile a lexicon contains all the Arabic
root words. One more step may add to the pattern-matching algorithm. The aim of
this new step is to extract from the testing word the three letters which corresponding
the root letters ( J f, L E, J l) in the pattern. The new word then look up in the root
lexicon to check if this word constitute a valid Arabic root or not. If the word found in
root lexicon, then the original testing word belongs to verb class, otherwise it belongs
to noun class. For example, the root of the word lAy: is y: (see figure 6.11). The word
y: is not a valid Arabic root. So, the original testing word lAy: belongs to noun class.
Compiling a lexicon contains all the Arabic valid roots is possible, but it needs a time
to compile, due to the fact that this problem did not have a noticeable impact on the ef-
fectiveness of our tagger performance because the number of words that can be found
in testing corpus (i.e 0.0043%) is scarce. In addition, the emergence of this problem
came in the final stages of our experiements and out of the scope in this research.
148
Therefore. it has been left as future work in this research.
The size of the tag set for an annotation system has a direct influence on the accuracy
of the tagging system. A smaller tag set may contribute to increase the performance of
the tagger. But using a smaller tag set means providing a less linguistic information
making the whole tagging system less useful for linguistic and NLP developers (i.e. to
build an educational system), especially if the aim of the tagger is to produce a tagged
corpus.
Out of the overall tagged corpus tokens, 68% of the tags were detailed tags while
32% are general tags as shown in figure 6.12. Since all the tags in pattern lexicon
are deatiled tags, each token is tagged by applying the pattern-based rules definitely
assigned a detailed tag. In addition, most of the tags designed with lexical and con-
textual rules are also detailed tags. Most of the 32% of general tags included one or
sometimes two inflectional features. In other word, most of the tags designed with
lexical and contextual rules, are attached with inflectional features such as, mood, state
or case. For example, NuCnld (Common noun, Indefinite). Such a tag has been cal-
culated with 32% as general tag. However, we tend to enrich each word in the testing
corpus with a detailed tag.
All POS tagging systems were built for Arabic (described in chapter 2) share the
following characteristics (1) they deal with unvocalised Arabic text (2) they need a
manually tagged corpus. The current tagger system deals with partially-vocalised Ara-
bic text without using a manually tagged corpus. In addition, the current tagger asslgs
the tag to the testing word based on the pattern of that word instead of the word itself.
Despite that the current tagger is uses a different technique and a different type of text,
149
we still would like to compare the results obtained from the current tagger (AMT) with
the results of Khoja tagger (APT). Unfortunately, the source code of Khoja tagger is
6
not available on her site . In addition, we had no luck in contacting her to acquire the
source code for her tagger.
A sample of 1500 words have been taken randomly from the above three test sets,
the last diacritical mark removed from the words, in other word, the text become un-
vocalised. An experiment was performed to tag this sample, and the result is shown in
figure 6.13.
1600
Detaied General 1400
tags tags 1200
-E 1000
0
~ 800
.....
0
z 600
0
400
200
0
Sample size Correct Incorrect
Figure 6.12: Detailed and general Figure 6.13: Success rate for un-
tag ratio overall in the correctly vocalised sample text which con-
tagged corpus tains 1500 words
The AMT correctly tagged 21 % of the unvocalised sample text. Most of the correctly
tagged tokens in this sample belong to particle and punctuation marks and some proper
nouns. It is not a surprise result since the patterns as well as the lexical and contextual
rules examined the last diacritical mark during the tagging process.
In addition, the result of experiment-4 shows the importance of the last diacritical
mark in reducing the lexical ambiguity and providing the semantic information to the
6http://zeus.cs.pacificu.edu/shereenl
150
word which helping the pas tagger to determine the correct tag of each word in the
testing text. AMT correctly tagged 91 % of testing words. Since the majority of Arabic
words are noun words, a default tag, that is, NuCn (Common noun) is assigned to the
remaining words (9%). These words are stored in a special list and were reviewed.
The deafult tag is correct for most of the remaining words, and is reduced the ratio of
these words to 30/0 which are manullay tagged.
6.3.1 The Quran text experiment
Another experiment was performed to get a picture of the tagger accuracy score in dif-
ferent text. A sample of the Quran text was taken from chapter Almulk and Alforqan
(see figure 6.14), it contains 1016 words. The diacritics were removed except the last
diacritical mark. A set contains 1016 words was taken from the Quran. The diacritics
were removed except the last diacritical mark.
~ " ~, ~ ~\ ~i ~~ ~'--'l, _"':I "i-io .;~, _~~ .. ..,.....;J';" y.,::::"':1~oIooo! .;~I J~

~~,., "
t~ ~~ ..;....;.'0,,)., ,,~'I ~oJ'~i ~i . ~,....~ I ~~J~ ' ~~ ') "~I ~~~I\
.~, ~J ~~ -', ~i )·I :.:.a _ ,j~~ '\oi ~ ..;~\ \ ~ ~J \ j~ -:,.:a.:1 ;1+):""" ~j ;Ji; ~
I I..,.. o,,)~ 'jJ ~ 'J, :..... ~~ ,j ~ 'jJ ,j,w.. ~J"""' ,j~ OJ ~- ~J~':';' I J~\ J .
I.) ~.::.;J..;A I J$-~ .jM'~ ~ ~ """" \ ~i '~: J ° 1;006jg ww.. ~ '.......... ~,ju...;... ;""i) \~) I
~ TI, "."~,,. 1111 to ll1h no.
System Output Tagged Text ~

... ~ <PrCo+NuPsFePISeAc> ,. j <NUCnNII1Of> ::t..:I <NuCnGeld> ..... <NueVMaSn> .$~ <vePIFe$n~h$l> od
... ,..:1 <vep,MasnThSj> ~ <NuCVMaSn> ",..:I <NucnNm> <Pun:> ~~ <NuCnGe> ~ ,_ <NuCnGeld> ,,is <prPp:»
~ <.Vc .. IMaSn'!>O,,, ~ <.Vc"cMaSnThJv ~ "'Nucn.>~':"'; ~"rCo+NuCnA'O'.> .~, <.NuCnA,Of.>
~ <NuCvMaSn> .J~ <NuCnHIIIOf> <Pun> j,..:. <t4uCnNmOf> ~;:,I <PrCo-+NuP,FcPlseAP ,. J <NuCnAc>
~ -:V.P.MaSnThSj!Oo ~J' ;;Pr/ln" \.. -:NuCnAC" ,,\..ia -:NuCnG." -1,'-- -:V.PeMaSnThSJ" ¢- -:V.PeMaSilnThSil)"
;...:1 <NuCnGeld> ~""JIA <NuCnGe> <Pun> '-')1&.

' . <N\I Cn.
<NuCn> ..r <NuCnGeDf> ~u>-I (;elcf>
. • <PrPr>
~
~) <VePeMaSnThSj> ~ <NuCnGe> <Pun> J~ <Nuen> ..>'" <VePeMaSnThSJ> J.J-' cPr> ..:.- <NuCnAcDf> ,
~Ia. <NuenNmOf> .~ c:VePeMa$nThSj> ~ <Nuen>:":"" <NuCnGeld> ...:~ <NuenAeDI> ~
y
Figure 6.14: A sample of Quran text
151
.. Summary of Results
LJ§
s..., of Reaulb lot QUIlIn lext with totalsile 1016 Word Token
CoIIparing With GOAlIeIIt ( Manuao, Tegged) III fob.. :
DistriJuIion of teat lokena 1ICf0SS the syateJI Nodules and POS lag sel : Dishibution olleat lokena across POS categories
Ratio t POS Category • 01 Words Ralio t

Ralio t
43 t 31 t Noun 439 491

Patterns Rules General tags :
57t Verb 168 191

L/ CRules Detailed tags : 69 t
Partide 207 231
Punctuation 84 91
Text lotal words Tolal correctly Tags OveraU Accuracy

1016 898 88l
Figure 6.15: The result of the Quran text
The AMT system correctly tagged 88% of the Quran sample as shown in figure 6.15.
Out of the tokens tagged correctly, 43% of correctly tagged tokens are achieved by ap-
plying pattern-based module while 57% are achieved by applying lexical and contex-
tual module. Some of the sample words in this experiment (experiment-5) are classical
Arabic words. Since the Modem Standard Arabic (MSA) text is used in the current
usage, the Arabic writers are used the meaning of these words instead of using the
classical Arabic words. A sample of classical words which have been used in the
Quran text and their meaning in MSA text can be seen in Table 6.1.
Most of these classical words are tagged incorrectly due to the fact that the patterns
are valid patterns for MSA text rather than Classical text. On the other hand, the Quam
text shares the MSA text in some errors described above. For example, some proper
nouns are used also in the Quran text, such as ~ y, ~!.r. \, (Y and~. Each
of these proper nouns do not have a pattern to follow. Also, some nouns are wrongly
152
6.4. SUMMARY OF RESULTS OBTAINED FROM THE AMT SYSTEM
Quran word MSAword transliteration translation

.. y)a..aj tDTrb disturbed
.Jy
..
.J~
..,\
~~ shqwq creases
~~
. o.J~ HjArP stones
'Y.- \j"V tmAdwA gone
rL~ ~ khlqkm created you

W·'.J \.:., .. qrybA soon
... j
0~ ~
.. Emyqa deeply
Table 6.1: Some of Quran words VS MSA words
matched with verb patterns especially the pattern J..;. Furthermore, the same problem
was appeared during the Quran text experiment relates to nouns end with long vowel I.
Despite the errors described above, the AMT has achieved very good accuracy in the
Quran text. Figure 6.15 shows that most of the Quran text are similar to MSA text in
regarding to POS classes they belonged to. For example, most of the Quran words are
belonged to nouns and particles (72%) rather than verb words. In addition, 43% of the
Quran sample words have been tagged using pattern rules. This is a nature ratio since
the Quran words are words derived from the root and most of the Quran words have
patterns to follow. At the same time, 69% of the sample words are tagged by deatiled
tags (see figure 6.15).
6.4 Summary of results obtained from the AMT system
The summary of all the results obtained from the AMT system for all the experiments
described above show that the correctly tagged words vary according to the domain of
each text. The AMT system achieved very good accuracy due to the fact that it does
not used a lexicon for training (91 %).
153
Since there is no a huge tagged corpus available to the tagger system presented in this
work, this accuracy enables us to point out that it is possible to build a tagger system
for Arabic that did not require a huge tagged corpus. Such this tagger helps to solve
the problem of the lack of a huge tagged corpus for Arabic in the current literature.
In addition, the diacritical mark especially at the last letter of the word plays a great
role in reducing the lexical ambiguity and determining the correct POS tag to each
word in testing corpus. Despite that the tagger system presented in this work has many
strength points (described in next chapter, section 7.4), the problems that have been
faced during all experiments can be summarised as follow:
• The system does not accurate in tagging proper nouns and Arabised words.
• The system does not differentiate between the passive and active perfect verb and
assign a detailed tag to both words.
• Some nouns are wrongly matched with verb patterns.
As mentioned earlier in section 6.3, the above shortcomings did not have a noticeable
impact on the effectiveness of the tagger performance because the number of words
that can be found in testing corpus is scarce. For example, 0.0005% of testing words
are passive perfect verb and 0.0043% relates to nouns end with long vowel \, Alif.
However, solving these shortcomings to enhance the performance of AMT tagger are
taken into account and described in next chapter (section 7.3).
6.5 Chapter Summary

This chapter presented several experiments have been done to evaluate the AMT sys-
tem using a new partially-vocalised Arabic testing Corpus. The description of the data
sets were used during the experiments is shown. The result of the experiments and the
154
analysis of these results are also explained. The results show that AMT is achieved an
average accuracy 91 % of the testing corpus which contains 20,000 words. The short-
comings that AMT system has also mentioned. The main conclusion yielded during
the course of this research, the strenght points that the tagger system has, and future
work are described in next chapter.
155
Chapter 7
Conclusion
Several Part-of-Speech tagging systems with high tagging accuracy have been devel-
oped, especially for English based on text statistics or on grammar rules. Unlike En-
glish, the Part-of-Speech tagging systems for Arabic as a research field in Arabic NLP
is relatively reviewed. A few systems have been developed in Part-of-Speech tagging
for Arabic. These systems were built to tag unvocalised Arabic text using a lexicon
or dictionary that was tagged manually and used as a training corpus containing all
possible tags (lexical information) for each word.
The Arabic language has a valuable and an important feature, called diacritics, which
are marks placed over and below the characters of the word. An Arabic text may be
written with diacritics or without. An Arabic text that appears without a short vowel
and diacritics is called unvocalised text while written Arabic text with full representa-
tion of short vowels and other diacritics marks is called fully-vocalised text. An Arabic
text is a partially-vocalised text when the the diacritical marks assigned to one or max-
imum two letters in the word.
156
This thesis represents a substantial starting point for developing a rule-based part-of-
speech tagging system deals with partially-vocalised Arabic text. It is the first tagger
(1) uses only linguistic rules, (2) investigate the role of the last diacritical mark in help
to determine the correct pas tag to each word in testing corpus. The main function of
the tagger system is to produce a pas tagged corpus.
A novel technique: pattern-based, has been explored using a novel algorithm (pattern-
matching algorithm). In this technique, the Arabic word was tagged based on its pat-
tern. A lexicon of patterns which are associated with the last diacritical mark was
generated automatically and used instead of a huge Arabic word lexicon. The advan-
tages of this technique are twofold: First, it does not need a lexicon or training corpus.
Second, it reduces the space since hundreds of Arabic words may follow one pat-
tern. Additionally, a set of linguistic rules (lexical and contextual technique) based on
the character(s), affixes, the last diacritical mark, the word itself, and the surrounding
words or on the tags of the surrounding words were used to tag those words not tagged
by pattern-based technique.
The system developed to answer hypothesis and research questions mentioned in chap-
ter 1. Since the accuracy of the AMT system that can be achieved is 91 %. This enables
us to make the following assertions :
1. it is possible to build a tagger system for Arabic with out needs a huge lexicon for
training.
2. the diacritical mark especially at the last letter of the word plays a great role in
reducing the lexical ambiguity and determining the correct POS tag to each word
in testing corpus.
157
7.1. IMPORTANCE OF DIACRITIC FEATURE
3. the accuracy is comparable to that of statistics-based tagging systems were built

for Arabic. But these systems deal with unvocalised text and need a huge manually
tagged lexicon which still not available in the current literature and most of the
current taggers were used a small training corpus.
Section 7.1 summarise the importance of diacritic feature. Section 7.2 describe the
contributions of this research while section 7.3 point out a direction for future works.
7.1 Importance of diacritic feature
The lack of diacritics in Arabic texts is presented as a major challenge to most Arabic
NLP tasks. The use of diacritics in Arabic texts are extremely important. The list
below summarises the importance of using diacritics in Arabic language:
1. They add a semantic information to words which helps with resolving ambiguity
in the meaning of words.
2. They help determining the correct pas tag to the words in the sentence.
3. They ascribe grammatical functions to the words, differentiating the word from
other words, and determining the syntactic position of the word in the sentence.
4. Indicating the correct pronunciation of words, correct syntactical analysis which
leads to reducing problems for NLP applications such as text-to-speech or
speech-to-text, and removing the semantical confusion of Arabic readers.
In addition, the last diacritical mark helps not only in determining the correct part-of-
speech of the words in the sentence, but also in providing full information regarding
the inflectional features for the sentence words.
158
7.2. CONTRIBUTIONS
7.2 Contributions
The contributions of this research to the field of NLP can be summarise as follow:
1. AMT: Arabic Morphosyntactic Tagger
This research has developed a POS tagger system called AMT (short for Arabic
Morphosyntactic Tagger). AMT deals for the first time with partially-vocalised
Arabic text . The main aim of AMT is to annotate the testing corpus by adding
POS tag or label to each word in the testing corpus and toproduce a POS tagged
partially-vocalised Arabic text. It can also used as a prerequisite tool for many
NLP tasks, such as, parsing and informational retrieval systems.
2. A new tag set for Arabic
A new morpho syntactic tag set that is derived from the ancient Arabic grammar
has been developed, which is based on the Arabic system of inflectional mor-
phology. The tag set does not follow the traditional Indo-European tag set that
is based on Latin but instead it's based on the semitic tradition of analysing lan-
guage. These tags contain a large amount of information and add more linguistic
attributes to the word. The Arabic tag set contains 161 detailed tags and 28 gen-
eral tags covering an Arabic major POS classes and sub-classes which have been
compiled and introduced in Chapter 4 in this work.
3. Partially-vocalised Arabic corpus

A new partially-vocalised Arabic corpus that contains 20,000 Arabic words cho-
sen and extracted from different books for different levels of school classes has
been compiled in this work and introduced in chapter 6. The corpus was tagged
using AMT system presented in this research. It will be available (both raw and
tagged) freely for public.
159
7.3. FUTURE WORKS
.f. Pattern-based technique
AMT is rule-based system. It has two rule components. The first component
is the pattern-based rule. The trigger in pattern-based technique depends on the
pattern of the testing word. These patterns are associated with the last diacritical
mark and generated automatically. A novel algorithm (Pattern-Matching Algo-
rithm) has been designed and built in this work and introduced in chapter 5. The
aim of this algorithm is to match the inflected word in the testing corpus with its
pattern in the pattern lexicon. The second component is the lexical and contex-
tual rule. The trigger in the contextual rules depends on the current word itself,
the tags or words on the context of the current word, while the trigger in the
lexical rules depends on the character(s), affixes, and diacritics of a word.
7.3 Future Works
During the course of this research, there are many areas that deserve more study, these
areas can be summarised as follows:
• Further research in the expansion of pattern lexicon to contains all Arabic pat-
terns.
• Improving the tagger by defining and encoding an additional set of Arabic tag-
ging rules.
• Encode the testing corpus in SGML marks.
• Produce the output in a standard format (e.g XML).
• Evaluate the tagger and compare its result with other tagger(s) deal with
partiall y -vocalised Arabic text.
160
7.4. SUMMARY
• Building a lexicon contains all Arabic roots to enhance the perfonnance of the
tagger system and pattern-match algorithm as well.
7.4 Summary
In conclusion, all the POS tagging systems for Arabic described in this work (see
chapter 2) were built to tag unvocalised Arabic text. AMT system presented in this
work is different from the described systems in the following aspects :
• It is the first tagger deals with partially-vocalised Arabic text.
• It is the first tagger uses purely rule-based approach, applied a novel technique,
that is, pattern-based technique. The tag assigned to the word based on the pattern
of that word instead of the word itself.
• It does not need a lexicon (manually tagged or untagged) for training.
• It is the first tagger investigated the role of diacritic feature in the Arabic language.
An overall ambiguity in vocalised Arabic text seems to be lower than in an unvocalised
text. The last diacritical mark plays a great role to remove a great deal of lexical
ambiguity when the text at least is partially-vocalised.
161
Bibliography
[1] Al-hayat corpus can be found at :. http://www . elda. org / catalogue/
en/text/W0030 .html.
[2] An-nahar newspaper text corpus can be found at:. http://www . elda.
org/catalogue/en/text/W0027.html.
[3] Arabic newswire corpus can be found at:. http://www .ldc. upenn. edu/
Catalog/CatalogEntry.jsp?catalogld=LDC2001T55.
[4] Buckwalter arabic corpus can be found at:. http://www . qamus. org.
[5] Buckwalter arabic morphological analyzer ( online ). http: / / students.
cs.byu.edu/-jonsafar/buckwalter.html.
[6] Nijmegen corpus can be found at:. http://www.let.kun.nl/wba/
Content2/1.4.5\_Nijmegen\_Corpus.html.
[7] Penn arabic treebank corpus can be found at:. http://www .ldc. upenn.
edu/Catalog/CatalogEntry.jsp?catalogld=LDC2005T20.
[8] Useful annotated list of arabic corpora can be found at:. www. comp .leeds .
ac.uk/eric/latifa/arabic\_corpora.html.
162
BIBLIOGRAPHY
[9] A useful resource for corpora can be found in:. http://www . essex.
ac.uk/linguistics/clmt/w3c/corpus\_ling/content/
introduction3.html.
[10] Useful resources for corpus-based computational linguistics can be found at
:. http://www.essex.ac . uk/ linguistics/ clmt/w3c/ corpus \
_ling/content/corpora/list/index2.html\#languages and
http://www.athel.com/corpora.html.
[11] Buckwalter arabic morphological analyzer. version 1 (2002) : http://www .
ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogld=
LDC2002L49. vesrion 2 (2004) :http://www.ldc.upenn.edu/
Catalog/CatalogEntry. jsp?catalogld=LDC2004L02., 2002.
[12] Flor Aarts. Relative who and whom: Prescriptive rules and linguistic reality.
Journal Information for American Speech, 69(1):71-79, 1994.
[13] Steven Abney. Part-of-Speech Tagging and Partial Parsing. S. Young and
G. Bloothooft (eds.) Corpus-Based Methods in Language and Speech Pro-
cessing. An ELSNET book. Kluwer Academic Publisher, Dordrecht, 1997.
http://www.sfs.nphil.uni-tuebingen.de/.
[14] Abuleil, Alsamara, and Evens. Acquisition system for arabic noun morphol-
ogy. In Proceedings of the Computational Approaches to Semitic Languages
Workshop, University of Pennsylvania., 2002.
[15] S. Abuleil and M Evens. Discovering lexical information by tagging arabic
newspaper text. In Proceedings of the workshop on Semitic Language Process-
ing. COLING-ACL.98, University of Montreal, Montreal, PQ, Canada., pages
1-7, 1998.
163
BIBLIOGRAPHY
[16] S AbuRabia and J Awwad. Morphological structures in visual word recog-
nition: the case of arabic. Journal of Research in Reading, 27(ISSN 0141-

0423):321336, 2004.
[17] Alfred V. Aho and Jeffrey D. Ullman. The theory of parsing, translation, and
compiling. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1972.
[18] Latifa Al-Sulaiti and Eric Atwell. The design of a corpus of contemporary
arabic. International Journal of Corpus Linguistics, 11:135-171,2006.
[19] Ahmed Al-Tarouti. Temporality in Arabic grammar and discourse. PhD thesis,
University of California, 1991.
[20] Ali Alhamad. In the arabic vocalbulary. Arabization Journal h t t P : / / www .
acatap. htmlplanet. com/jounal. htm, (20), Dec 2000.
[21] Reima Aljarf. Egnlish and arabic infletions for translation students. ht tp: / /
docs.ksu.edu.sa/PDF/Articles29/Article290045.pdf. ~c~
nical report, King Saud University, Saudi Arabia., 2007.
[22] Mohammed Aljlayl and Ophir Frieder. On arabic search: improving the retrieval
effectiveness via a light stemming approach. In CIKM, pages 340--347. ACM,
2002.
[23] James Allen. Natural Language Understanding. Benjamin-Cummings. Menlo
Park, California., 2nd edition, 1995.
[24] Shihadeh Alqrainy and Aladdin Ayesh. Developing a tag set for automated pos
tagging in arabic. WSEAS TRANSACTIONS on COMPUTERS, 5(11):2787-
2792,2006.
164
BIBLIOGRAPHY
[25] Gisle Andersen and Anna-Brita Stenstrom. Colt: a progress report. ICAME
Journal., 20:133-136,1996.
[26] Andras Kocsor Andras Kuba, Laszlo Felfoldi. Pos tagger combinations on hun-
garian text. In The Second International Joint Conference on Natural Language
Processing (IJCNLP-05), Korea., 2005.
[27] Chinatsu Aone and Kevin Hausman. Unsupervised learning of a rule-based
spanish part of speech tagger. In COLING, pages 53-58, 1996.
[28] Eric. Atwell. Lob corpus tagging project: Post-edit handbook. Depart-
ment of Linguistics and Modern English Language, University of Lancaster.
http://www.comp.leeds.ac. uk/amalgam/tagsets/lob.html, 1982.
[29] Eric. Atwell. Grammatical analysis of scribe: Spoken corpus recordings in
british english. SERC Advanced Research Fellowship proposal, Science and
Engineering Research Council., 1989.
[30] Eric Atwell. Development of tag sets for part-of-speech tagging. In Anke
Ludeling & Merja Kyto, editor, Corpus Linguistics: An International Hand-
book. Mouton de Gruyte, 2007.
[31] Eric Atwell, John Hughes, and Clive Souter. Amalgam: Automatic mapping
among lexicogrammatical annotation models. In J Klavans, editor, The Bal-
ancing Act: Combining Symbolic and Statistical Approaches to Language -
Proceedings of the ACL Workshop, Association for Computational Linguistics.,
pages pp. 21-28, 1994.
165
BIBLIOGRAPHY
[32] L. R. Bahl and R. L. Mercer. Part-of-speech assignment by a statistical decision
algorithm. In IEEE International Symposium on Information Theory-Ronneby-
Slveden.. pages 88-89, 1976.
[33] Baker. Franz, and Jordan. Coping with ambiguity in knowledge-based natural
language analysis. In the 8th International FLAIRS Conference, USA., 1994.
[3"+] Raffaella Bernardi, Andrea Bolognesi, Corrado Seidenari, and Fabio Tamburini.
Pos tagset design for italian. In In Proc. 5th International Conference on Lan-
guage Resources and Evaluation - LREC 2006, Genova., pages 1396-1401,
2006.
[35] Thorsten Brants. Tnt a statistical part-ofspeech tagger. In Proceedings of the
6th Applied NLP Conference, ANLP-2000, April., 2000.
[36] Eric Brill. A simple rule-based part of speech tagger. InANLP, pages 152-155,
1992.
[37] Eric Brill. Unsupervised learning of disambiguation rules for part of speech tag-
ging. In David Yarovsky and Kenneth Church, editors, Proceedings of the Third
Workshop on Very Large Corpora, pages 1-13, Somerset, New Jersey, 1995.
Association for Computational Linguistics, Association for Computational Lin-
guistics.
[38] Benny Brodda. Problems with tagging and a solution. Nordic Journal of Lin-
guistics, pages 93-116, 1982.
Buckwalter arabic morphological analyser. Lin-

[39] Tim Buckwalter.
guistic Data Consortium, Philadelphia. http://www .Ide. upenn. edu/
CataIog/CatalogEntry. jsp?eataIogld=LDC2002L49, 2002.
166
BIBLIOGRAPHY
[40] Ceccato, Kiyavitskaya, Zeni, Mich, and Berry. Ambiguity identification and
measurement in natural language texts. Technical Report DIT-04-111, Infor-
matica e Telecomunicazioni, University of Trento., 2004.
['+1] Jean-Pierre Chanod and Pasi Tapanainen. Creating a tagset, lexicon and guesser
for a french tagger. In ACL SIGDAT Workshop on Prom Texts to Tags: Issues in
Multilingual Language Analysis, University college - Dublin - Ireland., 1995.
[42] Jean-Pierre Chanod and Pasi Tapanainen. Tagging french - comparing a statis-
tical and a constraint-based method. In EACL, pages 149-156, 1995.
['+3] Gerald Chao. A probabilistic, Intergative Approach for Improved Natural Lan-
guage Disambiguiation. PhD thesis, Departement of Computer Science, Uni-
versity of California, Los Angeles., 2003.
[44] Kenneth W. Church. Current practice in part of speech tagging and suggestions
for the future. In Simmons (ed), Sbornik Praci : In Honor of Henry Kucara,
Michigan Salvic Studies. 13-48. Michigan., 1992.
[45] Kenneth Ward Church. A stochastic parts program and noun phrase parser for
unrestricted text. In ANLP, pages 136-143, 1988.
[46] Jan Cloeren. Towards a cross-linguistic tagset. In In Proceedings of the Work-
shop on Very Large. Corpora (WVLC), Columbus, Ohio., pages 30-39, 1993.
[47] Cutting, Kupiec, Pederson, and Sibun. A practical part-of-speech tagger,. In
Proceedings of the Third Conference on Applied Natural Language Processing,
Trento,Italy., 1992.
[48] Walter Daelemans, Jakub Zavrel, Peter Berck, and Steven Gillis. Mbt: Memory-
based part of speech tagger-generator. CoRR, cmp-Ig/9607012, 1996.
167
BIBLIOGRAPHY
[49] Aniket Dalal, Kumar Nagaraj, Uma Sawant, Sandeep Shelke, and Pushpak
Bhattacharyya. Building feature rich pos tagger for morphologically rich lan-
guages: Experiences in hindi. In ICON-2007: 5th INTERNATIONAL CONFER-
ENCE ON NATURAL IANGUAGE PROCESSING, Hyderabad, India., 2007.
[50] James H.Martin. Daniel Jurafsky. Speech and language processing: An intro-
duction to natural language processing, computational linguistics, and speech
recognition. prentice-hall, USA., 2000.
[51] Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. Automatic tagging of arabic
text: From raw text to base phrase chunks. In Proceedings of HLT-NAACL.,
2004.
[52] Kevin Dub and Katrin Kirchhoff. Pos tagging of dialectal arabic: A minimally
supervised approach. In Proceeding of ACL-05. Computational Approaches. to
Semitic Languages. Workshop Proceedings. University of Michigan. Ann Arbor,
Michigan, USA, 2005.
[53] Dzeroski, Etjavec, and Zavrel. Morphosyntactic tagging of slovene: Evalu-
ating taggers and tag sets . http://citeseer.ist.psu.edu/43 7 7 93 .
html; http://nl.ijs.si/et/Bib/LRECOO/lrec-tag .ps, 2000.
[54] EI-Kareh and Al-Ansary. An arabic interactive multi-feature pos tagger. In
Proceeding of the international conference on Artificial and Computational in-
telliegence for Decision Control and Automation in engineering and Industrial
Application (ACIDCA) conference, Tunisia., pages 83-88, 2000.
[55] Hashish M EI-Sadany T. An arabic morphological system. IBM SYSTEMS
JOURNAL, 28:600-612,1999.
168
BIBLIOGRAPHY
[56] M Elaraby. Alarge scale computational processor of the arabic morphology and
application. Master's thesis, Cairo University, Egypt, 2000.
[57] Antonie Eldahdah. A dictionary of Arabic grammar in charts and tables. Num-
ber 01D110410. Librairie du liban publishers, 9th edition, 2002.
[58] Ayman Elnaggar. A phrase structure grammar of the arabic language. CDLING,
pages 342-344, 1990.
[59] David Elworthy. Tagset design and inflected languages. CoRR, cmp-
Igl9504002, 1995.
[60] N. W. Francis and H. Kucera. Brown corpus manual of information: to
accompany a standard corpus of present-day edited american english, for

use with digital computers. Providence, R.I.: Department of Linguis-
tics, Brown University. http://www.comp.leeds.ac . ukl amalgaml
tagsets Ibrown. htmll, 1979.
[61] Anis Frayha. Essentials of Arabic: A Manualfor teaching classical and collo-
quial Arabic. American university of Beirut, 1953.
[62] A Freeman. Brills pos tagger and a morphology parser for arabic. In Proceed-
ings of the Arabic Language Processing: Status and Prospects Workshop at the
39th Annual Meeting of the Association of Computational Linguists, Toulouse,
France, page 148154,2001.
[63] R. Garaside and N. Smith. A Hybrid Grammatical Tagger: CLAWS4, in Gar-
side, Leech, and McEnery., chapter 7, pages 102-122. Longman, London.,
1997.
169
BIBLIOGRAPHY
[64] R Garside. The claws word-tagging system. In: R. Garside, G. Leech and
G. Sampson (eds), The Computational Analysis of English: A Corpus-based

Approach. London: Longman., 1987.
[65] Roger. Garside. The robust tagging of unrestricted text: the bnc experience.
In Jenny Thomas and Mick Short (eds) Using corpora for language research:
studies in the honour of Geoffrey Leech, pI67-180. London: Longman., 1996.
[66] G Gazadr and C Mellish. Natural Language Processing in LISP. Addison-
Wesley, Reading, Massachuestts., 1989.
[67] Sidney Greenbaum. The tagset for the international corpus of english. In Clive
Souter and Eric Atwell (eds) Corpus-based Computational Linguistics. ppll-24.
Amsterdam: Rodopi., 1993.
[68] B. B. Greene and G. M. Rubin. Automatic grammatical tagging of english.
Technical Report, Department of Linguistics, Brown University, 1971.
[69] Ihab Joseph Griess. Syntactical Comparsion Between Classical Hebrew
and Classical Arabic Based On The Translation Of Mohammad 'Id's Ara-
bic Grammar. PhD thesis, The Southern Baptist Theological Seminary,
louisville,Kentucky, USA., 2006.
[70] Marshall G.S.Hodgson. The Venture of Islam. Number ISBN: 0226476936.
University of chicago press, 1974.
[71] Nizar Habash and Owen Rambow. Arabic tokenization, part-of-speech tagging
and morphological disambiguation in one fell swoop. In ACL. The Association
for Computer Linguistics, 2005.
170
BIBLIOGRAPHY
[72] Saeed Raheel Haidar M. Harmanani, Walid T. Keirouz. A rule-based eextensible
stemmer for information retrieval with application to arabic. In Proceedings of
the Eighth lASTED International conference, Marbella, Spain., 2004.
[73] Hans Van Halteren. Syntactic Wordclass Tagging, volume 9. Kluwer Academic
Publishers, Netherlands., 1999.
[74] Andrew Hardie. Developing a tag set for automated part-of-speech tagging in
urdu. In Proceedings of the Corpus Linguistics 2003 conference, Lancaster
University, UK, 2003.
[75] Harmain M. Harmain. Arabic part-of-speech tagging. In The Fifth Annual
U.A.E. University Research Conference, Al-Ain, U.A.E., 2006.
[76] Haywood and Nahmad. A new arabic grammar: of the written language. LUND
HUMPHRIES, USA, 2005.
[77] Donald Hindle. Acquiring disambiguation rules from text. In ACL, pages 118-
125, 1989.
[78] Barbora Hladka and Kiril Ribarov. Part-of-speech tags for automatic tagging
and syntactic structures. Issues of Valency and Meaning. Studies in Honour of
Jarmila Panevov, Karolinum, Charles University Press, Prague, Czech Repub-
lie., pages 226-240, 1998.
[79] Barbora Hladka Jan Hajic. Czech language processing - pos tagging. In In
Proceedings of the First International Conference on Language Resources and
Evaluation, Granada, Spain., pages 931-936, 1998.
171
BIBLIOGRAPHY
[80] A.P. Hendrikse Jens Allwood, Leif Gmqvist. Developing a tagset and tagger for
the african languages of south africa with special reference to xhosa. Southern
African Linguistics and Applied Language Studies, 21(4):223-237, 2003.
[81] Mark R. Titchener Jim Yaghi. T-code compression for arabic computational
morphology. In Proceedings of the Australasian Language Technology Work-
shop. Melbourne., 2003.
[82] Johansson, Atwell, Garside, and Leech. The tagged lob corpus users manual.
Bergen: Norwegian Computing Centre for the Humanities., 1986.
[83] Pavel SMRZ Karel PALA. Building czech wordnet. ROMANIAN JOURNAL
OF INFORMATION SCIENCE AND TECHNOLOGY, 7(1-2):79-88, 2004.
[84] Vangelis Karkaletsis, Constantine D. Spyropoulos, and George Petasis. Named
entity recognition from greek texts: the GIE project. 1998.
[85] Fred Karlsson. Constraint grammar as a framework for parsing running text. In
Proceedings of the 13th International Conference on Computational Linguis-
tics, Helsinki., volume 3, pages 168-173, 1990.
[86] Yasuhiro Kawata. Tagsets for Morphosyntactic Corpus Annotation: the idea
of a reference tagset for Japanese. PhD thesis, Department of Language and
Linguistics - University of Essex., 2005.
[87] Shereen Khoja. Apt: Arabic part-of-speech tagger. In Proceedings of the Stu-
dent Workshop at the Second Meeting of (NAACL2001), Carnegie Mellon Uni-
versity, Pittsburgh, Pennsylvania., 2001.
[88] Shereen Khoja. APT: an Automatic Arabic Part-of-speech Tagger. PhD thesis,
Ph.D. thesis, Lancaster University., 2003.
172
BIBLIOGRAPHY
[89] Khojah, Graside, and Knowels. A tag set for the morpho syntactic tagging of ara-
bic. In presented at Corpus Linguistics 2001, Lancaster University, UK., 2001.
[90] Sheldon Klein and Robert F. Simmons. A computational approach to grammat-
ical coding of English words. Journal of the ACM, 10(3):334-347, July 1963.
[91] 1. Kupiec. Robust part-of-speech tagging using a hidden markov model. Com-
puter Speech and Language, 6, 1992.
[92] G. Leech. 100 million words of english. English Today, 9:9-15, 1993.
[93] G. Leech, R. Garside, and E. Atwell. The automatic grammatical tagging of the
lob corpus. ICAME News, 7: 13-33, 1983.
[94] Wolfgang Lezius, Reinhard Rapp, and Manfred Wettler. A morphology-system
and part-of-speech tagger for gennan. CoRR, cmp-Ig/9610006, 1996.
[95] Ann Bies Maamouri and Seth Kulick. Diacritization: A challenge to arabic
treebank annotation and parsing. In In Proceedings of the British Computer
Society Arabic NLPIMT Conference, London, UK, 2006.
[96] J Mace. Arabic verbs and essential grammar. Hodder/Stoughton. London, 1999.
[97] H. E. Mahgoub, M. A. Hashish, and Ahmed Taher Hassanein. A matrix rep-
resentation of the inflectional fonns of arabic words: A study of co-occurrence
patterns. In Proceeding of COLING, pages 419-421, 1990.
[98] Yong Mao. Natural language processing module - pos tagging and sentence
parsing - laboratory manual. http://www . csic. cornell. edu/201/
natural_Ianguage/,1997.
173
BIBLIOGRAPHY
[99] Marcus, Santorini, and Marcinkiewicz. Building a large annotated corpus of
english: The penn treebank. Computational Linguistics: Special Issue on Using

Large Corpora, 19(2):313-330, 1993.
[100] Marques and Pereira. A neural network approach to part-of-speech tagging.
In Proceedings of the second workshop on spoken and written Portuguese, Cu-

ritiba, Bra::,il, pages 1-9, 1996.
[101] L. Marquez and H. Rodriguez. Part-of-speech tagging using decision trees. In In
Proceedings of the 10th European Conference on Machine Learning, ECML'98.
Chemnit::,. Germany, 1998.
[102] E Marsi, A van den Bosch, and A Soudi. Memory-based morphological analy-
sis generation and part-of-speech tagging of arabic. In ACL-05. Computational
Approaches. to Semitic Languages. Workshop Proceedings. University of Michi-
gan. Ann Arbor, Michigan, USA, 2005.
[103] A. M. McEnery. Computaional Linguistics - a bandbook and toolboxfor natural
language procesiing. SIGMA PRESS-Wilmslow, United Kingdom., 1992.
[104] Karine Megerdoomian. Developing a persian part-of-speech tagger. In Pro-
ceedings of First Workshop on Persian Language and Computers. Invited talk.
Tehran University, Iran., 2004.
[l05] B Megyesi. Brill's rule-based part of speech tagger for hungarian. Master's
thesis, Computational Linguistics, Stockholm University, Sweden., 1998.
[106] B Megyesi. Brill's pos tagger with extended lexical templates for hungarian.
ACAI'99, 1999.
174
BIBLIOGRAPHY
[107] M.Elaffendi. An lvq connectionist solution to the non-determinacy problem in
arabic morphological analysis: a learning hybrid algorithm. Natural Language

Engineering, 8(1):3-23, 2001.
[108] Bernard Merialdo. Tagging english text with a probabilistic model. Computa-
tional Linguistics, 20(2): 155-171, 1994.
[109] George A. Miller. The lexical component of natural language processing. In
Proceedings of the 37th annual meeting of the Association for Computational
Linguistics on Computational Linguistics, College Park, Maryland, pages 21-
21. Association for Computational Linguistics Morristown, NJ, USA, 1999.
[110] Shinsuke Mori. A stochastic parser based on an SLM with arboreal context
trees. In COUNG, 2002.
[111] Joakim Nivre. Logic programming tools for probabilistic part-of-speech tag-
ging. Technical report, MSI Report - Vaxjo University - sweede, 2000.
[112] Kemal Oflazer and llker Kuru. Tagging and morphological disambiguation of
turkish text. In Proceeding of ANLP, pages 144-149, 1994.
[113] H. Paulussen and W. Martin. Dilemma-2: A lemmatizer-tagger for medical
abstracts. In Third Conference on Applied Language Processing, Trento, Italy.,
pages 141-146, 1992.
[114] Juan Prez-Ortiz and Mikel Forcada. Part-of-speech tagging with recurrent neu-
ral networks. In Proceedings of the International Joint Conference on Neural
Networks, IJCNN 2001., pages p. 1588-1592.,2001.
[115] K Prtz. Part-of-speech tagging for swedish. In Proceeding in Parallel Corpora,
Parallel Worlds, University, Sweden., pages 201-206, 2002.
175
BIBLIOGRAPHY
[116] Adwait Ratnaparkhi. A maximum entropy model for part-of-speech tagging.
In The empirical methods in Natural Language Processing Conference., pages

133-142, 1996.
[117] Steven 1. Rose. Grammatical category disambiguation by statistical optimiza-
tion. Computational Linguistics, 14(1):31-39, 1988.
[118] Yoad Winter Roy Bar-Haim, Khalil Sima'an. Part-of-speech tagging of modem
hebrew text. Natural Language Engineering, Cambridge University Press, UK.,
1,2006.
[119] Thomas Russi. A syntactic and morphological analyzer for a text-to-speech
system. In CDliNG, pages 443-445, 1990.
[120] Karin Ryding. A Reference Grammar of Modern Standard Arabic. Cambridge
University Press, 2005.
[121] G. Sampson. English for the Computer: the SUSANNE corpus and analytic
scheme. Oxford: Clarendon Press., 1995.
[122] C. Samuelsson. Morphological tagging based entirely on bayesian inference. In
Proceedings of the 9th Nordic Conference of Computational Linguistics., Stock-
holm, Sweden, 1993.
[123] C. Samuelsson. A novel framework for reductionistic statistical parsing. In
Proceedings of the 4th International Workshop on Parsing Technologies, pages
208-215, Prague/Karlovy Vary, Czech Republic, 1995.
[124] Beatrice. Santorini. Part-of-speech tagging guidelines for the penn treebank
project. Technical Report MS-CIS-90-47, University of Pennsylvania: Depart-
ment of Computer and Information Science., 1990.
176
BIBLIOGRAPHY
[125] Helmut Schmid. Part-of-speech tagging with neural networks. In Proceeding of

COliNG. pages 172-176, 1994.
[126] Gerold Schneider and Martin Volk. Adding manual constraints and lexical look-
up to a brill-tagger for german., 1998.
[127] Fatma Ai Shamsi and Ahmed Guessoum. A hidden markov model based pos
tagger for arabic. In JADT 2006 - 8th International Conference on the Statistical
Analysis o/Textual Data, Fance., 2006.
[128] Clive. Souter. A short handbook to the polytechnic of wales corpus. Bergen
University, Norway: ICAME, The Norwegian Computing Centre for the Hu-
manities., 1989.
[129] Stolz, Tannenbaum, and Carstensen. A stochastic approach to the grammatical
coding of english. Communications o/the ACM, 8(6):399-405, 1965.
[130] Richard Sutcliffe, Heinz-Detlev Koch, and Annette McElligott (eds.). Industrial
parsing of software manuals. Amsterdam: Rodopi., 1996.
[131] J. S vartvik. The london corpus of spoken english: Description and research.
Lund: Lund University Press. Lund Studies in English 82., 1990.
[132] Pasi Tapanainen and Atro Voutilainen. Tagging accurately - don't guess if you
know. ANLP 94, 1994.
[133] Lolita Taylor and Gerry Knowles. Manual of information to accompany the sec
corpus: The machine readable corpus of spoken english. University of Lan-
caster: Unit for Computer Research on the English Language., 1988.
[134] Kees Versteegh. The Arabic Language. Number ISBN-I0: 0748614362. Edin-
burgh University Press, 2001.
177
BIBLIOGRAPHY
[135] Atro Voutilainen. A syntax-based part-of-speech analyser. In EACL, pages 157-

164. 1995.
[136] Tams Vradi and Csaba Oravecz. Morpho-syntactic ambiguity and tagset design
for hungarian, 1999.
[137] R Weishedel, R Scewartz, J Ralmucci, M Meteer, and L Rawshaw. Coping with
ambiguity and unknown words through probabilistic models. Computational
Linguistics. 19:359-382, 1993.
[138] Wright. A Grammar of the Arabic Language. Cambridge University Press,
1988.
[139] Imed Zitouni, Jeffrey S. Sorensen, and Ruhi Sarikaya. Maximum entropy based
restoration of arabic diacritics. In Proceedings of the 21st International Con-
ference on Computational Linguistics and 44th Annual Meeting of the ACL,
Sydney, pages 577- 584, 2006.
[140] Muhammad Zughoul. Developing computer based corpora of arabic: A pre-
liminary proposal. In at the Conference on Situated Languages, Technology
and Communication, Institute of studies and Research on Arabicization, Rabat,
Morocco., 1997.
178
Appendix A
Tagset Appendices
A.1 General Tags
Tag Dsecription Tag Description

VePe Perfect verb NuCd Conditional noun
VePi Imperfect verb NuDe Demonstrative noun
VePm Imperative verb NuIn Interrogative noun
NuPo Proper noun NuAd Adverb
NuCn Common noun NuNn Numeral noun
NuAj Adjective noun Fw Foreign noun
NuIf Infinitive noun Pun Punctuation mark
NuRe Relative noun PrPp Preposition
NuDm Diminutive noun PrVo Vocative Particle
NuIs Instrument noun PrCo Conjunction Particle
NuPn noun of Place PrEx Exception Particle
NuTn noun of Time PrAn Annulment Particle
NuPs Pronoun PrSb Subjunctive Particle
NuCv Conjunctive noun PrJs Jussive Particle
179
A.2. DETAILED TAGS
A.2 Detailed Tags
Tag Dsecription Arabic Transliteratior Translation

Example
VePeMaSnThSj Verb, Perfect, Masculine, Singular, y:5' ktba He Wrote
Third Person, Subjunctive
VePeMaSnFsDc Verb, Perfect, Masculine, Singular, W ktbtu I Wrote

First Person, Indicative
VePeMaSnSeSj Verb, Perfect, Masculine, Singular, ~ ktbta You(Sn,Ma)

First Person, Subjunctive Wrote
VePeFeSnSeJs Verb, Perfect, Feminine, Singular, ~
, . ktbti You(Sn,Fe)
Second Person, Jussive Wrote
VePeFeSnThls Verb, Perfect, Feminine, Singular, ~ Ktbtx She Wrote
Third Person, Jussive
VePeNeDuSeSj Verb, Perfect, Neuter, Dual, Second ~ ktbtmA You(Du) Wrote
Person, Subjunctive
VePeMaDuThSj Verb, Perfect, Masculine, Dual, t:r ktbA They (Du,Ma)
Third Person, Subjunctive Wrote
VePeFeDuThSj Verb, Perfect, Feminine, Dual, L:.O ktbtA They(Du,Fe)
VePeMaPlFsSj Verb, Perfect, Masculine, Plural, ~ ktbnA We Wrote
First Person, Subjunctive
VePeMaPISeJs Verb, Perfect, Masculine, Plural, or Ktbtmx You(PI,Ma)
Wrote
Second Person, Jussive
Verb, Perfect, Feminine, Plural, J.;:j Ktbtna You(PI,Fe) Wrote

VePeFePISeJ s
Second Person, Subjunctive
09 Ktbna They(pI,Fe)
VePeFePIThl s Verb, Perfect, Feminine, Plural,
They(PI,Ma)
VePeMaPIThDc Verb, Perfect, Masculine, Plural, ~ ktbwA
Wrote
Third Person, Indicative
180
A.2. DETAILED TAGS
VePeMaSnThDc Verb, Perfect, Masculine, Singular, 4..::5" Ktbhu He Wrote It

Third Person, Indicative
~
VePeNeSnFsls Verb, Perfect, Neuter, Singular, El mthmx I teach them
First Person, lussive
VePeMaPlThSj Verb, Perfect, Masculine, Plural, ~~ ElmwnA They teach us

VePeMaPlFsls Verb, Perfect, Masculine, Plural, ~~ ElmnAhmx We teach them

First Person, lussive
...
VePeMaSnThl s Verb, Perfect, Masculine, Plural,
~ El mnyx He teach me
Third Person, lussive
...
VePeMaPlThJ s Verb, Perfect, Masculine, Plural, j~ El mwnyx They teach me
-
Third Person, lussive
VePeFePlThSj Verb, Perfect, Feminine, Plural, ~ Elmkna They teach you
VePeNePISel s Verb, Perfect, Neuter, Plural, Sec- ~~ Elmtwhrnx You teach them
ond Person, lussive
VePeMaSnFsSj Verb, Perfect, Masculine, Singular, ~ El mthA You (Sn) teach
First Person, Subjunctive her

...
VePeMaPlSeSj Verb, Perfect, Masculine, Plural, lA~ ElmtwhA You (PI) teach her
VePeFePlThDc Verb, Perfect, Feminine, Plural, ~ Elmnhu They (Fe) teach
Third Person, Indicative him

...
Verb, Perfect, Feminine, Plural, lA~ ElmtmAhA You teach her
VePeFePlSeSj
Second Person, Subjunctive (Du)
VePeMaPlSeDc Verb, Perfect, Masculine, Plural, 0# Elmtmwhu You teach him
Second Person, Indicative

...
Verb, Perfect, Neuter, Plural, Third lA~ ElmtmwhA You teach her
VePeNePlThSj
Person, Subjunctive
...
They teach you
VePeMaPlThJ s Verb, Perfect, Masculine, Plural, tr~ Elmwkrnx
181
A.2. DETAILED TAGS
VePiMaSnFsSj Verb, Imperfect, Masculine, Singu-

. ~
~I OElmhA I teach her

lar, First Person, Subjunctive
VePiMaSnFsl s Verb, Imperfect, Masculine, Singu-

.. ~
~I OElmhrnx I teach them

lar, First Person, Subjunctive
VePiMaSnFsDc
. ~
Verb, Imperfect, Masculine, Singu- ~rl OElmu I teach

lar, First Person, Indicative
VePiMaPlFsSj Verb, Imperfect, Masculine, Plural, ~ nElmhA We teach her

VePiMaPlFsl s Verb, Imperfect, Masculine, Plural,

~ nElmhrnx We teach them
VePIMaPIFsDc Verb, Imperfect, Masculine, Plural, ~~ nElmu We teach

First Person, Indicative
. ~_
.
VePiMaDuThJs Verb, Imperfect, Masculine, Dual, U yElmAni They(Ma,Du)
~
Third Person, Subjunctive teach

.
VePiMaPIThSj Verb, Imperfect, Masculine, Plural, 0~ yElmwna They(Ma,Pl)

..
VePiFePIThSj Verb, Imperfect, Feminine, Plural, ~ yElmna They (Fe,PI)

.
VePiMaSnThSj Verb, Imperfect, Masculine, Singu- ~ yElmhA He teach her
lar, Third Person, Subjunctive
VePiMaSnThJs Verb, Imperfect, Masculine, Singu- ~ yElmhmx He teach them
VePiMaSnThDc Verb, Imperfect, Masculine, Singu- ~r- yElmu He teach
lar, Third Person, Indicative

. You(Du) teach
VePiFeDuSels Verb, Imperfect, Feminine, Dual, 0L;..W tElmAni

. You(Pl) teach
VePiFeP1SeSj Verb, Imperfect, Feminine, Plural, ~ tElmna
VePiFeSnThDc Verb, Imperfect, Feminine, Singu- W,.; tElmhu You(Sn) teach
him
lar, Third Person, Indicative
182
A.2. DETAILED TAGS
VePiFeSnThSj Verb, Imperfect, Feminine, Singu- ~ tElmhA You(Sn) teach her

~
VePiFeSnThJs Verb, Imperfect, Feminine, Singu- tElrnhmx You(Sn) teach
lar, Third Person, Subjunctive them
!
VePrnMaSnSeJ s Verb, Imperative, Masculine, Sin- d Ouktbx You(Sn,Ma)
gular, Second Person, Jussive Write
~
VePmFeSnSeJ s Verb, Imperative, Feminine, Singu-
~
-. Ouktbyx You(Sn,Fe) Write
lar, Second Person, J ussive
:
VePmNeDuSeSj Verb, Imperative, Neuter, Dual, L::.(I OuktbA You(Du) Write
VePmFeP1SeSj Verb, Imperative, Feminine, Plural,

-
~I
~
Ouktbna You(Pl,Fe) Write
~
VePrnMaPlSeSj Verb, Imperative, Feminine, Plural, !5;>1 OuktbwA You(Pl,Ma)
Second Person, Subjunctive Write
PrPp Preposition Particle J fy In
PrVo Vocative Particle lJ yA Announcement
PrCo Conjunction Particle j w and
PrEx Exception Particle ~I ~

IlA Except
PrAn Annulment Particle ~ lA Negation
Never
PrSb Subjunctive Particle J In
1m Never
PrJs JussivelElision Particle ~
Pr Particle I~I ~
IdhA If
NuPsMaSnThAcId Personal Noun, Masculine, jA hwa He
Singular, Third Person, Ac-
cusative,Indefinite
Personal Noun, Neuter, Dual, Third ~ hmA They(Dual)

NuPsNeDuThAcId
Person, Accusative, Indefinite
183
A.2. DETAILED TAGS
NuPsMaPlThNmId Personal Noun, Masculine, Plural, •

~ hmx They(pI,Ma)
Third Person. Nominative , Indefi-
nite
NuPsFeSnThAcId Personal Noun, Feminine, Singular, hya

~ She
Third Person, Accusative,Indefinite
N uPsFePIThAcId Personal Noun, Feminine, Plural,
~ hna They(pI,Fe)
Third Person, Accusative, Indefi-
nite
N uPsMaSnSeAcId "
Personal Noun, Masculine, Singu- ~I Onta You(Sn,Ma)
lar, Third Person, Accusative, In-
definite
NuPsNeDuSeAcId Personal Noun, Neuter, Dual, Third ~I" OntmA You(Dual)

Person, Accusative,Indefinite
NuPsMaPISeNmId Personal Noun, Masculine, Plu- '-'1"

r Ontmx You(PI,Ma)
ral, Third Person, Nomina-
tive,Indefinite
)J"uPsFeSnSeGeId Personal Noun, Feminine, Singular, ~I

" Onti You(Sn,Fe)
Third Person, Genitive,Indefinite
NuPsFePISeAcId Personal Noun, Feminine, Plural, ~I" Ontna You(PI,Fe)
Third Person, Accusative,Indefinite
NuPsNeSnFsAcId Personal Noun, Neuter, Singular, l31" OnA Me
First Person, Accusative,Indefinite
Personal Noun, Neuter, Plural, First ~. nHnu We

NuPsNePlFsNmId
Person, N ominative,Indefinite
NuDeSnAcId Demonstrative Noun, Singular, Ac- Ii. h*A This
cusative,Indefinite
NuDeDuGeId Demonstrative Noun, Dual, Geni- 0, 1i. h*Ani These(Dual)
tive, Indefinite
Noun, Singular, O.AA h*h This(Sn)
NuDeSnGeId Demonstrative
Genitive, Indefinite
184
A.2. DETAILED TAGS
NuDePIGeId Demonstrative Noun, Plural, Geni- '~jA hWIA' These(PI)

tive, Indefinite
NuDe Demonstrative Noun, Indefinite ~L:.. hnAka There

NuInId Interrogative Noun, Indefinite ~ kyfa How
NuCvSnId Conjunctive Noun,Singular, Indefi- ~.:ul Aldhy Which/Who(Sn)
nite
NuCvDuId Conjunctive Noun,Dual, Indefinite 0 1iul All *Ani Which/Who(Du)

NuAdld Adverbal Noun,Indefinite ~y
-. fwqa Over
NuVnId Verbal Noun,Indefinite L:. hyA Come On
NuCdld Conditional Noun,Indefinite mtY When
~
NuNmId Numeral Noun,Indefinite ~I'

~ wAHd One
NuAjMsSnNmId Adjective Noun, Masculine, Singu- ~~ mElmN Instructor
lar, Nominative, Indefinite

~ ..
NuAjMsSnAcId Adjective Noun, Masculine, Singu- ~ mElmAF Instructor
lar, Accusative, Indefinite
NuAjMsSnGeId Adjective Noun, Masculine, Singu- }- mElmK Instructor
lar, Genitive, Indefinite
NuAjMsSnNmDf Adjective Noun, Masculine, Singu-

i~' AlmElmu Instructor(Ma,Sn)
lar, Nominative, Definite
NuAjMsSnAcDf Adjective Noun, Masculine, Singu- '~I AlmElma Instructor(Ma,Sn)
lar, Accusative, Definite
Adjective Noun, Masculine, Singu- AlmEl mi Instructor(Ma,Sn)

NuAjMsSnGeDf rl
lar, Genetive, Definite
.. Instructor(Ma,Du)
NuAjMsDuGeId Adjective Noun, Masculine, Dual, 0Wa.o mElmAni
,
Genetive, Indefinite
Adjective Noun, Masculine, Dual, 0dJI AlmElmAni Instructor(Ma,Du)

NuAjMsDuGeDf ,
Genetive, Definite
~ . Instructor(Fe,Sn)
Adjective Noun, Feminine, Singu- WaA mElmPN
NuAjFeSnNmId
185
A.2. DETAILED TAGS
NuAjFeSnAcId Adjective Noun, Feminine, Singu-

.
~
...
mEl mpF Instructor(Fe,Sn)

...
NuAjFeSnGeId Adjective Noun, Feminine, Singu- ~ mElmpK Instructor(Fe,Sn)
•
lar, Genetive, Indefinite
~ ...
NuAjFeSnNmDf Adjective Noun, Feminine, Singu- w.J1 mEl mpu Instructor(Fe,Sn)
NuAjFeSnAcDf Adjective Noun, Feminine, Singu- tJ,J1 mElmpa Instructor(Fe,Sn)
NuAjFeSnGeDf Adjective Noun, Feminine, Singu- w.J1 AlmElmpi Instructor(Fe,Sn)

lar, Genetive, Definite
NuAjFeDuGeId Adjective Noun, Feminine, Dual, 0to1- mEl mtAni Instructor(Fe,Du)

,
NuAjFeDuGeDf Adjective Noun, Masculine, Dual, 0liJ,J I AlmElmtAni Instructor(Fe,Du)
Genetive, Definite
NuAjFeplNmId Adjective Noun, Feminine, Plural, ~~ mElmAtN Instructor(Fe,Pl)
Nominative, Indefinite
NuAjFePIGeId Adjective Noun, Feminine, Plural, wLJ....o

•
mElmAtK Instructor(Fe,Pl)
NuAjFePlNmDf Adjective Noun, Feminine, Plural, ~lJ.J1 AlmElmAtu Instructor(Fe,Pl)
Nominative, Definite
NuAjFePlAcDf Adjective Noun, Feminine, Plural, ~lJ.J1 AlmElmAta Instructor(Fe,Pl)
Accusative, Definite
NuAjFePIGeDf Adjective Noun, Feminine, Plural, wlJ.J1 AlmEl mAti Instructor(Fe,Pl)
Genetive, Definite
NuAjFePlAcId Adjective Noun, Masculine, Plural, 0yi- mElmwna Instructor(Ma,Pl)
Accusative, Indefinite
NuAjFePIAcDf Adjective Noun, Masculine, Plural, 0~1 AlmElmwna Instructor(Ma,Pl)
Instrument Noun, Masculine, Sin- "l:.i.o mftAHN Key

NuIsMaSnNmId C
gular, Nominative, Indefinite
186
A.2. DETAILED TAGS
NuIsMaDuGeId Instrument Noun, Masculine, Dual, 01.>L:A..o mftAHAni (Two) Keys

NuIsMaPINmId Instrument Noun, Masculine, Plu- .. -~ mfAtyHN Keys
~
ral, Nominative, Indefinite
NuIsMsSnNmDf Instrument Noun, Masculine, Sin- ~ L:A.l1 AlmftAHu
( The Key
gular, Nominative, Definite
NuIsMsSnAcDf
(- L:A.l1
Instrument Noun, Masculine, Sin- AlmftAHa The Key
gular, Accusative, Definite
NuIsMsSnGeDf Instrument Noun, Masculine, Sin- L:A.l1 AlmftAHi The Key

~
gular, Genetive, Definite
NuIsMaDuGeId Instrument Noun, Masculine, Dual, 01.>L:A.l1 AlmftAHAni (Two) Keys

-
NuIsMaPINmDf Instrument Noun, Masculine, Plu- ~ -WI AlmfAtyHu Keys

Ci
ral, Nominative, Definite
NuIsMaPIAcDf Instrument Noun, Masculine, Plu- - -WI AlmfAtyHa Keys

Ci
ral, Accusative, Definite
NuIsMaPIGeDf Instrument Noun, Masculine, Plu- -WI AlmfAtyHi Keys

Ci
ral, Genetive, Definite
NuDmSnNmId Diminutive Noun, Singular, Nomi- ~ mTyEmN Restaurant
native, Indefinite
NuReMaSnNmId Relative Noun, Masculine, Singu- j~) ArdnyN Jordanian

-
lar, Nominative, Indefinite (Ma,Sn)
..
NuReFeSnNmId Relative Noun, Feminine, Singular, ~~) ArdnypN Jordanian (Fe,Sn)
Relative Noun, Masculine, Dual, . l:i~ I ArdnyAni Jordanian

NuReMaDuGeId ~ - '-J
Genitive, Indefinite (Ma,Du)
Relative Noun, Feminine, Dual, .~~ I ArdnytAni Jordanian

NuReFeDuGeId ~ - '-J
(Fe,Du)
NuReMaPIAcId Relative Noun, Masculine, Plural, 0.r.i~) Ardnywna Jordanian
(Ma,PI)
187
A.2. DETAILED TAGS
NuReFePINmId Relative Noun, Feminine, Plural, ~l;i.)) ArdnyAtN Jordanian (Fe,PI)

NuReMaSnNrnDf Relative Noun, Masculine, Singu- j.).J ~\ AIArdnyu Jordanian

lar, Nominative, Definite (Ma,Sn)
NuReFeSnNrnDf Relative Noun, Feminine, Singular, AIArdnypu Jordanian (Fe,Sn)
NuReMaDuGeDf Relative Noun, Masculine, Dual, ~l;i.).J ~\ AIArdnyAni Jordanian

Genitive, Definite (Ma,Du)
NuReFeDuGeDf Relative Noun, Feminine, Dual, ~~.).J ~\ AlArdnytAni Jordanian
Genitive, Definite (Fe,Du)
N uReMaPIAcDf Relative Noun, Masculine, Plural, 0~.).J ~\ AIArdnywna Jordanian
Accusative, Definite (Ma,PI)
NuReFePlNrnDf Relative Noun, Feminine, Plural, ~l;i.).J ~\ AIArdnyAtu Jordanian (Fe,PI)
NuCnMaSnNmId Common Noun, Masculine, Singu- y L:s" ktAbN Book (Sn)

..
NuCnFeSnNmId Common Noun, Feminine, Singu- ~.J..\.A mdrspN School (Sn)
NuCnMaSnAcId Common Noun, Masculine, Singu- y L:s" ktAbF Book (Sn)

~
NuCnFeSnNmId Common Noun, Feminine, Singu- ~.J'..\..o mdrspF School (Sn)
NuCnMaSnGeId Common Noun, Masculine, Singu- y L:s" ktAbK Book (Sn)

~
Common Noun, Feminine, Singu- ~.J..\.A mdrspK School (Sn)

NuCnFeSnGeId
Common Noun, Masculine, Dual, 'lJL:s"

u. ktAbAni Books (Du)
NuCnMaDuGeId
NuCnFeDuGeId Common Noun, Feminine, Dual, ~l:.....,.J..\.A mdrstAni Schools (Du)
188
A.2. DETAILED TAGS
NuCnFePIGeId Common Noun, Feminine, Plural, ~).i.o mdArsi Schools (PI)

NuCnFePIAcId Common Noun, Feminine, Plural, UZ).i.o mdArsa Schools (PI)

NuCnFePINmId Common Noun, Feminine, Plural, J).i.o mdArsu Schools (PI)
NuCnMaPINmId Common Noun, Masculine, Plural, ¥ ktbu Books (PI)

I ~uCnMaPIAcId Common Noun, Masculine, Plural, y;5' ktba Books (PI)

~ uCnMaPIGeId Common Noun, Masculine, Plural,

~ ktbi Books (PI)
I
I
N uCnMaSnNmDf Common Noun, Masculine, Singu- ~~I AlktAbu Book (Sn)
NuCnMaSnAcDf Common Noun, Masculine, Singu- w~1 AlktAba Book (Sn)
NuCnMaSnGeDf Common Noun, Masculine, Singu- ~~I AlktAbi Book (Sn)
lar, Genitive, Definite

~
NuCnFeSnNmDf Common Noun, Feminine, Singu- ~-,.JI Almdrspu School (Sn)
NuCnFeSnAcDf Common Noun, Feminine, Singu- L-,.JI Almdrspa School (Sn)
NuCnFeSnGeDf Common Noun, Feminine, Singu- ~-,.JI Almdrspi School (Sn)
lar, Genitive, Definite
Common Noun, Masculine, Dual, ·LJ~I

u. AlktAbAni Books (Du)
NuCnMaDuGeDf ~
Genitive, Definite
Common Noun, Feminine, Dual, . L:..... .JI AlmdrstAni Schools (Du)

NuCnFeDuGeDf I..J ~
Genitive, Definite
~).:JI AlmdArsi Schools (PI)
NuCnFePIGeDf Common Noun, Feminine, Plural,
Genitive, Definite
189
A.2. DETAILED TAGS
NuCnFePIAcDf Common Noun, Feminine, Plural, Lf).:J I AlmdArsa Schools (PI)

NuCnFePINmDf Common Noun, Feminine, Plural, .. I.:J I AlmdArsu Schools (PI)

4../'.J
Genitive, Definite
NuCnMaPINmDf Common Noun, Masculine, Plural, ~I Alktbu Books (PI)

NuCnMaPIAcDf Common Noun, Masculine, Plural, ~I Alktba Books (PI)

NuCnMaPIGeDf Common Noun, Masculine, Plural, ~I Alktbi Books (PI)
Genitive, Definite
Nulf Infinitive Noun

.. swmN Fasting
i~
NuPo Proper Noun ~j4.J ramzy Proper Noun

..
NuPn Noun of Place
~ mtbxN Kitchen
NuTn Noun of Time

..
.,.\S.y mwEdN Engagement
190
Appendix B
The Arabic Language Orthography
B.1 Arabic words and the Roman alphabet

The issue of transliteration and transcription codes used to describe Arabic language using the Roman
alphabet to give a reader unfamiliar with the language sufficient information for accurate pronunciation
still presented. Marshall Hodgson ( [70] ,p4) define transliteration as : "is the rendering of the spelling
of a word from the script of one language into another language'" , and transcription as : " is the
rendering of the sound of a word so that a reader can pronounce". Por example, the transliteration of
the Arabic word ~ may 'ktb', while one of the transcription is "kataba", other may be "kutub".l
Many different approaches and a variety of ways for transliteration (romanizing) Arabic language have
2
been developed. Some of these transliteration systems are listed below :
• Deutsche Morgenlandische Gesellschaft (1936): adopted by the International Convention of
Orientalist Scholars in Rome.
• Romanization Tables adopted by the US Library of Congress and the American Library
Association for cataloguing books (ALA-LC).
• ISO 233 published by the International Standards Organisation, BS 4280:1968 produced by
British Standards Institute.

1Since the word ~is unvocalized, the problem of pronouncing the word may arises.
2Por more inform~tion : http://www.al-bab.comlarabllanguage/roman1.htm
See also: http://en.wikipedia.orglwikil
191
B.2. ARABIC ALPHABET AND OTHER DIACRITICAL MARKS
• (UNGEGN) United Nations Romanization System for Geographical Names
• Romanization, Transcription and Transliteration by Kenneth R. Beesley (Xerox company).
• The Buckwalter Transliteration System.
• Al-kitaab Transliteration System published by by Kristen Brustad, Mahmoud AI-Batal, and

Abbas AI-Tonsi.
• The Standard Arabic Technical Transliteration System (SATTS).
• DIN 31635 developed by the Deutsches Institut fr Normung (German Institute for
Standardization).
• SAS: Spanish Arabists School (Jos Antonio Conde and others).
• BGNIPCGN 1956: Romanization System For Arabic.
Unfortunately, none of the systems described above is an universal standard for transliteration and
transcription Arabic language. All the systems described above suffer from many difficulties, such as :
they use special characters or add special marks to normal characters which make these systems
difficult to memories as well as most of these systems cannot be used easily with a standard computer
keyboard. On other hand, a few Arabic letters have a clear equivalent in the Roman
alphabet(B,F,K,L,M,N,R, and Z)3.
Due to some difficulties described above, each person uses their own standard. Throughout this thesis
we use a transliteration system compound from Buckwalter and Al-kitaab transliteration systems with
a little bit of my own update to transliterate diacritical marks described in table B.4.
B.2 Arabic alphabet and other diacritical marks

The alphabet in Arabic language consist of 28 letters. Unlike European languages, no separate printed
form of the letter. Arabic script is a cursive and written from right to left [76].
Table B.1 shows the various forms of Arabic letters and transliteration of each letter which has been
used to transliterate Arabic words throughout this thesis.

In addition, Arabic lanaguge has hamza , (glottal stop) consonant which can also occur on alif or waaw
or yaay consonants and Ta Marboota 0, these letters shown in table B.2).
192
No. Name Consonant Transliteration Pronunciation

1 Alif \ A man
2 baa u . b back
3 taa U t tablet
.f thaa u
0\
th throw
5
...
Jum ( J john
6 Haa (. H Hat
7 khaa . kh
(.
8 daal ;) d dad
9 dhaal ;) dh
10 raa ..J r rush
11 zaay ..J
. z carzy
12 sun if s sun
13 shiin if
0\
sh shadow
14 Saad ~
S Suffix
15 Daad ~
D
16 Taa .k T
17 DHaa ~ DH
18 ayn E
L.
19 ghayn gh
L
20 faa J f fat
21 qaaf
.. q quick
i..J
22 kaaf ~ k
23 laam J 1 laptop
24 miim m mark
i
25 nuun 0 n novel
26 haa 0 h hassel
27 waaw j w welcome
28 yaay y young
~
Table B.l: Arabic Alphabet
193
Name Consonant Transliteration

Hamza , ,
~
hamza above Alif \ 0

hamza below Alif \ I
~
hamza above waaw j

~
W
hamza above yaay Ls }
Ta Marboota 0 p
Alif Maqsoura ~
y
Table B.2: Hamza (glottal stop) with Alif, waaw, and yaay consonants
Furthermore, table B.3 shows the transliteration system for short vowels diacritical marks, while the
transliteration of other diacritical marks (Nunation,Sukun,gemination) described in table B.4.

,
Fatha sign .!l a Ia!
~
damma sign .!l u lui

kasra sign .!l, I Iii
Table B.3: Arabic short vowels

~
Tanween fath ~ an Ian!

".
Tanween damm ~ un lun!
Tanween kasr ~ In lin!
~
Sukun ~ x
Shadda w"". -
Table B.4: Other diacritical marks (Nunation,Sukun,gemination) in Arabic
3For more information: http://www.al-bab.comlarabllanguage/romanl.htm
194
Appendix C
Lexical and Contextual Rules
C.I Names and description of lexical rules
Rule Name Description

CWD The current word
CWDLM The last diacritical mark of the current word
FICHCWD The first character of the current word
F2CHCWD The first two characters of the current word
L2CHCWD The last two characters of the current word
F3CHCWD The first three characters of the current word
L3CHCWD The last three characters of the current word
C.2 Lexical Rule Examples

CWDLM NuCnNmld
Tanween Damm
Tanween Fath CWDLM NuCnAcld
Tanween Kasr CWDLM NuCnGeld
~ L3CHCWD NuRe
JIj F3CHCWD and Kasra mark CWDLM PrCo+NuCnGeDf
195
C.3. NAMES AND DESCRIPTION OF CONTEXTUAL RULES
C.3 Names and description of contextual rules
Rule Name Description

PWD The preceding word
PWDTAG The preceding tag
C.4 Examples used contextual rules

NuCnGeld PWDTAG PrPp
NuCnPWD ~
~
NuPo PWD ~~ or ~~ or ~~
196
Appendix D
Permission for Collecting Testing
Corpus
~~~
~I., 4~1 OJ!.)"
1'0 whom it m ay o ncern
Thi,· is 10 Cerlijj /hal Mr. SI/;lzal1l.'1/ A lqruin.I' requested our

Curricula and lex/hooks to {<I'e /! jor hi., I'll/} re.w((rcll il7 Dt.
1/onlForl Unil'er"il.k Uk.
If e . Managing 01 Curricula and lexlhouk.; - Milli., tr1l . 0/
rAil/calion Jordan . l.IJ(reed / 0 KNill / M r.A lqrulIIY CI/Jl'l'lnts.\lOn
(/nd 111f/7ori=0Iiol1 ro /I.re (l lll' Curricula and lexl /'ooks 10 'Teare
u J'ocali=ed . Jrabie corpu\' and I I) he {lvail"Me (Fr ee) l(lr m~1'
Scientific re.Hwrch ill/he !llTure .
Or. F a waz Jarada t

Man:\~i ll ~ I)JrCl:hll" nrCl lrrlcuhl and tcx,oooks.
MJm~/"V nlFlluullill11 - JOKl);JN
FIll:Ji l J. li..&t.I;JI I:..t\\,)7 :(I Yahoo COil I
197
Word-Class Tagger and Tagset Design for Vocalized Arabic Text
SHIHADEH ALQRAINY+, ALADDIN AYESH+
+Centre for Computational Intelligence (CCl) - School of Computing
De Montfort University, Leicester - The Gateway, UNITED KINDOM
{alqrainy , aayesh}@dmu.ac.uk
Abstract: - Arabic language has a valuable and important feature, called diacritics, which are marks placed
oyer and below the letters of Arabic word. This feature plays a great role in adding linguistic attributes to
Arabic words and in indicating pronunciation and grammatical function of the words. This feature
enriches the language sy-ntactically while removing a great deals of morphological and semantically
ambiguities. This paper present diacritics rule-based part-of-speech (POS) tagger which automatically tags
a partially vocalized Arabic text. The aim is to remove ambiguity and to enable accurate fast automated
tagging s~·stem. A tagset is being designed in support of this system. Tagset design is at an early stage of
research related to automatic morpho syntactic annotation in Arabic language. Preliminary results of the
ugset design lu,-e been reported in this paper.
Keyu/ords: - ~\rabic Language, Part-Of-Speech (POS), Diacritics, Tagset, Morphological, Syntactical
1Introduction relatively consistent across national boundaries. MSA is

,\fabie language is syntactically and morphologically a used in official documents, in educational settings, and
rich language, \vhich means several words and meanings, for communication between Arabs of different
]n be derived from the same word leading to ambiguity. nationalities. However, the spoken forms of Arabic vary
The ambiguity of .\rabic lies on 3 different levels, the widely, and each Arab country has its own dialect.
:Jre word le,Tel, the derived word forms and agglutinative Dialects are spoken in most informal settings, such as at
::,rms of words [l].In this paper; we exploit the effect of home, with friends, or while shopping.
fOcalization , \vhich is considered one of the Arabic The Arabic language belongs to the Semitic family of
Language distinctive features, on the tagging process. It is languages, and, like Hebrew, is written from right to left.
:misaged that the use of vocalization will increase the Arabic has been a literary language since the 6th century
'?eed of the tagging process without scarifying acc.uracy. A.D., and is the liturgical language of Islam in its classical
Indeed, the use of vocalization, as we demonstrate 1n thIS form.
The Arabic writing system is quite different from the
paper, will reduce the ambiguity of the parsed text. .
English system. The Arabic alphabet consis~s, of 28 ,le~ters
The paper starts with a brief sun:ma~ of ~he Arab~c
that change shape depending on their posltlon wlthm a
language overview followed DlacrltlCS 1n ArabIc
word and the letters by which they are surrounded. Some
Language. The tagset design and uses and benefits of
tagging systems are highlighted. Then, we present our Arabic letters must be connected to other letters; others
tagging system architecture and diacriti~al rule-based as may stand alone. Arabic vowels are indicated by marks
our approach. Finally, analyses of expe:-Iment results are (Diacritics) above and below the consonants . In m~ny
presented with future work and concluslOn. cases these diacritics play the role of vowels 1n EnglIsh
and :hus influence pronunciation. Additionally, there are
no special forms, such as the use of capital letters in
2Arabic Language
English [17].
2.1 Background . 2.2 Diacritics in Arabic

The Arabic language is spoken in more than 20 countrl,es, Arabic language has a valuable and important feature,
:·[Om Egypt to Morocco an d t h roughout the ArabIan , , called diacritics, which are marks placed over and below
Peninsula. It is the native language of over 195 ~llllOn the letters of Arabic word. This feature plays a great ro,le
people. Plus, at least another 35 million speak ArabIC as a in adding linguistic attributes to Ar~blc wor~s and 1n
second language. indicating pronunciation and grammatical function of the
~fodern Standard Arabic (MSA) is the official language
throughout the Arab world, and its written form is
11\lrds, It is particularly of interest for the purpose of this "~j" h' h .
F.lper. Table 1 shows .\rJbic yowe! diacritics. • w Ie eIther means "go" or "gold" can be Verb or
Noun.
The pronunciation of diacritized languages words cannot Di " .
be fully determined by spelling their characters only; acr1t~cs are used to prevent mIsunderstandings, to
~pecial marks are put above or below the characters to dete~mI.ne the correct pronunciation, reduce the
ambIguIty, and indicating grammatical functions. These
determine the correct pronunciation. They also indicate
the grammar function of the word within the context of
funct~ons play a great role in removing ambiguity and
:hc sentence [2]. enabling accurate fast automated tagging system.
To remove amb~guity and to determine the correct tag of
\ame:
Symbol:
;'
Fatha
14.1 I
,
Damma
I
Kasra
;'
the word " y,AJ" in the above example, adding the
short vowels (Fatha sign) to the last letter of the word
to be~ome " ~j" enough to get the correct tag [ Verb
] WIthout any ambiguity and without regards to the
III Ii I context.
Explanation: \\'ritten \\'ritten \X'ritten
above the above the below the
Consonant Consonant Consonant 3 Arabic Tagset and EAGLES guidelines
"., .J. A tag is a code which represents some features or set of
~ features and is attached to the segment in a text. Single or
Example: complex information are carried by a tag. The
Pronunciation: ba bu bi development of a tagset to support diacritical based
\ame: Tanween Tanween Tanween tagging system is at early stage. The need for such a tagset
Fatha Damm Kasr comes from the fact that there is no standardized and
Symbol:
Explanation:
~
l~lnJ
\\'ritten
above the
"
flint
Written
above the
linl
Written
below the
~ comprehensive Arabic tagset.
EAGLES [16] guidelines outline a set of features for

tagsets; these guidelines were designed to help
standardise tagsets for what were then the official
Consonant Consonant Consonant languages of the European Union. EAGLES tags are
~ ,;l ~ defined as sets of morphosyntactic attribute-value pairs
Example:
U• 4.-,.)
. ~.
~
(e.g. Gender is an attribute that can have the values
Masculine, Feminine or Neuter). The tagset discussed
Pronunciation: ban bun bin
here is not being developed in accordance with the
~ame : Shadda Sukun EAGLES guidelines for morphosyntactic annotation of
Symbol:
,.;J o corpora. Arabic is very different from the languages for
Explanation: \X'ritten above Written above which EAGLES was designed, and belongs to the
the Consonant the Consonant Semitic family rather than the Indo-European one.
~ 00 Following a normalised tagset and the EAGLES
recommendations would not capture some of Arabic's
Example: relevant information, such as the jussive mood of the
Pronunciation: bb b verb and the dual number that are integral to Arabic.
Another important aspect of Arabic is inheritance, where
Table 1: Arabic vowel diacritics all subclasses of words inherit properties from the classes
from which they are derived. For example, all subclasses
In Arabic, short vowels are not a part of the Arabic of the noun inherit the Nunation when in the indefinite
~phabet, instead they are written as marks over or below which is one of the main properties of the noun [14].
ilie consonant. They are used in both Noun and Verb in
Arabic Language. They indicate the case of the noun and 3.1 Previous work on POS tagsets
the mood of the verb. There are small numbers of popular tagsets for English,
such as: 87 -tag tagset used Brown Corpus, 4S-tag Penn
Many words are in general ambiguous in their part-of- Treebank tagset and 61-tag CS tagset [3]. For Arabic also
speech, for various reasons. In English, for example, a very small number of tagset had been built, EI-Kareh S,
Word such as "Make" can be "Verb" or "Noun". In Al-Ansary[10] described the tagset, they classifying the
,\rabic there are ambiguities. For example, the word
words into three mam classes, Verbs are sub classified
Figure 2 shows the Abbreviations which was used to
into 3 subclasses; .N ouns into 46 subclasses and Particles
define the words in our tagset.
into 23 subclasses. Shereen Khoja[14] described more
Let us try to explain the symbols of the tagset formula
detail Llg:set. Her LlgSl't contains 177 tags, 57 Verbs, 103 for a moment.
\L~llnS, 9 Particles, 7 residual and 1 punctuation.
~he .sy~bols [ T , S , G , N , P , M ] consider as
lIngUIStiC attributes for class Verb, while the symbols [ T
3.2 Proposed Arabic Tagset , S , G , N , P , C , F ] consider as linguistic attributes
\\'e haye based our ;\rabic tagset on inflectional for class Noun. For example, the word " ~ " which
morphology system. The traditional description of Arabic means "h e wrote" has the following tag [
grammarians consider as a base to create the linguistic VePe~aSnThSj ], which means [ Perfect Verb,
:.ltegories of ~-\rabic Llgset. ~-\rabic grammarians describe MasculIne Gender, Singular Number, Third Person,
.\rabic as being deriyed from three main categories: Subjunctive Mood].
1011n, verb and particle. Figure 1 shows the tagset
\Vord ~-\.bb 'Vord .A..bb
~k'LlfCh\·. Ve.rb Veo Annulment .,lJl
Noun Nu Subp,tnct:lve Sb
Part.icie PI' Mas(:ulme !\oJ:.
Perfect ~ Feminine Fe
Impel'fett ; Pi Neuter Nf;i
Imp erat ive Pin
SIngular 811
C!lfn:!'non Cn
Plural PI
VERB i NOUN
L _____
Adjective- Ai Dual Dn
Derm:mstr'atj'Je Df;i
Ftrst Fs
Relab t1e T
Re
- I
F~',,:: ~':;.~;.r2::·: L':;.Fe;'1t:~e •

--'I
Person.al
Dirru null"''' :
Ps
Dm
Second
Third
~{
Tit
Instrument b Illd:icatrve Dc
Proper Pn Subjunctive Sj
Adverb Acl Jussive J~
Jnterrogatilile [n N(lmmative NID
CQ:n junctto:n Cj Accusative A(

p fefl 0 sit ion Pp Genitive Gf;i
Vocative Vo Definite Dr
Cnnjunct.ion Co Indefinite Id
Exception Ex
Fig. 1: Tagset Hierarchy.

Fig. 2: Tagset Abbreviations
The tagset has the following main formula:
4 Part-Of-Speech Tagging
[T , S , G , N , P , M , C , F] ,Where:
T(Type) = {Verb, Noun, Particle} 4.1 Related Work
S:: Sub-Class {Common, Demonstrative, Relative, Part-of-speech tagging is the process of assigning a part-
Personal,Adverb,Diminutive,Instrument, of-speech or other syntactic class marker to each word in
Conjunctive, Interrogative, Proper a corpus [3]. Tagger is necessary for many applications,
and Adjective} such as: speech synthesis system, speech recognition
G(gender) = {Masculine, Feminine, Neuter} system, informational retrieval (IR) and parsing system.
~ (Number) = {Singular, Plural, Dual} Many techniques have been used to tag English and other
P(person) = {First, Second, Third} European languages corpora. Greene and Rubin [4]
M(Mood) = {Indicative, Subjunctive, Jussive} developed the first Rule-Based technique to tag Brown
C(Case) = {Nominative, Accusative, Genitive} Corpus. Eric Brill's [5] interest in rule-based tagger.
F(State) = {Definite, Indefinite} Garside[15] used hidden Markov Model to develop
CL-\ \\'S taK~cr. ~lore
recently, taggers that use
(prefixes+Forms+Suffixes) of most Arabic Word. We
combination of both Statistical and rule-based[6],
introduce an algorithm describe how we match the
}lachine learning [7] and Neural Network [8,9] have
tokens with its pattern. The pseudo code of proposed
':,W1 developed.
algorithm described in section 4.3. If the pattern of the
In terms of .-\rabic, small numbers of popular Part-of- token is found, then it is assigned the most likely tag of
~Fct'ch (pOS) tagger have been developed. EI- Kareh and
the word. If not, the word is then passed to the morpho-
.\l..-\DSJfy[10] described a hybrid semi-automatic tagger syntactic rules module to apply some linguistics rules to
that uses both morphological rules and statistical extract the most likely tag of the word. Some of these
techniques in the form of hidden Markov models. Abuleil rules shown in section 4.3.
ll1d E\'ens[11] describe a system for building an Arabic
lexicon automatically by tagging Arabic newspaper text. 4.3 Proposed Approach
~hereen Khoja[12] described an .-\rabic part-of-speech The proposed approach consists of two Parts: Pattern-Base
called A.PT that uses statistical and rule-based techniques. Approach and Linguistics Rules.
Diab, ;\Iona et al. [13] presented a Support Vector
J[achine (SY~n based approach to automatically Pattern-Base Approach, based on Full-patterns with
:,~kenize, part-of-speech tag in .-\rabic text. diacritics. Arabic language has a rich morphological
system that contains a lot of patterns. These patterns
U Proposed Arabic POS Tagging System assign part-of-speech tag of the Arabic word. Some of
Our uo-(rer is called A \\'TS - short for Arabic Word-
~
patterns belong to Verb class, while the others belong to
~.lgglng System - and its main function is to take as input Noun class. Particle has no patterns in Arabic language.
.:lugged Arabic text, and produce a POS tagged Text. We generate automatically a Full-Patterns lexicon by
.ill Oyer\,1ew of .-\ \X'TS can be seen in figure 3. collecting the Prefixes and Forms and Suffixes for most
Arabic words.
Algorithm-1 shows the pseudo code to describe how we

rnta~~ed
Text ,......AWTS-
match the tokens with its pattern. Figure 5 shows an
"'''' Tagged Arabic
Ara~ic example, how to trace the steps of algorithm to match
Tert the pattern for the wor d ,,~t-:u"" of r; h",1
V " , 10 Jzg t .
Let P=Full-Pattern, W=Inflected Word, T=Tag.

Fig. 3: .-\n Oven'iew of A WTS
Step-1: Return all P from lexicon where P=Len (W).
The description of the A WTS modules shown in Fig 4. Store results (Number of pattens) in N.
Step-2: For I = 1 to N
Compute the number of identical letters
between P(I) and W. Store results in Sim.
FI1L·P.\llER\' Next I
, Lt .. i{,III1I.HIi~Up Step-3: Return all P which have the Maximum (Sim)

Store results in M.
Arabic
Tokens
, .....---"-C
) 1I r"
Arabic Words
Tag
Step-4:
Step-5:
For] = 1 to M
Convert each Ietter 0 f '" \...JI fIt or "tE" or
" J I " in P(I) with the corresponding letters
ofW.
\llIr (lhll- '.1 IILIl'ti l'
......__f Step-6: If PO) = W then Return PO) , TO)·
RllF~ Go to Step 8
Step-7: Next]
Step-8: Exit
Algorithm-1: Pattern-Match-Algorithm
Fig 4: A WTS Modules
DUring tagging the Arabic Token is first looked up in the

' .
Full-Patterns lexIcon . h contaInS
WhIC . th e Full Patterns
Word (W) Patterns (P) 6 Conclusion and Future Works
In this paper, we presented diacritics rule-based part-of-
~
~ Sim=1 speec,h (paS) tagger which automatically tags a partially
vocalIzed Arabic text . Als 0, we d ' be a
escn
JJI Slm=2 mor~hosyntactic tagset that is derived from the ancient
ArabIc
, ,gr
am mar, w h'lC h IS
' based on Arabic system of
/-~
mfl~c,uonal morphology. The tagset does not follow the
Jc~ (\ Slm=3 / I
tradl.t1~nal Indo-European tagset that is based on Latin
"'-----_/ but IS mstead based on the Semitic tradition of analysing
~anguage: These tags contain a large amount of
T: YtPt~~nsj JJ Slm=1
mformauon and add more linguistic attributes to the
word. Also, we are currently collecting many rules to
Max ( Sim) reduce the amount of errors and expanding our tagset to
c~ver most categories word in Arabic language.
FIg.): :\fatching-Pattern Example It s clear that an overall ambiguity in a vocalised text is

quite lower than in an unvocalised text. Diacritics are
Unguistirs RJdes
uses Syntactical Information and used to prevent misunderstandings and reduce the
~[orphological Information without regard to context ambiguity; diacritics playa great role to speed the tagging
llld lookup tables to assign most likely tag to each process without scarifying accuracy and remove a great
~nknown and ambiguous word in the text. deal of morpho-lexical ambiguity when the text is partial
diacritization
~ome of these rules as examples are listed below:
References
Consider \\' = The word, T = The Tag
~ule-1: If \\' end with " ~ " or " ~ ", then [1] M. V. Mol, "The semi-automatic tagging of
T = [NuRe]. Arabic corpora," COLING 94} USA} 1994.
For Example, the word "~JJI", "Jordanian" [2] M.A.Elaraby2000, "Alarge scale
Computational processor of the Arabic
~ ~ or ~
Morphology and application.". (Master's thesis)}
Rule-2: If \\' end with or ,then
Cairo University} Egypt.
T= [NuCn] [3] D.J.J .H.Martin., Speech and language
For Example, the word " ~J ", "Man" processmg: An introduction to natural language
processing, computational linguistics, and speech
5 Results recognition. Prentice-ha/~ USA} 2000.
\\e tested our system to tag the words using partial- [4] B.Greene and G.Rubin., "Automatic
diacritization documents from the holly Qur'an and grammatical tagging of English" Department of Linguistics}
another set chosen randomly from the proceedings of the Brown University} Providence} RI.} USA.} 1971.
Saudi Arabian National Computer Conference and other [5] E. Brill, "A simple rule-based part of speech
tagger" Proceedings of the Twelfth International Conference on
resources.
Ire ran our system on a group of these documents. The AI.(AAAI- 94)}Seattle}WA} 1992.
[6] S.J .DeRose., "Grammatical category
accuracy of our system has been calculated for tagging
Disambiguation by statistical optimization."
the words. The total accuracy about 81 \ %, 19 \ % in
Computational Linguistics 14 (I)} 3139.} 1988.
errors, Some errors of the system came from Arabized
[7] B. Daelemans and Gills, "A memory-based
'.rords which are translated as pronounced from other
part of speech tagger generator." Proceedings of the Fourth
international languages, such as the word "Y ~".
Workshop on Very Larg,e Corpora} Copenhagen} Denmark} pp.
These words do not have a root and a pattern. Others
1427} 1996.
came from irregular verbs such as the word" ~ ". Also [8] N. G. Marques, "A neural network approach to
lome words in Arabic language consider as primitive part-of-speech tagging" Proceedings of the second workshop on
verbs, such as, "~ ", "~" . These words not spoken and written Portuguese} Curitiba} Brai!'~ p. 1-9}
tagged correctly and need a special treatment. 1996.
~] H.Schmid, "Part-of-speech tagging with neural
networks" Proceeding ofCOUNG-94. PP 172- 176, 1994.
[10] El-Kareh and Al-Ansary., "An Arabic
interactive multi-feature pos tagger." In Proceedings
oj/he, ACIDCA conference, Monastir, Tunisia, pp 204- 210.,
2000.WII, C. and .-Y.l\!. Jr?ang, 2000.
[11] S. Abuleil and M. Evens, "Discovering lexical
information by tagging Arabic newspaper text"
Workshop on Semitic Language Processing. COUNG-
ACL98, University of Montreal, Montreal, PQ, Canada, Aug
161998, pp 1-7.
[12] S.KHOJA, "Apt: Arabic part-of-speech
tagger" Proceedings of the Student Workshop at the
Second Meeting of the North American Chapter of the
Association for Computational Linguistics
(NAACL2001), Carnegie Mellon University, Pittsburgh,
Pmn.rylvania. June 2001, no. 2.
[13] K. H. Diab, Mona and D. Jurafsky, "Automatic
tagging of Arabic text: From raw text to base
phrase chunks," Proceedings of HLTNAACL,
2004.
[14] G, Khojah and Knowels, "A tagset for the
morphosyntactic tagging of Arabic," Paper
presented at Corpus Linguistics 2001, Lancaster
Universi!y, Lancaster, UK, March 2001.
[1~ Roger Garside, Geoffrey Leech, and Geoffrey
Sampson (1987) The Computational Analysis Of
English: a corpus-based approach. Longman Group
UK Iimited.
[16] Leech G, Wilson A 1996 Recommendations
for the Morphosyntactic Annotation of Corpora
EAGLES, Report.
[1~ Transparent Language,
http://www.transparent.com/
Developing a tagset for automated POS tagging in Arabic
SHIHADEH. ALQRAINY and ALADDIN A YESH

Centre for ComputatIonal Intelligence (CCI) - School of Computing
De Montfort University
Leicester - The Gateway
UNITED KINDOM
j 1 .
)a graIny, aayesh}@dmu.ac.uk
Abstract: - Arabic language has much more syntactical and morphological information. Diacritics which are marks
placed OYer and bel?w the letters ~f Arabic word, playa great role in adding linguistic attributes ~o Arabic word in
p~-of-spe~ch ta~gIllg system. :?IS
paper ~escribes a ~agset that were built based on the inflectional morphology
sy,tem \\ hlCh dc[l\ ed from. tradltlonal Arabl~ grammatical theory. The tagset developed represent an early stage of
:~search relat~d t~ au~~matlc morphosyntactlc annotation in Arabic language. This paper aims to present a general
:Jgset for use III dlacntlcs-based automated tagging system that is underdevelopment by the author.
Key-Words: - Part-of-Speech (POS), Arabic Language, Tagset, Diacritics, Syntactical, Morphological.
1 Introduction different nationalities. However, the spoken forms of

.-\ tag is a code which represents some features or set of Arabic vary widely, and each Arab country has its own
:'eatures and is attached to the segment in a text. Single dialect. Dialects are spoken in most informal settings,
or complex infonnation are carried by a tag [8]. In the such as at home, with friends, or while shopping.
:}Se of POS Tagging, a POS tagset to categories and The Arabic language belongs to the Semitic family of
Jark up the words of the target text is an absolutely languages, written from right to left. Arabic has been a
:~cessary preliminary [3]. The development of a tagset literary language since the 6th century A.D., and is the
:0 support diacritical based tagging system is at early liturgical language of Islam in its classical form.
):3.ge. Little work has been done in developing Arabic The Arabic writing system is quite different from the
Jgset. The need for such a tagset comes from the fact English system. The Arabic alphabet consists of 28
that there is no standardized and comprehensive Arabic letters that change shape depending on their position
tagset. within a word and the letters by which they are
:ill overview of Arabic language followed by diacritics
surrounded. Some Arabic letters must be connected to
other letters; others may stand alone. Arabic vowels are
in Arabic described in this paper. Tagset background
indicated by marks (Diacritics) above and below the
and EAGLES guidelines overview presented. Finally
consonants. In many cases, these diacritics play the role
we will present our tagset (Analysis and Hierarchy)
of vowels in English and thus influence pronunciation.
followed by conclusion and future work.
Additionally, there are no special forms, such as the use
of capital letters in English, to indicate proper nouns or
l Arabic Language the beginning of a sentence [10].
t1 BaCkground
The Arabic language is spoken in more than 20 2.2 Diacritics in Arabic
Countries, from Egypt to Morocco and throughout the Diacritics are marks placed over and below the letters
Arabian Peninsula. It is the native language of over 195 of Arabic word. This feature plays a great role in
million people. Plus, at least another 35 million speak adding linguistic attributes to Arabic words which help
Arabic as a second language. us to assign the most likely tag of the word in POS
Modem Standard Arabic (MSA) is the official tagging system and in indicating pronunciation and
language throughout the Arab world, and its written grammatical function of the words. It is particularly of
fonn is relatively consistent across national boundaries. interest for the purpose of this paper. Table 1 shows
MSA is used in official documents, in educational Arabic vowel diacritics.
settings, and for communication between Arabs of
The pronunciation of diacri tized languages word
s
caM ot be fully determined by spelling their characters 3 Arabic Ta~s~t and EAGLES guidelines
.' special marks are put above or below th e
only' EAGLES [9] gUIdelInes outline a set of features for
\haractt'rs (Diacritics) to determine the correc t Tagsets,. these guidelines were designed to help
promllciation and indicate the grammar function of th e standardIze tagsets for what were then the official
word within the sentence. For example, the word languages of the European Union.
"~ " without mark (Diacritic) may be pronounced to EA?LES tags are defined as sets of morpho syntactic
mean "He )I'rore" . "It )I'as lrritten", "books". Th e attrIbute-value pairs (e.g. Gender is an attribute that can
reader may refer to the context the word appears in to have the val~es Masculine, Feminine or Neuter)[3].
Jecidt' which of the \Yords is actually intended. In suc h The tagset dIscussed here is not being developed in
i.ll1guagt's. two different words may have identicaI
~ ~
accordance with the EAGLES guidelines for
spelling \yhereas their pronunciations and meanings are ~orphosyntactic annotation of corpora. Arabic is very
totally different [~]. dIfferent from the languages for which EAGLES was
In Arabic. short yowels are not apart of the Arabi c designed, and belongs to the Semitic family rather than
alphabet. They are used in both Noun and Verb in the Indo-European one.
Arabic Language. They indicate the case of the noun Following a normalized tagset and the EAGLES
lnQ the mood of the \·erb. recommendations would not capture some of Arabic
relevant information, such as the jussive mood of the
I Sholl Vowels ( Diacritics) -, verb and the dual number that are integral to Arabic.
I~anE Another important aspect of Arabic is inheritance,
Fatha Damma Kasra where all subclasses of words inherit properties from
S:YlT.bJl the classes from which they are derived. For example,
"", 'u I J tu I "., li I all subclasses of the noun inherit the "Tanween"
E~~anati.on Written above Written above Written b~lol,Y nunation when in the indefinite which is one of the
tre ::c:msonan1,
uample ,
tre ::c:msonan1. tre :nsonar.t,
.....>
main properties of the noun [7].
4...,..)
~'
.
Pnnun::iation
".
3.1 Previous work on POS tagsets
ba Bu bi
There are numbers of popular tagsets for English, such
I NUIl.ltioti .. TmlNeell" (Diacritics) I as : 87-tag tagset used Brown Corpus, 45-tag Penn
I~anE Treebank tagset and 61-tag C5 tagset, TOSCA tagset,
TaIlWeen Tamveen Tamveen ICE tagset, LUND tagset [5][3]. For Arabic also very
Tatil Damm KrMI small number of tagset had been built, El-Kareh S, AI-
Slin:bJl Ansary [1] described the tagset ,they classifying the
~ /~ln/ ~ /unl ~ !inl words into three main classes, Verbs are sub classified
E~~anati.on Writt en above V7r.tt:;r, fb~ Witten ':)elow
tre ::c:msonan1. the corucnanJ:, the OOJlSJr.ant,
into 3 subclasses; Nouns into 46 subclasses and
Example ~ Particles into 23 subclasses. Shereen Khoja [7]
1..1
•
~ ,..U
~.
described more detail tagset. Her tagset contains 177
Pnnun::iation bar. bU:1 tin tags, 57 Verbs, 103 Nouns, 9 Paricles, 7 residual and 1
punctuation.
Shali(ia & Sukull
( () iacritics ) 3.2 Proposed Arabic Tagset: Analysis
:.Jame
S1ladda SlJkuJJ.
It is necessary to have a model of the language to create
3Jmbo~
the linguistic categories of a tagset. An ideal approach
~ 0 would be to derive this model from the grammatical
~xplanation \?rit1eh roove WriterL above description of the language.
b.e con;onarrt , the CJrsorur.t. Since the grammar of Arabic has been standardized
~xarr.ple ~ <;'
....,.; for centuries, it is logical to derive our morpho syntactic

W. . Arabic tagset from this grammatical tradition that has
J'O:."nmci a:ior: bb b
... !
been used for around fourteen centuries by all students
of Arabic.
Table 1: Arabic vowel diacritics
2
.\rabic grammarians and linguists have always used the
For example, the words ".l.lJ" "61.l.l;" and "J'i I" h' h
" ' ; w IC
.\rabic system of inflectional morphology called
"~I.>"~I" when teaching Arabic grammar to students.
n:ean a boy ", " two boys " and "boys " indicate
smgular, dual, and plural respectively.
For example. given the sentence" jJ-Jl\ ~ ""the boy
played", students would have to say that the first word
• Gender: Arabic nouns have three genders: masculine
IS the indeclinable, indicative. perfect verb, while the
feminine and neuter. Most common noun ends with
second word is the nominative sUbject[7][3].
"Tanween". Most feminine singular nouns end with a
The proposed Arabic tagset in this paper is based on
round Ta (marbuta). For example, the words
the int1ectional morphology system. Arabic "~,, ,," "L" d:i -, ~_
,0.,)01'-01:1 an "-~", which mean" a king" "a
grammarians traditionally analyses all Arabic words I " and" group of people " indicate masculine
'
pane
into three main parts-of-speech. These parts-of-speech feminine and neuter respectively. '
are further sub-categorised into more detailed parts-of-
~peech which collectively cover the whole of the • Person: Arabic nouns have three persons: the speaker
Arabic language [-+]. These are: (First person), the individual spoken to (Second
'~oun: A noun in Arabic is a name or a describing- person), and individual spoken of (third person). For
word for a person, a thing or an idea. This includes not example, the personal noun and "ul" which mean" I" ,
only the English equivalent of a noun, but also " You" and" He " indicate First, Second, and third
adjectives. proper nouns and pronouns. person respectively.
'Verb: Verb: Yerbs are the same in Arabic as they are
in English in that they denote actions. 3.2.2 Verb
'Particle: Partie les include prepositions, conjunctions, Arabic verbs are deficient in tenses. Moreover, these
Exceptions, Vocative, Annulment, Subjunctive, and tenses do not have accurate time significances as in
Jussive. Indo-European languages [6].
U.1 ~oun The verb in the Arabic language implies a state or

.\ noun in Arabic indicates a meaning by itself without action and a notion of time combined with them and
}eing connected with the notion of time and refers to a has several aspects: Perfect, Imperfect and Imperative.
:er50n, place, thing, event, substance or quality. The Perfect verb indicates a state or a fact in the past.
\ouns are also divided into the following types: For example, the word "~" which means "He wrote".
(Common, Demonstrative, Relative, Personal, Adverb, The Imperfect verb expresses an action still unfinished
Diminutive, Instrument, Conjunctive, Interrogative, at the time to which reference is being made. For
Proper, and Adjective). example, the word "~~,, which means "He is writing".
The linguistic attributes of nouns that have been used in The Imperative verb indicates an action demanded to
this tagset are: be carried out in the future. For example, the word
"~1" which means "you write".
'Case: Arabic nouns have three cases: nominative,
accusative and genitive. For example, the words " ~J~\ The linguistic attributes of Verbs that have been used in
.... J~\ ,
o..J.l.l1 " which mean "the lesson", indicate the this tagset are:
above three cases respectively.
Without the case marker associated with the last letter • Mood: Arabic Verbs have three moods: Indicative,
of the above words (e.g short vowels), it's difficult to Subjunctive and Jussive. In Verbs, the words "~",
detennine the case of that word. "~" and "('.M" which mean" He wrote" , " I wrote
" and " You wrote " indicate Indicative, Subjunctive,
'State: Arabic nouns are marked for definiteness and Jussive mood respectively.
indefiniteness. Definiteness is marked by the article
')", which means" the". For example, the words • Number: Arabic has three numbers: singular, dual,
"1o:1L4.l1" and "~US" which mean" the boof('," a boof(' and plural. For example, the words "I.;l", "61ji;" and
indicate definiteness/indefiniteness respectively. "1;1.;l" which mean" He read ", " (two people) read"
and " they read" indicate singular, dual, and plural
I ~umber: Arabic has three numbers: singular, dual, number respectively.
and plural.
3
• Gender: Arabic verbs have two genders: masculine
feminine. For example, the words "~" and II~': The tagset has the following main formula:
which mean "He wrote "and " She wrote ". [ T , S , G , N , P , M , C , F) ,Where:
T (Type) = {Verb, Noun, Particle}
· Person: Arabic verbs have three persons: the speaker S = Sub-Class {Common, Demonstrative, Relative,
(First person), the individual spoken to (Second Personal, Adverb, Diminutive, Instrument,
Conjunctive, Interrogative, Proper
rcrson), and individual spoken of (third person).
and Adjective}
For example, the words which mean the words II~II ,
'~" and "~" which mean" He wrote" , " I wrote
. G (gender)= {Masculine, Feminine, Neuter}
N (Number) = {SingUlar, Plural, Dual}
., and " YOli wrote " indicate First, Second, and third P (Person) = {First, Second, Third}
person respectin~ly, M (Mood) = {Indicative, Subjunctive, Jussive}
C (Case) = {Nominative, Accusative, Genitive}
3.2.3 Particle F (State) = {Definite, Indefinite}
In Arabic, particles are classified as one of the three
main categories as part of speech, some of the particles Figure 2 shows the Abbreviations which was used to
used with \' erbs and effective the mood of verb when define the words in our tagset.
precedes the \' erb word. For example, the particles "~" A sample of our tagset shown in Table 2.
IJussi\'e), "..,.s" (Subjuncti\'e), some of them used with
\'ouns, For example. the particles "~" (Preposition), Wonl ~4\bb \VQrd Ahb
"YI" (Exception). and some used with both the noun Verb Vt' .A.r.tll\dment . .~n
and the verb. For example, the particle ",J" Noun Nu Subjunctive Sb
(Conjunction). Partide PI' Mascuhoe I\Ia
Pttfect Pf:a Feminme F ...
3.3 Proposed Arabic Tagset: Hierarchy Imperfect Pi
Me'uter Nt'
Imp erah'..re Pm
\\'e have based our Arabic tagset on inflectional Stngular S11
Common Cn r--
morphology system. The traditional description of
Adjectiv & Aj
Plural PI
-\rabic grammarians consider as a base to create the Dual Dn
Dem.onSlratlve Di:"
.:nguistic categories of Arabic tagset. Arabic R.elative Rf:'
rLfst Fs
grammarians describe Arabic as being derived from Perscm.al P,o;
Second Be
three main categories: noun, verb and particle. Figure 1 Dlrrunut ""'Ie Din Third Th
shows the tagset hierarchy. Instrurnenl Is IndicatIve Dc
Proper PI1 Subjunctive ~j
Adverb Ad Jussive Js
Interrogative In Nr)ftltnative Nm
Word ConjUMtiQn q AcctLsatlve ..\c
Preposition Pp Genitive G~
V-oc.atrve " -\"0- - Defirute Df
ConjOO(;tlCUl Co l'nde fmite Ed
u..ceplion Ex
Fig. 2: Tagset AbbreViatIOns
Let us try to explain the symbols of the tagset formula

for a moment.
The symbols [ T , S , G , N , P , M ] consider as
linguistic attributes for class Verb, while the symbols
[ T, S , G , N , P , C ,F ] consider as linguistic
attributes for class Noun. For example , the word
II ~ II which means "He wrote" has the following
ta~ [ VePeMaSnThSj ], which means [ Pe.rject Verb,
[-~SEJJ "-- - -'- _- - - - ' L-.-_ _ ~ '-_--' ~=----' Masculine Gender, Singular Number, Thzrd Person,
Fig. 1: Tagset Hierarchy. Subjunctive Mood ].
4
4Conclusion and Future Work
Verb, Imperfect, Masculine, Dual,
In t~is pa.p~r. we describe~ a morphosyntactic tag set
VePiMaDuThJs Third Person, Jussive
that IS derl\ ed from the anCIent Arabic grammar wh' h
dAb' ' IC Verb, Imperfect, Masculine, Dual,
IS base on ra IC system of inflectional morphology.
VePiFeDuSeJs Third Person, Jussive
The tagset represent an early stage for use in a word-
Verb, Imperfect, Masculine, Plural,
class based automated tagging system that is
VePiMaPIThSj Third Person, Subjunctive
underde\'elopme~~ by the author. The tag set does not
Verb, Imperfect, Feminine, Plural,
follow the tr.adlttonal Indo-European tagset that is
VePiFePIThSj Third Person, Subjunctive
based on Latm but is instead based on the Semitic
Verb, Imperative, Masculine,
tradition of analyzing language.
VePmMaSnSeJs Singular, Second Person, Jussive
These tags contain a large amount of information and
Verb, Imperative, Neuter, Dual,
add more linguis~ic attributes to the word. Also, we are
VePmNeDuSeS.i Second Person, Subjunctive
currently expandmg our tagset to cover most categories
Verb, Imperative, Feminine, Plural,
word in Arabic.
VePmFePlSeSj Second Person, Subjunctive
Verb, Imperative, Feminine, Plural,
Tag Description VePmMaPlSeSj Second Person, Subjunctive
\' erb, Perfect, Masculine, Singular, N uAjMsSnN mId Adjective Noun, Masculine,
YePeMaSnThSj Third Person, Subjunctive Singular, Nominative, Indefinite
Verb, Perfect, Masculine, Singular, N uAjMsSnAcId Adjective Noun, Masculine,
First Person, Indicative Singular, Accusative, Indefinite
YePe~laSnFsDc
NuAjMsSnGeId Adjective Noun, Masculine,
I \'erb, Perfect, Masculine, Singular,
Singular, Genitive, Indefinite
VePe~laSnSeSj First Person, Subjunctive
N uAjMsSnN mDf Adjective Noun, Masculine,
Verb, Perfect, Feminine, Singular, Singular, Nominative, Definite
YePeFeSnSeJs Second Person, Jussive N uAjMsSnAcDf Adjective Noun, Masculine,
Verb, Perfect, Feminine, Singular, Singular, Accusative, Definite
VePeFeSnThJs Third Person, Jussive NuAjMsSnGeDf Adjective Noun, Masculine,
Verb, Perfect, Neuter, Dual, Second Singular, Genitive, Definite
Person, Subjunctive N uAjMsDuGeId Adjective Noun, Masculine, Dual,
YePe~eDuSeS.i
Verb, Perfect, Masculine, Dual,
NuAjMsDuGeDf Adjective Noun, Masculine, Dual,
YePeMaDuThSj Third Person, Subjunctive
Genitive, Definite
Verb, Perfect, Feminine, Dual, NuAjFeSnNmId Adjective Noun, Feminine,
\"ePeFeDuThSj Third Person, Subjunctive Singular, Nominative, Indefinite
Verb, Perfect, Masculine, Plural, N uAjFeSnAcId Adjective Noun, Feminine,
\"ePe~laPIFsS.i ! First Person, Subjunctive Singular, Accusative, Indefinite
Verb, Perfect, Masculine, Plural, NuAjFeSnGeId Adjective Noun, Feminine,
Second Person, Jussive Singular, Genitive, Indefinite
•VePeMaPlSeJs
N uAjFeSnN mDf Adjective Noun, Feminine,
Verb, Perfect, Feminine, Plural,
Singular, Nominative, Definite
lVePeFePlSeJs Second Person, Subjunctive
NuAjFeSnAcDf Adjective Noun, Feminine,
Verb, Perfect, Feminine, Plural, Singular, Accusative, Definite
:VePeFePIThJs Third Person, Subjunctive Adjective Noun, Feminine,
NuAjFeSnGeDf
Verb, Perfect, Masculine, Plural, Singular, Genitive, Definite
! VePeMaPIThDc Third Person, Indicative NuAjFeDuGeId Adjective Noun, Feminine, Dual,
,
I
Verb, Imperfect, Masculine, Genitive, Indefinite
Singular, Third Person, Indicative NuAjFeDuGeDf Adjective Noun, Masculine, Dual,
i VePiMaSnThDc
Verb, Imperfect, Masculine, Genitive, Definite
Singular, First Person, Indicative N uAjMaPIAcId Adjective Noun, Masculine, Plural,
_VePiMaS nFsDc Accusative, Indefinite
Verb, Imperfect, Feminine,
NuAjMaPIGeId Adjective Noun, Masculine, Plural,
JePiFeSnThDc Singular, Third Person, Indicative
Verb, Imperfect, Neuter, Plural, Adjective Noun, Masculine, Plural,
First Person, Indicative NuAjMaPINmId Nominative, Indefinite
I -.yePiN ePLFsDc
5
~Aj 1\ laPIN mDf Adjective Noun, Masculine, Plural, Reference:
~ uAjl\'1aPIAcDf Adjective Noun, Masculine, Plural, [1] EI-Kareh and AI-Ansary, An Arabic interactive
multi-feature pos tagger. In Proceedings of the,
\uAjl\laPIGeDf .\djective Noun, Masculine, Plural,
Genitive, Definite A CIDCA conference, Monastir, Tunisia, 2000, pp
,204- 210.
~uAjFePl~mId Adjective Noun, Feminine, Plural,
Nominative, Indefinite [2] M. A. Elaraby 2000, A large scale computational
- processor of the Arabic morphology and
~uAjFePIAcId Adjective Noun, Feminine, Plural,
Accusative, Indefinite application. (Master's thesis), Cairo University,
- Egypt.
~uAjFePIGeId Adjective Noun, Feminine, Plural,
Genitive, Indefinite [3] Andrew Hardie. Developing a tagset for automated
- ~uAjFePl~ mDf Adjective Noun, Feminine, Plural, Part-of-speech tagging in Urdu. Proceedings of the
Nominative, Definite Corpus Linguistics 2003 conference, Lancaster
\uAjFePIAcDf ! :\djective Noun, Feminine, Plural, University, UK, 2003.
I :\ccusative, Definite
[4] J. A. Haywood and H. M. Nahmad. A new Arabic
\uAjFePIGeDf I .-\djective Noun, Feminine, Plural,
Grammar: of the written language, LUND
Genitive, Definite
HUMPHRIES , USA, 2005.
\uIs~laSn~ mId Instrument Noun, Masculine,
[5] Daniel Jurafsky & James H.Martin. Speech and
Singular, Nominative, Indefinite
language processing: An introduction to natural
~uIsl\laDuGeId Instrument Noun, Masculine, Dual,
GenitiYt?, Indefinite language processing, computational linguistics,
Instrument Noun, Masculine, and speech recognition. Prentice-hall, USA.,
\uIs~laPl~mId
Plural. ~ominative, Indefinite 2000.
\uIs~lsSn:\ mDf Instrument Noun, Masculine, [6] S. KHOJA, Apt: Arabic part-of-speech tagger.
Singular, Nominative, Definite
I.
Proceedings of the Student Workshop at the Second
\uIs~lsSnAcDf I Instrument Noun, Masculine, Meeting of the North American Chapter of the
I Singular, Accusative, Definite Association for Computational Linguistics
Instrument Noun, Masculine, (NAACL2001), Carnegie Mellon University,
\uIs~lsSnGeDf
Singular, Genitive, Definite Pittsburgh, Pennsylvania, no. 2, 2001 .
\uIs~laDuGeId Instrument Noun, Masculine, Dual, [7] Graside, Khojah and Knowels, A tagset for the
Genitive, Indefinite morpho syntactic tagging of Arabic. Paper
\uIs~IaPl.:\'mDf Instrument Noun, Masculine, presented at Corpus Linguistics 2001, Lancaster
Plural, Nominative, Definite University, Lancaster, UK, March 2001, and to
\ uIs.\ IaPlAcDf Instrument Noun, Masculine, appear in a book entitled "A Rainbow of Corpora:
Plural, Accusative, Definite Corpus Linguistics and the Languages of the
~uIs'\IaPl~mDf Instrument Noun, Masculine, World", edited by Andrew Wilson, Paul Rayson,
Plural, Genitive, Definite and Tony McEnery; Lincom-Europa, Munich.,
PrPp Preposition Particle
2001.
PrVo Vocative Particle [8] B Megyesi. Brill's rule-based part of speech
tagger for Hungarian. D-level thesis (Master's
PrCo Conjunction Particle thesis) in Computational Linguistics, Stockholm
University, Sweden. 1998.
PrEx Exception Particle
[9] Leech G, Wilson A 1996 Recommendations for the
-PrAn Annulment Particle Morphosyntactic Annotation of Corpora EA GLES
Report.
i PrSb Subjunctive Particle http://www.ilc.pi.cnr.itIEAGLES96/annotate/
I
[10] Transparent Language

Table 2: Sample of Arabic Tagset http://www.transparent.com!

Arabic Essay PDF

Uploaded by

Copyright:

Available Formats

Arabic Essay PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Arabic Essay PDF

Uploaded by

Copyright:

Available Formats

A Morphological - Syntactical Analysis

Approach For Arabic Textual Tagging

Thesis submitted in partial fulfillment

Part-of-Speech (POS) tagging is the process of labeling or classifying each word in

many NLP applications.

and determining the syntactic position of the word in the sentence.

pattern-based technique and the lexical and contextual technique.

word with its correct pattern in pattern lexicon.

Index of Transliteration xvi

1.3 Research Hypothesis, Aims and Objectives 6

1.4 Significant Research Contributions 8

1.5 Outline of the thesis . . . . . . . . 10

2 Related Concepts and Literature Review 12

2.2 Applications of Part-of-Speech Tagging 17

2.3 Corpus-based Linguistics . . . . . . . . 19

3 Arabic Language and POS tagging 65

4 Tag set Design 83

4.1 Tag set design criteria . 83

5 Design and Implementation of AMT 109

5.2 Rule-based - the developed approach. 110

5.2.1 Justification for using the rule-based approach. 111

5.3 Pattern-based technique - A novel technique. 112

5.3.1 Pattern-based Rules .. 116

6 Evaluation of Results obtained from AMT 137

6.2 AMT Experiments and accuracy measurement . 139

6.2.1 Experiment-1 . 140

6.2.2 Experiment-2 141

6.2.3 Experiment-3 142

6.2.4 Experiment-4 . 143

6.3 Experimental results Analysis 145

6.3.1 The Quran text experiment . 151

6.4 Summary of results obtained from the AMT system . 153

6.5 Chapter Summary . .................. 154

7.2 Contributions 159

7.3 Future Works 160

7.4 Summary .. 161

A Tagset Appendices 179

B The Arabic Language Orthography 191

C Lexical and Contextual Rules 195

D Permission for Collecting Testing Corpus 197

2.1 Ambiguity types ..... 14

2.2 The possible values of tag 15

2.3 ARBTAGS tag set hierarchy 33

2.4 Khoja tag set hierarchy [89] 34

2.5 Transformation-Based Error-Driven Learning. 43

2.6 How APT performs tagging. . ........ 57

3.1 The origin of the Arabic script 68

3.2 The Arabic grammatical system 78

4.1 Categories of Arabic verb 91

4.2 Categories of Arabic noun 92

4.4 Verb sub-classes and their inflectional features 102

4.5 Noun sub-classes and their inflectional features 103

4.6 Particle sub-classes . . . . . . . . . . .. 103

5.1 the word 0yi1.A!j and its pattern 0~~j 115

5.4 Matching the word ~ with the pattern ~ . . . . . . . . .. 123

6.1 Success rate of experiment-1 . . . . . . 141

6.5 Detailed and general tags ratio in experiment-2 143

6.6 Distribution of POS classes in experiment-2 143

6.7 Success rate of experiment-3 . . . . . . . . 143