Dr. TV. Geetha

A Special Talk
Tamil Computing T il C ti
Dr.T.V.Geetha, Tamil Computing Lab (TACOLA), Dept. of CSE & IST College of Engineering Guindy, Anna University Chennai Team Co-ordinators: Ranjani Parthasarathy & Dr.Madhan Karky 27th January 2012
Tamil Computing 1
Characteristics of Tamil
Partially free-word order language Morphologically rich language Morphological suffixes convey most of the h l i l ffi f h roles played in a sentence Ambiguity at morphological level Ambiguity at semantic level g y
3 Tamil Computing
Our Basis
Linguistics
Use of rich Morphological Features of Tamil Use of POS Tags Use of Word Based Semantics with well defined semantic constraints (primitives) UNL
Computer Science & Engineering Computer

Rule based Approach & FSA for Tamil Clustering Approaches Probabilistic Approaches N B P b bili ti A h Nave Bayes, C diti Conditional R d l Random Field, HMM g pp g p pp Machine Learning Bootstrapping & Unsupervised Approach
Tamil Computing
Language Processing
Tamil Computing
Language Processing
Morphological Analyzer POS Tagging Chunking Named Entity Recognition Parser Word Sense Disambiguation Anaphora Resolution A h l i Semantic Interpretation
Tamil Computing 6
Morphological Analyzer
Tamil Computing
Morphological Analyser - Introduction

Most of the textual data contains compound, numeral and colloquial words. words Due to morphological richness, Tamil language needs h dli of those words. d handling f th d Development of an Integrated Morphological analyser (Compound, Numeral, Colloquial)
Needed to tackle News & Lyrics
Helps to increase the accuracy of morphological analysis
Morphological Analyser Word processing

Morphological suffix stripping - (Conventional analyser) - Resulted word W The word W is processed by compound analyser The word W is processed by numeral analyser (Right to left by concatenating vowels and consonants and iteratively checking alphabets in the Morph Dictionary and applying Tamil grammar rules)
Morphological Analyser
Compound Word R C d W d Representation t ti
Numeral Word Representation
Compound Analyser Rules Classication
Compound Analyser C dA l Based on Finite State Transducer (FST) Not only handles simple compounding Handling compounding between two words that may cause inflectional variations during compounding process Ex : (i) (Golden statue) Rule: If, the second constituents first alphabet is Hard consonant Then, Then the first constituents last alphabet is Vowel / Medial consonant constituent s consonant, then No Modification
Compound A l C d Analyser
Ex : (Root tree) Rule R l : If, the second constituents first alphabet is Consonant
Then, Then is inserted as the first constituents last constituent s alphabet
(Root)
Insertion
Ex : (Sand pot) Rule R l : If, the second constituents first alphabet is Hard Consonant Then, the first constituents last alphabet - is replaced by

Replacement R l t
(Pot)
Ex : (Banana) Rule R l : If, the second constituents first alphabet is Hard Consonant
Then, h first Th the fi constituents last alphabet is the same Hard i l l h b i h H d consonant, then it is deleted

(Fruit)
Deletion
Compound Analyser Rules Classification
Compound Word Analyser - FST

Finite State Transducer Fi it St t T d
Two taps which describes the input (lexical form) and output (Surface form) sequences It has seven tuples 1 represents the finite alphabet, namely the input alphabet (ai1,......aik) 2 represents the finite alphabet, namely the output alphabet (bi1,......bik) Q i a fi it set of states (S0 ,S1,S2,S3,S4, S5 S6) is finite t f t t S1 S2 S3 S4 S5,S6) i Q is the initial state(S0 ) F is a subset of Q, the set of final states;(S6 ) Here a:b represents the replacement of a in the surface form to b in the lexical form c/d states that transition can occur if either c or d is in lexical form.
Compound Analyser FST
Numeral analyzer
Based B d on Finite S Fi i State T Transducer (FST) d Numbers, one to ten, hundred, thousand, lakh and crore can be directly converted into numbers i i Ex : (Ten)
Rule: No modification
10
Numeral Analyser N lA l Ex : (Five Thousand)
Rule :
If, the second constituents first alphabet is Vowel and the first constituent s last alphabet is Hard consonant then constituents consonant, insert ''
(Thousand)
Insertion
5000
Numeral Analyser N lA l
Ex : (Twenty Five) Rule R l :

If, the second constituent first alphabet is Vowel Then, Then the first constituent last alphabet is Hard Consonant, Consonant then replace that with

Replacement R l t
(Five)
25
Numeral Analyser N lA l
Ex : (Twenty Three) Rule R l :

If, the second constituents first alphabet is Soft Consonant and the first constituents last alphabet is Hard consonant, constituent s then delete the hard consonant
23

(Three)
Deletion
Numeral Analyser FST
Colloquial Analyser
Based B d on pattern mapping approach tt i h To the best of our knowledge, no previous work has been made to convert informal word to formal word. Adopt spelling variations rules and perform the mapping for transforming informal (colloquial) written word into formal written word.
Colloquial Analyser
Pattern based Approach based on spelling variation rules Word processing Right to left List of Spelling variation rules Suffix Mapping of ending patterns Suffix Mapping of ending patterns with Morphographemic changes Suffix S ffi mapping of ending patterns with checking of one/two i f di tt ith h ki f /t preceding characters Suffix mapping of patterns occurring at any place In all the rules, pattern p1 of colloquial form is converted into pattern p2 of normal form
Suffix Mapping of ending patterns
Colloquial Analyser
Suffix Mapping of ending patterns Ending p g pattern p is replaced with p p1 p pattern p2 p Ex : (irukean)
(irukirean) ( )
Pattern 1 Pattern 2
Replaced
Suffix Mapping of ending patterns with Morphographemic changes
Colloquial Analyser
Suffix Mapping of ending patterns with Morphographemic changes Ending pattern p1 is replaced with pattern p2, then p2 passed for morphographemic change Pattern 2 Ex : Pattern 1
(thambi kittey) (thambi yidam)
Replaced
morphographemic

Suffix mapping of ending patterns with checking of one/two preceding characters
Colloquial Analyser
Suffix Mapping of ending patterns with checking of
one/two preceding characters

Ending tt E di pattern p1 i replaced with pattern p2, after 1 is l d ith tt 2 ft checking one or two preceding characters. Pattern 2 Pattern 1 Ex E : (kaa nju) (kaa yndhu)
Replaced
Check one preceding character
(A)
Compound Word- Semantic WordRelation

Extracting the semantic relation for compound words
Identifying the metaphor words Identifying the characteristics of the components Identifying the comparison relation between the components
This relation are extracted by using the Tamil grammar of compound word ( - thogai), the part of speech tag and UNL semantic constraints t f ht d ti t i t
Compound Word- Semantic Relation WordRelations 14 are identified

Ex : To identify Color relation (black hair) + (black + hair)
Black
Noun+icl>color
+ hair (noun + pof>head)
concept
Semantic Constraint
Relation
Part Of Speech (POS) Tagging
Tamil Computing
35
POS Tagging
What is POS tagging?
Part-of-speech tagging is a process of assigning a part-of-speech like noun, verb, pronoun, preposition, adverb, adjective or other lexical class marker to each word in a sentence. t
Tag sets for different languages

For Tamil , a tag set is formulated by a literature survey a view of the standard tag set for English language like Penn tree bank, wall street journal tag set.
Tamil Computing 36
Noun C t N Category
N NP NN NNP IN INP PN PNP VN VNP Pn PnP Nn NnP Noun Noun Phrase Noun + noun Noun + Noun Phrase Interrogative noun Interrogative noun phrase Pronominal Noun Pronominal noun Verbal Noun Verbal Noun Phrase Pronoun Pronoun Phrase Nominal noun Nominal noun Phrase SP SCC Par P adj Iadj Dadj Inter Int CNum Num DT PO
Other category
Sub-ordinate clause conjunction Phrase Sub-ordinate clause conjunction Particle P ti l Adjective Interrogative adjective Demonstrative adjective Intersection Intensifier Character number Number 25 Date time , Post position
Verb category
V VP Vinf Vvp Vrp AV FV NFV adv Verb Verbal phrase Verb Infinitive Verb verbal participle Verbal Relative participle Auxiliary verb A ili b Finite Verb Negative Finite Verb Adverb Tamil Computing
37
Characteristics by analyzing 4,70,000 words

Tamil take on more than one morphological suffix; often the number of suffixes is 3 with the maximum going up to 13. The role of the sequence of the morphological suffixes attached to a word in determining the part partof-speech tag. 79 morpheme components were identified, which can combine to form about 2000 meaningful combination of integrated suffixes
Using these morpheme properties we design a Nave Baye s probabilistic Model for POS Bayes
Analysis of morphology of words and design of Nave Bayes Model for POS based on morpheme components
Tamil Computing 38
Chunking
Tamil Computing
39
Chunking
What is Chunking?
Chunking is the task of identifying and g f fy g segmenting the text into syntactically related non overlapping groups of words.
Need for chunking one of the important preprocessing for all other language processing aid to extract crux part of information from sentences and documents The chunk types are
ADJP, ADVP, CONJP, INTJ, NP, PP and VP.
Tamil Computing 40
Our Approach
Our Approach The morpheme features of words contribute in identifying boundaries of chunking. Using these morpheme components as one of the features ,CRF model is designed.
Using Morpheme Components as features for Conditional Random Fields models for identifying chunking boundaries
Tamil Computing
41
words
Transliteration [intha] [thakavalin] [atippataiyil] [pOlicAr] [andthandtha] [mAvatta] [p [pOlish] ] [cOthanaic] [cAvati] [maiyangkalil] [vAkanac] [cOthanaiyil] [Itupattanar]
POS <adj> <Ngen> <Nloc> <noun> <Dadj> <Madj> <noun> <adj> <noun> <Nloc> <adj> <Nloc> <FV>
chunk B -NP I-NP I-NP B-NP B-NP I-NP I-NP I-NP I-NP I-NP I NP B-NP I-NP B-VP
State features F(word(-2),Ctag)
Transition features F(word(-2),word(-1), Ctag)
F(word(-1),Ctag ) F(word(-1),word(0), C ) tag F(word(0),Ctag ) F(word(0),word(1), Ctag) F(word(1),Ctag ) F(word(1),word(2), Ctag) F( d(1) d(2) F(word(2),Ctag ) F(POS(-2),C ) F(POS(-2),POS(-1),Ctag )
tag
F(POS(-1),Ctag ) ( ( ), F(POS(0),Ctag ) F(POS(1),Ctag ) F(POS(2),C F(POS(2) Ctag )
F(POS(-1),POS(0),Ctag ) ( ( ), ( ), F(POS(0),POS(1),Ctag ) F(POS(1),POS(2),Ctag )
Tamil Computing
42
Named Entity Recognition
Tamil Computing
43
Named Entity Recognition (NER)

Locate and classify atomic elements in text into p predefined categories g Proper names (people, organizations, locations) expressions of time Quantities monetary values percentages Need f NER- Robust h dli of proper names essential d for b handling f i l for many applications Pre-processing for different classification levels Key part of Information Extraction system Information filtering Information linking
Tamil Computing 44
Indian language NER

use two levels of linguistic evidence to perform modeling:
Context cues Attributes to identify entities.
A standard list of attributes is maintained initially List d t d Li t updated - suitable learning algorithm. it bl l i l ith Attributes are thus extracted and used to identify NEs within the framework of the same system an y NER and its associated attribute extractor.
Tamil Computing
45

Challenges
Absence of capitalization of entities Presence of a free word order Lemmatization diffi lt L ti ti difficult
Features
Postpositions Case markers PNG marker in Verb
Tamil Computing
46

For Persons:
Presence of titles and honorifics like , [thiru, thalaivar] Presence of suffices like [Ar]. , [Al] in the corresponding noun phrase.
For Locations:
Presence of post-position , Presence of adjacent words like [ndakar]. [ndathi]. [ mAvattam] A tt ]
For Organizations:
Presence of adjacent words like [ndiRuvanam]. [thuRai] j [
For Time/Date:
Presence of adjacent words like [thEthi]. [ANtu]. [mAtham]
Tamil Computing 47
Training data
Shallow parsing
Semantic parsing
Statistical processing
Dictionary
NE table
Dictionary Entries Clue Extraction Verb Rules
Training data
Tamil Computing
48
Tamil Computing
49
Steps in Expectation Maximization

Seed probability estimates By picking up contextual cues Related words Ordering Smoothen the seed probability Perform ambiguity resolution Maximized probability values Named entities are tagged accordingly
Tamil Computing
50
Modified EM algorithm
Two problems were encountered with the traditional E M algorithm: E-M
Performed only positional analysis , and a modification was required for free word order languages like Tamil i i it was syntactically oriented, and modification was required to include semantic information.
The modification process called Quantum entanglement, solves both the above problems. g , p
Tamil Computing
51
Example -

Enloc E l 0.49 0 49 0.49 0 49
0.64 0 64 0.75
0.01 0 01
0.01 0 01
Enorg Noun Verb
0.34 0 34 0.01 0.01
0.23 0 23
0.01 0.86
0.01
0.01 0.01
0.06 0.92
0.12
52
Tamil Computing
Parser
Tamil Computing
53
Vaanavil Tamil parser

Free word order Phrase structure grammar Simple sentences any number of nouns, adjectives, adverbs adjectives adverbs... Clausal sentences identification using cue words or suffixes d ffi Nested clauses
Tamil Computing 54
3. Constituent Formation
Two main components are noun and verb constituent Noun constituent : A noun constituent can contain only noun (Ex. ) or can be of the following form (adjective)* ( dj i ( dj i )* (adjective clause)* (adjective)* (adjective l )* ( dj i )* ( dj i clause)* (adjective)* noun (case marker) (post position) Ex. (or) noun clause Ex.
Tamil Computing
55
Constituent Formation (cont..)

Verb Constituent : (adverb clause)* (adverb)* verb (suffix)* Ex. 1. 1 2.
Tamil Computing
56
Constituent Formation in Simple Sentence

Words are grouped based on the function they perform. 1. Adjectives are grouped with their nouns. 1 Adj i d ih h i - Adjectives are adjacent to their noun. 2. Adverbs are grouped with their verbs. - Adverbs can occur anywhere in the sentence prior to its verb Ex. . .
Tamil Computing
57
Constituent Formation in Complex Sentence

Noun, adjective and adverb clauses are considered. N dj i d d b l id d Step 1 : Conversion of complex to minimal sentence by grouping the clauses Step 2 : Minimal simple sentence can be analyzed as mentioned earlier Step 3: Integration of clauses into the minimal simple sentence
Tamil Computing
58
Grouping of Clauses
Distinguishing feature of the parser Clauses are generally indicated by special cue suffixes or cue phrases (Ex Verbal participles, participles relative participles, etc.) participles etc ) Grouping is done by position of the cues and linguistic based heuristic rules
Tamil Computing
59
Grouping of Clauses (cont..) (cont )

Ex. . Noun clause : Adjective clause : Adverb clause : Converted minimal simple sentence: .
Tamil Computing 60
Tree Generation
Position of each word in the sentence is also shown to take care of free word order First the converted minimal simple sentence is considered to generate the tree. Ex.
The NCs and VC are expanded to generate the tree for the actual input sentence.
Tamil Computing 61
Walk through of 4 phases with an example : g p p

. (After phase 1) (unidentified) (adv) (V) id tifi d) d ) (con) (N) (adv) (rpl) ( ) (N) (N) (vpl) ( ) ) ( p) (N) (V). (After phase 2) (N) (adv) (V) (con) d ) ( ) (N) (adv) (rpl) ( ) (N) (N) (vpl) ( ) ) ( p ) (N) (V).
Tamil Computing 62
( (After p phase 3) ) ( ) ( ( ) ) ( ) . ( After phase 4)
Tamil Computing
63
After expanding with the three clauses
Tamil Computing
64
Word Sense Ambiguation
Tamil Computing
65
Word Sense Disambiguation

The process of identifying which sense of a word in a sentence, when p y g , the word has multiple meanings Noun and Verb sense Disambiguation
Bootstrapping uses M h l i l S ffi B i Morphological Suffixes, POS S POS, Semantic i constraints and UNL relations (for verbs) Pattern representation and features Noun <left(features), ambiguous word(Sense set, f f (f ), g ( , features), right ), g (features), main verb> Verb <ambiguous word + sense, relations of interest>
66 Tamil Computing
Example Noun Disambiguation

Word 1: showing use of context for noun sense disambiguation Example 1: aaru river, number, get cold, heal
Sense number
POS
Example 1.1:
Noun
pandiyan aaru padaikal kondu por thoduthaan <entity, tit aaru<number>, b verb+icl>action> Example 1.2: noun+plural suffix, l l ffi
river
Noun
Tamilnattin periya aaru kaveri aagum <adjective, aaru<river>, entity, erb+aoj>thing> <adjecti e aar <ri er> entit verb+aoj>thing>
67 Tamil Computing
Example Verb Disambiguation

Word 2: showing use of context for verb sense disambiguation Example 2: padai army, disease, offer, create.
Sense offer
POS Example 2.1: Verb pa t a a a u u pa a ga a pada t t aa pakthan iraivanukku pazangalai padaiththaan <padai+offer, agt + obj + to> Example 2 2: 2.2:
create
Verb
iraivan makkalai padaithaan <padai+create, <padai+create agt + obj>

68 Tamil Computing
Anaphora Resolution
Tamil Computing
69
Anna University
Anaphora Resolution
The problem of resolving what a pronoun, or a noun phrase refers to
Approaches
C t i Th Centering Theory (B (Brennan et al, 1987) t l Hobbs algorithm (Hobbs, 1978)
Applications
Summarization Question Answering Information Retrieval
Tamil Computing 70
Anaphora Resolution in Tamil

Resolving Anaphora in Tamil Text
Partially free word order language free-word Morphologically rich language M Morphological suffixes convey most of the h l i l ffi t f th semantic roles played in a sentence
U of UNL as th b i f Semantic Use f the basis for S ti representation
Tamil Computing
71
Our Approach
Classification of Anaphora Persons, Places and Events Centering Theory - modified by incorporating Word level semantics - UNL Semantic constraints Graph based approach - Sentence level semantics - UNL graphs Absence of Case suffixes have been handled using UNL graphs Plural and Event pronouns associated with multiple antecedents - tackled using UNL graphs
Tamil Computing 72
Classification of Anaphor
Anaphora representing Persons Person Anaphora - Nouns, Noun phrases p , p avan, avaL, avar, ivan, ivaL, ivar and plural pronouns avarkaL and ivarkaL Examples Raju nandraaka padiththaan. avan thervu ezuthinaan Maanavarkal nandraaka padiththaarkal. avarkaL thervu ezuthinarkal Anaphora representing Places Place Anaphora - Nouns, Noun phrases - athu, ithu Adverbs such as angu and ingu can also acts as pronouns representing places Examples tiruchy tamilnaattin periya nagarangalil ondru. Ingu amman kovil uLLathu. ithil aayiram thooNkal uLLana.
Tamil Computing 73
Anaphora representing Events

Event Anaphora - Verb phrases, clauses and segments of sentences Pronouns such as athu, ithu with dative case, accusative case represent events Example swami aanmeega soRppozhivu nikazhthinaar. athai kaaNa makkaL koodiyirunthanar.
Tamil Computing
74
Ambiguous Pronouns
Pronouns such as athu, ithu can represent both places and events Higher level of semantics and verb semantics is needed ti i d d Examples
maduraiyil meenatchi kovil ullathu. Ithil aayiram thoonkal uLLana. madurayil ulla meenatchi kovilil aanmeeka sorpozhivu nadaipeRRathu. Ithil eeralamaana makkal pangeRRanar. l kk l RR
Tamil Computing 75
Semantics Integrated Centering Theory

Word level semantics UNL Semantic Constraints C i Anaphora classification Filter out the non-anaphoric expressions UNL Semantic Constraints Filter out the non-referring expressions Pl l pronouns h b Plural has been tackled to certain kl d i extent
Tamil Computing 76
Anaphora Resolution using UNL Graphs

Plural Pronouns having multiple concepts as antecedents Event Pronouns Two components
Use of UNL relations to extract the concepts for anaphora resolution
Co-ordinating UNL Relations Sub-ordinating UNL Relations
Use of UNL subgraphs for anaphora resolution l ti

Tamil Computing 77
Use of UNL Relations to extract the concepts for Anaphora Resolution

Co-ordinating UNL Relations
Relations obtained for referring expressions exactly matches with the relations obtained for anaphor
Sub-ordinating UNL Relations

Relations obtained for anaphor depends on the relations of referring expressions Rules
agt obj ben agt plc aoj
Tamil Computing 78
CoCo-ordinating UNL Relations Example

raju ramuvai mirattinaan. avan payanthaan
Mirattu threaten (agt>thing, obj>thing) obj agt Ramu (iof person) (iof>person) Raju (iof>person) Avan, he (Pronoun) obj Payam y scare(obj>thing)
Tamil Computing
79
Subordinating UNL Relations Example

raamalingaththai annan sabapathi kaNdiththaar. pp Anaal avar avarukku kattuppadavillai.
Kandippu, Scold K di S ld (agt>thing, obj>thing) agt obj Annan sabapathi icl>person Ramalingam g (iof>person)
Tamil Computing
Kattuppadu K tt d Abide (agt>thing, obj>thing) ben agt Avar (kku) He, pronoun Avar, he pronoun
80
Event Pronouns - Example

. . Happen icl>action obj Ramalingam iof>person aoj Spiritual aoj>thing mod Speech p icl>talk
Tamil Computing 81
agree agt>thing, agt>thing obj>thing agt to Avar Athu (kku)
Semantic Representation
Tamil Computing
82
Semantic Interpretation
Binding the user utterance to concept, or representation of concept concepts that the system can understand The process of mapping a syntactically analysed text of natural p pp g y y y language to a representation of its meaning Semantic Interpretation - Aspects
Word W d meaning & W d S i Word Sense Di Disambiguation bi i Lexical Disambiguation Structural Disambiguation Semantic Relations
Issues
Coreference and Anaphora Lexical Semantics Syntactical and Grammatical Categories Logical Semantics
Tamil Computing 83
Applications of Semantic Interpretation

Information Extraction extracting meaningful t t ti i f l templates T t l t Text Summarization Semantic Representation of text for selection of important inter relation between concepts Q Question Answering g Extracting semantic similar sentences that are answers to questions Multilingual Generation & Machine Translation Intermediate semantic representation
Tamil Computing
84
Purposed Work
Semantic Interpretation of Tamil Text Use f U of UNL as th b i f Semantic the basis for S ti representation Use of UNL based information for NLP processing Use of UNL graph for Summarization and Question Answering
Tamil Computing
85
Semantic Relation (UNL) Extraction Rule Based Approach

Morpho-Semantic Rule Based Approach Existing E isting Approaches Most UNL Enconverters use syntactic parser Morpho-syntactic f t M h t ti features Use of rich Morphological features of Tamil for semantic relation extraction ti l ti t ti Design of Rules based on Morpho-Semantic Features Use of semantic constraint information from UNL List
Tamil Computing 86
Enconversion Process
Pass1 Identify possible UNL relations of a word Wi Pass2 P 2 Disambiguate the relations, if multiple unl relations assigned for a word Identify the connected concepts with the word Wi
Tamil Computing
87
Morphology - Case suffixes associated with the word pazhathai Connective Natural Language word maRRum, maRRum allathu etc Co-occurrence Raamanaal seyyappattathu R l tt th POS Part Of Speech tag of the word Noun, V b Adj i Ad b N Verb, Adjective, Adverb Semantics icl>person, iof>place, icl>time etc.
Tamil Computing 88
Construction of UNL Graph using T P Two Passes

Pass 1 Pass 2
Tamil Computing
89
Another Example
Pass 1 Pass 2
Tamil Computing
90
Our Approach - Bootstrapping

Pattern representation Generic pattern to tackle multiple relations Features Morphological suffix POS UNL Semantic constraints Matching Exact matching Partial matching I Ignore tuples which d not take part in t l hi h do t t k ti identifying semantic relations
Tamil Computing 91
Unsupervised Approach
Features used for Probability estimation
Morphological Suffix POS Semantic Constraints Starting and Ending symbols g g y
Relation between concept pairs can occur anywhere Semantic Similarity based on UNL ontology Feature T F t Tagged corpus t d tagged using ruled i l 92 based approach Tamil Computing
Question Answering
Tamil Computing
93
Question Classification
Need for QC & Answer types QC: Accurately classify a question in to a question type and then map it to an expected answer type What i th biggest city in the United States? Wh t is the bi t it i th U it d St t ? Question Type: Q_LOCATION_CITY Extract and filter answer type to improve the overall accuracy of a question answering system Morpheme based CRF approach to Question Classification d Expected A Cl ifi ti and E t d Answer t type d t ti detection
Tamil Computing 94
Factoid type
? [ Who is India's prime minister ?] ? [When [Wh did India became independent country?] I di b i d d t t ?] Where - ? [Where was Gandhiji born?] Which ? Which state has the highest population in India? Abbreviation - ... ? [What is the expansion of IAS?] Definition type How [] ? How does DC generator operate? Who - ? [ Who is Manmohan singh?] Define - - [ Define Kirchoffs Law] List type Enumerate - . [ Enumerate districts in Tamil nadu] List List out states in India Who When Tamil Computing 95
Question Classification
DESC NUM
ABBR, DEFINITION DEFINITION, MEANING, ABBR DEFINITION, DEFINITION MEANING REASON,OTHER AGE, AREA, CODE, COUNT, DISTANCE, FREQUENCY, ORDER, PERCENT PHONENUMBER POSTCODE, ORDER PERCENT, PHONENUMBER, POSTCODE PRICE, RANGE, SPEED, TELCODE, TEMPERATURE, WEIGHT, LIST, OTHER ALIAS, DESCRIPTION, ORGANIZATION, PERSON, LIST, OTHER ANIMAL, CITY, COLOR, CURRENCY, ENTERTAIN, FOOD, INSTRUMENT, LANGUAGE, PLANT, RELIGION, SUBSTANCE, VEHICLE, LIST, OTHER ADDRESS, CITY, CONTINENT, COUNTRY, ISLAND, LAKE, MOUNTAIN, OCEAN, PLANET, PROVINCE, RIVER, LIST, OTHER DAY, MONTH, RANGE, TIME, YEAR, LIST, OTHER , , , , , ,
HUM OBJ
LOC
TIME
Tamil Computing
By TREC 96
Factoid Type Question Answering
Tamil Computing
97
Our Approach
Bag of key words matching. In extracted passage, the terms that is in question are removed. The remaining concept or entity terms may be answers. Person - Named Entity, Possible case marker, Question word case marker Location - Considering possible case markers - Temporal word database, number range
Time
Quantity - Possible words in database () ( ) - After question term as definition term - Before question term as definition term
Tamil Computing
98
Sentence to extract predicate relation

Sentences
Wordnet Preprocessing P i
Predicate Extraction
Predicate l ti P di t relation Extraction Rule Dictionary Predicate rule learning
Predicates A (x ,y) ( y)
Tagged agged training document
The relation graph gives semantic relation with all entities along with type of entity This semantic information provide filtering out the required Answer part
Tamil Computing 99
Definitional Type Question f yp Q Answering
Tamil Computing
100
Definitional QA Process
Due to the free word nature of Tamil the ranked sentences will not be the prcise answer for the question. So the definition terms f S th d fi iti t from th sentences are extracted using some short the t t t d i h t patterns (K Soo Han,2007)( Jinxi Xu, 2003) as given below. <place of birth> lpiRanthAr <year> Am Andu <month> Am mAtham ivarathu thanthaiyAr < father Name>, thAyAr <mother Name> y f , y <year> Am Andu maRainthAr The leaf nodes of the answer graph give the details presented in the sentence. The definition answer has been created using the definition templates.
Use of statistically processed seed information for classification yp and scoring of sentences for inclusion in the answer graph representing the definitional answer to who questions
Tamil Computing 101
Target term ( Name of Person )
WEB
Document retrieval (using ( i BAVANI)

Sentence classification
Statistical processing Definitional Sentence corpus
Seed Information
Sentence tagging based on seed information
Term Probability
Sentence ranking
Definition term extractor Answer graph generator Definition

Tamil Computing 102
Sentence Classification
T W (t ) = nd N
S. No 1 Category Birth Features (piRappu) (piRanthAr) (thOnRiNar) (peRROr) (thAyAr) (thAyAr) (thanthaiyAr) (kalvi) (padippu) (paLLi) ( LLi) (paNi) (vElai) (viruthu) (parisu) (iRanthAr) (maRainthAr) (pErasiriyar) (vinjAni) (arasiyalvAthi)
103
where TW is Term Weight nd = No. of documents, in which the term t occurred N = Total number of documents
Parent
Education
Work Award Death General
f ( x) = w x + b
t
where w is the weights vector of features, b i intercept f is i
5 6 7
Tamil Computing
1931 15 . . . . . . <ND> . 1981 . 1990 . 1997 . .<ND> . . <ND> . . . <ND> . .<ND> . <ND> . <ND> .<ND> . Tamil Computing 104
1931 15 . <D> <BIR> . <D> <PAR> . <D> <EDU> . <D> <EDU> . <D> <EDU> . <D> <WRK> . D WRK 1981 . <D> <AWD> 1990 . <D> <AWD> 1997 . <D> <AWD> . <D> <S> . <D> <S> . <D> <S> . <D> <S> . <D> <S> . <D> <S>
. <D> <S> . <D> <S> . <D> <S> D S . <D> <S> . <D> <S> . <D> <S>
Tamil Computing
105
Based on knowledge base tree Relation between the terms in knowledge base Graph Expansion with Lower levels of knowledge base tree
Tamil Computing
106
1931 15 . . . 1981 , 1990 , 1997 . .
Tamil Computing
107
Tamil Document Summarization Using Semantic Graph Method
Tamil Computing
108
Our Work
Capturing semantic features of the document. document Identifying key concepts and relations for summarization. summarization Using machine learning model to identify sub graph of the original document semantic graph.
Tamil Computing
109
Detailed Design
SEMANTIC GRAPH GENERATION
Linguistic Analysis
Syntactic and y Semantic analysis Analysing & y g Logical Form Parsing
EXTRACTING SUMMARY SENTENCES
Co reference Resolution Linguistic for Named Entities

Identification of Analysis Named Entities
SUB GRAPH IDENTIFICATION
Coreference Resolution
Light Weight Approach
Feature Set Identification

Linguistic & Semantic i i i S i Graph attributes, Document discourse structure
Semantic Normalization
WordNet Learning Algorithm
Construction of Semantic Graph

Tamil Computing
SVM
110
Prediction
After training the learned model is used to predict the important nodes of the given documents semantic graph.
Identification f S bG Id ifi i of SubGraph h

The sub graph of the large semantic graph is generated from the SVMs output SVM s
Extraction of Summary Sentences

The sentences containing the SOPs present in the sub graph are extracted from the input document.
Tamil Computing
111
Sample Input Document
Tamil Computing
112
Morphological Analyzer Output
Tamil Computing
113
Logical Form Parser
Tamil Computing
114
Graph Generation
Tamil Computing
115
Identification of Sub-Graph Sub-
Tamil Computing
116
Extraction of Summary Sentences
Tamil Computing
117
Tamil Summary Generation for a Cricket Match
Tamil Computing
118
Objective
To propose a framework for automatic analysis and summary generation for a cricket match in Tamil, with the scorecard of the match as the input. input The framework proposes a method to evaluate the interestingness of a cricket match. The framework proposes a customization model for the summary. The f Th framework also proposes methods for evaluating the k l h d f l i h humanness of the generated summary.
Data Mining d Analytics D t Mi i and A l ti

Modified version of Apriori algorithm is used to find the association rules from the feature vectors. Mathematical analysis using correlation of variance (CoV) is performed, CoV is plotted against average to give an idea about how consistent the player is. The interestingness of the match is calculated based on the weighted average of the scores assigned to the factors identified, they include the Winning margin, Team history, Individual records made, High run rate, Series state, Relative position in international ranking, Reaction in social networks etc. t k t
Sentence Generation
The sentence which is the most apt to the current event under consideration is selected The vocabulary used in the sentence and the depth to which an event is discussed is also varied based on the expert level of the user The nouns in the key events are passed to the morphological generator along with the desired case endings and the generated variants are added to the sentences. The system uses the morphological generator developed at TaCoLa + =
Event Clustering
Tamil Computing
122
Clustering for news event detection
Tamil Computing
123
Why UNL context Cluster?

Identifying semantic coherence between two sentences is based on overlapping of terms between the sentences. In news paper article, the term overlap between two sentences is minimal. Each sentence can have more than one event -difficult to properly segment the difficult sentences.
Tamil Computing
124
UNL based context clustering for news event detection-Event A l i d t ti E t Analysis

In natural language text, an event analysis involves g g , y discovering the portion of text in a sentence that describes an event participants of the event ti i t f th t the actual event occurrence and time of the event. All these event specific properties are obtained from p p p UNL(Universal Networking Language) representation. These properties help in separating the event sentences with non-event non event sentences with out much effort effort.
Contribution The degree of connectivity of a concept with UNL event specific semantics+ the concept distance score as well as the TF/IDF score.
Event Representation in UNL
Snapshot for Tamil News Event Search
Lyrics Mining & Generation
Tamil Computing
128
Lyric Mining
We have processing using 2,000 lyrics Analysis
Word level analysis Rhyme analysis Concept co-occurence analysis Pleasantness score Pl t
This analysis has been mainly used in the lyric generation and computing freshness scoring for lyrics.
Lyric Mining
Word Level Analysis
The frequency of words is used to associate a q y popularity score for each word. Popularity score of the word has been p y identified from lyrics. In lyrics, the words are attached with suffix. y , Root words - determine its frequency count.
Lyric Mining
Word Level Analysis - Results
WORDS USAGE
1153 1062 793 965 857
Lyric corpus of two thousand songs were analysed for the word, rhyme and Co-occurence concepts usage.
List of top 5 usage words in lyrics
Lyric Mining
Rhyme L l A l i Rh Level Analysis
Adapted Apriori Algorithm Frequency count of rhyme, alliteration and F t f h llit ti d end rhyme pairs of Tamil lyrics
USAGE
2291 2255 2028 1973 1952
EDHUGAI
, , , ,
MONAI
,
,
USAGE
3338 3145 2947 2763 2480
, , ,
a) t 5 usage word rhyme ) top d h b) top 5 usage word Alliteration
Lyric Mining
Concept Co-occurence Analysis Frequent occurrence of two terms from a lyric corpus Agaraadhi, an online Tamil dictionary Cancelling the ambiguous and the polysemy of words t i l f d to improve the th accuracy of the entire system. Example : The word which has the concept , , , , , ,
Lyric Mining
Pleasantness score
Identify the pleasantness of a word based on 5 models
3 models Language independent 2 models Language dependent
In all the models, first the given grapheme word is converted into phoneme form using Tamil phonology rules rules. Models
Meaning based model Language Dependent Model I Language Dependent Model II Manner of articulation based model Manner and place of articulation based model
Lyric Mining
Pleasantness score Meaning based Model
Maintain the pleasant and unpleasant word list Calculate the frequency of phoneme in pleasant and unpleasant word list Language Dependent Model I Judge the plesantness based on Vallinum, Mellinum Idaiyinam classification, lli d i i l ifi i Maathirai and kurukkams except kutriyalikaram Language Dependent Model II Thi d l i d i ii l l
Lyric Mining
Pleasantness sore
Manner of articulation based model
Category Manner of Articulation
Phoneme
Greater Rough G t R h
Retroflex, Trill R t fl T ill
Rough
Tap, Dental, Tap Dental Bilabial
, , , ,
Intermediate
Semivowels, Approximants
, , , ,
Soft
Nasal
, , , , ,
Lyric Mining
Pleasantness score
Manner and place of articulation based model
place of articulation score, categories which arise from the parts near the oral cavity are p y considered pleasanter than those which go deeper. Taking manner of articulation into consideration, Nasals N l are given hi h sweetness score i highest followed by Laterals, Fricatives, Stops and Trills.
Applications
Tamil Computing
138
COREE The Concept based p Search Engine
Tamil Computing
139
Components of a Search Engine

Crawler (or Worm or Spider)
collects pages p g checks for page changes
Indexer de e
constructs a sophisticated file structure to enable fast page retrieval
Searcher
Searches the indexed information that satisfies user queries Ranks output
Tamil Computing
140
Search Engine Architecture
Tamil Computing
141
The Concept based Search

Indexes concepts instead of words Indexes concepts and relations between concepts I d d l i b In this work
Representing th d R ti the document t
Use of UNL ( Universal Networking Language) as intermediate structure UNL consists of concepts and relations it f t d l ti
Three indices Concept-Relation-Concept, ConceptRelation, Concept Query converted into UNL representation Searching and ranking based on concepts & relations rather than words
Tamil Computing 142
COREECOREE-Architecture
Thesaurus
Input Processing
Parsed Query
UNL Index Based list

UNL Expressions/ Query Translation [UNL Encoding] UNL Graph
UNL based ranking
IL Query
Light Weight WSD
Query Expansion
NER List
MWE List
UNL Based Matching
Set of Documents With UNL Expressions
Document Processing WWW
UW List
UNL Based Indexing Detailed UNL Expression
Focused Web Crawler
Document Processing using Semantic Approach
Tamil to UNL Converter
Selection of UNL Expressions For Indexing
WSD
NER List
MWE List
Searching Tamil Computing 143
Modules of COREE
Focussed Crawling UNL based Document Processing g
Sentence Extraction Enconversion Construction of three types of multilist indexes
UNL based Input Processing

Query Expansion Query T Q Translation l ti
UNL based Searching

Matching and Ranking
UNL based Output Processing

Information Extraction Summary Generation y
Tamil Computing
144
Document Processing g
WSD NER
Tamil Document
Extraction of Components of sentences f t (TF based)
Enconversion (Concept & Relation)
UNL Expression/Graph (multi-list)
Tamil Enconversion Rules
Tamil UW list
Tamil Computing
145
UNL Lists
UWList Universal Word List
MWList Multiword List
Tamil Computing
146
UW concepts are iof>city
plf
via i
plt lt
UNL relations are disambiguated using the semantics of the concepts (iof>city)
Tamil Computing 147
Concept p Nodes Head Node
MULTILIST
Relation Nodes To Concept Nodes
Tamil Computing
148
Tamil Computing
149
Example check font - balaji

Concept Relation Concept
Concept Only y
Concept Relation
Tamil Computing
150
The Index Structure

Input - set of UNL graphs as a MultiList data structure Output are UNL indices stored in three Binary Search Trees. UNL Indexer parses the UNL graphs and builds an inverted list on the indices. The indices are categorized into three different types - t aid retrieval of semantically relevant t to id t i l f ti ll l t documents
CRC (Concept -Relation- Concept) Indices CR (Concept -Relation) Indices C (Concept Only) indices
Tamil Computing
151
Query Translation

[s] [w] ; vivekanandar; iof>person; Entity; 1 ; lecture; icl>action; Noun; 2 [/w] [r] 2 [/r] [/s] pos 1
pos
Tamil Computing
152
Query Expansion
NER
Input Query
WSD
Parsed y Query
Morphological Processing
Query Expansion (index based/verb-noun pairs)
Expanded Query
Domain specific D i ifi Noun verb pairs
Analyzed A l d Index table
Other tools to be Integrated Tamil Computing 153
Query Expansion
Query Word Query word With Expanded word Relation
< - pos> < - pos> < - and> < - and> d < - pos> < - and> < - plf> < - iof> < - pos> < - pos> < - pos>
154
() ()
Tamil Computing
Search
Indexed Concept, Concept-relation and Concept-Relation-Concept
UNL Based Indexing UNL Based Searcher Performing various levels of matching
Expanded query
UNL query UNL B Based d Ranking
Exact match (CRC) Concept-Relation Match Concept Only Match
Ranked set of documents(O/P)
Tamil Computing
155
Output Processing
UNL Document D t
Capturing tourism related Filling Templates info from the UNL documents
Morphological Generator
Document summary
Summarization of the sentences generated from the morphological generator
Tamil Computing
156
Example Result Snap Shot in 5 Window
Actual query term Match
Actual query term+Concept Match
Conceptual Results
Single Term Match
Expanded Term Match Results
Tamil Computing
157
AGARAADHI -A NOVEL ONLINE DICTIONARY FRAMEWORK
Tamil Computing
158
OBJECTIVES
Agaraadhi, Agaraadhi a dictionary framework for indexing and retrieving Tamil words, their meaning, meaning analysis and related information information. Framework to incorporate various unique features - designed to provide additional information to the user regarding the word that they query about about.
Tamil Computing
159
AGARAADHI FRAMEWORK
Tamil Computing
160
Agaraadhi - Features Features

1. Morphological Analyzer 1. Lyric Related 2. Morphological Generator 2. Kural related 3. Spelling suggestion. p g gg 4. Equivalent word 5. Picture Dictionary 6. 6 Rare Word
3. 3 Popularity of the word 4. Pleasantness score 5. Bharathiyar Songs Related. Related
Agaraadhi Meaning for the Word mazhai
Agaraadhi Meaning for the Word pookkal (example for case ending word)
Kuralagam - Concept Relation based Search Engine for Thirukkural
Tamil Computing
164
Obj ti Objectives
Kuralagam is a conceptual search framework for Thirukkural based on UNL Framework.
Searching with keywords in kurals and intepretations Concept based search based on CoReX conceptual indexing based on UNL Bilingual search English and Tamil Showing Relationships between the concepts.
Tamil Computing
165
Kuralagam Framework
Tamil Computing
166
Online Processing
Search and Ranking fetches the Thirukkural number and its details. Thirukkurals for a given query are fetched using the two types of concept relation indices namely CRC and C. The query concept is expanded using related CRC indices pointing to the query concept. helps in retrieving many Thirukkurals conceptually related to the query not possible with key word Thirukkural search engines. The ranking is based on priority to the indices in the order CRC>C usage score frequency occurrence of the query concept
Tamil Computing 167
Kuralagam results for the query word paNam
Kuralagam conceptual results for the query word paNam is Dhanam Dhanam
Kuralagam Meaning for a particular kural
Tamil Word Game
Tamil Word Game Miruginajumbo (Jumble words)
Tamil Word Game Kattaboman (Scramble words)
Tamil Word Game Thookku Thookki (Hang man)
Tamil Word Game

1. Miruginajumbo (Jumble words) 2. Kattaboman (Scramble words) 3. Thookku Thookki (Hang man) Scoring
Score can b calculated using, words usage across web, ti S be l l t d i d b time and no of tiles swap.
Demo
COREE Agaraadhi Tamil Language Based Games
Tamil Computing 176

Dr. TV. Geetha

Uploaded by

Copyright:

Available Formats

Dr. TV. Geetha

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dr. TV. Geetha

Uploaded by

Copyright:

Available Formats

A Special Talk

Computer Science & Engineering Computer

Morphological Analyser - Introduction

Helps to increase the accuracy of morphological analysis

Morphological Analyser Word processing

Compound Word R C d W d Representation t ti

Numeral Word Representation

Compound Analyser Rules Classication

Compound Analyser Rules Classification

Compound Word Analyser - FST

Compound Analyser FST

Ex : (Twenty Five) Rule R l :

Ex : (Twenty Three) Rule R l :

Numeral Analyser FST

Suffix Mapping of ending patterns

Suffix Mapping of ending patterns with Morphographemic changes

Suffix mapping of ending patterns with checking of one/two preceding characters

one/two preceding characters

Compound Word- Semantic WordRelation

Compound Word- Semantic Relation WordRelations 14 are identified

+ hair (noun + pof>head)

Part Of Speech (POS) Tagging

Tag sets for different languages

Characteristics by analyzing 4,70,000 words

State features F(word(-2),Ctag)

Transition features F(word(-2),word(-1), Ctag)

F(POS(-1),Ctag ) ( ( ), F(POS(0),Ctag ) F(POS(1),Ctag ) F(POS(2),C F(POS(2) Ctag )

F(POS(-1),POS(0),Ctag ) ( ( ), ( ), F(POS(0),POS(1),Ctag ) F(POS(1),POS(2),Ctag )

Named Entity Recognition

Named Entity Recognition (NER)

Indian language NER

Named Entity Recognition

Named Entity Recognition

Dictionary Entries Clue Extraction Verb Rules

Steps in Expectation Maximization

Enorg Noun Verb

0.34 0 34 0.01 0.01

Vaanavil Tamil parser

Constituent Formation (cont..)

Constituent Formation in Simple Sentence

Constituent Formation in Complex Sentence

Grouping of Clauses (cont..) (cont )

Walk through of 4 phases with an example : g p p

( (After p phase 3) ) ( ) ( ( ) ) ( ) . ( After phase 4)

After expanding with the three clauses

Word Sense Ambiguation

Word Sense Disambiguation

Example Noun Disambiguation

Example Verb Disambiguation

iraivan makkalai padaithaan <padai+create, <padai+create agt + obj>

The problem of resolving what a pronoun, or a noun phrase refers to

Anaphora Resolution in Tamil

U of UNL as th b i f Semantic Use f the basis for S ti representation

Anaphora representing Events

Semantics Integrated Centering Theory

Anaphora Resolution using UNL Graphs

Use of UNL subgraphs for anaphora resolution l ti

Use of UNL Relations to extract the concepts for Anaphora Resolution

Sub-ordinating UNL Relations

CoCo-ordinating UNL Relations Example