Dr. TV. Geetha

Download as pdf or txt
Download as pdf or txt
You are on page 1of 176

A Special Talk

Tamil Computing T il C ti
Dr.T.V.Geetha, Tamil Computing Lab (TACOLA), Dept. of CSE & IST College of Engineering Guindy, Anna University Chennai Team Co-ordinators: Ranjani Parthasarathy & Dr.Madhan Karky 27th January 2012
Tamil Computing 1

Characteristics of Tamil
Partially free-word order language Morphologically rich language Morphological suffixes convey most of the h l i l ffi f h roles played in a sentence Ambiguity at morphological level Ambiguity at semantic level g y

3 Tamil Computing

Our Basis
Linguistics
Use of rich Morphological Features of Tamil Use of POS Tags Use of Word Based Semantics with well defined semantic constraints (primitives) UNL

Computer Science & Engineering Computer


Rule based Approach & FSA for Tamil Clustering Approaches Probabilistic Approaches N B P b bili ti A h Nave Bayes, C diti Conditional R d l Random Field, HMM g pp g p pp Machine Learning Bootstrapping & Unsupervised Approach

Tamil Computing

Language Processing

Tamil Computing

Language Processing
Morphological Analyzer POS Tagging Chunking Named Entity Recognition Parser Word Sense Disambiguation Anaphora Resolution A h l i Semantic Interpretation
Tamil Computing 6

Morphological Analyzer

Tamil Computing

Morphological Analyser - Introduction


Most of the textual data contains compound, numeral and colloquial words. words Due to morphological richness, Tamil language needs h dli of those words. d handling f th d Development of an Integrated Morphological analyser (Compound, Numeral, Colloquial)
Needed to tackle News & Lyrics

Helps to increase the accuracy of morphological analysis

Morphological Analyser Word processing


Morphological suffix stripping - (Conventional analyser) - Resulted word W The word W is processed by compound analyser The word W is processed by numeral analyser (Right to left by concatenating vowels and consonants and iteratively checking alphabets in the Morph Dictionary and applying Tamil grammar rules)

Morphological Analyser

Compound Word R C d W d Representation t ti

Morphological Analyser

Numeral Word Representation

Compound Analyser Rules Classication

Morphological Analyser
Compound Analyser C dA l Based on Finite State Transducer (FST) Not only handles simple compounding Handling compounding between two words that may cause inflectional variations during compounding process Ex : (i) (Golden statue) Rule: If, the second constituents first alphabet is Hard consonant Then, Then the first constituents last alphabet is Vowel / Medial consonant constituent s consonant, then No Modification

Morphological Analyser
Compound A l C d Analyser

Ex : (Root tree) Rule R l : If, the second constituents first alphabet is Consonant
Then, Then is inserted as the first constituents last constituent s alphabet

(Root)

Insertion

Morphological Analyser
Compound A l C d Analyser

Ex : (Sand pot) Rule R l : If, the second constituents first alphabet is Hard Consonant Then, the first constituents last alphabet - is replaced by


Replacement R l t

(Pot)

Morphological Analyser
Compound A l C d Analyser

Ex : (Banana) Rule R l : If, the second constituents first alphabet is Hard Consonant
Then, h first Th the fi constituents last alphabet is the same Hard i l l h b i h H d consonant, then it is deleted

(Fruit)

Deletion

Compound Analyser Rules Classification

Compound Word Analyser - FST


Finite State Transducer Fi it St t T d
Two taps which describes the input (lexical form) and output (Surface form) sequences It has seven tuples 1 represents the finite alphabet, namely the input alphabet (ai1,......aik) 2 represents the finite alphabet, namely the output alphabet (bi1,......bik) Q i a fi it set of states (S0 ,S1,S2,S3,S4, S5 S6) is finite t f t t S1 S2 S3 S4 S5,S6) i Q is the initial state(S0 ) F is a subset of Q, the set of final states;(S6 ) Here a:b represents the replacement of a in the surface form to b in the lexical form c/d states that transition can occur if either c or d is in lexical form.

Compound Analyser FST

Morphological Analyzer
Numeral analyzer
Based B d on Finite S Fi i State T Transducer (FST) d Numbers, one to ten, hundred, thousand, lakh and crore can be directly converted into numbers i i Ex : (Ten)
Rule: No modification

10

Morphological Analyser
Numeral Analyser N lA l Ex : (Five Thousand)

Rule :
If, the second constituents first alphabet is Vowel and the first constituent s last alphabet is Hard consonant then constituents consonant, insert ''

(Thousand)
Insertion

5000

Morphological Analyser
Numeral Analyser N lA l

Ex : (Twenty Five) Rule R l :


If, the second constituent first alphabet is Vowel Then, Then the first constituent last alphabet is Hard Consonant, Consonant then replace that with

Replacement R l t

(Five)

25

Morphological Analyser
Numeral Analyser N lA l

Ex : (Twenty Three) Rule R l :


If, the second constituents first alphabet is Soft Consonant and the first constituents last alphabet is Hard consonant, constituent s then delete the hard consonant

23

(Three)

Deletion

Numeral Analyser FST

Morphological Analyser
Colloquial Analyser

Based B d on pattern mapping approach tt i h To the best of our knowledge, no previous work has been made to convert informal word to formal word. Adopt spelling variations rules and perform the mapping for transforming informal (colloquial) written word into formal written word.

Colloquial Analyser
Pattern based Approach based on spelling variation rules Word processing Right to left List of Spelling variation rules Suffix Mapping of ending patterns Suffix Mapping of ending patterns with Morphographemic changes Suffix S ffi mapping of ending patterns with checking of one/two i f di tt ith h ki f /t preceding characters Suffix mapping of patterns occurring at any place In all the rules, pattern p1 of colloquial form is converted into pattern p2 of normal form

Suffix Mapping of ending patterns

Colloquial Analyser
Suffix Mapping of ending patterns Ending p g pattern p is replaced with p p1 p pattern p2 p Ex : (irukean)
(irukirean) ( )
Pattern 1 Pattern 2

Replaced

Suffix Mapping of ending patterns with Morphographemic changes

Colloquial Analyser
Suffix Mapping of ending patterns with Morphographemic changes Ending pattern p1 is replaced with pattern p2, then p2 passed for morphographemic change Pattern 2 Ex : Pattern 1
(thambi kittey) (thambi yidam)

Replaced

morphographemic

Suffix mapping of ending patterns with checking of one/two preceding characters

Colloquial Analyser
Suffix Mapping of ending patterns with checking of

one/two preceding characters


Ending tt E di pattern p1 i replaced with pattern p2, after 1 is l d ith tt 2 ft checking one or two preceding characters. Pattern 2 Pattern 1 Ex E : (kaa nju) (kaa yndhu)

Replaced
Check one preceding character

(A)

Compound Word- Semantic WordRelation


Extracting the semantic relation for compound words
Identifying the metaphor words Identifying the characteristics of the components Identifying the comparison relation between the components

This relation are extracted by using the Tamil grammar of compound word ( - thogai), the part of speech tag and UNL semantic constraints t f ht d ti t i t

Compound Word- Semantic Relation WordRelations 14 are identified


Ex : To identify Color relation (black hair) + (black + hair)
Black
Noun+icl>color

+ hair (noun + pof>head)

concept

Semantic Constraint

Relation

Part Of Speech (POS) Tagging

Tamil Computing

35

POS Tagging
What is POS tagging?
Part-of-speech tagging is a process of assigning a part-of-speech like noun, verb, pronoun, preposition, adverb, adjective or other lexical class marker to each word in a sentence. t

Tag sets for different languages


For Tamil , a tag set is formulated by a literature survey a view of the standard tag set for English language like Penn tree bank, wall street journal tag set.
Tamil Computing 36

Noun C t N Category
N NP NN NNP IN INP PN PNP VN VNP Pn PnP Nn NnP Noun Noun Phrase Noun + noun Noun + Noun Phrase Interrogative noun Interrogative noun phrase Pronominal Noun Pronominal noun Verbal Noun Verbal Noun Phrase Pronoun Pronoun Phrase Nominal noun Nominal noun Phrase SP SCC Par P adj Iadj Dadj Inter Int CNum Num DT PO

Other category
Sub-ordinate clause conjunction Phrase Sub-ordinate clause conjunction Particle P ti l Adjective Interrogative adjective Demonstrative adjective Intersection Intensifier Character number Number 25 Date time , Post position

Verb category
V VP Vinf Vvp Vrp AV FV NFV adv Verb Verbal phrase Verb Infinitive Verb verbal participle Verbal Relative participle Auxiliary verb A ili b Finite Verb Negative Finite Verb Adverb Tamil Computing

37

Characteristics by analyzing 4,70,000 words


Tamil take on more than one morphological suffix; often the number of suffixes is 3 with the maximum going up to 13. The role of the sequence of the morphological suffixes attached to a word in determining the part partof-speech tag. 79 morpheme components were identified, which can combine to form about 2000 meaningful combination of integrated suffixes

Using these morpheme properties we design a Nave Baye s probabilistic Model for POS Bayes
Analysis of morphology of words and design of Nave Bayes Model for POS based on morpheme components
Tamil Computing 38

Chunking

Tamil Computing

39

Chunking
What is Chunking?

Chunking is the task of identifying and g f fy g segmenting the text into syntactically related non overlapping groups of words.
Need for chunking one of the important preprocessing for all other language processing aid to extract crux part of information from sentences and documents The chunk types are
ADJP, ADVP, CONJP, INTJ, NP, PP and VP.
Tamil Computing 40

Our Approach
Our Approach The morpheme features of words contribute in identifying boundaries of chunking. Using these morpheme components as one of the features ,CRF model is designed.
Using Morpheme Components as features for Conditional Random Fields models for identifying chunking boundaries

Tamil Computing

41

words

Transliteration [intha] [thakavalin] [atippataiyil] [pOlicAr] [andthandtha] [mAvatta] [p [pOlish] ] [cOthanaic] [cAvati] [maiyangkalil] [vAkanac] [cOthanaiyil] [Itupattanar]

POS <adj> <Ngen> <Nloc> <noun> <Dadj> <Madj> <noun> <adj> <noun> <Nloc> <adj> <Nloc> <FV>

chunk B -NP I-NP I-NP B-NP B-NP I-NP I-NP I-NP I-NP I-NP I NP B-NP I-NP B-VP

State features F(word(-2),Ctag)

Transition features F(word(-2),word(-1), Ctag)

F(word(-1),Ctag ) F(word(-1),word(0), C ) tag F(word(0),Ctag ) F(word(0),word(1), Ctag) F(word(1),Ctag ) F(word(1),word(2), Ctag) F( d(1) d(2) F(word(2),Ctag ) F(POS(-2),C ) F(POS(-2),POS(-1),Ctag )
tag

F(POS(-1),Ctag ) ( ( ), F(POS(0),Ctag ) F(POS(1),Ctag ) F(POS(2),C F(POS(2) Ctag )

F(POS(-1),POS(0),Ctag ) ( ( ), ( ), F(POS(0),POS(1),Ctag ) F(POS(1),POS(2),Ctag )

Tamil Computing

42

Named Entity Recognition

Tamil Computing

43

Named Entity Recognition (NER)


Locate and classify atomic elements in text into p predefined categories g Proper names (people, organizations, locations) expressions of time Quantities monetary values percentages Need f NER- Robust h dli of proper names essential d for b handling f i l for many applications Pre-processing for different classification levels Key part of Information Extraction system Information filtering Information linking
Tamil Computing 44

Indian language NER


use two levels of linguistic evidence to perform modeling:
Context cues Attributes to identify entities.
A standard list of attributes is maintained initially List d t d Li t updated - suitable learning algorithm. it bl l i l ith Attributes are thus extracted and used to identify NEs within the framework of the same system an y NER and its associated attribute extractor.

Tamil Computing

45

Named Entity Recognition


Challenges
Absence of capitalization of entities Presence of a free word order Lemmatization diffi lt L ti ti difficult

Features
Postpositions Case markers PNG marker in Verb

Tamil Computing

46

Named Entity Recognition


For Persons:
Presence of titles and honorifics like , [thiru, thalaivar] Presence of suffices like [Ar]. , [Al] in the corresponding noun phrase.

For Locations:
Presence of post-position , Presence of adjacent words like [ndakar]. [ndathi]. [ mAvattam] A tt ]

For Organizations:
Presence of adjacent words like [ndiRuvanam]. [thuRai] j [

For Time/Date:
Presence of adjacent words like [thEthi]. [ANtu]. [mAtham]
Tamil Computing 47

Training data

Shallow parsing

Semantic parsing

Statistical processing

Dictionary

NE table

Dictionary Entries Clue Extraction Verb Rules

Training data

Tamil Computing

48

Tamil Computing

49

Steps in Expectation Maximization


Seed probability estimates By picking up contextual cues Related words Ordering Smoothen the seed probability Perform ambiguity resolution Maximized probability values Named entities are tagged accordingly

Tamil Computing

50

Modified EM algorithm
Two problems were encountered with the traditional E M algorithm: E-M
Performed only positional analysis , and a modification was required for free word order languages like Tamil i i it was syntactically oriented, and modification was required to include semantic information.

The modification process called Quantum entanglement, solves both the above problems. g , p

Tamil Computing

51

Example -


Enloc E l 0.49 0 49 0.49 0 49

0.64 0 64 0.75

0.01 0 01

0.01 0 01

Enorg Noun Verb

0.34 0 34 0.01 0.01

0.23 0 23

0.01 0.86

0.01

0.01 0.01

0.06 0.92

0.12
52

Tamil Computing

Parser

Tamil Computing

53

Vaanavil Tamil parser


Free word order Phrase structure grammar Simple sentences any number of nouns, adjectives, adverbs adjectives adverbs... Clausal sentences identification using cue words or suffixes d ffi Nested clauses
Tamil Computing 54

3. Constituent Formation
Two main components are noun and verb constituent Noun constituent : A noun constituent can contain only noun (Ex. ) or can be of the following form (adjective)* ( dj i ( dj i )* (adjective clause)* (adjective)* (adjective l )* ( dj i )* ( dj i clause)* (adjective)* noun (case marker) (post position) Ex. (or) noun clause Ex.

Tamil Computing

55

Constituent Formation (cont..)


Verb Constituent : (adverb clause)* (adverb)* verb (suffix)* Ex. 1. 1 2.

Tamil Computing

56

Constituent Formation in Simple Sentence


Words are grouped based on the function they perform. 1. Adjectives are grouped with their nouns. 1 Adj i d ih h i - Adjectives are adjacent to their noun. 2. Adverbs are grouped with their verbs. - Adverbs can occur anywhere in the sentence prior to its verb Ex. . .

Tamil Computing

57

Constituent Formation in Complex Sentence


Noun, adjective and adverb clauses are considered. N dj i d d b l id d Step 1 : Conversion of complex to minimal sentence by grouping the clauses Step 2 : Minimal simple sentence can be analyzed as mentioned earlier Step 3: Integration of clauses into the minimal simple sentence

Tamil Computing

58

Grouping of Clauses
Distinguishing feature of the parser Clauses are generally indicated by special cue suffixes or cue phrases (Ex Verbal participles, participles relative participles, etc.) participles etc ) Grouping is done by position of the cues and linguistic based heuristic rules

Tamil Computing

59

Grouping of Clauses (cont..) (cont )


Ex. . Noun clause : Adjective clause : Adverb clause : Converted minimal simple sentence: .
Tamil Computing 60

Tree Generation
Position of each word in the sentence is also shown to take care of free word order First the converted minimal simple sentence is considered to generate the tree. Ex.

The NCs and VC are expanded to generate the tree for the actual input sentence.
Tamil Computing 61

Walk through of 4 phases with an example : g p p


. (After phase 1) (unidentified) (adv) (V) id tifi d) d ) (con) (N) (adv) (rpl) ( ) (N) (N) (vpl) ( ) ) ( p) (N) (V). (After phase 2) (N) (adv) (V) (con) d ) ( ) (N) (adv) (rpl) ( ) (N) (N) (vpl) ( ) ) ( p ) (N) (V).
Tamil Computing 62

( (After p phase 3) ) ( ) ( ( ) ) ( ) . ( After phase 4)

Tamil Computing

63

After expanding with the three clauses

Tamil Computing

64

Word Sense Ambiguation

Tamil Computing

65

Word Sense Disambiguation


The process of identifying which sense of a word in a sentence, when p y g , the word has multiple meanings Noun and Verb sense Disambiguation

Bootstrapping uses M h l i l S ffi B i Morphological Suffixes, POS S POS, Semantic i constraints and UNL relations (for verbs) Pattern representation and features Noun <left(features), ambiguous word(Sense set, f f (f ), g ( , features), right ), g (features), main verb> Verb <ambiguous word + sense, relations of interest>
66 Tamil Computing

Example Noun Disambiguation


Word 1: showing use of context for noun sense disambiguation Example 1: aaru river, number, get cold, heal

Sense number

POS
Example 1.1:

Noun

pandiyan aaru padaikal kondu por thoduthaan <entity, tit aaru<number>, b verb+icl>action> Example 1.2: noun+plural suffix, l l ffi

river

Noun

Tamilnattin periya aaru kaveri aagum <adjective, aaru<river>, entity, erb+aoj>thing> <adjecti e aar <ri er> entit verb+aoj>thing>
67 Tamil Computing

Example Verb Disambiguation


Word 2: showing use of context for verb sense disambiguation Example 2: padai army, disease, offer, create.

Sense offer

POS Example 2.1: Verb pa t a a a u u pa a ga a pada t t aa pakthan iraivanukku pazangalai padaiththaan <padai+offer, agt + obj + to> Example 2 2: 2.2:

create

Verb

iraivan makkalai padaithaan <padai+create, <padai+create agt + obj>


68 Tamil Computing

Anaphora Resolution

Tamil Computing

69

Anna University

Anaphora Resolution

The problem of resolving what a pronoun, or a noun phrase refers to

Approaches
C t i Th Centering Theory (B (Brennan et al, 1987) t l Hobbs algorithm (Hobbs, 1978)

Applications
Summarization Question Answering Information Retrieval
Tamil Computing 70

Anaphora Resolution in Tamil


Resolving Anaphora in Tamil Text
Partially free word order language free-word Morphologically rich language M Morphological suffixes convey most of the h l i l ffi t f th semantic roles played in a sentence

U of UNL as th b i f Semantic Use f the basis for S ti representation

Tamil Computing

71

Our Approach
Classification of Anaphora Persons, Places and Events Centering Theory - modified by incorporating Word level semantics - UNL Semantic constraints Graph based approach - Sentence level semantics - UNL graphs Absence of Case suffixes have been handled using UNL graphs Plural and Event pronouns associated with multiple antecedents - tackled using UNL graphs
Tamil Computing 72

Classification of Anaphor
Anaphora representing Persons Person Anaphora - Nouns, Noun phrases p , p avan, avaL, avar, ivan, ivaL, ivar and plural pronouns avarkaL and ivarkaL Examples Raju nandraaka padiththaan. avan thervu ezuthinaan Maanavarkal nandraaka padiththaarkal. avarkaL thervu ezuthinarkal Anaphora representing Places Place Anaphora - Nouns, Noun phrases - athu, ithu Adverbs such as angu and ingu can also acts as pronouns representing places Examples tiruchy tamilnaattin periya nagarangalil ondru. Ingu amman kovil uLLathu. ithil aayiram thooNkal uLLana.
Tamil Computing 73

Anaphora representing Events


Event Anaphora - Verb phrases, clauses and segments of sentences Pronouns such as athu, ithu with dative case, accusative case represent events Example swami aanmeega soRppozhivu nikazhthinaar. athai kaaNa makkaL koodiyirunthanar.

Tamil Computing

74

Ambiguous Pronouns
Pronouns such as athu, ithu can represent both places and events Higher level of semantics and verb semantics is needed ti i d d Examples
maduraiyil meenatchi kovil ullathu. Ithil aayiram thoonkal uLLana. madurayil ulla meenatchi kovilil aanmeeka sorpozhivu nadaipeRRathu. Ithil eeralamaana makkal pangeRRanar. l kk l RR
Tamil Computing 75

Semantics Integrated Centering Theory


Word level semantics UNL Semantic Constraints C i Anaphora classification Filter out the non-anaphoric expressions UNL Semantic Constraints Filter out the non-referring expressions Pl l pronouns h b Plural has been tackled to certain kl d i extent
Tamil Computing 76

Anaphora Resolution using UNL Graphs


Plural Pronouns having multiple concepts as antecedents Event Pronouns Two components
Use of UNL relations to extract the concepts for anaphora resolution
Co-ordinating UNL Relations Sub-ordinating UNL Relations

Use of UNL subgraphs for anaphora resolution l ti


Tamil Computing 77

Use of UNL Relations to extract the concepts for Anaphora Resolution


Co-ordinating UNL Relations
Relations obtained for referring expressions exactly matches with the relations obtained for anaphor

Sub-ordinating UNL Relations


Relations obtained for anaphor depends on the relations of referring expressions Rules
agt obj ben agt plc aoj
Tamil Computing 78

CoCo-ordinating UNL Relations Example


raju ramuvai mirattinaan. avan payanthaan
Mirattu threaten (agt>thing, obj>thing) obj agt Ramu (iof person) (iof>person) Raju (iof>person) Avan, he (Pronoun) obj Payam y scare(obj>thing)

Tamil Computing

79

Subordinating UNL Relations Example


raamalingaththai annan sabapathi kaNdiththaar. pp Anaal avar avarukku kattuppadavillai.

Kandippu, Scold K di S ld (agt>thing, obj>thing) agt obj Annan sabapathi icl>person Ramalingam g (iof>person)
Tamil Computing

Kattuppadu K tt d Abide (agt>thing, obj>thing) ben agt Avar (kku) He, pronoun Avar, he pronoun
80

Event Pronouns - Example


. . Happen icl>action obj Ramalingam iof>person aoj Spiritual aoj>thing mod Speech p icl>talk
Tamil Computing 81

agree agt>thing, agt>thing obj>thing agt to Avar Athu (kku)

Semantic Representation

Tamil Computing

82

Semantic Interpretation
Binding the user utterance to concept, or representation of concept concepts that the system can understand The process of mapping a syntactically analysed text of natural p pp g y y y language to a representation of its meaning Semantic Interpretation - Aspects
Word W d meaning & W d S i Word Sense Di Disambiguation bi i Lexical Disambiguation Structural Disambiguation Semantic Relations

Issues
Coreference and Anaphora Lexical Semantics Syntactical and Grammatical Categories Logical Semantics
Tamil Computing 83

Applications of Semantic Interpretation


Information Extraction extracting meaningful t t ti i f l templates T t l t Text Summarization Semantic Representation of text for selection of important inter relation between concepts Q Question Answering g Extracting semantic similar sentences that are answers to questions Multilingual Generation & Machine Translation Intermediate semantic representation

Tamil Computing

84

Purposed Work
Semantic Interpretation of Tamil Text Use f U of UNL as th b i f Semantic the basis for S ti representation Use of UNL based information for NLP processing Use of UNL graph for Summarization and Question Answering

Tamil Computing

85

Semantic Relation (UNL) Extraction Rule Based Approach


Morpho-Semantic Rule Based Approach Existing E isting Approaches Most UNL Enconverters use syntactic parser Morpho-syntactic f t M h t ti features Use of rich Morphological features of Tamil for semantic relation extraction ti l ti t ti Design of Rules based on Morpho-Semantic Features Use of semantic constraint information from UNL List
Tamil Computing 86

Enconversion Process
Pass1 Identify possible UNL relations of a word Wi Pass2 P 2 Disambiguate the relations, if multiple unl relations assigned for a word Identify the connected concepts with the word Wi

Tamil Computing

87

Morphology - Case suffixes associated with the word pazhathai Connective Natural Language word maRRum, maRRum allathu etc Co-occurrence Raamanaal seyyappattathu R l tt th POS Part Of Speech tag of the word Noun, V b Adj i Ad b N Verb, Adjective, Adverb Semantics icl>person, iof>place, icl>time etc.
Tamil Computing 88

Construction of UNL Graph using T P Two Passes


Pass 1 Pass 2

Tamil Computing

89

Another Example
Pass 1 Pass 2

Tamil Computing

90

Our Approach - Bootstrapping


Pattern representation Generic pattern to tackle multiple relations Features Morphological suffix POS UNL Semantic constraints Matching Exact matching Partial matching I Ignore tuples which d not take part in t l hi h do t t k ti identifying semantic relations
Tamil Computing 91

Unsupervised Approach
Features used for Probability estimation
Morphological Suffix POS Semantic Constraints Starting and Ending symbols g g y

Relation between concept pairs can occur anywhere Semantic Similarity based on UNL ontology Feature T F t Tagged corpus t d tagged using ruled i l 92 based approach Tamil Computing

Question Answering

Tamil Computing

93

Question Classification
Need for QC & Answer types QC: Accurately classify a question in to a question type and then map it to an expected answer type What i th biggest city in the United States? Wh t is the bi t it i th U it d St t ? Question Type: Q_LOCATION_CITY Extract and filter answer type to improve the overall accuracy of a question answering system Morpheme based CRF approach to Question Classification d Expected A Cl ifi ti and E t d Answer t type d t ti detection
Tamil Computing 94

Factoid type

? [ Who is India's prime minister ?] ? [When [Wh did India became independent country?] I di b i d d t t ?] Where - ? [Where was Gandhiji born?] Which ? Which state has the highest population in India? Abbreviation - ... ? [What is the expansion of IAS?] Definition type How [] ? How does DC generator operate? Who - ? [ Who is Manmohan singh?] Define - - [ Define Kirchoffs Law] List type Enumerate - . [ Enumerate districts in Tamil nadu] List List out states in India Who When Tamil Computing 95

Question Classification
DESC NUM

ABBR, DEFINITION DEFINITION, MEANING, ABBR DEFINITION, DEFINITION MEANING REASON,OTHER AGE, AREA, CODE, COUNT, DISTANCE, FREQUENCY, ORDER, PERCENT PHONENUMBER POSTCODE, ORDER PERCENT, PHONENUMBER, POSTCODE PRICE, RANGE, SPEED, TELCODE, TEMPERATURE, WEIGHT, LIST, OTHER ALIAS, DESCRIPTION, ORGANIZATION, PERSON, LIST, OTHER ANIMAL, CITY, COLOR, CURRENCY, ENTERTAIN, FOOD, INSTRUMENT, LANGUAGE, PLANT, RELIGION, SUBSTANCE, VEHICLE, LIST, OTHER ADDRESS, CITY, CONTINENT, COUNTRY, ISLAND, LAKE, MOUNTAIN, OCEAN, PLANET, PROVINCE, RIVER, LIST, OTHER DAY, MONTH, RANGE, TIME, YEAR, LIST, OTHER , , , , , ,

HUM OBJ

LOC

TIME

Tamil Computing

By TREC 96

Factoid Type Question Answering

Tamil Computing

97

Our Approach
Bag of key words matching. In extracted passage, the terms that is in question are removed. The remaining concept or entity terms may be answers. Person - Named Entity, Possible case marker, Question word case marker Location - Considering possible case markers - Temporal word database, number range

Time

Quantity - Possible words in database () ( ) - After question term as definition term - Before question term as definition term

Tamil Computing

98

Sentence to extract predicate relation


Sentences
Wordnet Preprocessing P i

Predicate Extraction

Predicate l ti P di t relation Extraction Rule Dictionary Predicate rule learning

Predicates A (x ,y) ( y)

Tagged agged training document

The relation graph gives semantic relation with all entities along with type of entity This semantic information provide filtering out the required Answer part
Tamil Computing 99

Definitional Type Question f yp Q Answering

Tamil Computing

100

Definitional QA Process
Due to the free word nature of Tamil the ranked sentences will not be the prcise answer for the question. So the definition terms f S th d fi iti t from th sentences are extracted using some short the t t t d i h t patterns (K Soo Han,2007)( Jinxi Xu, 2003) as given below. <place of birth> lpiRanthAr <year> Am Andu <month> Am mAtham ivarathu thanthaiyAr < father Name>, thAyAr <mother Name> y f , y <year> Am Andu maRainthAr The leaf nodes of the answer graph give the details presented in the sentence. The definition answer has been created using the definition templates.

Use of statistically processed seed information for classification yp and scoring of sentences for inclusion in the answer graph representing the definitional answer to who questions
Tamil Computing 101

Target term ( Name of Person )

WEB

Document retrieval (using ( i BAVANI)


Sentence classification

Statistical processing Definitional Sentence corpus

Seed Information

Sentence tagging based on seed information

Term Probability

Sentence ranking

Definition term extractor Answer graph generator Definition


Tamil Computing 102

Sentence Classification
T W (t ) = nd N
S. No 1 Category Birth Features (piRappu) (piRanthAr) (thOnRiNar) (peRROr) (thAyAr) (thAyAr) (thanthaiyAr) (kalvi) (padippu) (paLLi) ( LLi) (paNi) (vElai) (viruthu) (parisu) (iRanthAr) (maRainthAr) (pErasiriyar) (vinjAni) (arasiyalvAthi)
103

where TW is Term Weight nd = No. of documents, in which the term t occurred N = Total number of documents

Parent

Education

Work Award Death General

f ( x) = w x + b
t
where w is the weights vector of features, b i intercept f is i

5 6 7

Tamil Computing

1931 15 . . . . . . <ND> . 1981 . 1990 . 1997 . .<ND> . . <ND> . . . <ND> . .<ND> . <ND> . <ND> .<ND> . Tamil Computing 104

1931 15 . <D> <BIR> . <D> <PAR> . <D> <EDU> . <D> <EDU> . <D> <EDU> . <D> <WRK> . D WRK 1981 . <D> <AWD> 1990 . <D> <AWD> 1997 . <D> <AWD> . <D> <S> . <D> <S> . <D> <S> . <D> <S> . <D> <S> . <D> <S>

. <D> <S> . <D> <S> . <D> <S> D S . <D> <S> . <D> <S> . <D> <S>

Tamil Computing

105

Based on knowledge base tree Relation between the terms in knowledge base Graph Expansion with Lower levels of knowledge base tree

Tamil Computing

106

1931 15 . . . 1981 , 1990 , 1997 . .

Tamil Computing

107

Tamil Document Summarization Using Semantic Graph Method

Tamil Computing

108

Our Work
Capturing semantic features of the document. document Identifying key concepts and relations for summarization. summarization Using machine learning model to identify sub graph of the original document semantic graph.

Tamil Computing

109

Detailed Design
SEMANTIC GRAPH GENERATION

Linguistic Analysis
Syntactic and y Semantic analysis Analysing & y g Logical Form Parsing

EXTRACTING SUMMARY SENTENCES

Co reference Resolution Linguistic for Named Entities


Identification of Analysis Named Entities

SUB GRAPH IDENTIFICATION

Coreference Resolution

Light Weight Approach

Feature Set Identification


Linguistic & Semantic i i i S i Graph attributes, Document discourse structure

Semantic Normalization

WordNet Learning Algorithm

Construction of Semantic Graph


Tamil Computing

SVM

110

Prediction
After training the learned model is used to predict the important nodes of the given documents semantic graph.

Identification f S bG Id ifi i of SubGraph h


The sub graph of the large semantic graph is generated from the SVMs output SVM s

Extraction of Summary Sentences


The sentences containing the SOPs present in the sub graph are extracted from the input document.

Tamil Computing

111

Sample Input Document

Tamil Computing

112

Morphological Analyzer Output

Tamil Computing

113

Logical Form Parser

Tamil Computing

114

Graph Generation

Tamil Computing

115

Identification of Sub-Graph Sub-

Tamil Computing

116

Extraction of Summary Sentences

Tamil Computing

117

Tamil Summary Generation for a Cricket Match

Tamil Computing

118

Objective
To propose a framework for automatic analysis and summary generation for a cricket match in Tamil, with the scorecard of the match as the input. input The framework proposes a method to evaluate the interestingness of a cricket match. The framework proposes a customization model for the summary. The f Th framework also proposes methods for evaluating the k l h d f l i h humanness of the generated summary.

Data Mining d Analytics D t Mi i and A l ti


Modified version of Apriori algorithm is used to find the association rules from the feature vectors. Mathematical analysis using correlation of variance (CoV) is performed, CoV is plotted against average to give an idea about how consistent the player is. The interestingness of the match is calculated based on the weighted average of the scores assigned to the factors identified, they include the Winning margin, Team history, Individual records made, High run rate, Series state, Relative position in international ranking, Reaction in social networks etc. t k t

Sentence Generation
The sentence which is the most apt to the current event under consideration is selected The vocabulary used in the sentence and the depth to which an event is discussed is also varied based on the expert level of the user The nouns in the key events are passed to the morphological generator along with the desired case endings and the generated variants are added to the sentences. The system uses the morphological generator developed at TaCoLa + =

Event Clustering

Tamil Computing

122

Clustering for news event detection

Tamil Computing

123

Why UNL context Cluster?


Identifying semantic coherence between two sentences is based on overlapping of terms between the sentences. In news paper article, the term overlap between two sentences is minimal. Each sentence can have more than one event -difficult to properly segment the difficult sentences.

Tamil Computing

124

UNL based context clustering for news event detection-Event A l i d t ti E t Analysis


In natural language text, an event analysis involves g g , y discovering the portion of text in a sentence that describes an event participants of the event ti i t f th t the actual event occurrence and time of the event. All these event specific properties are obtained from p p p UNL(Universal Networking Language) representation. These properties help in separating the event sentences with non-event non event sentences with out much effort effort.

Contribution The degree of connectivity of a concept with UNL event specific semantics+ the concept distance score as well as the TF/IDF score.

Event Representation in UNL

Snapshot for Tamil News Event Search

Lyrics Mining & Generation

Tamil Computing

128

Lyric Mining
We have processing using 2,000 lyrics Analysis
Word level analysis Rhyme analysis Concept co-occurence analysis Pleasantness score Pl t

This analysis has been mainly used in the lyric generation and computing freshness scoring for lyrics.

Lyric Mining
Word Level Analysis
The frequency of words is used to associate a q y popularity score for each word. Popularity score of the word has been p y identified from lyrics. In lyrics, the words are attached with suffix. y , Root words - determine its frequency count.

Lyric Mining
Word Level Analysis - Results
WORDS USAGE
1153 1062 793 965 857

Lyric corpus of two thousand songs were analysed for the word, rhyme and Co-occurence concepts usage.

List of top 5 usage words in lyrics

Lyric Mining
Rhyme L l A l i Rh Level Analysis

Adapted Apriori Algorithm Frequency count of rhyme, alliteration and F t f h llit ti d end rhyme pairs of Tamil lyrics
USAGE
2291 2255 2028 1973 1952

EDHUGAI
, , , ,

MONAI
,
,

USAGE
3338 3145 2947 2763 2480

, , ,

a) t 5 usage word rhyme ) top d h b) top 5 usage word Alliteration

Lyric Mining
Concept Co-occurence Analysis Frequent occurrence of two terms from a lyric corpus Agaraadhi, an online Tamil dictionary Cancelling the ambiguous and the polysemy of words t i l f d to improve the th accuracy of the entire system. Example : The word which has the concept , , , , , ,

Lyric Mining
Pleasantness score
Identify the pleasantness of a word based on 5 models
3 models Language independent 2 models Language dependent

In all the models, first the given grapheme word is converted into phoneme form using Tamil phonology rules rules. Models
Meaning based model Language Dependent Model I Language Dependent Model II Manner of articulation based model Manner and place of articulation based model

Lyric Mining
Pleasantness score Meaning based Model
Maintain the pleasant and unpleasant word list Calculate the frequency of phoneme in pleasant and unpleasant word list Language Dependent Model I Judge the plesantness based on Vallinum, Mellinum Idaiyinam classification, lli d i i l ifi i Maathirai and kurukkams except kutriyalikaram Language Dependent Model II Thi d l i d i ii l l

Lyric Mining
Pleasantness sore
Manner of articulation based model
Category Manner of Articulation
Phoneme

Greater Rough G t R h

Retroflex, Trill R t fl T ill

Rough

Tap, Dental, Tap Dental Bilabial

, , , ,

Intermediate

Semivowels, Approximants

, , , ,

Soft

Nasal

, , , , ,

Lyric Mining
Pleasantness score
Manner and place of articulation based model
place of articulation score, categories which arise from the parts near the oral cavity are p y considered pleasanter than those which go deeper. Taking manner of articulation into consideration, Nasals N l are given hi h sweetness score i highest followed by Laterals, Fricatives, Stops and Trills.

Applications

Tamil Computing

138

COREE The Concept based p Search Engine

Tamil Computing

139

Components of a Search Engine


Crawler (or Worm or Spider)
collects pages p g checks for page changes

Indexer de e
constructs a sophisticated file structure to enable fast page retrieval

Searcher
Searches the indexed information that satisfies user queries Ranks output

Tamil Computing

140

Search Engine Architecture

Tamil Computing

141

The Concept based Search


Indexes concepts instead of words Indexes concepts and relations between concepts I d d l i b In this work
Representing th d R ti the document t
Use of UNL ( Universal Networking Language) as intermediate structure UNL consists of concepts and relations it f t d l ti

Three indices Concept-Relation-Concept, ConceptRelation, Concept Query converted into UNL representation Searching and ranking based on concepts & relations rather than words
Tamil Computing 142

COREECOREE-Architecture
Thesaurus
Input Processing
Parsed Query

UNL Index Based list


UNL Expressions/ Query Translation [UNL Encoding] UNL Graph
UNL based ranking

IL Query

Morphological Analyzer
Light Weight WSD

Query Expansion

NER List

MWE List

UNL Based Matching

Set of Documents With UNL Expressions

Document Processing WWW

UW List

UNL Based Indexing Detailed UNL Expression

Focused Web Crawler

Document Processing using Semantic Approach

Tamil to UNL Converter

Selection of UNL Expressions For Indexing

WSD

NER List

MWE List

Searching Tamil Computing 143

Modules of COREE
Focussed Crawling UNL based Document Processing g
Sentence Extraction Enconversion Construction of three types of multilist indexes

UNL based Input Processing


Query Expansion Query T Q Translation l ti

UNL based Searching


Matching and Ranking

UNL based Output Processing


Information Extraction Summary Generation y

Tamil Computing

144

Document Processing g
WSD NER

Tamil Document

Extraction of Components of sentences f t (TF based)

Enconversion (Concept & Relation)

UNL Expression/Graph (multi-list)

Tamil Enconversion Rules

Tamil UW list

Tamil Computing

145

UNL Lists
UWList Universal Word List

MWList Multiword List

Tamil Computing

146

UW concepts are iof>city

plf

via i

plt lt

UNL relations are disambiguated using the semantics of the concepts (iof>city)
Tamil Computing 147

Concept p Nodes Head Node

MULTILIST
Relation Nodes To Concept Nodes

Tamil Computing

148

Tamil Computing

149

Example check font - balaji


Concept Relation Concept

Concept Only y

Concept Relation

Tamil Computing

150

The Index Structure


Input - set of UNL graphs as a MultiList data structure Output are UNL indices stored in three Binary Search Trees. UNL Indexer parses the UNL graphs and builds an inverted list on the indices. The indices are categorized into three different types - t aid retrieval of semantically relevant t to id t i l f ti ll l t documents
CRC (Concept -Relation- Concept) Indices CR (Concept -Relation) Indices C (Concept Only) indices

Tamil Computing

151

Query Translation

[s] [w] ; vivekanandar; iof>person; Entity; 1 ; lecture; icl>action; Noun; 2 [/w] [r] 2 [/r] [/s] pos 1

pos

Tamil Computing

152

Query Expansion
NER
Input Query

WSD

Parsed y Query

Morphological Processing

Query Expansion (index based/verb-noun pairs)

Expanded Query

Domain specific D i ifi Noun verb pairs

Analyzed A l d Index table

Other tools to be Integrated Tamil Computing 153

Query Expansion
Query Word Query word With Expanded word Relation

< - pos> < - pos> < - and> < - and> d < - pos> < - and> < - plf> < - iof> < - pos> < - pos> < - pos>
154

() ()

Tamil Computing

Search
Indexed Concept, Concept-relation and Concept-Relation-Concept
UNL Based Indexing UNL Based Searcher Performing various levels of matching

Expanded query

UNL query UNL B Based d Ranking

Exact match (CRC) Concept-Relation Match Concept Only Match

Ranked set of documents(O/P)

Tamil Computing

155

Output Processing
UNL Document D t

Capturing tourism related Filling Templates info from the UNL documents

Morphological Generator

Document summary

Summarization of the sentences generated from the morphological generator

Tamil Computing

156

Example Result Snap Shot in 5 Window

Actual query term Match

Actual query term+Concept Match

Conceptual Results

Single Term Match

Expanded Term Match Results

Tamil Computing

157

AGARAADHI -A NOVEL ONLINE DICTIONARY FRAMEWORK

Tamil Computing

158

OBJECTIVES
Agaraadhi, Agaraadhi a dictionary framework for indexing and retrieving Tamil words, their meaning, meaning analysis and related information information. Framework to incorporate various unique features - designed to provide additional information to the user regarding the word that they query about about.

Tamil Computing

159

AGARAADHI FRAMEWORK

Tamil Computing

160

Agaraadhi - Features Features


1. Morphological Analyzer 1. Lyric Related 2. Morphological Generator 2. Kural related 3. Spelling suggestion. p g gg 4. Equivalent word 5. Picture Dictionary 6. 6 Rare Word

3. 3 Popularity of the word 4. Pleasantness score 5. Bharathiyar Songs Related. Related

Agaraadhi Meaning for the Word mazhai

Agaraadhi Meaning for the Word pookkal (example for case ending word)

Kuralagam - Concept Relation based Search Engine for Thirukkural

Tamil Computing

164

Obj ti Objectives
Kuralagam is a conceptual search framework for Thirukkural based on UNL Framework.
Searching with keywords in kurals and intepretations Concept based search based on CoReX conceptual indexing based on UNL Bilingual search English and Tamil Showing Relationships between the concepts.

Tamil Computing

165

Kuralagam Framework

Tamil Computing

166

Online Processing
Search and Ranking fetches the Thirukkural number and its details. Thirukkurals for a given query are fetched using the two types of concept relation indices namely CRC and C. The query concept is expanded using related CRC indices pointing to the query concept. helps in retrieving many Thirukkurals conceptually related to the query not possible with key word Thirukkural search engines. The ranking is based on priority to the indices in the order CRC>C usage score frequency occurrence of the query concept
Tamil Computing 167

Kuralagam results for the query word paNam

Kuralagam conceptual results for the query word paNam is Dhanam Dhanam

Kuralagam Meaning for a particular kural

Tamil Word Game

Tamil Word Game Miruginajumbo (Jumble words)

Tamil Word Game Kattaboman (Scramble words)

Tamil Word Game Thookku Thookki (Hang man)

Tamil Word Game


1. Miruginajumbo (Jumble words) 2. Kattaboman (Scramble words) 3. Thookku Thookki (Hang man) Scoring
Score can b calculated using, words usage across web, ti S be l l t d i d b time and no of tiles swap.

Demo
COREE Agaraadhi Tamil Language Based Games
Tamil Computing 176

You might also like