02 Text Operation
02 Text Operation
02 Text Operation
5. Thesaurus construction
Term categorization to allow expansion of query with related terms
Index
Terms
12
approximation
02: Text Operation
of importance for classification, summarization, etc.
Normalization
• It is standardizing tokens so that matches occur
despite superficial differences in the character
sequences of the tokens
Need to “normalize” terms in indexed text as well as query terms
into the same form
Example: We want to match U.S.A. and USA, by deleting periods
in a term
Case Folding: Often best to lower case everything, since users
will use lowercase regardless of ‘correct’ capitalization…
Republican vs. republican
Fasil vs. fasil vs. FASIL
Anti-discriminatory vs. antidiscriminatory
Affix Successor
Table lookup n-gram
Removal Variety
method Method
Method Method
longest match
simple
removal
statistical => st ta at ti is st ti ic ca al
unique digrams = al at ca ic is st ta ti
2C 2*6
.80S
A B 78
A and B are the numbers of unique digrams in the first and the second words. C
is the number of unique digrams shared by A and B
s
The longest sequence of letters is searched left hand side in a set of
rules
Applied to the word stresses yields the stem stress instead of the stem stresse.
A detailed description of the Porter algorithm can be found in the appendix of the
text book and its implementation at
27 02: Text Operation
http://tartarus.org/~martin/PorterStemmer/index.html
Porter stemmer
Most common algorithm for stemming English words to
their common grammatical root
It is simple procedure for removing known affixes in
English without using a dictionary. To gets rid of plurals
the following rules are used:
SSES SS caresses caress
IES y ponies pony
SS SS caress → caress
S cats cat
ment (Delete final element if what remains is longer than 1
character )
replacement replace
cement cement
29
-color = colour, paint
02: Text Operation
Aim of Thesaurus
Thesaurus tries to control the use of the vocabulary by
showing a set of related words to handle synonyms and
homonyms
The aim of thesaurus is therefore:
to provide a standard vocabulary for indexing and searching
Thesaurus rewrite to form equivalence classes, and we index such
equivalences
When the document contains automobile, index it under car as well
(usually, also vice-versa)
to assist users with locating terms for proper query
formulation: When the query contains automobile, look
under car as well for expanding query
to provide classified hierarchies that allow the broadening
and narrowing of the current request according to user needs
30 02: Text Operation
Thesaurus Construction
Example: thesaurus built to assist IR for searching cars and
vehicles :
Term: Motor vehicles
UF : Automobiles, Cars, Trucks
BT: Vehicles
RT: Road Engineering, Road Transport
Example: thesaurus built to assist IR in the fields of
computer science:
TERM: natural languages
UF natural language processing (UF=used for NLP)
BT languages (BT=broader term is languages)
TT languages (TT = top term is languages)
RT artificial intelligence (RT=related term/s)
computational linguistic, formal languages, query languages, speech
31 recognition
02: Text Operation
Language-specificity
Many of the above features embody transformations
that are
Language-specific and
Often, application-specific
These are “plug-in” addenda to the indexing process
Both open source and commercial plug-ins are
available for handling these
33
probability
02: Text
mass
Operation
is in the “tail”
Sample Word Frequency Data
If the most frequent term occurs f1 then the second most
frequent term has half as many occurrences, the third most
frequent term has a third as many, etc
Zipf's Law states that the frequency of the i-th most
frequent word is 1/iӨ times that of the most frequent word
occurrence of some event ( P ), as a function of the rank (i)
when the rank is determined by the frequency of occurrence, is a
power-law function Pi ~ 1/i Ө with the exponent Ө close to unity.
irrelevan
irrelevant & irrelevant
t
retrieved & not retrieved
releva
relevant but relevant
nt
retrieved not retrieved