02 Text Operation

Introduction to
Information Storage and Retrieval
Chapter Two: Text Operations

Chapter Objectives
At the end of this chapter students will be able to
Identify major steps in text operation
Understand challenges in text operation
Differentiate content bearing and function terms
Tokenize text into index terms
Eliminate stopwords from text
Understand stemming and stemming algorithms
Understand the need for thesaurus construction
Compute distribution of terms in text
Relate term distribution and term significance
2 02: Text Operation

Text Operations
 Five text operations
1. Lexical analysis
 Objective is to handle digits, hyphens, punctuation marks, and case of
letters
2. Stop word elimination
 Objective is filtering out words with very low discriminating power for
retrieval purpose
3. Stemming
 Removing affixes to allow retrieval of documents with syntactical variation
of query terms
4. Term selection
 Select index terms that carry more semantic
5. Thesaurus construction
 Term categorization to allow expansion of query with related terms

Text Operations …
 Not all words in a document are equally significant to represent the
contents/meanings of a document
 Some word carry more meaning than others
 Noun words are the most representative of a document content
 Therefore, need to preprocess the text of a document in a collection
to be used as index terms
 Using the set of all words in a collection to index documents creates
too much noise for the retrieval task
 Reduce noise means reduce words which can be used to refer to the document
 Preprocessing is the process of controlling the size of the
vocabulary or the number of distinct words used as index terms
 Preprocessing will lead to an improvement in the information retrieval
performance
 However, some search engines on the Web omit preprocessing
Every word in the document is an index term

Generating Document Representatives
Text Processing System
Input text – full text, abstract or title
Output – a document representative adequate for use in an automatic
retrieval system
 The document representative consists of a list of class names, each
name representing a class of words occurring in the total input text.
A document will be indexed by a name if one of its significant words
occurs as a member of that class.
Documents Tokenization Stop word Stemming Thesaurus
Index
Terms

Tokenization/Lexical Analysis
Change text of the documents into words to be adopted as
index terms
Objective - identify words in the text
 Digits, hyphens, punctuation marks, case of letters
 Numbers are not good index terms (like 1910, 1999); but 510
B.C. – unique
 Hyphen – break up the words (e.g. state-of-the-art = state of the
art)- but some words, e.g. gilt-edged, B-49 - unique words which
require hyphens
 gilt-edged and gilt edged do not mean the same; similarly, B-49 refers to
submarine in Soviet while B49 refers to city bus in the US
 Punctuation marks – remove them totally unless significant, e.g.
program code: x.exe and xexe
 Case of letters – not important and can convert all to upper or
lower
02: Text Operation
6
Tokenization …
Analyze text into a sequence of discrete tokens (words).
Input: “Friends, Romans and Countrymen”
Output: Tokens (an instance of a sequence of characters that are
grouped together as a useful semantic unit for processing)
Friends
Romans
and
Countrymen
Each such token is now a candidate for an index entry,
after further processing
But what are valid tokens to produce?

Issues in Tokenization
One word or multiple: How do you decide it is one token or
two or more?
splitting on white space can also split what should be regarded
as a single token. This occurs most commonly with names (San
Francisco, Los Angeles
Hewlett-Packard  Hewlett and Packard as two tokens?
 state-of-the-art: break up hyphenated sequence.
 San Francisco, Los Angeles
lowercase, lower-case, lower case ?
 data base, database, data-base
• cases with internal spaces that we might wish to regard as a single
token include phone numbers ((800) 234-2333) and dates
(Mar 11, 1983)
• Splitting tokens on spaces can cause bad retrieval results, for
example, if a search for York University mainly returns documents
containing New York University
Issues in Tokenization …
• The problems of hyphens and non-separating whitespace can even
interact; Advertisements for air fares frequently contain items like San
Francisco-Los Angeles, where simply doing whitespace splitting would
give unfortunate results
• Elimination of period
 IP addresses (100.2.86.144)
How to handle special cases involving apostrophes, hyphens etc? C++,
C#, URLs, emails, …
Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs.
republican) can be a meaningful part of a token, they are not usually
however
 Simplest approach is to ignore all numbers and punctuation and use only
case-insensitive unbroken strings of alphabetic characters as tokens
Generally, don’t index numbers as text,
Will often index “meta-data” , including creation date, format, etc. separately
Issues of tokenization are language specific

Requires the language to be known

Elimination of STOPWORD
Stopwords are extremely common words across document
collections that have no discriminatory power
They may occur in 80% of the documents in a collection.
They would appear to be of little value in helping select documents
matching a user need and needs to be filtered out from index list
Examples of stopword:
articles (a, an, the);
pronouns: (I, he, she, it, their, his)
 prepositions (on, of, in, about, besides, against),
 conjunctions/ connectors (and, but, for, nor, or, so, yet),
 verbs (is, are, was, were),
 adverbs (here, there, out, because, soon, after) and
 adjectives (all, any, each, every, few, many, some)
Stopwords are language dependent
Elimination of STOPWORD …
Stop word elimination used to be standard in older IR
systems
But the trend is away from doing this. Most web search
engines index stop words:
Good query optimization techniques mean you pay little
at query time for including stop words.
You need stopwords for:
Phrase queries: “King of Denmark”
Various song titles, etc.: “Let it be”, “To be or not to be”
“Relational” queries: “flights to London”
Elimination of stopwords might reduce recall (e.g. “To be
or not to be” – all eliminated except “be” – no or irrelevant
retrieval)
How to determine a list of stopword?
Intuition:
Stopword have little semantic content; It is typical to remove such
high-frequency words
Stopwords take up 50% of the text. Hence, document size reduces by
30-50% when stopword are eliminated
One method: Sort terms (in decreasing order) by collection
frequency and take the most frequent ones
In a collection about insurance practices, “insurance” would be a stop
word
Another method: Build a stop word list that contains a set of
articles, pronouns, etc.
 Why do we need stop lists: With
a stop list, we can compare and
exclude from index terms entirely the most common words
 With the removal of stopwords, we can measure better
12
approximation
02: Text Operation
of importance for classification, summarization, etc.
Normalization
• It is standardizing tokens so that matches occur
despite superficial differences in the character
sequences of the tokens
Need to “normalize” terms in indexed text as well as query terms
into the same form
Example: We want to match U.S.A. and USA, by deleting periods
in a term
Case Folding: Often best to lower case everything, since users
will use lowercase regardless of ‘correct’ capitalization…
Republican vs. republican
Fasil vs. fasil vs. FASIL
Anti-discriminatory vs. antidiscriminatory

Normalization issues
Good for
Allow instances of Automobile at the beginning of a sentence to
match with a query of automobile
Helps a search engine when most users type ferrari when they
are interested in a Ferrari car
Bad for
Proper names vs. common nouns
 E.g. General Motors, Associated Press, …
Solution:
lowercase only words at the beginning of the sentence
In IR, lowercasing is most practical because of the way
users issue their queries

Stemming
 Stemming is the conflation of variant forms of a word into a single
representation, the stem, semantically related to the variants
The words connect, connects, connected, connecting, connectivity,
connection can bed stemmed into connect
 The stem does not need to be a valid word, it need to capture the
meaning of the word though
 Stemming aims to increase effectiveness of an information retrieval,
recall where by more relevant out of the entire collection are retrieved
 Its also used to reduce the size of index files, since a single stem
typically corresponds to several full terms, by storing stems instead of
terms, compression factor of 50 percent can be achieve
 Conflation of words or so called stemming can either be done manually
by using some kind of regular expressions or automatically using
stemmers
Term conflation
One of the problems involved in the use of free text for
indexing and retrieval is the variation in word forms that is
likely to be encountered
The most common types of variations are
spelling errors (father, fathor)
alternative spellings i.e. locality or national usage (color vs
colour, labor vs labour)
multi-word concepts (database, data base)
affixes (dependent, independent, dependently)
 abbreviations (i.e., that is).

Conflation or stemming methods
Automatic
approaches
Affix Successor
Table lookup n-gram
Removal Variety
method Method
Method Method
longest match
simple
removal

Affix removal
Affix removal method removes suffix or prefix from the
words so as to convert them into a common stem form
Most of the stemmers that are currently used use this type of
approach for conflation
Affix removal method is based on two principles one is
iterations and the other is longest match
An iterative stemming algorithm is simply a recursive
procedure, as its name implies, which removes strings in
each order-class one at a time, starting at the end of a word
and working toward its beginning
No more than one match is allowed within a single order-
class, by definition
Affix removal
 Iteration is usually based on the fact that suffixes are attached
to stems in a certain order, that is, there exist order-classes of
suffixes
 The longest-match principle states that within any given class
of endings, if more than one ending provides a match, the one
which is longest should be removed
 The first stemmer based on this approach is the one developed
by Lovins (1968); MF Porter (1980) also used this method
 However, Porter’s stemmer is more compact and easy to use
then Lovins
 YASS is another stemmer based on the same approach; it is
however language independent
Table lookup approach
Store terms and their corresponding stems in a table
Stemming is then done via lookups in the table
One way to do stemming is to store a table of all index
terms and their stems
Terms from queries and indexes could then be stemmed
via table lookup
Problems with this approach
making these lookup tables we need to extensively work on a
language
There will be some probability that these tables may miss out
some exceptional cases
storage overhead for such a table

Successor Variety approach
 Determine word and morpheme boundaries based on the distribution of
phonemes in a large body of utterances
 The successor variety of a string is the number of different characters that follow
it in words in some body of text
 The successor variety of substrings of a term will decrease as more characters are
added until a segment boundary is reached
Test Word: READABLE
Corpus: ABLE, APE, BEATABLE, FIXABLE, READ, READABLE,
READING, READS, RED, ROPE, RIPE
Prefix Successor Variety Letters

R 3 E,I,O
RE 2 A,D
REA 1 D
READ 3 A,I,S
READA 1 B
READAB 1 L
READABL 1 E
READABLE 1 (Blank)

n-gram stemmers
 Another method of conflating terms called the shared digram method
 A digram is a pair of consecutive letters
 Besides digrams we can also use trigrams and hence it is called n-gram method
in general
 Association measures are calculated between pairs of terms based on shared
unique digrams
statistics => st ta at ti is st ti ic cs
unique digrams = at cs ic is st ta ti
statistical => st ta at ti is st ti ic ca al
unique digrams = al at ca ic is st ta ti
 Dice’s coefficient (similarity)
2C 2*6
  .80S
A B 78
 A and B are the numbers of unique digrams in the first and the second words. C
is the number of unique digrams shared by A and B

n-gram stemmers …
Similarity measures are determined for all pairs of terms in
the database, forming a similarity matrix
Such similarity measures are determined for all pairs of
terms in the database
Once such similarity is computed for all the word pairs they
are clustered as groups
The value of Dice coefficient gives us the hint that the stem
for these pair of words lies in the first unique 8 digrams out
of 10.

Criteria for judging stemmers
Correctness
Overstemming: too much of a term is removed.
 Can cause unrelated terms to be conflated  retrieval of non-relevant
documents
 Over-stemming is when two words with different stems are stemmed to
the same root. This is also known as a false positive
Understemming: too little of a term is removed.
 Prevent related terms from being conflated  relevant documents may
not be retrieved
 Under-stemming is when two words that should be stemmed to the
same root are not. This is also known as a false negative
Term Stem
GASES GAS (correct)
GAS GA (over)
GASEOUS GASE (under)

Assumptions in Stemming
Words with the same stem are semantically
related and have the same meaning to the
user of the text
The chance of matching increase when the
index term are reduced to their word stems
because it is normal to search using “
retrieve” than “retrieving”

Of the four types of stemming strategies (affix removal, table
lookup, successor variety, and n-grams) which is preferable?
 Table lookup consists simply of looking for the stem of a word
in a table, a simple procedure but dependent on data on stems
for the whole language. Since such data is not readily available
and might require considerable storage space, this type of
stemming algorithm might not be practical.
Successor variety stemming is based on the determination of
morpheme boundaries, uses knowledge from structural
linguistics, more complex than affix removal algorithms
 N-grams stemming is based on the identification of digrams
and trigrams and is more a term clustering procedure than a
stemming one
 Affix removal stemming is intuitive, simple, and can be
implemented efficiently
Porter Stemmer
 Porter stemmer is the most popular affix (suffix) removal algorithm, for its
simplicity and elegance
 Despite being simpler, the Porter algorithm yields results comparable to those of
the more sophisticated algorithms
 The Porter algorithm uses a suffix list for suffix stripping, the idea is to apply a
series of rules to the suffixes of the words in the text. For instance, the suffix s is
replace by nil as shown below converting plural into singular
s
 The longest sequence of letters is searched left hand side in a set of
rules
 Applied to the word stresses yields the stem stress instead of the stem stresse.
 A detailed description of the Porter algorithm can be found in the appendix of the
text book and its implementation at
http://tartarus.org/~martin/PorterStemmer/index.html
Porter stemmer
Most common algorithm for stemming English words to
their common grammatical root
It is simple procedure for removing known affixes in
English without using a dictionary. To gets rid of plurals
the following rules are used:
SSES  SS caresses  caress
IES  y ponies  pony
SS  SS caress → caress
S   cats  cat
ment (Delete final element if what remains is longer than 1
character )
replacement  replace
cement  cement

Thesauri
Mostly full-text searching cannot be accurate, since different
authors may select different words to represent the same
concept
Problem: The same meaning can be expressed using different terms
that are synonyms, homonyms, and related terms
How can it be achieved such that for the same meaning the identical
terms are used in the index and the query?
Thesaurus: The vocabulary of a controlled indexing
language, formally organized so that a priori relationships
between concepts (for example as "broader" and “related")
are made explicit.
A thesaurus contains terms and relationships between terms
IR thesauri rely typically upon the use of symbols such as USE/UF
(UF=used for), BT, and RT to demonstrate inter-term relationships.
e.g., car = automobile, truck, bus, taxi, motor vehicle
29
-color = colour, paint
02: Text Operation
Aim of Thesaurus
Thesaurus tries to control the use of the vocabulary by
showing a set of related words to handle synonyms and
homonyms
The aim of thesaurus is therefore:
to provide a standard vocabulary for indexing and searching
Thesaurus rewrite to form equivalence classes, and we index such
equivalences
When the document contains automobile, index it under car as well
(usually, also vice-versa)
to assist users with locating terms for proper query
formulation: When the query contains automobile, look
under car as well for expanding query
to provide classified hierarchies that allow the broadening
and narrowing of the current request according to user needs
Thesaurus Construction
Example: thesaurus built to assist IR for searching cars and
vehicles :
Term: Motor vehicles
UF : Automobiles, Cars, Trucks
BT: Vehicles
RT: Road Engineering, Road Transport
Example: thesaurus built to assist IR in the fields of
computer science:
TERM: natural languages
UF natural language processing (UF=used for NLP)
BT languages (BT=broader term is languages)
TT languages (TT = top term is languages)
RT artificial intelligence (RT=related term/s)
computational linguistic, formal languages, query languages, speech
31 recognition
02: Text Operation
Language-specificity
Many of the above features embody transformations
that are
Language-specific and
Often, application-specific
These are “plug-in” addenda to the indexing process
Both open source and commercial plug-ins are
available for handling these

Statistical Properties of Text
How is the frequency of different words distributed?
Refer to Zipf’s law
How fast does vocabulary size grow with the size of a
corpus? Refer to Heap’s law
Such factors affect the performance of IR system & can be used
to select suitable term weights & other aspects of the system.
A few words are very common.
2 most frequent words (e.g. “the”, “of”) can account for about
10% of word occurrences in a document.
Most words are very rare.
Half the words in a corpus appear only once, called “read only
once”
Called a “heavy tailed” distribution, since most of the
33
probability
02: Text
mass
Operation
is in the “tail”
Sample Word Frequency Data

Zipf’s distributions
For all the words in a documents collection, for each word w
• f : is the frequency that w appears
• r : is rank of w in order of frequency.
(The most commonly occurring word has rank 1, etc.)
Distribution of sorted word

f
frequencies, according to
Zipf’s law
w has rank r and
frequency f
35 02: Text Operation r

Word distribution: Zipf's Law
Zipf's Law- named after the Harvard linguistics professor
George Kingsley Zipf (1902-1950),
attempts to capture the distribution of the frequencies (i.e. ,
number of occurances ) of the words within a text
Zipf's Law states that when the distinct words in a text
are arranged in decreasing order of their frequency of
occuerence (most frequent words first), the occurence
characterstics of the vocabulary can be characterized by the
constant rank-frequency law of Zipf:
Frequency * Rank = constant
If the words, w, in a collection are ranked, r, by their
frequency, f, they roughly fit the relation: r*f=c
Different collections have different constants c
Example: Zipf's Law
The table shows the most frequently occurring words from

336,310 document collection containing 125, 720, 891 total
words; out of which there are 508,209 unique words
Zipf’s law: modeling word distribution
The collection frequency of the ith most common term is
proportional to 1/i
1
fi 
i
If the most frequent term occurs f1 then the second most
frequent term has half as many occurrences, the third most
frequent term has a third as many, etc
Zipf's Law states that the frequency of the i-th most
frequent word is 1/iӨ times that of the most frequent word
occurrence of some event ( P ), as a function of the rank (i)
when the rank is determined by the frequency of occurrence, is a
power-law function Pi ~ 1/i Ө with the exponent Ө close to unity.

More Example: Zipf’s Law
 Illustration of Rank-Frequency Law. Let the total number of word
occurrences in the sample N = 1,000,000
Rank (R) Term Frequency (F) R.(F/N)
1 the 69 971 0.070
2 of 36 411 0.073
3 and 28 852 0.086
4 to 26 149 0.104
5 a 23237 0.116
6 in 21341 0.128
7 that 10595 0.074
8 is 10099 0.081
9 was 9816 0.088
10 he 9543 0.095

Methods that Build on Zipf's Law
• Stop lists: Ignore the most frequent words (upper cut-
off)
• Used by almost all systems
• Significant words: Take words in between the most
frequent (upper cut-off) and least frequent words (lower
cut-off)
• Used in IR systems
• Term weighting: Give differing weights to terms based
on their frequency, with most frequent words weighted
less
• Used by almost all ranking methods
Explanations for Zipf’s Law
The law has been explained by “principle of least effort”
which makes it easier for a speaker or writer of a language
to repeat certain words instead of coining new words
Zipf’s explanation was his “principle of least effort” which balance
between speaker’s desire for a small vocabulary and hearer’s desire
for a large one
Zipf’s Law Impact on IR
Good News: Stopwords will account for a large fraction
of text so eliminating them greatly reduces inverted-index
storage costs
Bad News: For most words, gathering sufficient data for
meaningful statistical analysis (e.g. for correlation
analysis for query expansion) is difficult since they are
extremely rare
Word significance: Luhn’s Ideas
Luhn Idea (1958): the frequency of word occurrence in a

text furnishes a useful measurement of word significance
Luhn suggested that both extremely common and
extremely uncommon words were not very useful for
indexing

Word significance: Luhn’s Ideas
For this, Luhn specifies two cutoff points: an upper and a
lower cutoffs based on which non-significant words are
excluded
The words exceeding the upper cutoff were considered to be
common
The words below the lower cutoff were considered to be rare
Hence they are not contributing significantly to the content of the
text
The ability of words to discriminate content, reached a peak at a
rank order position half way between the two-cutoffs
Let f be the frequency of occurrence of words in a text, and r their
rank in decreasing order of word frequency, then a plot relating f & r
yields the following curve

Luhn’s Ideas
Luhn (1958) suggested that both extremely common and extremely

uncommon words were not very useful for document representation
& indexing

Vocabulary Growth: Heaps’ Law
How does the size of the overall vocabulary (number of
unique words) grow with the size of the corpus?
This determines how the size of the inverted index will scale with
the size of the corpus.
Heaps’ law: estimates the number of vocabularies in a
given corpus
For a text of n words, the vocabulary size grows by O(nβ), where β
is a constant, 0 < β < 1
If V is the size of the vocabulary and n is the length of the corpus
in words, Heaps provides the following equation:
Where constants:

K  10100 V  Kn
  0.40.6 (approx. square-root)
Heaps’ distributions
• Distribution of size of the vocabulary: there is a linear
relationship between vocabulary size and number of
tokens
• Example: from 1,000,000,000 words, there may be

1,000,000 distinct words. Do you agree?
Example
We want to estimate the size of the vocabulary for a corpus of
1,000,000 words. However, we only know the statistics
computed on smaller corpora sizes:
For 100,000 words, there are 50,000 unique words
For 500,000 words, there are 150,000 unique words
Estimate the vocabulary size for the 1,000,000 words corpus?

How about for a corpus of 1,000,000,000 words?

Relevance Measure
retrieved & not retrieved
irrelevan
irrelevant & irrelevant
t
retrieved & not retrieved
releva
relevant but relevant
nt
retrieved not retrieved

Issues: recall and precision
breaking up hyphenated terms increase recall
but decrease precision
preserving case distinctions enhance precision
but decrease recall
commercial information systems usually take
recall enhancing approach (numbers and words
containing digits are index terms, and all are
case insensitive)

Exercise: Tokenization
The cat slept peacefully in the living room. It’s a very old
cat.
Mr. O’Neill thinks that the boys’ stories about Chile’s

capital aren’t amusing.

Index term selection
Index language is the language used to describe
documents and requests
Elements of the index language are index terms which
may be derived from the text of the document to be
described, or may be arrived at independently.
If a full text representation of the text is adopted, then
all words in the text are used as index terms = full text
indexing
Otherwise, need to select the words to be used as index
terms for reducing the size of the index file which is
basic to design an efficient searching IR system

End of Chapter Two

02 Text Operation

Uploaded by

Copyright:

Available Formats

02 Text Operation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

02 Text Operation

Uploaded by

Copyright:

Available Formats

Introduction to

Information Storage and Retrieval

Chapter Two: Text Operations

2 02: Text Operation

3 02: Text Operation

4 02: Text Operation

Documents Tokenization Stop word Stemming Thesaurus

5 02: Text Operation

7 02: Text Operation

Issues of tokenization are language specific

9 02: Text Operation

13 02: Text Operation

14 02: Text Operation

16 02: Text Operation

17 02: Text Operation

20 02: Text Operation

Prefix Successor Variety Letters

21 02: Text Operation

 Dice’s coefficient (similarity)

22 02: Text Operation

23 02: Text Operation

24 02: Text Operation

25 02: Text Operation

28 02: Text Operation

32 02: Text Operation

34 02: Text Operation

Distribution of sorted word

35 02: Text Operation r

The table shows the most frequently occurring words from

38 02: Text Operation

39 02: Text Operation

Luhn Idea (1958): the frequency of word occurrence in a

42 02: Text Operation

43 02: Text Operation

Luhn (1958) suggested that both extremely common and extremely

44 02: Text Operation

• Example: from 1,000,000,000 words, there may be

Estimate the vocabulary size for the 1,000,000 words corpus?

47 02: Text Operation

48 02: Text Operation

49 02: Text Operation

Mr. O’Neill thinks that the boys’ stories about Chile’s

50 02: Text Operation

51 02: Text Operation

52 02: Text Operation

You might also like