Chapter 2 Text Operation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

Chapter Two

Text Operations
FARIS A. APRIL 2022
Statistical Properties of Text

 How is the frequency of different words


distributed?
 How fast does vocabulary size grow with the size
of a corpus?
 Such factors affect the performance of IR system & can be used to select suitable
term weights & other aspects of the system.

 A few words are very common.


 2 most frequent words (e.g. “the”, “of”) can account for about 10% of word
occurrences.

 Most words are very rare.


 Half the words in a corpus appear only once, called “read only once”

 Called a “heavy tailed” distribution, since most of


the probability mass is in the “tail”
Sample Word Frequency Data
Zipf’s distributions
Rank Frequency Distribution
For all the words in a collection of documents, for each word w
• f : is the frequency that w appears
• r : is rank of w in order of frequency. (The most commonly
occurring
f
word has rank 1, etc.)
Distribution of sorted word
frequencies, according to Zipf’s
law
w has rank r and
frequency f

r
Word distribution: Zipf's Law

 Zipf's Law- named after the Harvard linguistic professor


George Kingsley Zipf (1902-1950),
 attempts to capture the distribution of the frequencies (i.e. , number of occurances ) of
the words within a text.

 Zipf's Law states that when the distinct words in a text


are arranged in decreasing order of their frequency of
occuerence (most frequent words first), the
occurence characterstics of the vocabulary can be
characterized by the constant rank-frequency law of
Zipf:
Frequency * Rank = constant
that is If the words, w, in a collection are ranked, r, by
their frequency, f, they roughly fit the relation: r * f = c
 Different collections have different constants c.
Example: Zipf's Law

 The table shows the most frequently occurring words


from 336,310 document collection containing 125, 720,
891 total words; out of which 508, 209 unique words
Zipf’s law: modeling word
distribution
 The collection frequency of the ith most common term
1
is proportional to 1/i f i 
i
 If the most frequent term occurs f1 times then the second most frequent term has half as
many occurrences, the third most frequent term has a third as many, etc

 Zipf's Law states that the frequency of the i-th most


frequent word is 1/iӨ times that of the most frequent
word for some Ө>=1
 occurrence of some event ( P ), as a function of the rank (i) when the rank is determined by
the frequency of occurrence, is a power-law function Pi ~ 1/i Ө with the exponent Ө close to
1.
 More Example:
Illustration Zipf’s
of Rank-Frequency Law.Law
Let the total number of word
occurrences in the sample N = 1, 000, 000
Rank (R) Term Frequency R.(F/N)
(F)
1 the 69 971 0.070
2 of 36 411 0.073
3 and 28 852 0.086
4 to 26 149 0.104
5 a 23237 0.116
6 in 21341 0.128
7 that 10595 0.074
8 is 10099 0.081
9 was 9816 0.088
Methods that Build on Zipf's Law

• Stop lists: Ignore the most frequent words


(upper cut-off). Used by almost all systems.
• Significant words: Take words in between the
most frequent (upper cut-off) and least frequent
words (lower cut-off).
• Term weighting: Give differing weights to terms
based on their frequency, with most frequent
words weighed less. Used by almost all ranking
methods.
Explanations for Zipf’s Law

The law has been explained by “principle of


least effort” which makes it easier for a speaker
or writer of a language to repeat certain words
instead of coining new and different words.
 Zipf’s explanation was his “principle of least effort” which balance between
speaker’s desire for a small vocabulary and hearer’s desire for a large one.

Zipf’s Law Impact on IR


 Good News: Stopwords will account for a large fraction of text so eliminating
them greatly reduces inverted-index storage costs.
 Bad News: For most words, gathering sufficient data for meaningful statistical
analysis (e.g. for correlation analysis for query expansion) is difficult since they are
extremely rare.
Word significance: Luhn’s Ideas

 LuhnIdea (1958): the frequency of word occurrence in a text furnishes a useful


measurement of word significance.
 Luhn suggested that both extremely common and extremely uncommon words were
not very useful for indexing.
 Forthis, Luhn specifies two cut-off points: an upper and a lower cutoffs based on
which non-significant words are excluded
 The words exceeding the upper cut-off were considered to be common
 The words below the lower cut-off were considered to be rare
 Hence they are not contributing significantly to the content of the text
 The ability of words to discriminate content, reached a peak at a rank order position half way
between the two-cutoffs
 Letf be the frequency of occurrence of words in a text, and r their rank in
decreasing order of word frequency, then a plot relating f & r yields the following
curve
Luhn’s Ideas

Luhn (1958) suggested that both extremely common and extremely uncommon words
were not very useful for document representation & indexing.
Vocabulary size : Heaps’ Law

 Dictionaries
600,000+ words

 The words do not include names of people, locations, products etc

 Heap’s law: estimates the number of vocabularies in a given corpus


Vocabulary Growth: Heaps’ Law

 How does the size of the overall vocabulary (number of


unique words) grow with the size of the corpus?
 This determines how the size of the inverted index will scale with the size of the corpus.

 Heap’s law: estimates the number of vocabularies in a


given corpus
 The vocabulary size grows by O(nβ), where β is a constant between 0 – 1.
 If V is the size of the vocabulary and n is the length of the corpus in words, Heap’s provides
the following equation:

 Where constants:

K  10100
V  Kn
  0.40.6 (approx. square-root)
Heap’s distributions

• Distribution of size
of the vocabulary:
there is a linear
relationship
between
vocabulary size
and number of
tokens
Example: from 1,000,000,000 documents, there
may be 1,000,000 distinct words. Can you agree?
Example

 We want to estimate the size of the vocabulary for a


corpus of 1,000,000 words. However, we only know
statistics computed on smaller corpora sizes:
 For 100,000 words, there are 50,000 unique words
 For 500,000 words, there are 150,000 unique words

 Estimate the vocabulary size for the 1,000,000 words corpus?


 How about for a corpus of 1,000,000,000 words?
Text Operations

 Not all words in a document are equally significant to represent the


contents/meanings of a document
 Some word carry more meaning than others
 Noun words are the most representative of a document content
 Therefore, one needs to preprocess the text of a document in a collection to
be used as index terms
 Using the set of all words in a collection to index documents creates too much
noise for the retrieval task
 Reduce noise means reduce words which can be used to refer to the document
 Preprocessing is the process of controlling the size of the vocabulary or the
number of distinct words used as index terms
 Preprocessing will lead to an improvement in the information retrieval performance
 However, some search engines on the Web omit preprocessing
 Every word in the document is an index term
Text Operations

 Text operations is the process of text


transformations in to logical representations
 The main operations for selecting index terms,
i.e. to choose words/stems (or groups of words)
to be used as indexing terms are:
 Lexical analysis/Tokenization of the text - digits, hyphens, punctuations marks,
and the case of letters
 Elimination of stop words - filter out words which are not useful in the retrieval
process
 Stemming words - remove affixes (prefixes and suffixes)
 Construction of term categorization structures such as thesaurus, to capture
relationship for allowing the expansion of the original query with related terms
Generating Document
Representatives
 Text Processing System
 Input text – full text, abstract or title
 Output – a document representative adequate for use in an automatic retrieval system
 The document representative consists of a list of class names, each name representing
.
a class of words occurring in the total input text A document will be indexed
by a name if one of its significant words occurs as a member of
that class.

documents Tokenization stop words stemming Thesaurus

Index
terms
Lexical Analysis/Tokenization of
Text
 Change text of the documents into words to
be adopted as index terms
 Objective - identify words in the text
 Digits, hyphens, punctuation marks, case of letters
 Numbers are not good index terms (like 1910, 1999); but 510 B.C. – unique
 Hyphen – break up the words (e.g. state-of-the-art = state of the art)- but
some words, e.g. gilt-edged, B-49 - unique words which require hyphens
 Punctuation marks – remove totally unless significant, e.g. program code:
x.exe and xexe
 Case of letters – not important and can convert all to upper or lower
Tokenization

 Analyze text into a sequence of discrete tokens (words).


 Input: “Friends, Romans and Countrymen”
 Output: Tokens (an instance of a sequence of characters that
are grouped together as a useful semantic unit for processing)
 Friends
 Romans
 and
 Countrymen

 Each such token is now a candidate for an index entry,


after further processing
 But what are valid tokens to omit?
Issues in Tokenization

 One word or multiple: How do you decide it is one token or two or more?
 Hewlett-Packard  Hewlett and Packard as two tokens?
 state-of-the-art: break up hyphenated sequence.
 San Francisco, Los Angeles
 Addis Ababa, Arba Minch

 lowercase, lower-case, lower case ?


 data base, database, data-base

• Numbers:
 dates (3/12/91 vs. Mar. 12, 1991);
 phone numbers,
 IP addresses (100.2.86.144)
Issues in Tokenization

How to handle special cases involving apostrophes,


hyphens etc? C++, C#, URLs, emails, …
 Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs.
republican) can be a meaningful part of a token.
 However, frequently they are not.
 Simplest
approach is to ignore all numbers and
punctuation and use only case-insensitive
unbroken strings of alphabetic characters as
tokens.
 Generally, don’t index numbers as text, But often very useful. Will often index “meta-
data” , including creation date, format, etc. separately

Issues of tokenization are language specific


 Requires the language to be known
Exercise: Tokenization

 The cat slept peacefully in the living room. It’s a very old cat.

 Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t
amusing.
Elimination of STOPWORD

 Stop words are extremely common words across document collections that
have no discriminatory power
 They may occur in 80% of the documents in a collection.
 They would appear to be of little value in helping select documents matching a user
need and needs to be filtered out from potential index terms
 Examples of stop words are articles, , pronouns, prepositions, conjunctions,
etc.:
 articles (a, an, the); pronouns: (I, he, she, it, their, his)
 Some prepositions (on, of, in, about, besides, against, over),
 conjunctions/ connectors (and, but, for, nor, or, so, yet),
 verbs (is, are, was, were),
 adverbs (here, there, out, because, soon, after) and
 adjectives (all, any, each, every, few, many, some)
can also be treated as stop words
 Stop words are language dependent.
Stopwords
Intuition:
Stop words have little semantic content; It is typical to
remove such high-frequency words
Stopwords take up 50% of the text. Hence, document
size reduces by 30-50%

Smaller indices for information retrieval


 Good compression techniques for indices: The 30 most common words account for 30%
of the tokens in written text

Better approximation of importance for


classification, summarization, etc.
How to determine a list of
stopwords?
 One method: Sort terms (in decreasing order) by
collection frequency and take the most frequent ones
Problem: In a collection about insurance practices,
“insurance” would be a stop word
Another method: Build a stop word list that contains a
set of articles, pronouns, etc.
 Why do we need stop lists: With
a stop list, we can compare and
exclude from index terms entirely the commonest words.

 With the removal of stop words, we can measure


better approximation of importance for classification,
summarization, etc.
Stop words

Stop word elimination used to be standard in


older IR systems.
But the trend is getting away from doing this.
Most web search engines index stop words:
 Good query optimization techniques mean you pay little at query time for
including stop words.
 You need stop words for:
 Phrase queries: “King of Denmark”
 Various song titles, etc.: “Let it be”, “To be or not to be”
 “Relational” queries: “flights to London”
 Elimination of stop words might reduce recall (e.g. “To be or not to be” – all
eliminated except “be” – no or irrelevant retrieval)
Normalization

• It is Canonicalizing tokens so that matches


occur despite superficial differences in the
character sequences of the tokens. It is in a
way standardization of text
 Need to “normalize” terms in indexed text as well as query terms into the same form
 Example: We want to match U.S.A. and USA, by deleting periods in a term
 Case Folding: Often best to lowercase everything, since users will use
lowercase regardless of ‘correct’ capitalization…
 Republican vs. republican
 Fasil vs. fasil vs. FASIL
 Anti-discriminatory vs. antidiscriminatory
 Car vs. automobile?
Normalization issues

 Good for
 Allow instances of Automobile at the beginning of a sentence to match with a query
of automobile
 Helps a search engine when most users type ferrari when they are interested in a
Ferrari car

 Bad for
 Proper names vs. common nouns
 E.g. General Motors, Associated Press, …

 Solution:
 lowercase only words at the beginning of the sentence

 In IR, lowercasing is most practical because of the


way users issue their queries
Stemming/Morphological analysis

Stemming reduces tokens to their “root” form of words to recognize


morphological variation .
 The process involves removal of affixes (i.e. prefixes and suffixes) with the aim of
reducing variants to the same stem
 Often removes inflectional and derivational morphology of a word
 Inflectional morphology: vary the form of words in order to express grammatical
features, such as singular/plural or past/present tense. E.g. Boy → boys, cut → cutting.
 Derivational morphology: makes new words from old ones. E.g. creation is formed from
create , but they are two separate words. And also, destruction → destroy
 Stemming is language dependent
 Correct stemming is language specific and can be complex.

for example compressed and for example compress and


compression are both accepted. compress are both accept
Stemming

 The final output from a conflation (reducing words to the same token)
algorithm is a set of classes, one for each stem detected.
A Stem: the portion of a word which is left after the removal of its affixes (i.e., prefixes
and/or suffixes).
Example: ‘connect’ is the stem for {connected, connecting connection, connections}
Thus, [automate, automatic, automation]  all reduce to  automat
A class name is assigned to a document if and only if one of its members
occurs as a significant word in the text of the document.
A document representative then becomes a list of class names, which are often
referred as the documents index terms/keywords.
 Queries : Queries are handled in the same way.
Ways to implement stemming

There are basically two ways to implement stemming.


 The first approach is to create a big dictionary that maps words to their stems.
 The advantage of this approach is that it works perfectly (insofar
as the stem of a word can be defined perfectly); the
disadvantages are the space required by the dictionary and the
investment required to maintain the dictionary as new words
appear.
 The second approach is to use a set of rules that extract stems from words.
 The advantages of this approach are that the code is typically
small, and it can gracefully handle new words; the disadvantage
is that it occasionally makes mistakes.
 But,since stemming is imperfectly defined, anyway, occasional
mistakes are tolerable, and the rule-based approach is the one
that is generally chosen.
Porter Stemmer

Stemming is the operation of stripping the


suffices from a word, leaving its stem.
 Google, for instance, uses stemming to search for web pages containing the
words connected, connecting, connection and connections when users ask for
a web page that contains the word connect.

In 1979, Martin Porter developed a stemming


algorithm that uses a set of rules to extract
stems from words, and though it makes some
mistakes, most common words seem to work
out right.
 Porter describes his algorithm and provides a reference implementation in C at
http://tartarus.org/~martin/PorterStemmer/index.html
Porter stemmer

It is the most common algorithm for stemming


English words to their common grammatical root
It uses a simple procedure for removing known
affixes in English without using a dictionary. To gets
rid of plurals the following rules are used:
 SSES  SS caresses  caress
IES  i ponies  poni
SS  SS caress → caress
S   (nil) cats  cat
EMENT   (Delete final element if what remains is longer than 1 character )
replacement  replac
cement  cement
 While step 1a gets rid of plurals, step 1b removes -ed or -ing.

 e.g.
;; agreed -> agree ;; disabled -> disable

;; matting -> mat ;; mating -> mate


;; meeting -> meet ;; milling -> mill
;; messing -> mess ;; meetings -> mee
;; feed -> feedt
Stemming: challenges

 May produce unusual stems that are not English words:


 Removing ‘UAL’ from FACTUAL and EQUAL

 May conflate (reduce to the same token) words that are actually distinct.

 “computer”, “computational”,
“computation” all reduced to same token
“comput”

 Not recognize all morphological derivations.


Thesauri

 Mostly full-text searching cannot be accurate,


since different authors may select different
words to represent the same concept
 Problem: The same meaning can be expressed using different terms that are
synonyms(a word or phrase which has the same or nearly the same meaning
as another word or phrase in the same language), homonyms(a word that
sounds the same or is spelled the same as another word but has a different
meaning), and related terms
 How can it be achieved such that for the same meaning the identical terms
are used in the index and the query?
Thesauri

 Thesaurus: The vocabulary of a controlled


indexing language, formally organized so
that a priori relationships between concepts
(for example as "broader" and “related") are
made explicit.
 A thesaurus contains terms and relationships
between terms
 IR thesauri rely typically upon the use of symbols such as USE/UF (UF=used
for), BT(Broader Term), and RT(Related term) to demonstrate inter-term
relationships.
 e.g., car = automobile, truck, bus, taxi, motor vehicle
-color = colour, paint
Aim of Thesaurus

Thesaurus tries to control the use of the vocabulary by


showing a set of related words to handle synonyms and
homonyms
The aim of thesaurus is therefore:
 to provide a standard vocabulary for indexing and
searching
 Thesaurus rewrite to form equivalence classes, and we index such equivalences
 When the document contains automobile, index it under car as well (usually, also vice-versa)

 to assist users with locating terms for proper query


formulation: When the query contains automobile, look
under car as well for expanding query
 to provide classified hierarchies that allow the broadening
and narrowing of the current request according to user
needs
Thesaurus Construction

Example: thesaurus built to assist IR for searching cars and vehicles :


Term: Motor vehicles
UF : Automobiles
Cars
Trucks
BT: Vehicles
RT: Road Engineering
Road Transport
More Example

Example: thesaurus built to assist IR in the fields of


computer science:
TERM: natural languages
 UF natural language processing (UF=used for NLP)
 BT languages (BT=broader term is languages)
 NT languages (NT= Narrower related term)
 TT languages (TT = top term is languages)
 RT artificial intelligence (RT=related term/s)
computational linguistic
formal languages
query languages
speech recognition
Language-specificity

 Many of the above features embody


transformations that are
 Language-specific and
 Often, application-specific
 These are “plug-in” addenda to the indexing
process
 Both open source and commercial plug-ins are
available for handling these
INDEX TERM SELECTION

 Index
language is the language used to describe
documents and requests

 Elementsof the index language are index terms which


may be derived from the text of the document to be
described, or may be arrived at independently.
 Ifa full text representation of the text is adopted, then all
words in the text are used as index terms = full text
indexing
 Otherwise, need to select the words to be used as index
terms for reducing the size of the index file which is basic
to design an efficient searching IR system
End of Chapter Two
Thank you
FARIS A. APRIL 2022

You might also like