2 - Text Operation
2 - Text Operation
2 - Text Operation
com
Statistical Properties of Text
How is the frequency of different words distributed?
How fast does vocabulary size grow with the size of a
corpus?
Such factors affect the performance of IR system & can be
used to select suitable term weights & other aspects of the
system.
A few words are very common.
2 most frequent words (e.g. “the”, “of”) can account for
about 10% of word occurrences.
Most words are very rare.
Half the words in a corpus appear only once, called “read
only once”
Called a “heavy tailed” distribution, since most of the
probability mass is in the “tail”
For all the words in a collection of documents, for each word w
• f : is the frequency that w appears
• r : is rank of w in order of frequency. (The most commonly
occurring word has rank 1, etc.)
f
Distribution of sorted word frequencies,
according to Zipf’s law
r
Word distribution: Zipf's Law
Zipf's Law- named after the Harvard linguistic
professor George Kingsley Zipf (1902-1950),
attempts to capture the distribution of the frequencies (i.e. ,
number of occurances ) of the words within a text.
Zipf's Law states that when the distinct words in a
text are arranged in decreasing order of their
frequency of occuerence (most frequent words first),
the occurence characterstics of the vocabulary can be
characterized by the constant rank-frequency law of
Zipf:
Frequency * Rank = constant
If the words, w, in a collection are ranked, r, by their
frequency, f, they roughly fit the relation: r*f=c
Different collections have different constants c.
Example: Zipf's Law
Index
terms
Lexical Analysis/Tokenization of Text
Change text of the documents into words to be
adopted as index terms
Objective - identify words in the text
Digits, hyphens, punctuation marks, case of letters
Numbers are not good index terms (like 1910, 1999); but
510 B.C. – unique
Hyphen – break up the words (e.g. state-of-the-art = state
of the art)- but some words, e.g. gilt-edged, B-49 - unique
words which require hyphens
Punctuation marks – remove totally unless significant,
e.g. program code: x.exe and xexe
Case of letters – not important and can convert all to
upper or lower
Tokenization
Analyze text into a sequence of discrete tokens (words).
Input: “Friends, Romans and Countrymen”
Output: Tokens (an instance of a sequence of characters that
are grouped together as a useful semantic unit for processing)
Friends
Romans
and
Countrymen
Each such token is now a candidate for an index entry,
after further processing
But what are valid tokens to use?
Issues in Tokenization
• One word or multiple: How do you decide it is one token
or two or more?
– Hewlett-Packard Hewlett and Packard as two tokens?
• state-of-the-art: break up hyphenated sequence.
• San Francisco, Los Angeles
– lowercase, lower-case, lower case ?
• data base, database, data-base
• Numbers:
• dates (3/12/91 vs. Mar. 12, 1991);
• phone numbers,
• IP addresses (100.2.86.144)
Issues in Tokenization
• How to handle special cases involving apostrophes,
hyphens etc? C++, C#, URLs, emails, …
– Sometimes punctuation (e-mail), numbers (1999), and case
(Republican vs. republican) can be a meaningful part of a
token.
– However, frequently they are not.
• Solution:
– lowercase only words at the beginning of the sentence
• In IR, lowercasing is most practical because of the way
users issue their queries
Stemming/Morphological analysis
• Stemming reduces tokens to their “root” form of words to recognize
morphological variation .
– The process involves removal of affixes (i.e. prefixes and suffixes) with the aim of
reducing variants to the same stem
– Often removes inflectional and derivational morphology of a word
• Inflectional morphology: vary the form of words in order to express grammatical
features, such as singular/plural or past/present tense. E.g. Boy → boys, cut →
cutting.
• Derivational morphology: makes new words from old ones. E.g. creation is formed
from create , but they are two separate words. And also, destruction → destroy
• Stemming is language dependent
– Correct stemming is language specific and can be complex.
mistakes are tolerable, and the rule-based approach is the one that is
generally chosen.
Porter Stemmer
• Stemming is the operation of stripping the suffices
from a word, leaving its stem.
– Google, for instance, uses stemming to search for web
pages containing the words connected, connecting,
connection and connections when users ask for a web
page that contains the word connect.
• In 1979, Martin Porter developed a stemming
algorithm that uses a set of rules to extract stems
from words, and though it makes some mistakes,
most common words seem to work out right.
– Porter describes his algorithm and provides a reference
implementation in C at
http://tartarus.org/~martin/PorterStemmer/index.html
Porter stemmer
• Most common algorithm for stemming English
words to their common grammatical root
• It is simple procedure for removing known affixes in
English without using a dictionary. To gets rid of
plurals the following rules are used:
– SSES SS caresses caress
– IES i ponies poni
– SS SS caress → caress
– S cats cat
– EMENT (Delete final ement if what remains is longer
than 1 character )
replacement replac
cement cement
While step 1a gets rid of plurals, step 1b removes -ed or
-ing.
e.g.
;; agreed -> agree ;; disabled -> disable
;; matting -> mat ;; mating -> mate
;; meeting -> meet ;; milling -> mill
;; messing -> mess ;; meetings -> mee
;; feed -> feedt
Stemming: challenges
May produce unusual stems that are not English words:
Removing ‘UAL’ from FACTUAL and EQUAL