Chapter 2 Text Operation
Chapter 2 Text Operation
Chapter 2 Text Operation
Text Operations
FARIS A. APRIL 2022
Statistical Properties of Text
r
Word distribution: Zipf's Law
Luhn (1958) suggested that both extremely common and extremely uncommon words
were not very useful for document representation & indexing.
Vocabulary size : Heaps’ Law
Dictionaries
600,000+ words
K 10100
V Kn
0.40.6 (approx. square-root)
Heap’s distributions
• Distribution of size
of the vocabulary:
there is a linear
relationship
between
vocabulary size
and number of
tokens
Example: from 1,000,000,000 documents, there
may be 1,000,000 distinct words. Can you agree?
Example
Index
terms
Lexical Analysis/Tokenization of
Text
Change text of the documents into words to
be adopted as index terms
Objective - identify words in the text
Digits, hyphens, punctuation marks, case of letters
Numbers are not good index terms (like 1910, 1999); but 510 B.C. – unique
Hyphen – break up the words (e.g. state-of-the-art = state of the art)- but
some words, e.g. gilt-edged, B-49 - unique words which require hyphens
Punctuation marks – remove totally unless significant, e.g. program code:
x.exe and xexe
Case of letters – not important and can convert all to upper or lower
Tokenization
One word or multiple: How do you decide it is one token or two or more?
Hewlett-Packard Hewlett and Packard as two tokens?
state-of-the-art: break up hyphenated sequence.
San Francisco, Los Angeles
Addis Ababa, Arba Minch
• Numbers:
dates (3/12/91 vs. Mar. 12, 1991);
phone numbers,
IP addresses (100.2.86.144)
Issues in Tokenization
The cat slept peacefully in the living room. It’s a very old cat.
Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t
amusing.
Elimination of STOPWORD
Stop words are extremely common words across document collections that
have no discriminatory power
They may occur in 80% of the documents in a collection.
They would appear to be of little value in helping select documents matching a user
need and needs to be filtered out from potential index terms
Examples of stop words are articles, , pronouns, prepositions, conjunctions,
etc.:
articles (a, an, the); pronouns: (I, he, she, it, their, his)
Some prepositions (on, of, in, about, besides, against, over),
conjunctions/ connectors (and, but, for, nor, or, so, yet),
verbs (is, are, was, were),
adverbs (here, there, out, because, soon, after) and
adjectives (all, any, each, every, few, many, some)
can also be treated as stop words
Stop words are language dependent.
Stopwords
Intuition:
Stop words have little semantic content; It is typical to
remove such high-frequency words
Stopwords take up 50% of the text. Hence, document
size reduces by 30-50%
Good for
Allow instances of Automobile at the beginning of a sentence to match with a query
of automobile
Helps a search engine when most users type ferrari when they are interested in a
Ferrari car
Bad for
Proper names vs. common nouns
E.g. General Motors, Associated Press, …
Solution:
lowercase only words at the beginning of the sentence
The final output from a conflation (reducing words to the same token)
algorithm is a set of classes, one for each stem detected.
A Stem: the portion of a word which is left after the removal of its affixes (i.e., prefixes
and/or suffixes).
Example: ‘connect’ is the stem for {connected, connecting connection, connections}
Thus, [automate, automatic, automation] all reduce to automat
A class name is assigned to a document if and only if one of its members
occurs as a significant word in the text of the document.
A document representative then becomes a list of class names, which are often
referred as the documents index terms/keywords.
Queries : Queries are handled in the same way.
Ways to implement stemming
e.g.
;; agreed -> agree ;; disabled -> disable
May conflate (reduce to the same token) words that are actually distinct.
“computer”, “computational”,
“computation” all reduced to same token
“comput”
Index
language is the language used to describe
documents and requests