Chapter 4 - Processing Text
Chapter 4 - Processing Text
⇒ Search engines is about deciphering the meaning of text by the counts of word occurrences
and co-occurrences (the number of times group of words occur together in documents)
● Knowing (1) how fast the vocabulary size grows with the size of a corpus and (2) how
the words are distributed helps us determine:
○ Indexing technique
○ Storage: how to store data + storage availability
Probability
● Experiment: a specific set of actions the results of which can not be predicted with
certainty
● Sample space: possible set of recorded data
● Event space: subsets of a sample space defined by a specific event/outcome
○ Example: The event that the sum of 2 dice is 4
● Zipf’s law: The frequency of the rth most common word is inversely proportional to r
○ Graph frequency vs. rank → hard exponential decay (the more frequent the word,
the higher the rank)
○ The rank of word r * its frequency f is approximately a constant k
■ 𝑟 × 𝑓 = 𝑘
● If rank 1 term occurs cN times, ten rank 2 occurs cN/2, …
○ Most frequent word (1/1) will occur 2x as frequently as the
2nd most frequent (½).
○ Alternatively:
■ 𝑟 × 𝑃𝑟 = 𝑐 where
● 𝑃𝑟: probability of occurrence for the rth ranked word
● 𝑐 (≈ 0. 1): a constant
○ Might be inaccurate for low and high ranks
■ Not accurate at tails
○ There might be words with the same frequency n ⇒ Count: 𝑟𝑛 − 𝑟𝑛+1 (where 𝑟𝑛 is
the rank of the last word in the group, 𝑟𝑛+1 is the rank of the last word of the
previous group of words with higher frequency)
● Vocabulary Growth: as the size of the corpus grows, new words occur
○ Can be due to: typos, new words in the vocabulary, …
β
○ Heap’s law: 𝑣 = 𝑘 × 𝑛
■ 𝑣: vocab size for a corpus of size 𝑛 words
■ 𝑘, β: parameters (peculiar to each collection)
● 10 ≤ 𝑘 ≤ 100
● β ≈ 0. 5
Process:
1. Taking a document
2. Turn it into a representation (like metadata)
a. Non-context: bag of words
b. word2vec: ML algorithm to see what’s the probability of seeing a word next to
another
On the other hand
1. Taking a query
2. Turn it into a representation (have to be of the same form as the document)
⇒ Compare them
⇒ Evaluate the comparisons (how related is this to the query)
⇒ Feedback (eg: based on how many times you clicked on something)
Basic inverted index construction:
● Text preprocessing:
○ Remove punctuations
○ Lowercase
What’s in a document
● Multiple language
● Multiple character set/encoding: How do we go from the binary to the characters
● Different compressing algorithms: zipped/compressed
● Recognition of the content and structure of text documents
○ Tokenizing/lexical analysis: recognizing each word occurrence in the sequence
of characters in a document
○ Metadata: document attributes (date/author); tags to identify document
components
a) Tokenizing
Challenges:
● Small words important in queries that are disregarded
● Hyphenated/non-hyphenated forms of many words are common
● Special characters
● Capitalized words (since tokenizing lowercases everything)
● Apostrophes (‘) that can be part of a word
● Periods
⇒ Can treat hyphens, apostrophes and periods as spaces
⇒ No single right way to do it
1. First pass: identifying markup or tags in the document (accommodate syntax errors in
the structure)
2. Second pass: identify structure in appropriate parts of the document
b) Stopping
● Function words: words that have little meaning apart from other words
● Stopwords: function words in the context of information retrievals
○ We will throw them out → decrease index size, increase retrieval efficiency
○ Can be customized for specific parts of the document structure/fields (like
“click”, “here”, “privacy” are stopwords when processing anchor text)
⇒ Downside of just removing the stopwords: Cannot search on them!
c) Stemming
Pros:
+ Smaller dictionary size
+ Increased recall (number of documents returned)
Cons:
- Exact quotes
- Decrease in specificity
- Decrease in precision (more *junk*)
● 2 types of stemmers:
○ Algorithmic: decide whether 2 words are related
■ Simplest: “s” signifies plurality
● False negatives: cannot detect a relationship (example: country
and countries)
● False positives: detect a relationship when there is none (example:
I and is)
⇒ Solution: consider more kinds of suffixes → reduce false
negative, but false positives increase
■ Porter stemmer:
● The most popular algorithmic stemmer
● Used in many information retrieval experiments and systems
since the 70s
○ Dictionary-based: store term relationships based on pre-created dictionaries of
related terms
■ Instead of computing the relationship between words, these just store
lists of related words in a large dictionary
■ Drawback: tend to be static + less flexible (while language evolves
constantly)
○ Combine the 2:
■ Use algorithmic to deal with new/unrecognized words
■ Dictionary is used to detect relationships between common words
■ Krovetz stemmer: use dictionary to check the validity of a word + if not
found, then check for suffixes → feed to the dictionary again
● Lower false positive than Porter
● Higher false negative
● Document parsers use the structure of web pages (example, from HTML markup) for
ranking algorithm
○ Headings indicate that the phrase is particularly important
● Anchor Text: 2-3 words describing the topic of the linked page
○ For a link www.ebay.com is highly linked to “eBay” in the anchor text
○ This motivates a simple ranking algorithm: search through all links in the
collection, looking for anchor text that matches user’s query
● PageRank:
○ Naive: Use page rank to measure popularity of pages
■ Like count the number of inlinks (links pointing to a page) for each page
■ Susceptible to spam!
○ PageRank: Google search engine
■ A value associated with a web page
■ Help search engines sift through all pages containing the word “eBay” to
find the most popular one
■ The PageRank for a page = Sum(Page that goes to it / count of pages
that it links to)
○ Rank sinks: pages with no outbound links (accumulate PR but do not distribute it)
● Link quality:
○ Hard to assess: web page designers may try to create useless links to improve
the PR → link spam
■ Solution: detect link spam in comment sections and ignore them during
indexing
● Easier solution: put rel=nofollow to unimportant links
6. Internationalization