0% found this document useful (0 votes)
2 views

Chapter 4 - Processing Text

The document discusses the processing of text for search engines, covering topics such as corpus characteristics, text transformation techniques like tokenization, stopping, and stemming, as well as text statistics and probability laws like Zipf's law. It also addresses document parsing, inverted index construction, challenges in querying, and the importance of document structure and link analysis for ranking. Additionally, it highlights the complexities of internationalization in search engines, particularly in word segmentation.

Uploaded by

do24l
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Chapter 4 - Processing Text

The document discusses the processing of text for search engines, covering topics such as corpus characteristics, text transformation techniques like tokenization, stopping, and stemming, as well as text statistics and probability laws like Zipf's law. It also addresses document parsing, inverted index construction, challenges in querying, and the importance of document structure and link analysis for ranking. Additionally, it highlights the complexities of internationalization in search engines, particularly in word segmentation.

Uploaded by

do24l
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Processing Text

●​ Corpus: collection of some types of documents


○​ Corpus of news/articles
○​ Usually share some characteristics

●​ We can rely on the textual characteristics to build text-based engines


●​ The laws apply to:
○​ Different languages
○​ Different text domains

1.​ Words to Terms

●​ Text transformation/processing: convert words into index terms


○​ Index term: representation of the content of a document (used for searching)
●​ Tokenization: split words apart
●​ Stopping: ignore some words to make query processing more efficient and effective
●​ Stemming: allow similar words (like “run” and “running”) to match each other

⇒ Search engines is about deciphering the meaning of text by the counts of word occurrences
and co-occurrences (the number of times group of words occur together in documents)

2.​ Text Statistics

●​ Distribution of word frequencies is very skewed


○​ Words like “the”, “of” account for 10% of all word occurrences
○​ ½ of all unique words in a large sample of text occur only once

●​ Knowing (1) how fast the vocabulary size grows with the size of a corpus and (2) how
the words are distributed helps us determine:
○​ Indexing technique
○​ Storage: how to store data + storage availability

Probability
●​ Experiment: a specific set of actions the results of which can not be predicted with
certainty
●​ Sample space: possible set of recorded data
●​ Event space: subsets of a sample space defined by a specific event/outcome
○​ Example: The event that the sum of 2 dice is 4

●​ Zipf’s law: The frequency of the rth most common word is inversely proportional to r
○​ Graph frequency vs. rank → hard exponential decay (the more frequent the word,
the higher the rank)
○​ The rank of word r * its frequency f is approximately a constant k
■​ 𝑟 × 𝑓 = 𝑘
●​ If rank 1 term occurs cN times, ten rank 2 occurs cN/2, …
○​ Most frequent word (1/1) will occur 2x as frequently as the
2nd most frequent (½).
○​ Alternatively:
■​ 𝑟 × 𝑃𝑟 = 𝑐 where
●​ 𝑃𝑟: probability of occurrence for the rth ranked word
●​ 𝑐 (≈ 0. 1): a constant
○​ Might be inaccurate for low and high ranks
■​ Not accurate at tails
○​ There might be words with the same frequency n ⇒ Count: 𝑟𝑛 − 𝑟𝑛+1 (where 𝑟𝑛 is
the rank of the last word in the group, 𝑟𝑛+1 is the rank of the last word of the
previous group of words with higher frequency)

●​ Vocabulary Growth: as the size of the corpus grows, new words occur
○​ Can be due to: typos, new words in the vocabulary, …
β
○​ Heap’s law: 𝑣 = 𝑘 × 𝑛
■​ 𝑣: vocab size for a corpus of size 𝑛 words
■​ 𝑘, β: parameters (peculiar to each collection)
●​ 10 ≤ 𝑘 ≤ 100
●​ β ≈ 0. 5

3.​ Document Parsing

Process:
1.​ Taking a document
2.​ Turn it into a representation (like metadata)
a.​ Non-context: bag of words
b.​ word2vec: ML algorithm to see what’s the probability of seeing a word next to
another
On the other hand
1.​ Taking a query
2.​ Turn it into a representation (have to be of the same form as the document)

⇒ Compare them
⇒ Evaluate the comparisons (how related is this to the query)
⇒ Feedback (eg: based on how many times you clicked on something)
Basic inverted index construction:

●​ Token: a semantic unit


○​ Instance of a word occurring in a doc. Example: Bush, bush, friends
○​ Tokenizing: figuring out how to divide up the documents into fragments

●​ Text preprocessing:
○​ Remove punctuations
○​ Lowercase

●​ Inverted index: map a word to the documents it appears


○​ Term: entry in dictionary (normalized). Example: bush, friend
○​ Type: class of all tokens containing the same character sequence (Example: “to
be or not to be” has 6 tokens, 4 terms)

What’s in a document
●​ Multiple language
●​ Multiple character set/encoding: How do we go from the binary to the characters
●​ Different compressing algorithms: zipped/compressed
●​ Recognition of the content and structure of text documents
○​ Tokenizing/lexical analysis: recognizing each word occurrence in the sequence
of characters in a document
○​ Metadata: document attributes (date/author); tags to identify document
components

Big challenge in querying: Vocabulary mismatch


●​ Word boundary variation
○​ eusa is different from e-usa = usa = u.s.a = us of a
●​ Morphological variation
○​ Policeman = policemen = police vs. policy = politics
●​ Spelling variation:
○​ Gaddafi = Gadhafi = …
●​ Synonymy and Polysemy
○​ apple vs. Apple vs. Big Apple

a)​ Tokenizing

Challenges:
●​ Small words important in queries that are disregarded
●​ Hyphenated/non-hyphenated forms of many words are common
●​ Special characters
●​ Capitalized words (since tokenizing lowercases everything)
●​ Apostrophes (‘) that can be part of a word
●​ Periods
⇒ Can treat hyphens, apostrophes and periods as spaces
⇒ No single right way to do it

1.​ First pass: identifying markup or tags in the document (accommodate syntax errors in
the structure)
2.​ Second pass: identify structure in appropriate parts of the document

b)​ Stopping

●​ Function words: words that have little meaning apart from other words
●​ Stopwords: function words in the context of information retrievals
○​ We will throw them out → decrease index size, increase retrieval efficiency
○​ Can be customized for specific parts of the document structure/fields (like
“click”, “here”, “privacy” are stopwords when processing anchor text)
⇒ Downside of just removing the stopwords: Cannot search on them!
c)​ Stemming

Pros:
+​ Smaller dictionary size
+​ Increased recall (number of documents returned)

Cons:
-​ Exact quotes
-​ Decrease in specificity
-​ Decrease in precision (more *junk*)

●​ Component of text processing that captures the relationships between different


variations of a word
○​ Reduce the different forms of a word to a common stem that occur because of
■​ Inflection (eg, plurals, tenses) or
■​ Derivations (making a verb in to a noun by adding the suffix -ation)
→ For example, “swimming” and “swam” can be reduced to “swim”

●​ 2 types of stemmers:
○​ Algorithmic: decide whether 2 words are related
■​ Simplest: “s” signifies plurality
●​ False negatives: cannot detect a relationship (example: country
and countries)
●​ False positives: detect a relationship when there is none (example:
I and is)
⇒ Solution: consider more kinds of suffixes → reduce false
negative, but false positives increase
■​ Porter stemmer:
●​ The most popular algorithmic stemmer
●​ Used in many information retrieval experiments and systems
since the 70s
○​ Dictionary-based: store term relationships based on pre-created dictionaries of
related terms
■​ Instead of computing the relationship between words, these just store
lists of related words in a large dictionary
■​ Drawback: tend to be static + less flexible (while language evolves
constantly)

○​ Combine the 2:
■​ Use algorithmic to deal with new/unrecognized words
■​ Dictionary is used to detect relationships between common words
■​ Krovetz stemmer: use dictionary to check the validity of a word + if not
found, then check for suffixes → feed to the dictionary again
●​ Lower false positive than Porter
●​ Higher false negative

d)​ Phrases and N-grams

●​ Phrase: a simple noun phrase (can be nouns, or adjectives followed by nouns)


○​ POS tagging:
■​ Too slow
■​ Alternative: store word position information in the indexes → only identify
phrases when a query is processed

⇒ Testing word proximities at query time is likely to be too slow


⇒ Identify n words to be a unit during text processing

●​ n-gram: sequence of n words


○​ Generated by choosing a particular value for n
○​ Then move that window forward one unit (character/word) at a time
■​ Example: “hello” contains the following bigrams: he, el, ll, lo
○​ Indexes based on n-grams are larger than word indexes

4.​ Document Structure and Markup

●​ Document parsers use the structure of web pages (example, from HTML markup) for
ranking algorithm
○​ Headings indicate that the phrase is particularly important

●​ XML: allows each application to define what the element types


○​ Defined by a schema (like a database schema)
○​ Therefore, its elements are more tied to the semantics of the data than HTML
elements

5.​ Link Analysis

●​ Links tell relationship between pages–help with ranking pages

●​ Anchor Text: 2-3 words describing the topic of the linked page
○​ For a link www.ebay.com is highly linked to “eBay” in the anchor text
○​ This motivates a simple ranking algorithm: search through all links in the
collection, looking for anchor text that matches user’s query
●​ PageRank:
○​ Naive: Use page rank to measure popularity of pages
■​ Like count the number of inlinks (links pointing to a page) for each page
■​ Susceptible to spam!
○​ PageRank: Google search engine
■​ A value associated with a web page
■​ Help search engines sift through all pages containing the word “eBay” to
find the most popular one
■​ The PageRank for a page = Sum(Page that goes to it / count of pages
that it links to)

○​ If we take into account “distraction” links on the site…


■​ Adjust the PR to be the weighted average of clicking on that link (λ) vs.
not clicking on that link

○​ Rank sinks: pages with no outbound links (accumulate PR but do not distribute it)

●​ Link quality:
○​ Hard to assess: web page designers may try to create useless links to improve
the PR → link spam
■​ Solution: detect link spam in comment sections and ignore them during
indexing
●​ Easier solution: put rel=nofollow to unimportant links

6.​ Internationalization

●​ Monolingual search engine: search engine designed for a particular language


●​ Challenge:
○​ Word segmentation: identifying breaks within sequence of words/index terms
■​ Techniques: Hidden Markov Model

You might also like