0% found this document useful (0 votes)

2 views

Chapter 4 - Processing Text

The document discusses the processing of text for search engines, covering topics such as corpus characteristics, text transformation techniques like tokenization, stopping, and stemming, as well as text statistics and probability laws like Zipf's law. It also addresses document parsing, inverted index construction, challenges in querying, and the importance of document structure and link analysis for ranking. Additionally, it highlights the complexities of internationalization in search engines, particularly in word segmentation.

Uploaded by

do24l

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Chapter 4 - Processing Text

Uploaded by

do24l

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Processing Text

● Corpus: collection of some types of documents

○ Corpus of news/articles
○ Usually share some characteristics

● We can rely on the textual characteristics to build text-based engines

● The laws apply to:
○ Different languages
○ Different text domains

1. Words to Terms

● Text transformation/processing: convert words into index terms

○ Index term: representation of the content of a document (used for searching)
● Tokenization: split words apart
● Stopping: ignore some words to make query processing more efficient and effective
● Stemming: allow similar words (like “run” and “running”) to match each other

⇒ Search engines is about deciphering the meaning of text by the counts of word occurrences
and co-occurrences (the number of times group of words occur together in documents)

2. Text Statistics

● Distribution of word frequencies is very skewed

○ Words like “the”, “of” account for 10% of all word occurrences
○ ½ of all unique words in a large sample of text occur only once

● Knowing (1) how fast the vocabulary size grows with the size of a corpus and (2) how
the words are distributed helps us determine:
○ Indexing technique
○ Storage: how to store data + storage availability

Probability
● Experiment: a specific set of actions the results of which can not be predicted with
certainty
● Sample space: possible set of recorded data
● Event space: subsets of a sample space defined by a specific event/outcome
○ Example: The event that the sum of 2 dice is 4

● Zipf’s law: The frequency of the rth most common word is inversely proportional to r
○ Graph frequency vs. rank → hard exponential decay (the more frequent the word,
the higher the rank)
○ The rank of word r * its frequency f is approximately a constant k
■ 𝑟 × 𝑓 = 𝑘
● If rank 1 term occurs cN times, ten rank 2 occurs cN/2, …
○ Most frequent word (1/1) will occur 2x as frequently as the
2nd most frequent (½).
○ Alternatively:
■ 𝑟 × 𝑃𝑟 = 𝑐 where
● 𝑃𝑟: probability of occurrence for the rth ranked word
● 𝑐 (≈ 0. 1): a constant
○ Might be inaccurate for low and high ranks
■ Not accurate at tails
○ There might be words with the same frequency n ⇒ Count: 𝑟𝑛 − 𝑟𝑛+1 (where 𝑟𝑛 is
the rank of the last word in the group, 𝑟𝑛+1 is the rank of the last word of the
previous group of words with higher frequency)

● Vocabulary Growth: as the size of the corpus grows, new words occur
○ Can be due to: typos, new words in the vocabulary, …
β
○ Heap’s law: 𝑣 = 𝑘 × 𝑛
■ 𝑣: vocab size for a corpus of size 𝑛 words
■ 𝑘, β: parameters (peculiar to each collection)
● 10 ≤ 𝑘 ≤ 100
● β ≈ 0. 5

3. Document Parsing

Process:
1. Taking a document
2. Turn it into a representation (like metadata)
a. Non-context: bag of words
b. word2vec: ML algorithm to see what’s the probability of seeing a word next to
another
On the other hand
1. Taking a query
2. Turn it into a representation (have to be of the same form as the document)

⇒ Compare them
⇒ Evaluate the comparisons (how related is this to the query)
⇒ Feedback (eg: based on how many times you clicked on something)
Basic inverted index construction:

● Token: a semantic unit

○ Instance of a word occurring in a doc. Example: Bush, bush, friends
○ Tokenizing: figuring out how to divide up the documents into fragments

● Text preprocessing:
○ Remove punctuations
○ Lowercase

● Inverted index: map a word to the documents it appears

○ Term: entry in dictionary (normalized). Example: bush, friend
○ Type: class of all tokens containing the same character sequence (Example: “to
be or not to be” has 6 tokens, 4 terms)

What’s in a document
● Multiple language
● Multiple character set/encoding: How do we go from the binary to the characters
● Different compressing algorithms: zipped/compressed
● Recognition of the content and structure of text documents
○ Tokenizing/lexical analysis: recognizing each word occurrence in the sequence
of characters in a document
○ Metadata: document attributes (date/author); tags to identify document
components

Big challenge in querying: Vocabulary mismatch

● Word boundary variation
○ eusa is different from e-usa = usa = u.s.a = us of a
● Morphological variation
○ Policeman = policemen = police vs. policy = politics
● Spelling variation:
○ Gaddafi = Gadhafi = …
● Synonymy and Polysemy
○ apple vs. Apple vs. Big Apple

a) Tokenizing

Challenges:
● Small words important in queries that are disregarded
● Hyphenated/non-hyphenated forms of many words are common
● Special characters
● Capitalized words (since tokenizing lowercases everything)
● Apostrophes (‘) that can be part of a word
● Periods
⇒ Can treat hyphens, apostrophes and periods as spaces
⇒ No single right way to do it

1. First pass: identifying markup or tags in the document (accommodate syntax errors in
the structure)
2. Second pass: identify structure in appropriate parts of the document

b) Stopping

● Function words: words that have little meaning apart from other words
● Stopwords: function words in the context of information retrievals
○ We will throw them out → decrease index size, increase retrieval efficiency
○ Can be customized for specific parts of the document structure/fields (like
“click”, “here”, “privacy” are stopwords when processing anchor text)
⇒ Downside of just removing the stopwords: Cannot search on them!
c) Stemming

Pros:
+ Smaller dictionary size
+ Increased recall (number of documents returned)

Cons:
- Exact quotes
- Decrease in specificity
- Decrease in precision (more *junk*)

● Component of text processing that captures the relationships between different

variations of a word
○ Reduce the different forms of a word to a common stem that occur because of
■ Inflection (eg, plurals, tenses) or
■ Derivations (making a verb in to a noun by adding the suffix -ation)
→ For example, “swimming” and “swam” can be reduced to “swim”

● 2 types of stemmers:
○ Algorithmic: decide whether 2 words are related
■ Simplest: “s” signifies plurality
● False negatives: cannot detect a relationship (example: country
and countries)
● False positives: detect a relationship when there is none (example:
I and is)
⇒ Solution: consider more kinds of suffixes → reduce false
negative, but false positives increase
■ Porter stemmer:
● The most popular algorithmic stemmer
● Used in many information retrieval experiments and systems
since the 70s
○ Dictionary-based: store term relationships based on pre-created dictionaries of
related terms
■ Instead of computing the relationship between words, these just store
lists of related words in a large dictionary
■ Drawback: tend to be static + less flexible (while language evolves
constantly)

○ Combine the 2:
■ Use algorithmic to deal with new/unrecognized words
■ Dictionary is used to detect relationships between common words
■ Krovetz stemmer: use dictionary to check the validity of a word + if not
found, then check for suffixes → feed to the dictionary again
● Lower false positive than Porter
● Higher false negative

d) Phrases and N-grams

● Phrase: a simple noun phrase (can be nouns, or adjectives followed by nouns)

○ POS tagging:
■ Too slow
■ Alternative: store word position information in the indexes → only identify
phrases when a query is processed

⇒ Testing word proximities at query time is likely to be too slow

⇒ Identify n words to be a unit during text processing

● n-gram: sequence of n words

○ Generated by choosing a particular value for n
○ Then move that window forward one unit (character/word) at a time
■ Example: “hello” contains the following bigrams: he, el, ll, lo
○ Indexes based on n-grams are larger than word indexes

4. Document Structure and Markup

● Document parsers use the structure of web pages (example, from HTML markup) for
ranking algorithm
○ Headings indicate that the phrase is particularly important

● XML: allows each application to define what the element types

○ Defined by a schema (like a database schema)
○ Therefore, its elements are more tied to the semantics of the data than HTML
elements

5. Link Analysis

● Links tell relationship between pages–help with ranking pages

● Anchor Text: 2-3 words describing the topic of the linked page
○ For a link www.ebay.com is highly linked to “eBay” in the anchor text
○ This motivates a simple ranking algorithm: search through all links in the
collection, looking for anchor text that matches user’s query
● PageRank:
○ Naive: Use page rank to measure popularity of pages
■ Like count the number of inlinks (links pointing to a page) for each page
■ Susceptible to spam!
○ PageRank: Google search engine
■ A value associated with a web page
■ Help search engines sift through all pages containing the word “eBay” to
find the most popular one
■ The PageRank for a page = Sum(Page that goes to it / count of pages
that it links to)

○ If we take into account “distraction” links on the site…

■ Adjust the PR to be the weighted average of clicking on that link (λ) vs.
not clicking on that link

○ Rank sinks: pages with no outbound links (accumulate PR but do not distribute it)

● Link quality:
○ Hard to assess: web page designers may try to create useless links to improve
the PR → link spam
■ Solution: detect link spam in comment sections and ignore them during
indexing
● Easier solution: put rel=nofollow to unimportant links

6. Internationalization

● Monolingual search engine: search engine designed for a particular language

● Challenge:
○ Word segmentation: identifying breaks within sequence of words/index terms
■ Techniques: Hidden Markov Model

Practice Question For Information Retrieval Subject
No ratings yet
Practice Question For Information Retrieval Subject
5 pages
Sim Plot
No ratings yet
Sim Plot
26 pages
Elasticsearch Sizing and Capacity Planning
No ratings yet
Elasticsearch Sizing and Capacity Planning
49 pages
BHB PgmPlatz en
No ratings yet
BHB PgmPlatz en
29 pages
1. 2_text Operation_1 (2)
No ratings yet
1. 2_text Operation_1 (2)
28 pages
1-Getting Started With ELK
No ratings yet
1-Getting Started With ELK
44 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Topic 4 W4 - Text Processing
No ratings yet
Topic 4 W4 - Text Processing
42 pages
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
No ratings yet
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
29 pages
MSC IR 2021
100% (1)
MSC IR 2021
188 pages
Multimedia Information Retrieval (CSC 545) : The Problem of IR
No ratings yet
Multimedia Information Retrieval (CSC 545) : The Problem of IR
29 pages
Chapter 2 Part 1 & 2
No ratings yet
Chapter 2 Part 1 & 2
58 pages
IRS Chapter 2
No ratings yet
IRS Chapter 2
57 pages
Pert23 - NLP
No ratings yet
Pert23 - NLP
30 pages
Natural Language Processing Using Java: Sang Venkatraman April 21, 2015
No ratings yet
Natural Language Processing Using Java: Sang Venkatraman April 21, 2015
51 pages
Made By:-Bhawana Agarwal Cs Iiiyr
No ratings yet
Made By:-Bhawana Agarwal Cs Iiiyr
29 pages
Text Mining
No ratings yet
Text Mining
62 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Pipeline
No ratings yet
Pipeline
9 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
No ratings yet
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
3 pages
Unit3 QueryLanguages Berlin
No ratings yet
Unit3 QueryLanguages Berlin
29 pages
IR Chapter 2
No ratings yet
IR Chapter 2
37 pages
Apznzazcghor Yfaefzxic8mtoyxh4styndoxb7gk17qpn3jvxdvqw0hldfkvr9zqdwdlqlvv Bxxsh9ypo05o9bu2vf7xntq6 Pzji8yata6ieq9uptrduksav3o g6fx5brv Epaefr Ehdghr7renjhhptsx6dxy3fundzb1nwwcrmbvg5lggbaw6m2gzk5rudbp31dnn8w
No ratings yet
Apznzazcghor Yfaefzxic8mtoyxh4styndoxb7gk17qpn3jvxdvqw0hldfkvr9zqdwdlqlvv Bxxsh9ypo05o9bu2vf7xntq6 Pzji8yata6ieq9uptrduksav3o g6fx5brv Epaefr Ehdghr7renjhhptsx6dxy3fundzb1nwwcrmbvg5lggbaw6m2gzk5rudbp31dnn8w
61 pages
Introduction To Information Retrieval: Courtesy
No ratings yet
Introduction To Information Retrieval: Courtesy
61 pages
Lecture 3
No ratings yet
Lecture 3
70 pages
IR Chap7
No ratings yet
IR Chap7
30 pages
Unit-Ii Notes
No ratings yet
Unit-Ii Notes
17 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
emutye
No ratings yet
emutye
20 pages
2-Text Operations_new
No ratings yet
2-Text Operations_new
39 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
IR Chapter 2 Text Operations
No ratings yet
IR Chapter 2 Text Operations
25 pages
02 Text Operation
No ratings yet
02 Text Operation
52 pages
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
47 pages
3-IRS-MID-IMPORTANT-QUESTIONS
No ratings yet
3-IRS-MID-IMPORTANT-QUESTIONS
6 pages
Chap 4
No ratings yet
Chap 4
76 pages
IR Problem: Introduction To Information Retrieval Outline
No ratings yet
IR Problem: Introduction To Information Retrieval Outline
11 pages
text-processing
No ratings yet
text-processing
114 pages
feature eng
No ratings yet
feature eng
34 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Text Mining
No ratings yet
Text Mining
34 pages
Chapter Five (ISR)
No ratings yet
Chapter Five (ISR)
17 pages
My M-7
No ratings yet
My M-7
44 pages
Indexing Processes (Text Transformation)
No ratings yet
Indexing Processes (Text Transformation)
10 pages
03Text Processing
No ratings yet
03Text Processing
22 pages
Explain Item Normalization?
No ratings yet
Explain Item Normalization?
7 pages
20 Tolerantretrieval
No ratings yet
20 Tolerantretrieval
39 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
Text Mining
No ratings yet
Text Mining
31 pages
Query Languages: Chapter Seven
No ratings yet
Query Languages: Chapter Seven
36 pages
Text Processing, Tokenization & Characteristics
No ratings yet
Text Processing, Tokenization & Characteristics
89 pages
7 B - Query Languages
No ratings yet
7 B - Query Languages
33 pages
Advanced Indexing Issues
No ratings yet
Advanced Indexing Issues
52 pages
Session 1
No ratings yet
Session 1
33 pages
AI Unit V
No ratings yet
AI Unit V
64 pages
7 Query Languages Operations
No ratings yet
7 Query Languages Operations
12 pages
Information Retrieval
No ratings yet
Information Retrieval
72 pages
Chapter #7 Applicatios of NLP (Reading Ass)
No ratings yet
Chapter #7 Applicatios of NLP (Reading Ass)
58 pages
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
How Search Engine Works
No ratings yet
How Search Engine Works
4 pages
Language Technology A First Overview: Hans Uszkoreit 1. Scope
No ratings yet
Language Technology A First Overview: Hans Uszkoreit 1. Scope
4 pages
Unit V - Web and Text Mining
No ratings yet
Unit V - Web and Text Mining
35 pages
Introduction To Search Engine Optimization
No ratings yet
Introduction To Search Engine Optimization
22 pages
1281819944artificial Intelligence & Machine Learning (1)
No ratings yet
1281819944artificial Intelligence & Machine Learning (1)
109 pages
AI_102_Notes
No ratings yet
AI_102_Notes
41 pages
Splunk Fundamentals
No ratings yet
Splunk Fundamentals
9 pages
Splunk 5.0 Installation
No ratings yet
Splunk 5.0 Installation
102 pages
101 Free Online Journal and Research Databases For Academics Free Resource
No ratings yet
101 Free Online Journal and Research Databases For Academics Free Resource
13 pages
S2-18-SS ZG537-L1
No ratings yet
S2-18-SS ZG537-L1
60 pages
Artificial Intelligence Delhi Conference
No ratings yet
Artificial Intelligence Delhi Conference
11 pages
Guide To Port Entry CD: Frequently Asked Questions (FAQ)
No ratings yet
Guide To Port Entry CD: Frequently Asked Questions (FAQ)
6 pages
Subject Code AP-522 Office Automation
No ratings yet
Subject Code AP-522 Office Automation
6 pages
Filing and Indexing
100% (10)
Filing and Indexing
9 pages
AA Practical Guide To Free
No ratings yet
AA Practical Guide To Free
33 pages
OFFICE MANAGEMENT RBB & ADBL in English
100% (1)
OFFICE MANAGEMENT RBB & ADBL in English
14 pages
Ucinet Slides PDF
No ratings yet
Ucinet Slides PDF
52 pages
2021 07 26.13.04.19 CS8080 Information Retrieval Techniques Reg 2017 Question Bank
No ratings yet
2021 07 26.13.04.19 CS8080 Information Retrieval Techniques Reg 2017 Question Bank
6 pages
ASpen Plus Examples PDF
100% (1)
ASpen Plus Examples PDF
57 pages
Records Management System
No ratings yet
Records Management System
9 pages
Dms Link 4.0 For Enterprise: Automate Paper-Intensive Workflows
No ratings yet
Dms Link 4.0 For Enterprise: Automate Paper-Intensive Workflows
2 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
FoxTrot Search - Getting Started
No ratings yet
FoxTrot Search - Getting Started
10 pages
A Survey Paper On Elastic Search Similarity Algorithm
No ratings yet
A Survey Paper On Elastic Search Similarity Algorithm
4 pages
Iitk Thesis Guide
100% (3)
Iitk Thesis Guide
8 pages
The Blogger Template Book
100% (5)
The Blogger Template Book
115 pages