Indexing and Searching: Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Indexing and Searching: Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Indexing and Searching: Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
1
Outline
Inverted Files
Other Indices for Text
Sequential Searching
Pattern Matching
Compression
2
Inverted Files
And inverted file (or inverted index) is a word-
oriented mechanism for indexing a text collection
in order to speed up the searching task.
Structure:vocabulary and occurrences
Block addressing
The text is divided in blocks, and the
occurrences point to the blocks
Full inverted indices:exact occurrences
3
4
5
Inverted Files
The search algorithm on an inverted index
Vocabulary search
Retrieval of occurrences
Manipulation of occurrences
8
9
Other Indices for Text
Suffix Trees
Suffix Arrays
Signature Files
10
Suffix Trees and Suffix Arrays
Each position in the text is considered as a
text suffix
Index points are selected form the text,
which point to the beginning of the text
positions which will be retrievable
11
12
Suffix arrays
The main drawbacks of Suffix Array are its
costly construction process.
Allow binary searches done by comparing
the contents of each pointer.
Supra-indices (for large suffix array)
13
14
15
Construction of Suffix Arrays for
Large Texts
16
Signature Files
Word-oriented index structures base on hashing
Maps words to bit masks of B bits
Divides the text in blocks of b words each
The mask is obtained by bitwise ORing the
signatures of all the words in the text block.
Hash the query to a bit mask W
If W & Bi = W, the text block may contain the
word
17
18
Sequential Searching
Brute Force
Knuth-Morris-Pratt
Boyer-Moore Family
Shift-Or
Suffix Automaton
Backward DAWG matching (BDM)
BNDM
19
Knuth-Morris-Pratt
20
Boyer-Moore Family
21
Shift-Or
22
Suffix Automaton
23
24
Pattern Matching
Searching allowing errors
Dynamic Programming
Automaton
25
Dynamic Programming
26
Automaton
27
Regular Expressions
28
Pattern Matching Using Indices
Inverted Files
The types of queries such as suffix or
substring queries, searching allowing
errors and regular expressions, are solved
by a sequential search
The restriction is to find approximate
matches or regular expressions that span
many word.
29
Pattern Matching Using Indices
Suffix Trees
Suffix trees are able to perform complex
searches
Word, prefix, suffix, substring, and Range
queries
Regular expressions
31
Compression
Compressed text--Huffman coding
Taking words as symbols
Compressed indices
Inverted Files
Signature Files
32