PAT Trees and PAT Arrays

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 12

PAT Trees and PAT Arrays

PAT Trees and PAT Arrays

Problems of tradition IR models


Documents and words are assumed.
Keywords must be extracted from the text (indexing).
Queries are restricted to keywords.

New indices for text


A text is regarded as a long string.
Each position corresponds to a semi-infinite string (sistring).
No structures and no keywords

Semi-infinite Strings

Example Text
Once upon a time, in a far away land
sistring 1 Once upon a time
sistring 2 nce upon a time
sistring 8 on a time, in a
sistring 11 a time, in a far
sistring 22 a far away land
3

PAT Tree

PATRICIA Practical Algorithm To Retrieve Information Coded In


Alphanumerics
A PAT tree is an unbalanced, binary digital tree defined by sistrings.
The individual bits of the keys are used to decide on the branching
A 0 bit will cause a branch to the left subtree
A 1 bit will cause a branch to the right subtree

each internal node indicates which bit of the query is used for
branching
absolute bit position
a count of the number of bits to skip

each external node points to a sistring


the integer displacement to original text

For a text input of size n there are n leaf nodes and n-1 at most
higher level nodes.
4

Example
Text
sistring 1
sistring 2
sistring 3
sistring 4
sistring 5
sistring 6
sistring 7
sistring 8

01100100010111
01100100010111
1100100010111
100100010111
00100010111
0100010111
100010111
00010111
0010111 ...

1
2
4

2
3

3
5

: external node sistring


(integer displacement)
total displacement of the
bit to be inspected

: internal node
skip counter & pointer

Signature File Structure

Goal: To provide a fast test to eliminate the majority of items that are not
related to a query.
File structure is highly compressed and unordered, hence needs less
storage space which leads to fast retrieval.
Response time is linear to the file size as it performs linear scan of the
compressed items. O(N)
Word Signature is a fixed length code with fixed number of bits set to 1
The bit positions of 1 are determined by a hash function of the word.
Word signatures are ORed together to create Item Signature.
Items are logically divided into fixed length blocks to reduce density of
1s in signature.

Example
Block Size 5 words
Code Length 16 bits
#bits allowed to be 1 5
Text: Computer Science Graduate Students Study
Word
Signature
Computer
0001 0110 0000 0110
Science
1001 0000 1110 0000
Graduate
1000 0101 0100 0010
Students
0000 0111 1000 0100
Study 0000 0110 0110 0100
Block Signature: 1001 0111 1110 0110

Signature Files
Signature size. Number of bits in a signature, F.
Word signature. A bit pattern of size F with exactly
m bits set to 1 and the others 0.
Block. A sequence of text that contains D distinct
words.
Block signature. The logical OR of all the word
signatures in a block of text.
8

Search Procedure

Query terms are mapped to their signature.


Search is accomplished by template matching on the bit
positions specified by the words in the query.
Longer code lengths reduce probability of collision.
Fewer bits/code reduce the effect of code word pattern being
in the final block signature even though the word is not in the
item.
Horizontal or Vertical Partitioning are applied to reduce search
time.
Signature files are applicable to medium size databases with
low frequency of terms, WORM devices, parallel processing
machines and distributed environments.
9

False Drop
probability that the signature test will fail, creating
a false hit or false drop
A word signature may match the block signature,
but the word is not in the block. This is a false hit.

10

Comparisons

Signature files
Use hashing techniques to produce an index
Advantage
storage overhead is small (10%-20%)

Disadvantages
the search time on the index is linear
some answers may not match the query, thus filtering
must be done

11

Comparisons

(Continued)

Inverted files
storage overhead (30% ~ 100%)
search time for word searches is logarithmic

PAT arrays
potential use in other kind of searches
phrases
regular expression searching
approximate string searching
longest repetitions
most frequent searching
12

You might also like