PAT Trees and PAT Arrays
PAT Trees and PAT Arrays
PAT Trees and PAT Arrays
Semi-infinite Strings
Example Text
Once upon a time, in a far away land
sistring 1 Once upon a time
sistring 2 nce upon a time
sistring 8 on a time, in a
sistring 11 a time, in a far
sistring 22 a far away land
3
PAT Tree
each internal node indicates which bit of the query is used for
branching
absolute bit position
a count of the number of bits to skip
For a text input of size n there are n leaf nodes and n-1 at most
higher level nodes.
4
Example
Text
sistring 1
sistring 2
sistring 3
sistring 4
sistring 5
sistring 6
sistring 7
sistring 8
01100100010111
01100100010111
1100100010111
100100010111
00100010111
0100010111
100010111
00010111
0010111 ...
1
2
4
2
3
3
5
: internal node
skip counter & pointer
Goal: To provide a fast test to eliminate the majority of items that are not
related to a query.
File structure is highly compressed and unordered, hence needs less
storage space which leads to fast retrieval.
Response time is linear to the file size as it performs linear scan of the
compressed items. O(N)
Word Signature is a fixed length code with fixed number of bits set to 1
The bit positions of 1 are determined by a hash function of the word.
Word signatures are ORed together to create Item Signature.
Items are logically divided into fixed length blocks to reduce density of
1s in signature.
Example
Block Size 5 words
Code Length 16 bits
#bits allowed to be 1 5
Text: Computer Science Graduate Students Study
Word
Signature
Computer
0001 0110 0000 0110
Science
1001 0000 1110 0000
Graduate
1000 0101 0100 0010
Students
0000 0111 1000 0100
Study 0000 0110 0110 0100
Block Signature: 1001 0111 1110 0110
Signature Files
Signature size. Number of bits in a signature, F.
Word signature. A bit pattern of size F with exactly
m bits set to 1 and the others 0.
Block. A sequence of text that contains D distinct
words.
Block signature. The logical OR of all the word
signatures in a block of text.
8
Search Procedure
False Drop
probability that the signature test will fail, creating
a false hit or false drop
A word signature may match the block signature,
but the word is not in the block. This is a false hit.
10
Comparisons
Signature files
Use hashing techniques to produce an index
Advantage
storage overhead is small (10%-20%)
Disadvantages
the search time on the index is linear
some answers may not match the query, thus filtering
must be done
11
Comparisons
(Continued)
Inverted files
storage overhead (30% ~ 100%)
search time for word searches is logarithmic
PAT arrays
potential use in other kind of searches
phrases
regular expression searching
approximate string searching
longest repetitions
most frequent searching
12