資料工程 Data Engineering: Pattern Matching 張賢宗
資料工程 Data Engineering: Pattern Matching 張賢宗
資料工程 Data Engineering: Pattern Matching 張賢宗
Pattern Matching
張賢宗
2
Outline
• What is Pattern Matching
• The Brute Force Algorithm
• The Knuth-Morris-Pratt(KMP) Algorithm
• The Boyer-Moore Algorithm
• External Pattern Matching
3
Basic Concepts
• Assume S is a string with length m
• S[i…j] is a fragment between indexes i and j, we
call the fragment as substring of S
• S[0…i] is a prefix of S, where 0<=i<=m-1
• S[i…m-1] is a suffix of S, where 0<=i<=m-1
5
Examples
• S: smallpig
• Substring
▫ mal
▫ lpig
• Prefix
▫ smallpig, smallpi, smallp, small, smal, sma, sm, s
• Suffix
▫ smallpig, mallpig, allpig, llpig, lpig, pig, ig, g
6
T s ma l l p i g
:P a l l
: P al l
: P al l
7
Time Complexity
• Brute force pattern matching runs in time O(mn)
in the worst case.
• But most searches of ordinary text take
O(m+n), which is very quick.
9
Alphabets
• Alphabet
▫ The variations of a character in a string
▫ English: a~z, A~Z, 0~9
▫ Computer: ASCII(0) ~ ASCII(255)
▫ Bits: 0, 1
• The brute force algorithm is fast when the alpha
bet of the text is large
• It is slower when the alphabet is small
10
Answer
• The largest prefix of P[0 .. j-1] that is a suffix of P
[1 .. j-1]
13
Why
• Let u is the largest prefix of P that is a also a suffi
x of P (P[0 .. k-1] = P[m-k…m-1])
• Assume that we can find a match from T[i+d],
where 0<d<|m|-|u|
• T[i+d… i+|m|-1] = P[0…|m|-|d|-1]
• T[i+d… i+|m|-1] is the suffix of P with length |
m|-d
• |m|-d > |u|, Contradiction.
• We cannot find a such match from T[i+d]
14
Why?
Pattern Matching 110/12/07
15
Example
16
Example
• Find largest prefix of:
"a b a a b" ( P[0..j-1] )
which is suffix of:
"b a a b" ( p[1 .. j-1] )
• It is "a b"
• Set j = 2 // the new j value
17
Failure Function
• KMP preprocesses the pattern to find matches of
prefixes of the pattern with the pattern itself.
• j = mismatch position in P[]
• k = position before the mismatch (k = j-1).
• The failure function F(k) is defined as the size of
the largest prefix of P[0..k] that is also a suffix of
P[1..k].
18
F(4)=2
• Find the size of the largest prefix of P[0..4] that i
s also a suffix of P[1..4]
▫ Find the size largest prefix of "abaab" that
is also a suffix of "baab“
▫ It is "ab“ =2
20
KMP in C Code
• Knuth-Morris-Pratt’s algorithm modifies the
brute-force algorithm.
▫ if a mismatch occurs at P[j]
(i.e. P[j] != T[i]), then
k = j-1;
j = F(k); // obtain the new j
21
KMP in C Code
while (i < n) {
if (pattern[j] == text[i]) {
if (j == m - 1)
return i - m + 1; // match
i++;
j++;
}
else if (j > 0)
j = fail[j-1];
else
i++;
}
22
Analysis of KMP
• KMP runs in optimal time: O(m+n)
• The algorithm never needs to move backwards i
n the input text, T
▫ This makes the algorithm good for processing very
large files that are read in from external devices or
through a network stream.
23
Analysis of KMP
• KMP doesn’t work so well as the size of the alpha
bet increases
▫ More chance of a mismatch (more possible misma
tches)
▫ Mismatches tend to occur early in the pattern, but
KMP is faster when the mismatches occur later
24
BM Case 1
26
BM Case 2
27
BM Case 3
28
BM Example
T:
a p a t t e r n m a t c h i n g a l g o r i t h m
1 3 5 11 10 9 8 7
r i t h m r i t h m r i t h m r i t h m
P: r i
2
t h m r i
4
t h m r i
6
t h m
29
x a b c d
BMBC 1 1 2 6
31
BM Good Suffix
• Assume that a mismatch occurs between the character x[i]=a of the
pattern and the character y[i+j]=b of the text during an attempt at
position j.
• Then, x[i+1 .. m-1]=y[i+j+1 .. j+m-1]=u and x[i] y[i+j]. The good-
suffix shift consists in aligning the segment y[i+j+1 .. j+m-1]=x[i+1 ..
m-1] with its rightmost occurrence in x that is preceded by a
character different from x[i]
32
BM Good Suffix
• If there exists no previous segment, the shift
consists in aligning the longest suffix v of y[i+j+1
.. j+m-1] with a matching prefix of x
33
Analysis of BM
• Boyer-Moore worst case running time is
O(nm)
• Best Case of Moyer-Moore is O(n/m)
• Boyer-Moore is fast when the alphabet is large,
slow when the alphabet is small.
• In practice, the running time BM < KMP < BF
35
KMP & BM
• KMP
▫ Small alphabet
▫ Network Stream
▫ External Disk
• BM
▫ Large alphabet
▫ Faster in average