Lecture 05
Lecture 05
Lecture 05
92
Factor based algorithms use an automaton that accepts suffixes of the
reverse pattern P R (or equivalently reverse prefixes of the pattern P ).
a s s i
3 2 1 0 -1
93
Suppose we are currently comparing P against T [j..j + m). We use the
automaton to scan the text backwards from T [j + m − 1]. When the
automaton has scanned T [j + i..j + m):
94
BNDM does a bitparallel simulation of the nondeterministic automaton,
which is quite similar to Shift-And.
The state of the automaton is stored in a bitvector D. When the
automaton has scanned T [j + i..j + m):
• D.k = 1 if and only if there is a path from the initial state to state k
with the string (T [j + i..j + m))R , and thus
T [j + i..j + m) = P [m − k − 1..2m − k − i − 1).
• If D.(m − 1) = 1, then T [j + i..j + m) is a prefix of the pattern.
• If D = 0, then the automaton can no more reach an accept state.
95
Algorithm 2.16: BNDM
Input: text T = T [0 . . . n), pattern P = P [0 . . . m)
Output: position of the first occurrence of P in T
Preprocess:
(1) for c ∈ Σ do B[c] ← 0
(2) for i ← 0 to m − 1 do B[P [m − 1 − i]] ← B[P [m − 1 − i]] + 2i
Search:
(3) j ← 0
(4) while j + m ≤ n do
(5) i ← m; shif t ← m
(6) D ← 2m − 1 // D ← 1m
(7) while D 6= 0 do
// Now T [j + i..j + m) is a pattern factor
(8) i←i−1
(9) D ← D & B[T [j + i]]
(10) if D & 2m−1 6= 0 then
// Now T [j + i..j + m) is a pattern prefix
(11) if i = 0 then return j
(12) else shif t ← i
(13) D ← D << 1
(14) j ← j + shif t
(15) return n
96
Example 2.17: P = assi, T = apassi.
97
In the integer alphabet model when m ≤ w:
98
• The search time of BDM and BOM is O((n/m) + (n/m) logσ m), which
is optimal on average. (BNDM is optimal only when m ≤ w.)
• MP and KMP are optimal in the worst case but not in the average case.
• There are also algorithms that are optimal in both cases. They are
based on similar techniques, but we will not describe them here.
99
Crochemore
• Two of the exact string matching algorithms we have seen, the brute
force algorithm and the Karp–Rabin algorithm, use only a constant
amount of extra space in addition to the text and the pattern. All the
other algorithms use some additional data structures whose size is
proportional to the length of the pattern or the alphabet size.
• On the other hand, the brute force algorithm and the Karp–Rabin
algorithm have a worst case running time of O(mn).
• Remarkably, there exists algorithms with O(n) worst case time that
need only a constant amount of extra space. One of them is the
Crochemore algorithm.
• We will only outline the main ideas of the algorithm here without
detailed proofs.
100
All the linear time, constant extra space algorithms are based on the
periodicity properties of the pattern.
Example 2.19: The periods of S1 = aabaaabaa are 4,7,8 and 9. The periods
of S2 = abcabcabcabca are 3, 6, 9, 12 and 13. S2 is 3-periodic but S1 is not.
101
The Crochemore algorithm resembles the Morris–Pratt algorithm at a high
level:
• When the pattern P is aligned against a text factor T [j..j + m), they
compute the longest common prefix ` = lcp(P, T [j..j + m)) and report
an occurrence if ` = m. Otherwise, they shift the pattern forward.
• MP shifts the pattern forward by ` − fail[`] positions. Recall that fail[`]
in MP is the length of the longest proper border of P [0..`). Thus the
pattern shift by MP is per(P [0..`)).
• In the next lcp computation, MP skips the first fail[`] = ` − per(P [0..`))
characters (cf. lcp-comparison).
• Thus knowing per(P [0..`)) is sufficient to emulate MP shift and skip.
102
Crochemore either does the same shift and skip as MP, or a shorter shift
than MP and starts the lcp comparison from scratch:
103
Definition 2.22: Let M S(S) denote the lexicographically maximal suffix of
a string S. If S = M S(S), S is called self-maximal.
Lemma 2.24: Let S[0..m) be a self-maximal string and let p = per(S). For
any c ∈ Σ,
M S(Sc) = Sc and per(Sc) = p if c = S[m − p]
M S(Sc) = Sc and per(Sc) = m + 1 if c < S[m − p]
M S(Sc) 6= Sc if c > S[m − p]
Furthermore, let r = m mod p and R = S[m − r..m). Then R is self-maximal
and
M S(Sc) = M S(Rc) if c > S[m − p]
104
Crochemore’s algorithm computes the maximal suffix and its period for
P [0..`) incrementally using Lemma 2.24. The following algorithm updates
the maximal suffix information when the match is extended by one
character.
Algorithm 2.25: Update-MS(P, `, s, p)
Input: a string P and integers `, s, p such that
M S(P [0..`)) = P [s..`) and p = per(P [s..`)).
Output: a triple (` + 1, s0 , p0 ) such that
M S(P [0..` + 1)) = P [s0 ..` + 1) and p0 = per(P [s0 ..` + 1)).
(1) if ` = 0 then return (1, 0, 1)
(2) i ← `
(3) while i < ` + 1 do
// P [s..i) is self-maximal and p = per(P [s..i))
(4) if P [i] > P [i − p] then
(5) i ← i − ((i − s) mod p)
(6) s←i
(7) p←1
(8) else if P [i] < P [i − p] then
(9) p←i−s+1
(10) i←i+1
(11) return (` + 1, s, p)
105
As the final piece of the Crochemore algorithm, the following result shows
how to use the maximal suffix information to obtain information about the
periodicity of the full string.
Lemma 2.26: Let S[0..m) be a string and let S[s..m) = M S(S) and
p = per(M S(S)).
Example 2.27:
106
Algorithm 2.28: Crochemore
Input: strings T [0..n) (text) and P [0..m) (pattern).
Output: position of the first occurrence of P in T
(1) j ← ` ← p ← s ← 0
(2) while j + m ≤ n do
(3) while j + ` < n and ` < m and T [j + `] = P [`] do
(4) (`, s, p) ← Update-MS(P, `, s, p)
// ` = lcp(P, T [j..j + m))
(5) if ` = m then return j
// M S(P [0..`)) = P [s..`) and p = per(P [s..`))
(6) if p ≤ `/3 and P [0..s) = P [p..p + s) then
// per(P [0..`)) = p
(7) j ←j+p
(8) `←`−p
(9) else // per(P [0..`)) > `/3
(10) j ← j + b`/3c + 1
(11) (`, s, p) ← (0, 0, 0)
(12) return n
107
In the general alphabet model:
There are also other linear time, constant extra space algorithms. All of
them are based on string periodicity in some way.
108
Aho–Corasick
Given a text T and a set P = {P1 .P2 , . . . , Pk } of patterns, the multiple exact
string matching problem asks for the occurrences of all the patterns in the
text. The Aho–Corasick algorithm is an extension of the Morris–Pratt
algorithm for multiple exact string matching.
Σ h e r s
-1 0 1 2 8 9
i
s s
6 7
h e
3 4 5
109
Let Sv denote the string represented by a node v in the trie. The
components of the AC automaton are:
• root is the root and child() the child function of the trie.
• fail(v) = u such that Su is the longest proper suffix of Sv represented by
any trie node u.
• patterns(v) is the set of pattern indices i such that Pi is a suffix of Sv .
110
At each stage of the matching, the algorithm computes the node v such
that Sv is the longest suffix of T [0..j] represented by any node.
Algorithm 2.31: Aho–Corasick
Input: text T , pattern set P = {P1 , P2 , . . . , Pk }.
Output: all pairs (i, j) such that Pi occurs in T ending at j.
(1) (root, child(), fail(), patterns()) ← Construct-AC-Automaton(P)
(2) v ← root
(3) for j ← 0 to n − 1 do
(4) while child(v, T [j]) = ⊥ do v ← fail(v)
(5) v ← child(v, T [j])
(6) for i ∈ patterns(v) do output (i, j)
112
Algorithm 2.34: Compute-AC-Fail
Input: AC trie: root, child() and patterns()
Output: AC failure function fail() and updated patterns()
(1) Create new node f allback
(2) for c ∈ Σ do child(f allback, c) ← root
(3) fail(root) ← f allback
(4) queue ← {root}
(5) while queue 6= ∅ do
(6) u ← popfront(queue)
(7) for c ∈ Σ such that child(u, c) 6= ⊥ do
(8) v ← child(u, c)
(9) w ← fail(u)
(10) while child(w, c) = ⊥ do w ← fail(w)
(11) f ail(v) ← child(w, c)
(12) patterns(v) ← patterns(v) ∪ patterns(fail(v))
(13) pushback(queue, v)
(14) return (fail(), patterns())
The algorithm does a breath first traversal of the trie. This ensures that
correct values of fail() and patterns() are already computed when needed.
113
fail(v) is correctly computed on lines (8)–(11):
• Let fail∗(v) = {v, fail(v), fail(fail(v)), . . . , root}. These nodes are exactly
the trie nodes that represent suffixes of Sv .
• Let u = parent(v) and child(u, c) = v. Then Sv = Suc and a string S is a
suffix of Su iff Sc is suffix of Sv . Thus for any node w
– If w ∈ fail∗ (v) \ {root}, then parent(w) ∈ fail∗ (u).
– If w ∈ fail∗ (u) and child(w, c) 6= ⊥, then child(w, c) ∈ fail∗ (v).
• Therefore, fail(v) = child(w, c), where w is the first node in fail∗(u)
other than u such that child(w, c) 6= ⊥, or fail(v) = root if no such node
w exists.
114
In the constant alphabet model:
115
Summary: Exact String Matching
The properties of the algorithms vary with respect to worst case time
complexity, average case time complexity, alphabet model, and even space
complexity.
• There exists numerous algorithms for exact string matching (see study
group) but most of them use variations or combinations of the
techniques we have seen.
• Many of the techniques can be adapted to other problems. All of the
techniques have some uses in practice.
116
Selected Literature
• Survey
Faro & Lecroq: The Exact Online String Matching Problem: A
Review of the Most Recent Results. ACM Computing Surveys,
45(2)2, Article 13, 2013.
• Knuth–Morris–Pratt
Knuth, Morris & Pratt: Fast pattern matching in strings. SIAM
Journal on Computing, 6(1), 1977, 323–350.
• Shift-Or / Shift-And
Baeza-Yates & Gonnet: A new approach to text searching.
Communications of the ACM, 35(10), 1992, 74–82
117
• Karp-Rabin
Karp & Rabin: Efficient randomized pattern-matching
algorithms. IBM Journal of Research and Development, 31(2),
1987, 249–260.
• Boyer–Moore / Horspool
Boyer & Moore: A fast string searching algorithm.
Communications of the ACM, 20(10), 1977, 762–772.
Horspool: Practical fast searching in strings. Software: Practice
and Experience, 10(6), 1980, 501–506.
Hume & Sunday: Fast string searching. Software: Practice and
Experience, 21(11), 1991, 1221–1248.
118
• BDM / BOM / BNDM
Crochemore, Czumaj, Gąsieniec, Jarominek, Lecroq, Plandowski
& Rytter: Speeding Up Two String-Matching Algorithms.
Algorithmica, 12(24), 1994, 247–267.
Allauzen, Crochemore & Raffinot: Factor oracle: A new
structure for pattern matching. Proc. 26th Conference on
Current Trends in Theory and Practice of Informatics
(SOFSEM’99). Lecture Notes in Computer Science, vol. 1725,
Springer, 291–306.
Navarro & Raffinot: Fast and flexible string matching by
combining bit-parallelism and suffix automata. ACM Journal on
Experimental Algorithmics, 5, Article 4, 2000.
119
• Crochemore
Crochemore: String-matching on ordered alphabets. Theoretical
Computer Science, 92(1), 1992, 33–47.
• Aho–Corasick
Aho & Corasick: Efficient string matching: An aid to
bibliographic search. Communications of the ACM, 18(6), 1975,
333–340.
120