Lecture 05

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

BNDM

Starting the matching from the end enables long shifts.

• The Horspool algorithm bases the shift on a single character.

• The Boyer–Moore algorithm uses the matching suffix and the


mismatching character.

• Factor based algorithms continue matching until no pattern factor


matches. This may require more comparisons but it enables longer
shifts.

Example 2.14: Horspool shift


varmasti-aikai/sen-ainainen
ainaisen-ainainen
ainaisen-ainainen
Boyer–Moore shift Factor shift
varmasti-aikai/sen-ainainen varmasti-ai/kaisen-ainainen
ainaisen-ainainen ainaisen-ainainen
ainaisen-ainainen ainaisen-ainainen

92
Factor based algorithms use an automaton that accepts suffixes of the
reverse pattern P R (or equivalently reverse prefixes of the pattern P ).

• BDM (Backward DAWG Matching) uses a deterministic automaton


that accepts exactly the suffixes of P R .
DAWG (Directed Acyclic Word Graph) is also known as suffix automaton.

• BNDM (Backward Nondeterministic DAWG Matching) simulates a


nondeterministic automaton.

Example 2.15: P = assi.


ε

a s s i
3 2 1 0 -1

• BOM (Backward Oracle Matching) uses a much simpler deterministic


automaton that accepts all suffixes of P R but may also accept some
other strings. This can cause shorter shifts but not incorrect behaviour.

93
Suppose we are currently comparing P against T [j..j + m). We use the
automaton to scan the text backwards from T [j + m − 1]. When the
automaton has scanned T [j + i..j + m):

• If the automaton is in an accept state, then T [j + i..j + m) is a prefix


of P .
⇒ If i = 0, we found an occurrence.
⇒ Otherwise, mark the prefix match by setting shif t = i. This is the
length of the shift that would achieve a matching alignment.

• If the automaton can still reach an accept state, then T [j + i..j + m) is


a factor of P .
⇒ Continue scanning.

• When the automaton can no more reach an accept state:


⇒ Stop scanning and shift: j ← j + shif t.

94
BNDM does a bitparallel simulation of the nondeterministic automaton,
which is quite similar to Shift-And.
The state of the automaton is stored in a bitvector D. When the
automaton has scanned T [j + i..j + m):
• D.k = 1 if and only if there is a path from the initial state to state k
with the string (T [j + i..j + m))R , and thus
T [j + i..j + m) = P [m − k − 1..2m − k − i − 1).
• If D.(m − 1) = 1, then T [j + i..j + m) is a prefix of the pattern.
• If D = 0, then the automaton can no more reach an accept state.

Updating D uses precomputed bitvectors B[c], for all c ∈ Σ:


• B[c].i = 1 if and only if P [m − 1 − i] = P R [i] = c.

The update when reading T [j + i] is familiar: D = (D << 1) & B[T [j + i]]


• Note that there is no “ | 1”. This is because D.(−1) = 0 always after
reading at least one character, so the shift brings the right bit to D.0.
• Before reading anything D.(−1) = 1. This exception is handled by
starting the computation with the first shift already performed.
Because of this, the shift is done at the end of the loop.

95
Algorithm 2.16: BNDM
Input: text T = T [0 . . . n), pattern P = P [0 . . . m)
Output: position of the first occurrence of P in T
Preprocess:
(1) for c ∈ Σ do B[c] ← 0
(2) for i ← 0 to m − 1 do B[P [m − 1 − i]] ← B[P [m − 1 − i]] + 2i
Search:
(3) j ← 0
(4) while j + m ≤ n do
(5) i ← m; shif t ← m
(6) D ← 2m − 1 // D ← 1m
(7) while D 6= 0 do
// Now T [j + i..j + m) is a pattern factor
(8) i←i−1
(9) D ← D & B[T [j + i]]
(10) if D & 2m−1 6= 0 then
// Now T [j + i..j + m) is a pattern prefix
(11) if i = 0 then return j
(12) else shif t ← i
(13) D ← D << 1
(14) j ← j + shif t
(15) return n
96
Example 2.17: P = assi, T = apassi.

B[c], c ∈ {a,i,p,s} D when scanning apas backwards


a i p s s a p a
i 0 1 0 0 i 1 0 0 0
s 0 0 0 1 s 1 1 0 0
s 0 0 0 1 s 1 1 0 0
a 1 0 0 0 a 1 0 1 0 ⇒ shif t = 2

D when scanning apassi backwards


i s s a p a
i 1 1 0 0 0
s 1 0 1 0 0
s 1 0 0 1 0
a 1 0 0 0 1 ⇒ occurrence

97
In the integer alphabet model when m ≤ w:

• Preprocessing time is O(σ + m).


• In the worst case, the search time is O(mn).
For example, P = am−1 b and T = an .
• In the best case, the search time is O(n/m).
For example, P = bm and T = an .
• In the average case, the search time is O((n/m) + (n/m) logσ m).
This is optimal! It has been proven that any algorithm needs to inspect
Ω((n/m) + (n/m) logσ m) text characters on average.

When m > w, there are several options:

• Use multi-word bitvectors.


• Search for a pattern prefix of length w and check the rest when the
prefix is found.
• Use BDM or BOM.

98
• The search time of BDM and BOM is O((n/m) + (n/m) logσ m), which
is optimal on average. (BNDM is optimal only when m ≤ w.)

• MP and KMP are optimal in the worst case but not in the average case.

• There are also algorithms that are optimal in both cases. They are
based on similar techniques, but we will not describe them here.

99
Crochemore

• Two of the exact string matching algorithms we have seen, the brute
force algorithm and the Karp–Rabin algorithm, use only a constant
amount of extra space in addition to the text and the pattern. All the
other algorithms use some additional data structures whose size is
proportional to the length of the pattern or the alphabet size.

• On the other hand, the brute force algorithm and the Karp–Rabin
algorithm have a worst case running time of O(mn).

• Remarkably, there exists algorithms with O(n) worst case time that
need only a constant amount of extra space. One of them is the
Crochemore algorithm.

• We will only outline the main ideas of the algorithm here without
detailed proofs.

100
All the linear time, constant extra space algorithms are based on the
periodicity properties of the pattern.

Definition 2.18: Let S[0..m) be a string. An integer p ∈ [1..m] is a period


of S, if S[i] = S[i + p] for all i ∈ [0..m − p). The smallest period of S is
denoted per(S). S is k-periodic if m ≥ k · per(S).

Example 2.19: The periods of S1 = aabaaabaa are 4,7,8 and 9. The periods
of S2 = abcabcabcabca are 3, 6, 9, 12 and 13. S2 is 3-periodic but S1 is not.

There is a strong connection between periods and borders.

Lemma 2.20: p is a period of S[0..m) if and only if S has a proper border


of length m − p.

Proof. Both conditions hold if and only if S[0..m − p) = S[p..m). 

Corollary 2.21: The length of the longest proper border of S is m − per(S).

101
The Crochemore algorithm resembles the Morris–Pratt algorithm at a high
level:

• When the pattern P is aligned against a text factor T [j..j + m), they
compute the longest common prefix ` = lcp(P, T [j..j + m)) and report
an occurrence if ` = m. Otherwise, they shift the pattern forward.
• MP shifts the pattern forward by ` − fail[`] positions. Recall that fail[`]
in MP is the length of the longest proper border of P [0..`). Thus the
pattern shift by MP is per(P [0..`)).
• In the next lcp computation, MP skips the first fail[`] = ` − per(P [0..`))
characters (cf. lcp-comparison).
• Thus knowing per(P [0..`)) is sufficient to emulate MP shift and skip.

102
Crochemore either does the same shift and skip as MP, or a shorter shift
than MP and starts the lcp comparison from scratch:

• If P [0..`) is 3-periodic, then compute per(P [0..`)) and do the MP shift


and skip.
• If P [0..`) is not 3-periodic, then shift by b`/3c + 1 ≤ per(P [0..`)) and
start the lcp comparison from scratch (no skip).
• Note that the latter case is inoptimal but always safe: no occurrence is
missed.

To find out if P [0..`) is 3-periodic and to compute per(P [0..`)) if it is,


Crochemore uses another combinatorial concept.

103
Definition 2.22: Let M S(S) denote the lexicographically maximal suffix of
a string S. If S = M S(S), S is called self-maximal.

Example 2.23: M S(aabaaabaa) = baaabaa and


M S(abcabcabcabca) = cabcabcabca.

Period computation is easier for maximal suffixes and self-maximal strings


than for arbitrary strings.

Lemma 2.24: Let S[0..m) be a self-maximal string and let p = per(S). For
any c ∈ Σ,
M S(Sc) = Sc and per(Sc) = p if c = S[m − p]
M S(Sc) = Sc and per(Sc) = m + 1 if c < S[m − p]
M S(Sc) 6= Sc if c > S[m − p]
Furthermore, let r = m mod p and R = S[m − r..m). Then R is self-maximal
and
M S(Sc) = M S(Rc) if c > S[m − p]

104
Crochemore’s algorithm computes the maximal suffix and its period for
P [0..`) incrementally using Lemma 2.24. The following algorithm updates
the maximal suffix information when the match is extended by one
character.
Algorithm 2.25: Update-MS(P, `, s, p)
Input: a string P and integers `, s, p such that
M S(P [0..`)) = P [s..`) and p = per(P [s..`)).
Output: a triple (` + 1, s0 , p0 ) such that
M S(P [0..` + 1)) = P [s0 ..` + 1) and p0 = per(P [s0 ..` + 1)).
(1) if ` = 0 then return (1, 0, 1)
(2) i ← `
(3) while i < ` + 1 do
// P [s..i) is self-maximal and p = per(P [s..i))
(4) if P [i] > P [i − p] then
(5) i ← i − ((i − s) mod p)
(6) s←i
(7) p←1
(8) else if P [i] < P [i − p] then
(9) p←i−s+1
(10) i←i+1
(11) return (` + 1, s, p)
105
As the final piece of the Crochemore algorithm, the following result shows
how to use the maximal suffix information to obtain information about the
periodicity of the full string.
Lemma 2.26: Let S[0..m) be a string and let S[s..m) = M S(S) and
p = per(M S(S)).

• S is 3-periodic if and only if p ≤ m/3 and S[0..s) = S[p..p + s).


• If S is 3-periodic, then per(S) = p.

Example 2.27:

• S1[0..9) = aabaaabaa: M S(S1) = S1[3..9) = baaabaa, s = 3,


p = per(baaabaa) = 4 > 9/3, and thus S1 is not 3-periodic.

• S2[0..9) = aaabaabaa: M S(S2) = S2[4..9) = baabaa, s = 4,


p = per(baabaa) = 3 ≤ 9/3, S2 [0..4) = aaab 6= baab = S2 [3..7), and thus
S2 is not 3-periodic.

• S3[0..13) = abcabcabcabca: M S(S3) = S3[3..13) = cabcabcabca, s = 3,


p = per(cabcabcabca) = 3, S3 [0..3) = abc = S3 [3..6) and thus S3 is
3-periodic and per(S3 ) = p = 3.

106
Algorithm 2.28: Crochemore
Input: strings T [0..n) (text) and P [0..m) (pattern).
Output: position of the first occurrence of P in T
(1) j ← ` ← p ← s ← 0
(2) while j + m ≤ n do
(3) while j + ` < n and ` < m and T [j + `] = P [`] do
(4) (`, s, p) ← Update-MS(P, `, s, p)
// ` = lcp(P, T [j..j + m))
(5) if ` = m then return j
// M S(P [0..`)) = P [s..`) and p = per(P [s..`))
(6) if p ≤ `/3 and P [0..s) = P [p..p + s) then
// per(P [0..`)) = p
(7) j ←j+p
(8) `←`−p
(9) else // per(P [0..`)) > `/3
(10) j ← j + b`/3c + 1
(11) (`, s, p) ← (0, 0, 0)
(12) return n

107
In the general alphabet model:

• The time complexity is O(n).


• The algorithm uses only a constant number of integer variables in
addition to the strings P and T .

Crochemore is not competitive in practice. However, there are situations,


where the pattern can be very long and the space complexity is more
important than speed.

There are also other linear time, constant extra space algorithms. All of
them are based on string periodicity in some way.

108
Aho–Corasick

Given a text T and a set P = {P1 .P2 , . . . , Pk } of patterns, the multiple exact
string matching problem asks for the occurrences of all the patterns in the
text. The Aho–Corasick algorithm is an extension of the Morris–Pratt
algorithm for multiple exact string matching.

Aho–Corasick uses the trie trie(P) as an automaton and augments it with a


failure function similar to the Morris-Pratt failure function.

Example 2.29: Aho–Corasick automaton for P = {he, she, his, hers}.

Σ h e r s
-1 0 1 2 8 9
i
s s
6 7

h e
3 4 5

109
Let Sv denote the string represented by a node v in the trie. The
components of the AC automaton are:

• root is the root and child() the child function of the trie.
• fail(v) = u such that Su is the longest proper suffix of Sv represented by
any trie node u.
• patterns(v) is the set of pattern indices i such that Pi is a suffix of Sv .

Example 2.30: For the automaton in Example 2.29, patterns(2) = {1}


({he}), patterns(5) = {1, 2} ({he, she}), patterns(7) = {3} ({his}),
patterns(9) = {4} ({hers}), and patterns(v) = ∅ for all other nodes v.

110
At each stage of the matching, the algorithm computes the node v such
that Sv is the longest suffix of T [0..j] represented by any node.
Algorithm 2.31: Aho–Corasick
Input: text T , pattern set P = {P1 , P2 , . . . , Pk }.
Output: all pairs (i, j) such that Pi occurs in T ending at j.
(1) (root, child(), fail(), patterns()) ← Construct-AC-Automaton(P)
(2) v ← root
(3) for j ← 0 to n − 1 do
(4) while child(v, T [j]) = ⊥ do v ← fail(v)
(5) v ← child(v, T [j])
(6) for i ∈ patterns(v) do output (i, j)

The construction of the automaton is done in two phases: the trie


construction and the failure links computation.
Algorithm 2.32: Construct-AC-Automaton
Input: pattern set P = {P1 , P2 , . . . , Pk }.
Output: AC automaton: root, child(), fail() and patterns().
(1) (root, child(), patterns()) ← Construct-AC-Trie(P)
(2) (fail(), patterns()) ← Compute-AC-Fail(root, child(), patterns())
(3) return (root, child(), fail(), patterns())
111
Algorithm 2.33: Construct-AC-Trie
Input: pattern set P = {P1 , P2 , . . . , Pk }.
Output: AC trie: root, child() and patterns().
(1) Create new node root
(2) for i ← 1 to k do
(3) v ← root; j ← 0
(4) while child(v, Pi [j]) 6= ⊥ do
(5) v ← child(v, Pi [j]); j ← j + 1
(6) while j < |Pi | do
(7) Create new node u
(8) child(v, Pi [j]) ← u
(9) v ← u; j ← j + 1
(10) patterns(v) ← {i}
(11) return (root, child(), patterns())

Lines (3)–(10) perform the standard trie insertion (Algorithm 1.2).

• Line (10) marks v as a representative of Pi.


• The creation of a new node v initializes patterns(v) to ∅
(in addition to initializing child(v, c) to ⊥ for all c ∈ Σ).

112
Algorithm 2.34: Compute-AC-Fail
Input: AC trie: root, child() and patterns()
Output: AC failure function fail() and updated patterns()
(1) Create new node f allback
(2) for c ∈ Σ do child(f allback, c) ← root
(3) fail(root) ← f allback
(4) queue ← {root}
(5) while queue 6= ∅ do
(6) u ← popfront(queue)
(7) for c ∈ Σ such that child(u, c) 6= ⊥ do
(8) v ← child(u, c)
(9) w ← fail(u)
(10) while child(w, c) = ⊥ do w ← fail(w)
(11) f ail(v) ← child(w, c)
(12) patterns(v) ← patterns(v) ∪ patterns(fail(v))
(13) pushback(queue, v)
(14) return (fail(), patterns())

The algorithm does a breath first traversal of the trie. This ensures that
correct values of fail() and patterns() are already computed when needed.

113
fail(v) is correctly computed on lines (8)–(11):

• Let fail∗(v) = {v, fail(v), fail(fail(v)), . . . , root}. These nodes are exactly
the trie nodes that represent suffixes of Sv .
• Let u = parent(v) and child(u, c) = v. Then Sv = Suc and a string S is a
suffix of Su iff Sc is suffix of Sv . Thus for any node w
– If w ∈ fail∗ (v) \ {root}, then parent(w) ∈ fail∗ (u).
– If w ∈ fail∗ (u) and child(w, c) 6= ⊥, then child(w, c) ∈ fail∗ (v).
• Therefore, fail(v) = child(w, c), where w is the first node in fail∗(u)
other than u such that child(w, c) 6= ⊥, or fail(v) = root if no such node
w exists.

patterns(v) is correctly computed on line (12):


patterns(v) = {i | Pi is a suffix of Sv }
= {i | Pi = Sw and w ∈ fail∗ (v)}
= {i | Pi = Sv } ∪ patterns(fail(v))

114
In the constant alphabet model:

• The search time is O(n).


• The space complexity is O(m), where m = ||P||.
– The implementation of patterns() requires care (exercise).
• The preprocessing time is O(m), where m = ||P||.
– The only non-trivial issue is the while-loop on line (10) in
Compute-AC-Fail.
– Let root, v1 , v2 , . . . , v` be the nodes on the path from root to a node
representing a pattern Pi . Let wj = fail(vj ) for all j. Let depth(v) be
the depth of a node v (depth(root) = 0).
– When processing vj and computing wj = fail(vj ), we have
depth(wj ) = depth(wj−1 ) + 1 before line (10) and
depth(wj ) ≤ depth(wj−1 ) + 1 − tj after line (10), where tj is the
number of rounds in the while-loop.
– Thus, the total number of rounds in the while-loop when processing
the nodes v1 , v2 , . . . , v` is at most ` = |Pi |, and thus over the whole
algorithm at most ||P||.

115
Summary: Exact String Matching

Exact string matching is a fundamental problem in stringology. We have


seen several different algorithms for solving the problem.

The properties of the algorithms vary with respect to worst case time
complexity, average case time complexity, alphabet model, and even space
complexity.

The algorithms use a wide range of completely different techniques:

• There exists numerous algorithms for exact string matching (see study
group) but most of them use variations or combinations of the
techniques we have seen.
• Many of the techniques can be adapted to other problems. All of the
techniques have some uses in practice.

116
Selected Literature

• Survey
Faro & Lecroq: The Exact Online String Matching Problem: A
Review of the Most Recent Results. ACM Computing Surveys,
45(2)2, Article 13, 2013.

• Knuth–Morris–Pratt
Knuth, Morris & Pratt: Fast pattern matching in strings. SIAM
Journal on Computing, 6(1), 1977, 323–350.

• Shift-Or / Shift-And
Baeza-Yates & Gonnet: A new approach to text searching.
Communications of the ACM, 35(10), 1992, 74–82

117
• Karp-Rabin
Karp & Rabin: Efficient randomized pattern-matching
algorithms. IBM Journal of Research and Development, 31(2),
1987, 249–260.

• Boyer–Moore / Horspool
Boyer & Moore: A fast string searching algorithm.
Communications of the ACM, 20(10), 1977, 762–772.
Horspool: Practical fast searching in strings. Software: Practice
and Experience, 10(6), 1980, 501–506.
Hume & Sunday: Fast string searching. Software: Practice and
Experience, 21(11), 1991, 1221–1248.

118
• BDM / BOM / BNDM
Crochemore, Czumaj, Gąsieniec, Jarominek, Lecroq, Plandowski
& Rytter: Speeding Up Two String-Matching Algorithms.
Algorithmica, 12(24), 1994, 247–267.
Allauzen, Crochemore & Raffinot: Factor oracle: A new
structure for pattern matching. Proc. 26th Conference on
Current Trends in Theory and Practice of Informatics
(SOFSEM’99). Lecture Notes in Computer Science, vol. 1725,
Springer, 291–306.
Navarro & Raffinot: Fast and flexible string matching by
combining bit-parallelism and suffix automata. ACM Journal on
Experimental Algorithmics, 5, Article 4, 2000.

• Average case complexity


Yao: The complexity of pattern matching for a random string.
SIAM Journal on Computing, 8(3), 1979, 368–387.

119
• Crochemore
Crochemore: String-matching on ordered alphabets. Theoretical
Computer Science, 92(1), 1992, 33–47.

• Aho–Corasick
Aho & Corasick: Efficient string matching: An aid to
bibliographic search. Communications of the ACM, 18(6), 1975,
333–340.

120

You might also like