0% found this document useful (0 votes)
18 views

Two-Way String Algorithm

the best explanation of two-way string algo by wiki

Uploaded by

Andrej Mikuš
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Two-Way String Algorithm

the best explanation of two-way string algo by wiki

Uploaded by

Andrej Mikuš
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 1

the two-way string-matching algorithm is a string-searching algorithm, discovered

by Maxime Crochemore and Dominique Perrin in 1991.[1] It takes a pattern of size m,


called a “needle”, preprocesses it in linear time O(m), producing information that
can then be used to search for the needle in any “haystack” string, taking only
linear time O(n) with n being the haystack's length.

The two-way algorithm can be viewed as a combination of the forward-going Knuth–


Morris–Pratt algorithm (KMP) and the backward-running Boyer–Moore string-search
algorithm (BM). Like those two, the 2-way algorithm preprocesses the pattern to
find partially repeating periods and computes “shifts” based on them, indicating
what offset to “jump” to in the haystack when a given character is encountered.

Unlike BM and KMP, it uses only O(log m) additional space to store information
about those partial repeats: the search pattern is split into two parts (its
critical factorization), represented only by the position of that split. Being a
number less than m, it can be represented in ⌈log₂ m⌉ bits. This is sometimes
treated as "close enough to O(1) in practice", as the needle's size is limited by
the size of addressable memory; the overhead is a number that can be stored in a
single register, and treating it as O(1) is like treating the size of a loop
counter as O(1) rather than log of the number of iterations. The actual matching
operation performs at most 2n − m comparisons.[2]

Breslauer later published two improved variants performing fewer comparisons, at


the cost of storing additional data about the preprocessed needle:[3]

The first one performs at most n + ⌊(n − m)/2⌋ comparisons, ⌈(n − m)/2⌉ fewer
than the original. It must however store ⌈log φ {\displaystyle \varphi } m⌉
additional offsets in the needle, using O(log2 m) space.
The second adapts it to only store a constant number of such offsets, denoted
c, but must perform n + ⌊(1⁄2 + ε) * (n − m)⌋ comparisons, with ε = 1⁄2(Fc+2 − 1)−1
= O( φ {\displaystyle \varphi }−c) going to zero exponentially quickly as c
increases.

The algorithm is considered fairly efficient in practice, being cache-friendly and


using several operations that can be implemented in well-optimized subroutines. It
is used by the C standard libraries glibc, newlib, and musl, to implement the
memmem and strstr family of substring functions.[4][5][6] As with most advanced
string-search algorithms, the naïve implementation may be more efficient on small-
enough instances;[7] this is especially so if the needle isn't searched in multiple
haystacks, which would amortize the preprocessing cost.

You might also like