String Matching
• We formalize the string-matching problem as
follows. We assume that the text is an array
T[1.. n] of length n and that the pattern is an
array P[ 1..m] of length m≤ n. We further
assume that the elements of P and T are
characters drawn from a finite alphabet .
For example, we may have = {0,1} or =
,a, b, …, z-. The character arrays P and T are
often called strings of characters.
• we say that pattern P occurs with shift s in text T (or,
equivalently, that pattern P occurs beginning at
position s + 1 in text T ) if 0 ≤ s ≤ n-m and T*s+1… s
+m+ = P*1…m+ (that is, if T[s+j] = P[j], for 1 ≤ j ≤ m).
If P occurs with shift s in T , then we call s a valid
shift; otherwise, we call s an invalid shift. The string-
matching problem is the problem of finding all valid
shifts with which a given pattern P occurs in a given
text T .
Notation and terminology
• We denote by * (read “sigma-star”) the set of
all finite-length strings formed using characters
from the alphabet . The zero-length empty
string, denoted 𝜀, also belongs to * . The
concatenation of two strings x and y, denoted xy.
• We say that a string w is a prefix of a string x,
denoted w x, if x = wy for some string y ∈ * .
Note that if w x, then |w| ≤ |x|. Similarly, we
say that a string w is a suffix of a string x,
denoted w x, if x = yw for some y ∈ * .
The naive string-matching algorithm
Procedure NAIVE-STRING-MATCHER takes time O((n - m + 1)m)
The Rabin-Karp algorithm
• This algorithm makes use of elementary number-theoretic notions such as
the equivalence of two numbers modulo a third number
• For expository purposes, let us assume that = {0, 1, 2,…, 9}, so that each
character is a decimal digit. (In the general case, we can assume that each
character is a digit in radix-d notation, where d =| |).
• Given a pattern P[1…m+, let p denote its corresponding decimal value. In a
similar manner, given a text T[1…n+, let ts denotes the decimal value of the
length-m substring T[s+1…s+m], for s = 0,1,…,n -m. Certainly, ts = p if and
only if T[s+1…s+m] = P[1…m+; thus, s is a valid shift if and only if ts = p. If
we could compute p in time (m) and all the ts values in a total of ( n-
m+1) time, then we could determine all valid shifts s in time (m) + ( n-
m+1) = (n) by comparing p with each of the ts values.
• T=56489005050
• P=5648
• We can compute p in time (m) using Horner’s rule
• p = P[m] + 10(P[m-1] + 10(P[m-2++ …+10(P*2+
+10(P*1+)…))
• Similarly, we can compute t0 from T*1…m+ in time (m).
• To compute the remaining values t1, t2, …, tn-m in time
(n-m), we observe that we can compute ts+1 from ts in
constant time, since
• ts+1 = 10(ts - 10m-1T[s+1]) + T[s+ m +1]
• Subtracting 10m-1T[s+1]removes the high-order digit
from ts , multiplying the result by 10 shifts the number
left by one digit position, and adding T[s+ m +1] brings
in the appropriate low-order digit.
• For example, if m = 5 and ts = 31415, then we
wish to remove the high-order digit T[s+1] = 3
and bring in the new low-order digit (suppose it is
T[s + 5 + 1] = 2) to obtain
• ts+1 = 10(31415 – 10000.3) + 2
• = 14152
• p and ts may be too large to work with
conveniently. If P contains m characters, then we
cannot reasonably assume that each arithmetic
operation on p (which is m digits long) takes
constant time.
• ts+1 = (d(ts – T[s + 1]h) + T[s + m + 1]) mod q
• where h =dm-1 (mod q)
• In general, q is a prime number such that dq
fits within a computer word where d=| |.
Run-Time-Analysis
• RABIN-KARP-MATCHER takes (m) preprocessing
time, and its matching time is (n-m+1)m) in the
worst case, since (like the naive string-matching
algorithm) the Rabin-Karp algorithm explicitly
verifies every valid shift.
• In many applications, we expect few valid shifts—
perhaps some constant c of them. In such
applications, the expected matching time of the
algorithm is only
• O((n-m+1)+ cm) = O(n + m)=O(n)
The Knuth-Morris-Pratt (KMP)
Algorithm
• In the Brute-Force algorithm, if a mismatch
occurs at P[j]( j>1), it only slides P to right by 1
step. It throws away one piece of information
that we’ve already known. What is that piece
of information?
• Let s be the current shift value. Since it is a
mismatch at P[j], we know T[s+1…s+j-
1]=p[1…j-1].