0% found this document useful (0 votes)
4 views18 pages

17 StringMatching

The document discusses the string-matching problem, defining the text and pattern as arrays of characters and outlining valid and invalid shifts. It presents various algorithms for string matching, including the naive string-matching algorithm and the Rabin-Karp algorithm, detailing their time complexities and methodologies. Additionally, it introduces the Knuth-Morris-Pratt algorithm, emphasizing the importance of utilizing previously known information to improve efficiency in matching.

Uploaded by

Amrutha V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views18 pages

17 StringMatching

The document discusses the string-matching problem, defining the text and pattern as arrays of characters and outlining valid and invalid shifts. It presents various algorithms for string matching, including the naive string-matching algorithm and the Rabin-Karp algorithm, detailing their time complexities and methodologies. Additionally, it introduces the Knuth-Morris-Pratt algorithm, emphasizing the importance of utilizing previously known information to improve efficiency in matching.

Uploaded by

Amrutha V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

String Matching

• We formalize the string-matching problem as


follows. We assume that the text is an array
T[1.. n] of length n and that the pattern is an
array P[ 1..m] of length m≤ n. We further
assume that the elements of P and T are
characters drawn from a finite alphabet .
For example, we may have = {0,1} or =
,a, b, …, z-. The character arrays P and T are
often called strings of characters.
• we say that pattern P occurs with shift s in text T (or,
equivalently, that pattern P occurs beginning at
position s + 1 in text T ) if 0 ≤ s ≤ n-m and T*s+1… s
+m+ = P*1…m+ (that is, if T[s+j] = P[j], for 1 ≤ j ≤ m).
If P occurs with shift s in T , then we call s a valid
shift; otherwise, we call s an invalid shift. The string-
matching problem is the problem of finding all valid
shifts with which a given pattern P occurs in a given
text T .
Notation and terminology
• We denote by * (read “sigma-star”) the set of
all finite-length strings formed using characters
from the alphabet . The zero-length empty
string, denoted 𝜀, also belongs to * . The
concatenation of two strings x and y, denoted xy.
• We say that a string w is a prefix of a string x,
denoted w x, if x = wy for some string y ∈ * .
Note that if w x, then |w| ≤ |x|. Similarly, we
say that a string w is a suffix of a string x,
denoted w x, if x = yw for some y ∈ * .
The naive string-matching algorithm

Procedure NAIVE-STRING-MATCHER takes time O((n - m + 1)m)


The Rabin-Karp algorithm
• This algorithm makes use of elementary number-theoretic notions such as
the equivalence of two numbers modulo a third number

• For expository purposes, let us assume that = {0, 1, 2,…, 9}, so that each
character is a decimal digit. (In the general case, we can assume that each
character is a digit in radix-d notation, where d =| |).
• Given a pattern P[1…m+, let p denote its corresponding decimal value. In a
similar manner, given a text T[1…n+, let ts denotes the decimal value of the
length-m substring T[s+1…s+m], for s = 0,1,…,n -m. Certainly, ts = p if and
only if T[s+1…s+m] = P[1…m+; thus, s is a valid shift if and only if ts = p. If
we could compute p in time (m) and all the ts values in a total of ( n-
m+1) time, then we could determine all valid shifts s in time (m) + ( n-
m+1) = (n) by comparing p with each of the ts values.
• T=56489005050
• P=5648
• We can compute p in time (m) using Horner’s rule
• p = P[m] + 10(P[m-1] + 10(P[m-2++ …+10(P*2+
+10(P*1+)…))
• Similarly, we can compute t0 from T*1…m+ in time (m).
• To compute the remaining values t1, t2, …, tn-m in time
(n-m), we observe that we can compute ts+1 from ts in
constant time, since
• ts+1 = 10(ts - 10m-1T[s+1]) + T[s+ m +1]
• Subtracting 10m-1T[s+1]removes the high-order digit
from ts , multiplying the result by 10 shifts the number
left by one digit position, and adding T[s+ m +1] brings
in the appropriate low-order digit.
• For example, if m = 5 and ts = 31415, then we
wish to remove the high-order digit T[s+1] = 3
and bring in the new low-order digit (suppose it is
T[s + 5 + 1] = 2) to obtain
• ts+1 = 10(31415 – 10000.3) + 2
• = 14152
• p and ts may be too large to work with
conveniently. If P contains m characters, then we
cannot reasonably assume that each arithmetic
operation on p (which is m digits long) takes
constant time.
• ts+1 = (d(ts – T[s + 1]h) + T[s + m + 1]) mod q
• where h =dm-1 (mod q)
• In general, q is a prime number such that dq
fits within a computer word where d=| |.
Run-Time-Analysis
• RABIN-KARP-MATCHER takes (m) preprocessing
time, and its matching time is (n-m+1)m) in the
worst case, since (like the naive string-matching
algorithm) the Rabin-Karp algorithm explicitly
verifies every valid shift.
• In many applications, we expect few valid shifts—
perhaps some constant c of them. In such
applications, the expected matching time of the
algorithm is only
• O((n-m+1)+ cm) = O(n + m)=O(n)
The Knuth-Morris-Pratt (KMP)
Algorithm
• In the Brute-Force algorithm, if a mismatch
occurs at P[j]( j>1), it only slides P to right by 1
step. It throws away one piece of information
that we’ve already known. What is that piece
of information?
• Let s be the current shift value. Since it is a
mismatch at P[j], we know T[s+1…s+j-
1]=p[1…j-1].

You might also like