Strings and Pattern Matching
Strings and Pattern Matching
1
String Searching
• The object of string searching is to find the location
of a specific text pattern within a larger body of text
(e.g., a sentence, a paragraph, a book, etc.).
• As with most algorithms, the main considerations
for string searching are speed and efficiency.
• There are a number of string searching algorithms in
existence today, but the two we shall review are
Naive String Matching and Rabin-Karp algorithm.
2
Rules to be remember
• Text T= [1….n] of length n and pattern P = [1….m] where m<= n
• We assume that P and T are the characters drawn from a finite
alphabet ∑ *
– ∑ * = [0,1] or ∑ * =[a,b,----z]
• Character array P and T are also called string of characters
• Pattern P occurs with Shift s in text T (Posses at S+1 in text T or
Equivalently). If 0 ≤ s ≤ n-m and T[s+1 ------ s+m] = P[1-----m]
– T[s+j] = P[j] where 1 ≤ j ≤ m
• If P occurs with shift s in T then we call s a valid shift. Other wise
we call s an invalid shift.
• Find all shifts s in the range 0 ≤ s ≤ n-m such that p is suffix of
3
T s+m
Rules to be remember
Problem
• In string matching problem is of finding all valid shifts with which a
given pattern P in a given text T.
a b c a b a a b c a b a c
Text
S0 S1 S2 S3
Pattern a b a a
4
Rules to be remember
Notations
∑ denotes the set of all finite length strings here we consider only
finite strings
ε denote to Zero length “empty string” it also belong to ∑*
|x| denotes the length of string x
|x| + |y| the concatenation of two strings x and y denoted xy, has length
|x| + |y| and consists of characters from X followed by the
characters from y.
w x string w is a prefix of a string x. if x= wy for some string y Є ∑*.
if w x then |w| ≤ |x|
Example: ab abcca
w x string w is a suffix of a string x. if x= yw for some string y Є ∑*.
It follows from w x then |w| ≤ |x|
Example: cca abcca
5
Rules to be remember
x x x
z z z
y y y
a c a a b c
Text
Pattern a a b
8
Brute Force-Complexity
• Given a pattern M characters in length, and a text N characters in
length...
• Worst case: compares pattern to each substring of text of length M.
For example, M=5.
• This kind of case can occur for image data.
13
Rabin-Karp Algorithm
pattern is M characters long
hash_p=hash value of pattern
hash_t=hash value of first M letters in body of text
do
if (hash_p == hash_t)
brute force comparison of pattern
and selected section of text
hash_t= hash value of next section of text, one character
over
while (end of text or brute force comparison == true)
14
Rabin-Karp
15
Rabin-Karp Math Example
16
Rabin-Karp Complexity
• If a sufficiently large prime number is used for the hash function,
the hashed values of two different patterns will usually be distinct.
• If this is the case, searching takes O(N) time, where N is the
number of characters in the larger body of text.
• It is always possible to construct a scenario with a worst case
complexity of O(MN). This, however, is likely to happen only if
the prime number used for hashing is small.
17