0% found this document useful (0 votes)
43 views17 pages

Strings and Pattern Matching

The document discusses string matching algorithms including naive string matching, Rabin-Karp algorithm, and finite automata string matching. It also covers string searching goals of finding a pattern in text efficiently. The Rabin-Karp algorithm calculates hash values for patterns and text to quickly eliminate non-matches before doing character-by-character comparisons.

Uploaded by

Moatter Butt
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views17 pages

Strings and Pattern Matching

The document discusses string matching algorithms including naive string matching, Rabin-Karp algorithm, and finite automata string matching. It also covers string searching goals of finding a pattern in text efficiently. The Rabin-Karp algorithm calculates hash values for patterns and text to quickly eliminate non-matches before doing character-by-character comparisons.

Uploaded by

Moatter Butt
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 17

Strings and Pattern Matching

• Naive String Matching,


• Rabin-Karp algorithm,
• String matching with Finite Automata
• Knuth-Morris-Pratt algorithm

1
String Searching
• The object of string searching is to find the location
of a specific text pattern within a larger body of text
(e.g., a sentence, a paragraph, a book, etc.).
• As with most algorithms, the main considerations
for string searching are speed and efficiency.
• There are a number of string searching algorithms in
existence today, but the two we shall review are
Naive String Matching and Rabin-Karp algorithm.
2
Rules to be remember
• Text T= [1….n] of length n and pattern P = [1….m] where m<= n
• We assume that P and T are the characters drawn from a finite
alphabet ∑ *
– ∑ * = [0,1] or ∑ * =[a,b,----z]
• Character array P and T are also called string of characters
• Pattern P occurs with Shift s in text T (Posses at S+1 in text T or
Equivalently). If 0 ≤ s ≤ n-m and T[s+1 ------ s+m] = P[1-----m]
– T[s+j] = P[j] where 1 ≤ j ≤ m
• If P occurs with shift s in T then we call s a valid shift. Other wise
we call s an invalid shift.
• Find all shifts s in the range 0 ≤ s ≤ n-m such that p is suffix of
3
T s+m
Rules to be remember
Problem
• In string matching problem is of finding all valid shifts with which a
given pattern P in a given text T.

a b c a b a a b c a b a c
Text
S0 S1 S2 S3

Pattern a b a a

Valid shift s comes ?

4
Rules to be remember
Notations
∑ denotes the set of all finite length strings here we consider only
finite strings
ε denote to Zero length “empty string” it also belong to ∑*
|x| denotes the length of string x
|x| + |y| the concatenation of two strings x and y denoted xy, has length
|x| + |y| and consists of characters from X followed by the
characters from y.
w x string w is a prefix of a string x. if x= wy for some string y Є ∑*.
if w x then |w| ≤ |x|
Example: ab abcca
w x string w is a suffix of a string x. if x= yw for some string y Є ∑*.
It follows from w x then |w| ≤ |x|
Example: cca abcca

5
Rules to be remember

x x x
z z z
y y y

If |x| ≤ |y| then x is suffix of y

If |x| = |y| then x is equal of y


If |x| ≥ |y| then y is suffix of x
6
Brute Force
• The Brute Force algorithm compares the pattern to the text, one
character at a time, until unmatching characters are found

Compared characters are italicized.


Correct matches are in boldface type.
• The algorithm can be designed to stop on either the first
occurrence of the pattern, or upon reaching the end of the text. 7
Brute Force Pseudo-Code
• NAIVE String Matcher (T,P)
1 n ← length[T]
2 m ← length[P]
3 For s ← 0 to n-m
4 do if p[1…..m] = T[s+1……… s+m]
5 then print “Pattern occurs with shift” s

a c a a b c
Text

Pattern a a b

8
Brute Force-Complexity
• Given a pattern M characters in length, and a text N characters in
length...
• Worst case: compares pattern to each substring of text of length M.
For example, M=5.
• This kind of case can occur for image data.

Total number of comparisons: M (N-M+1) 9


Worst case time complexity: O(MN)
Brute Force-Complexity(cont.)
• Given a pattern M characters in length, and a text N characters in
length...
• Best case if pattern found: Finds pattern in first M positions of text.
For example, M=5.

Total number of comparisons: M


Best case time complexity: O(M) 10
Brute Force-Complexity(cont.)
• Given a pattern M characters in length, and a text N characters in length...
• Best case if pattern not found: Always mismatch on first character. For
example, M=5.

Total number of comparisons: N 11


Best case time complexity: O(N)
Rabin-Karp
• The Rabin-Karp string searching algorithm calculates a hash value
for the pattern, and for each M-character subsequence of text to be
compared.
• If the hash values are unequal, the algorithm will calculate the hash
value for next M-character sequence.
• If the hash values are equal, the algorithm will do a Brute Force
comparison between the pattern and the M-character sequence.
• In this way, there is only one comparison per text subsequence,
and Brute Force is only needed when hash values match.
• Perhaps an example will clarify some things...
12
Rabin-Karp Example
• Hash value of “AAAAA” is 37
• Hash value of “AAAAH” is 100

13
Rabin-Karp Algorithm
pattern is M characters long
hash_p=hash value of pattern
hash_t=hash value of first M letters in body of text
do
if (hash_p == hash_t)
brute force comparison of pattern
and selected section of text
hash_t= hash value of next section of text, one character
over
while (end of text or brute force comparison == true)
14
Rabin-Karp

• Common Rabin-Karp questions:


“What is the hash function used to calculate values for
character sequences?”
“Isn’t it time consuming to hash very one of the M-character
sequences in the text body?”
“Is this going to be on the final?”

• To answer some of these questions, we’ll have to get mathematical.

15
Rabin-Karp Math Example

• Let’s say that our alphabet consists of 10 letters.


• our alphabet = a, b, c, d, e, f, g, h, i, j
• Let’s say that “a” corresponds to 1, “b” corresponds to 2 and so
on.
The hash value for string “cah” would be ...

3*100 + 1*10 + 8*1 = 318

16
Rabin-Karp Complexity
• If a sufficiently large prime number is used for the hash function,
the hashed values of two different patterns will usually be distinct.
• If this is the case, searching takes O(N) time, where N is the
number of characters in the larger body of text.
• It is always possible to construct a scenario with a worst case
complexity of O(MN). This, however, is likely to happen only if
the prime number used for hashing is small.

17

You might also like