0% found this document useful (0 votes)
28 views

Robin Karp Algorithm For String Matching

Uploaded by

Patel Vedant
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Robin Karp Algorithm For String Matching

Uploaded by

Patel Vedant
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 13

String Matching Using the Rabin-

Karp Algorithm
String Matching Problem

• We assume that the text is an array T [1..N] of


length n and that the pattern is an array P [1..M]
of length m, where m < n. We also assume that
the elements of P and T are characters in the
finite alphabet 

(e.g., {a,b} We want to find P


= ‘aab’ in T = ‘abbaabaaaab’)
String Matching Problem (Continued)
• The idea of the string matching problem is that we
want to find all occurrences of the pattern P in the
given text T.
• We could use the brute force method for string
matching, which utilizes iteration over T. At each
letter, we compare the sequence against P until all
letters match of until the end of the alphabet is
reached.
• The worst case scenario can reach O(N*M)
Definition of Rabin-Karp
• A string search algorithm which compares a
string's hash values, rather than the strings
themselves. For efficiency, the hash value of
the next position in the text is easily
computed from the hash value of the current
position.
How Rabin-Karp works
• Let characters in both arrays T and P be digits in
radix- notation. (
• ssume d= |
• Let p be the decimal value of the characters in P
• Choose a prime number q such that fits within a
computer word to speed computations.
• Compute (p mod q)
– The value of p mod q is what we will be using to find all
matches of the pattern P in T.
Preprocessing
Let ts= T[s+1,s+2,s+3…..s+m] for all s
ts=p iff T[s+1,s+2,s+3…..s+m]=P[1…m]
Compute p
How Rabin-Karp works (continued)
• Compute (T[s+1, .., s+m] mod q) for all s till n-
m
• Test against P only those sequences in T
having the same (mod q) value
A Rabin-Karp example
• Given T = 31415926535 and P = 26
• We choose q = 11
• P mod q = 26 mod 11 = 4

3 1 4 1 5 9 2 6 5 3 5
31 mod 11 = 9 not equal to 4
3 1 4 1 5 9 2 6 5 3 5
14 mod 11 = 3 not equal to 4
3 1 4 1 5 9 2 6 5 3 5
41 mod 11 = 8 not equal to 4
Rabin-Karp example continued
3 1 4 1 5 9 2 6 5 3 5
15 mod 11 = 4 equal to 4 -> spurious hit
3 1 4 1 5 9 2 6 5 3 5
59 mod 11 = 4 equal to 4 -> spurious hit
3 1 4 1 5 9 2 6 5 3 5
92 mod 11 = 4 equal to 4 -> spurious hit
3 1 4 1 5 9 2 6 5 3 5
26 mod 11 = 4 equal to 4 -> an exact match!!
3 1 4 1 5 9 2 6 5 3 5
65 mod 11 = 10 not equal to 4
Rabin-Karp example continued
3 1 4 1 5 9 2 6 5 3 5
53 mod 11 = 9 not equal to 4
3 1 4 1 5 9 2 6 5 3 5
35 mod 11 = 2 not equal to 4
As we can see, when a match is found, further testing is
done to insure that a match has indeed been found.
Complexity
• The running time of the Rabin-Karp algorithm in the
worst-case scenario is O(n-m+1)m but it has a good
average-case running time.
• If the expected number of valid shifts is small O(1)
and the prime q is chosen to be quite large, then the
Rabin-Karp algorithm can be expected to run in time
O(n+m) plus the time to required to process spurious
hits.
Applications
• Bioinformatics
– Used in looking for similarities of two or more
proteins; i.e. high sequence similarity usually implies
significant structural or functional similarity.

Example:
Hb A_human
GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL
G+ +VK+HGKKV A++++++AH+ D++ ++ +++LS+LH KL
Hb B_human
GNPKVKAHGKKVLGAFSDGLAH LDNLKGTF ATLSELH CDKL
+ similar amino acids
Applications continued
• Alpha hemoglobin and beta hemoglobin are subunits that
make up a protein called hemoglobin in red blood cells.
Notice the similarities between the two sequences, which
probably signify functional similarity.
• Many distantly related proteins have domains that are similar
to each other, such as the DNA binding domain or cation
binding domain. To find regions of high similarity within
multiple sequences of proteins, local alignment must be
performed. The local alignment of sequences may provide
information of similar functional domains present among
distantly related proteins.

You might also like