patternmatching

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

String matching algorithms

Deliverables

String Basics

Naïve String matching Algorithm

Boyer Moore Algorithm

Rabin-Karp Algorithm

Knuth-Morris- Pratt Algorithm


11/15/2021 10:08 AM Copyright @ gdeepak.Com® 2
String Basics

A string is a sequence of characters

Examples of strings: C++ program, HTML


document, DNA sequence, Digitized image
An alphabet S is the set of possible characters for a
family of strings
Example of alphabets: ASCII (used by C and C++),
Unicode (used by Java), {0, 1}, {A, C, G, T}
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 3
String Basics
Let P be a string of size m
A substring P[i .. j] of P is the subsequence of P consisting of
the characters with ranks between i and j
A prefix of P is a substring of the type P[0 .. i]

A suffix of P is a substring of the type P[i ..m - 1]


Given strings T (text) and P (pattern), pattern matching
problem consists of finding a substring of T equal to P
Applications: Text editors, Search engines, Biological
research

11/15/2021 10:08 AM Copyright @ gdeepak.Com® 4


Brute Force String Matching
Naive-String-Matcher(T, P)
1. n ← length[T]
2. m ← length[P]
3. for s ← 0 to n - m
4. do if P[1 .. m]=T[s+1..s+m]
5. then print "Pattern occurs
with shift" s Match Q with A, not matching, so
Worst case O(m*n) shift the pattern by one and so on.
T = aaa … ah
P = aaah
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 5
Boyer-Moore Algorithm
It uses two heuristics

Looking-glass heuristic: Compare P


with T moving backwards
Character-jump heuristic: When a
mismatch occurs at T[i] = c
If P contains c, shift P to align the last
occurrence of c in P with T[i]
Else, shift P to align P[0] with T[i + 1]

11/15/2021 10:08 AM Copyright @ gdeepak.Com® 6


Boyer Moore Algorithm
Character shift
M 0
• Boyer-Moore’s runs in time O(nm + s) H 1
T 2
• Example of worst case: T = aaa …a P = baaa I
R
3
4

• Worst case may occur in images and DNA sequences but


unlikely in English text so BM is better for English Text
a p a t t e r n m a t c h i n g a l g o r i t h m

1 3 5 11 10 9 8 7
r i t h m r i t h m r i t h m r i t h m

2 4 6
r i t h m r i t h m r i t h m

11/15/2021 10:08 AM Copyright @ gdeepak.Com® 7


Boyer Moore Algorithm
Algorithm BoyerMooreMatch(T, P, S) else
L  lastOccurenceFunction(P, S ) { character-jump }
im-1 l  L[T[i]]
jm-1 i  i + m – min(j, 1 + l)
repeat jm-1
if T[i] = P[j] until i > n - 1
if j = 0 return -1 { no match }
return i { match at i }
else
ii-1
jj-1

11/15/2021 10:08 AM Copyright @ gdeepak.Com® 8


Example
• Text: Does the text match the pattern?
• Patter: tern

11/15/2021 10:08 AM Copyright @ gdeepak.Com® 9


Boyer Moore Algorithm
• Bad Character
– Upon mismatch, skips alignments until
• mismatch becomes a match or
• p moves past mismatched character
• Good Suffix Rule
– Substring matched by inner loop; skip until
• There are no mismatch between p and substring
• p moves past substring.

Ref: https://www.youtube.com/watch?v=4Xyhb72LCX4
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 10
Rabin-Karp Algorithm
• It speeds up testing of equality of the pattern to the
substrings in the text by using a hash function.
• A hash function is a function which converts every string
into a numeric value, called its hash value; e.g.
hash("hello")=5. It exploits the fact that if two strings are
equal, their hash values are also equal.
• There are two problems: Different strings can also result in
the same hash value and there is extra cost of calculating
the hash for each group of strings.
• Worst case is O(mn)

11/15/2021 10:08 AM Copyright @ gdeepak.Com® 11


How Rabin-Karp works
• Let characters in both arrays T and P be digits in radix-S
notation. (S = (0,1,...,9)
• Let p be the value of the characters in P
• Choose a prime number q such that fits within a computer
word to speed computations.
• Compute (p mod q)
– The value of p mod q is what we will be using to find all matches
of the pattern P in T.
How Rabin-Karp works (continued)
• Compute (T[s+1, .., s+m] mod q) for s = 0 .. n-m
• Test against P only those sequences in T having the same
(mod q) value
• (T[s+1, .., s+m] mod q) can be incrementally computed
by subtracting the high-order digit, shifting, adding the
low-order bit, all in modulo q arithmetic.
A Rabin-Karp example
• Given T = 31415926535 and P = 26
• We choose q = 11
• P mod q = 26 mod 11 = 4

3 1 4 1 5 9 2 6 5 3 5

31 mod 11 = 9 not equal to 4

3 1 4 1 5 9 2 6 5 3 5

14 mod 11 = 3 not equal to 4

3 1 4 1 5 9 2 6 5 3 5

41 mod 11 = 8 not equal to 4


Rabin-Karp example continued
3 1 4 1 5 9 2 6 5 3 5

15 mod 11 = 4 equal to 4 -> spurious hit

3 1 4 1 5 9 2 6 5 3 5

59 mod 11 = 4 equal to 4 -> spurious hit


3 1 4 1 5 9 2 6 5 3 5

92 mod 11 = 4 equal to 4 -> spurious hit

3 1 4 1 5 9 2 6 5 3 5

26 mod 11 = 4 equal to 4 -> an exact match!!


3 1 4 1 5 9 2 6 5 3 5

65 mod 11 = 10 not equal to 4


Rabin-Karp example continued
3 1 4 1 5 9 2 6 5 3 5

53 mod 11 = 9 not equal to 4


3 1 4 1 5 9 2 6 5 3 5

35 mod 11 = 2 not equal to 4

As we can see, when a match is found, further testing


is done to insure that a match has indeed been
found.
Rolling Hash
• A rolling hash allows an algorithm to calculate a hash
value without having the rehash the entire string.
• For example, when searching for a word in a text, as the
algorithm shifts one letter to the right instead of having
to calculate the hash of the section.
• Eg: Convert a text from a[8-9]=53 to a[9-10]=35.
a[9-10] = a[8-9]-a[8]+a[10]
a[9-10] = ((5*101 + 3*100) – (5*101)) *101+ (5*10o)
a[9-10] = (53-50)*101+ 5 = 35
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 17
Rabin Karp Example
Here we are matching 31415 in the given string. We can have various valid hits
but there may be few valid matches.

11/15/2021 10:08 AM Copyright @ gdeepak.Com® 18


KMP Algorithm
• It never re-compares a text symbol that has matched a pattern
symbol. As a result, complexity of the searching phase of the
KMP = O(n). Preprocessing phase has a complexity of O(m).
Since m< n, the overall complexity of is O(n). A border of x is a
substring that is both proper prefix and proper suffix of x. We
call its length b the width of the border.
• Let x=abacab Proper prefixes of x are ε, a, ab, aba, abac, abaca
• The proper suffixes of x are ε, b, ab, cab, acab, bacab
• The borders of x are ε, ab

11/15/2021 10:08 AM Copyright @ gdeepak.Com® 19


KMP-Preprocess Algorithm
void kmpPreprocess(){ For pattern p = ababaa the
int i=0, j=-1; widths of the borders in
b[i]=j;
array b have the following
while (i<m) {
values.
while (j>=0 && p[i]!=p[j])
j=b[j]; For instance we
i++; have b[5] = 3, since the
j++; prefix ababa of length 5 has
b[i]=j; a border of width 3
}}

11/15/2021 10:08 AM Copyright @ gdeepak.Com® 23


Preprocessing
pre-processing algorithm
could be applied to the
string pt instead of p. If
borders up to a width of m are
computed only, then a border
of width m of some
prefix x of pt corresponds to a
match of the pattern
in t (provided that the border
is not self-overlapping)

11/15/2021 10:08 AM Copyright @ gdeepak.Com® 24


KMP Search Algorithm
void kmpSearch() {
int i=0, j=0;
while (i<n){
while (j>=0 && t[i]!=p[j])
j=b[j];
i++;
j++;
if (j==m) {
report(i-j);
j=b[j];
}}}

11/15/2021 10:08 AM Copyright @ gdeepak.Com® 25


KMP Search Algorithm
• When in inner while loop a mismatch at position j occurs, the
widest border of the matching prefix of length j of the pattern
is considered. Resuming comparisons at position b[j], the
width of the border, yields a shift of the pattern such that the
border matches. If again a mismatch occurs, the next-widest
border is considered, and so on, until there is no border left or
the next symbol matches. Then we have a new matching prefix
of the pattern and continue with the outer while loop.
• If all m symbols of the pattern have matched the
corresponding text window (j = m), a function report is called
for reporting the match at position i-j. Afterwards, the pattern
is shifted as far as its widest border allows

11/15/2021 10:08 AM Copyright @ gdeepak.Com® 26


a b a c a a b a c c a b a c a b a a b b
1 2 3 4 5 6
a b a c a b
7
a b a c a b
8 9 10 11 12

a b a c a b
13
x 0 1 2 3 4 5
a b a c a b
P[x] a b a c a b
14 15 16 17 18 19
f(x) 0 0 1 0 1 2
a b a c a b
11/15/2021 10:08 AM Copyright @ gdeepak.Com® 27
KMP Video

11/15/2021 10:08 AM Copyright @ gdeepak.Com® 28


Questions, comments and Suggestions

11/15/2021 10:08 AM Copyright @ gdeepak.Com® 29


Question 1

How many nonempty prefixes of the string


p=“aaabbaaa” are also suffixes of P?

11/15/2021 10:08 AM Copyright @ gdeepak.Com® 30


Question 2

What is the longest proper prefix of the string


cgtacgttcgtacg that is also the suffix of this string.

11/15/2021 10:08 AM Copyright @ gdeepak.Com® 31


Question 3
What is the complexity of the KMP Algorithm, if we have a
main string of length s and we wish to find the pattern of
length p.
A) O( s+p)
B) O(p)
C) O(sp)
D) O(s)

11/15/2021 10:08 AM Copyright @ gdeepak.Com® 32

You might also like