0% found this document useful (0 votes)
5 views146 pages

String Matching Algorithms

The document discusses string matching algorithms, including the naive algorithm and the Knuth-Morris-Pratt algorithm. It defines the problem of finding a pattern within a text and provides examples, complexities, and applications such as DNA sequencing. The naive algorithm's complexity is Θ(nm), and it explores improvement strategies for more efficient matching.

Uploaded by

ahmadxaidi2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views146 pages

String Matching Algorithms

The document discusses string matching algorithms, including the naive algorithm and the Knuth-Morris-Pratt algorithm. It defines the problem of finding a pattern within a text and provides examples, complexities, and applications such as DNA sequencing. The naive algorithm's complexity is Θ(nm), and it explores improvement strategies for more efficient matching.

Uploaded by

ahmadxaidi2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 146

String Matching Algorithms

Aimal Rextin

Faculty of Computing, SEECS

Design and Analysis of Algorithms


Outline
Problem definition

Naı̈ve algorithm

Knuth-Morris-Pratt algorithm

Design and Analysis of Algorithms


Problem

Design and Analysis of Algorithms


Problem
Given the text
Nel mezzo del cammin di nostra vita
mi ritrovai per una selva oscura
che la dritta via era smarrita. . .
Find the string “trova”

Design and Analysis of Algorithms


Problem
Given the text
Nel mezzo del cammin di nostra vita
mi ritrovai per una selva oscura
che la dritta via era smarrita. . .
Find the string “trova”

A more challenging example: How many times does the string


“110011” appear in the following text
0011110101011010011000110101111011010111
0110111001001010101011111011110110000101
1011000010111111011110011000011111000100
1001010010111011101011011110101001100101
0010111001000011111110010011011101011010
0110011011101001010010101000010100111110

Design and Analysis of Algorithms


Problem
Given the text
Nel mezzo del cammin di nostra vita
mi ritrovai per una selva oscura
che la dritta via era smarrita. . .
Find the string “trova”

A more challenging example: How many times does the string


“110011” appear in the following text
0011110101011010011000110101111011010111
0110111001001010101011111011110110000101
1011000010111111011110011000011111000100
1001010010111011101011011110101001100101
0010111001000011111110010011011101011010
0110011011101001010010101000010100111110

Design and Analysis of Algorithms


Application 1: Find / Find Replace
Even an inefficient will be ok in practical situations due to the limited
size of these files.

Design and Analysis of Algorithms


Application 2: DNA Sequencing

Design and Analysis of Algorithms


Application 2: DNA Sequencing
Human DNA has 6 billion characters

Design and Analysis of Algorithms


String Matching: Definitions
Σ is the alphabet set and the Σ∗ is the set of all possible finite
strings
Given a text T
◮ T ∈ Σ∗ : finite alphabet Σ
◮ |T | = n: the length of T is n

Design and Analysis of Algorithms


String Matching: Definitions
Σ is the alphabet set and the Σ∗ is the set of all possible finite
strings
Given a text T
◮ T ∈ Σ∗ : finite alphabet Σ
◮ |T | = n: the length of T is n

Given a pattern P
◮ P ∈ Σ∗ : same finite alphabet Σ
◮ |P| = m: the length of P is m

Design and Analysis of Algorithms


String Matching: Definitions
Σ is the alphabet set and the Σ∗ is the set of all possible finite
strings
Given a text T
◮ T ∈ Σ∗ : finite alphabet Σ
◮ |T | = n: the length of T is n

Given a pattern P
◮ P ∈ Σ∗ : same finite alphabet Σ
◮ |P| = m: the length of P is m

Both T and P can be modeled as arrays


◮ T [1 . . . n] and P[1 . . . m]

Design and Analysis of Algorithms


String Matching: Definitions
Σ is the alphabet set and the Σ∗ is the set of all possible finite
strings
Given a text T
◮ T ∈ Σ∗ : finite alphabet Σ
◮ |T | = n: the length of T is n

Given a pattern P
◮ P ∈ Σ∗ : same finite alphabet Σ
◮ |P| = m: the length of P is m

Both T and P can be modeled as arrays


◮ T [1 . . . n] and P[1 . . . m]

Pattern P occurs with shift s in T iff


◮ 0≤s ≤n−m
◮ T [s + i] = P[i] for all positions 1 ≤ i ≤ m
Design and Analysis of Algorithms
Example
Problem: find all s such that
◮ 0≤s ≤n−m
◮ T [s + i] = P[i] for 1 ≤ i ≤ m

Design and Analysis of Algorithms


Example
Problem: find all s such that
◮ 0≤s ≤n−m
◮ T [s + i] = P[i] for 1 ≤ i ≤ m

n = 14
T a b c a a b a a b a b a c a

Design and Analysis of Algorithms


Example
Problem: find all s such that
◮ 0≤s ≤n−m
◮ T [s + i] = P[i] for 1 ≤ i ≤ m

n = 14
T a b c a a b a a b a b a c a

m=3
P a b a

Result

Design and Analysis of Algorithms


Example
Problem: find all s such that
◮ 0≤s ≤n−m
◮ T [s + i] = P[i] for 1 ≤ i ≤ m

n = 14
T a b c a a b a a b a b a c a

s=4
P a b a

Result
s=4

Design and Analysis of Algorithms


Example
Problem: find all s such that
◮ 0≤s ≤n−m
◮ T [s + i] = P[i] for 1 ≤ i ≤ m

n = 14
T a b c a a b a a b a b a c a

s=7
P a b a

Result
s=4
s=7

Design and Analysis of Algorithms


Example
Problem: find all s such that
◮ 0≤s ≤n−m
◮ T [s + i] = P[i] for 1 ≤ i ≤ m

n = 14
T a b c a a b a a b a b a c a

s=9
P a b a

Result
s=4
s=7
s=9

Design and Analysis of Algorithms


Naı̈ve Algorithm

Design and Analysis of Algorithms


Naı̈ve Algorithm
Naı̈ve means a person or act meaning experience or wisdom

Design and Analysis of Algorithms


Naı̈ve Algorithm
Naı̈ve means a person or act meaning experience or wisdom
For each position s in 0 . . . n − m, see if T [s + i] = P[i] for all
1≤i ≤m

Design and Analysis of Algorithms


Naı̈ve Algorithm
Naı̈ve means a person or act meaning experience or wisdom
For each position s in 0 . . . n − m, see if T [s + i] = P[i] for all
1≤i ≤m

Naive-String-Matching(T , P)
1 n = length(T )
2 m = length(P)
3 for s = 0 to n − m
4 if Substring-At(T , P, s)
5 output(s)

Design and Analysis of Algorithms


Naı̈ve Algorithm
Naı̈ve means a person or act meaning experience or wisdom
For each position s in 0 . . . n − m, see if T [s + i] = P[i] for all
1≤i ≤m

Naive-String-Matching(T , P)
1 n = length(T )
2 m = length(P)
3 for s = 0 to n − m
4 if Substring-At(T , P, s)
5 output(s)

Substring-At(T , P, s)
1 for i = 1 to length(P)
2 if T [s + i] 6= P[i]
3 return false
4 return true

Design and Analysis of Algorithms


Complexity of the Naı̈ve Algorithm

Design and Analysis of Algorithms


Complexity of the Naı̈ve Algorithm
The Substring-At function takes time proportional to m the length
of the pattern P

Design and Analysis of Algorithms


Complexity of the Naı̈ve Algorithm
The Substring-At function takes time proportional to m the length
of the pattern P
How many shifts do we need to make in the main function?

Design and Analysis of Algorithms


Complexity of the Naı̈ve Algorithm
The Substring-At function takes time proportional to m the length
of the pattern P
How many shifts do we need to make in the main function?
no shift, shift of 1, shift of 2 · · · all the way upto when the first char
of P P[1] aligned to T [n − m] i.e shift of n − m + 1

Design and Analysis of Algorithms


Complexity of the Naı̈ve Algorithm
The Substring-At function takes time proportional to m the length
of the pattern P
How many shifts do we need to make in the main function?
no shift, shift of 1, shift of 2 · · · all the way upto when the first char
of P P[1] aligned to T [n − m] i.e shift of n − m + 1 Hence the
complexity of Naive-String-Match is

Θ((n − m + 1)m)

Design and Analysis of Algorithms


Complexity of the Naı̈ve Algorithm
The Substring-At function takes time proportional to m the length
of the pattern P
How many shifts do we need to make in the main function?
no shift, shift of 1, shift of 2 · · · all the way upto when the first char
of P P[1] aligned to T [n − m] i.e shift of n − m + 1 Hence the
complexity of Naive-String-Match is

Θ((n − m + 1)m)
since usually n ≫ m so we can say that its complexity is Θ(nm)

Design and Analysis of Algorithms


Improvement Strategy

Design and Analysis of Algorithms


Improvement Strategy
Observation

T a b c a a b a a b a b a c a

P a b a

Design and Analysis of Algorithms


Improvement Strategy
Observation

T a b c a a b a a b a b a c a
=
P a b a

Design and Analysis of Algorithms


Improvement Strategy
Observation

T a b c a a b a a b a b a c a
= =
P a b a

Design and Analysis of Algorithms


Improvement Strategy
Observation

T a b c a a b a a b a b a c a
= = 6=
P a b a

Design and Analysis of Algorithms


Improvement Strategy
Observation

T a b c a a b a a b a b a c a
= = 6=
P a b a

What now?

Design and Analysis of Algorithms


Improvement Strategy
Observation

T a b c a a b a a b a b a c a
= = 6=
P a b a

What now?
◮ the naı̈ve algorithm goes back to the second position in T and starts
from the beginning of P

Design and Analysis of Algorithms


Improvement Strategy
Observation

T a b c a a b a a b a b a c a
= = 6=
P a b a

What now?
◮ the naı̈ve algorithm goes back to the second position in T and starts
from the beginning of P
◮ can’t we simply move along through T ?

Design and Analysis of Algorithms


Improvement Strategy
Observation

T a b c a a b a a b a b a c a
= = 6=
P a b a

What now?
◮ the naı̈ve algorithm goes back to the second position in T and starts
from the beginning of P
◮ can’t we simply move along through T ?

◮ why?

Design and Analysis of Algorithms


Improvement Strategy (2)
Example run of Wrong-String-Matching

Design and Analysis of Algorithms


Improvement Strategy (2)
Example run of Wrong-String-Matching

T p a g l i a i o b a g o r d o

P a g o

Design and Analysis of Algorithms


Improvement Strategy (2)
Example run of Wrong-String-Matching

T p a g l i a i o b a g o r d o

P a g o

q+1

Design and Analysis of Algorithms


Improvement Strategy (2)
Example run of Wrong-String-Matching

T p a g l i a i o b a g o r d o

P a g o

q+1

Design and Analysis of Algorithms


Improvement Strategy (2)
Example run of Wrong-String-Matching

T p a g l i a i o b a g o r d o

P a g o

q+1

Design and Analysis of Algorithms


Improvement Strategy (2)
Example run of Wrong-String-Matching

T p a g l i a i o b a g o r d o

P a g o

q+1

Design and Analysis of Algorithms


Improvement Strategy (2)
Example run of Wrong-String-Matching

T p a g l i a i o b a g o r d o

P a g o

q+1

Design and Analysis of Algorithms


Improvement Strategy (2)
Example run of Wrong-String-Matching

T p a g l i a i o b a g o r d o

P a g o

q+1

Design and Analysis of Algorithms


Improvement Strategy (2)
Example run of Wrong-String-Matching

T p a g l i a i o b a g o r d o

P a g o

q+1

Design and Analysis of Algorithms


Improvement Strategy (2)
Example run of Wrong-String-Matching

T p a g l i a i o b a g o r d o

P a g o

q+1

Design and Analysis of Algorithms


Improvement Strategy (2)
Example run of Wrong-String-Matching

T p a g l i a i o b a g o r d o

P a g o

q+1

Design and Analysis of Algorithms


Improvement Strategy (2)
Example run of Wrong-String-Matching

T p a g l i a i o b a g o r d o

P a g o

q+1

Design and Analysis of Algorithms


Improvement Strategy (2)
Example run of Wrong-String-Matching

T p a g l i a i o b a g o r d o

P a g o

q+1

Design and Analysis of Algorithms


Improvement Strategy (2)
Example run of Wrong-String-Matching

T p a g l i a i o b a g o r d o

P a g o

q+1

Design and Analysis of Algorithms


Improvement Strategy (2)
Example run of Wrong-String-Matching

T p a g l i a i o b a g o r d o

P a g o

q+1

Design and Analysis of Algorithms


Improvement Strategy (2)
Example run of Wrong-String-Matching

T p a g l i a i o b a g o r d o

P a g o

q+1

Design and Analysis of Algorithms


Improvement Strategy (2)
Example run of Wrong-String-Matching

T p a g l i a i o b a g o r d o

P a g o

q+1

Design and Analysis of Algorithms


Improvement Strategy (2)
Example run of Wrong-String-Matching

T p a g l i a i o b a g o r d o

P a g o

Design and Analysis of Algorithms


Improvement Strategy (2)
Example run of Wrong-String-Matching

T p a g l i a i o b a g o r d o

P a g o Output: 10

q+1

Design and Analysis of Algorithms


Improvement Strategy (2)
Example run of Wrong-String-Matching

T p a g l i a i o b a g o r d o

P a g o Output: 10

q+1

Design and Analysis of Algorithms


Improvement Strategy (2)
Example run of Wrong-String-Matching

T p a g l i a i o b a g o r d o

P a g o Output: 10

q+1

Design and Analysis of Algorithms


Improvement Strategy (2)
Example run of Wrong-String-Matching

T p a g l i a i o b a g o r d o

P a g o Output: 10

Done. Perfect!

Design and Analysis of Algorithms


Improvement Strategy (2)
Example run of Wrong-String-Matching

T p a g l i a i o b a g o r d o

P a g o Output: 10

Done. Perfect!

Complexity: Θ(n)

Design and Analysis of Algorithms


Improvement Strategy (2)
Here’s the algo for the wrong but insightful strategy

Design and Analysis of Algorithms


Improvement Strategy (2)
Here’s the algo for the wrong but insightful strategy
Wrong-String-Matching(T , P)
1 n = length(T )
2 m = length(P)
3 q =0 // number of characters matched in P
4 s =1
5 while s ≤ n
6 s = s +1
7 if T [s] == P[q + 1]
8 q = q+1
9 if q == m
10 output(s − m)
11 q =0
12 else q = 0

Design and Analysis of Algorithms


Improvement Strategy (4)
What is wrong with Wrong-String-Matching?

Design and Analysis of Algorithms


Improvement Strategy (4)
What is wrong with Wrong-String-Matching?

T a a b a a a b a b a b a c a

P a a b

Design and Analysis of Algorithms


Improvement Strategy (4)
What is wrong with Wrong-String-Matching?

T a a b a a a b a b a b a c a

P a a b

q+1

Design and Analysis of Algorithms


Improvement Strategy (4)
What is wrong with Wrong-String-Matching?

T a a b a a a b a b a b a c a

P a a b

q+1

Design and Analysis of Algorithms


Improvement Strategy (4)
What is wrong with Wrong-String-Matching?

T a a b a a a b a b a b a c a

P a a b

q+1

Design and Analysis of Algorithms


Improvement Strategy (4)
What is wrong with Wrong-String-Matching?

T a a b a a a b a b a b a c a
output(0)

P a a b

Design and Analysis of Algorithms


Improvement Strategy (4)
What is wrong with Wrong-String-Matching?

T a a b a a a b a b a b a c a

P a a b

q+1

Design and Analysis of Algorithms


Improvement Strategy (4)
What is wrong with Wrong-String-Matching?

T a a b a a a b a b a b a c a

P a a b

q+1

Design and Analysis of Algorithms


Improvement Strategy (4)
What is wrong with Wrong-String-Matching?

T a a b a a a b a b a b a c a

P a a b

q+1

Design and Analysis of Algorithms


Improvement Strategy (4)
What is wrong with Wrong-String-Matching?

T a a b a a a b a b a b a c a

P a a b

q+1

Design and Analysis of Algorithms


Improvement Strategy (4)
What is wrong with Wrong-String-Matching?

T a a b a a a b a b a b a c a

P a a b

q+1

Design and Analysis of Algorithms


Improvement Strategy (4)
What is wrong with Wrong-String-Matching?

T a a b a a a b a b a b a c a

P a a b

q+1

Design and Analysis of Algorithms


Improvement Strategy (4)
What is wrong with Wrong-String-Matching?

T a a b a a a b a b a b a c a

P a a b

q+1

Design and Analysis of Algorithms


Improvement Strategy (4)
What is wrong with Wrong-String-Matching?

T a a b a a a b a b a b a c a
missed!

P a a b

q+1

Design and Analysis of Algorithms


Improvement Strategy (4)
What is wrong with Wrong-String-Matching?

T a a b a a a b a b a b a c a
missed!

P a a b

q+1

So Wrong-String-Matching doesn’t work, but it tells us


something useful

Design and Analysis of Algorithms


Improvement Strategy (5)
Where did Wrong-String-Matching go wrong?

Design and Analysis of Algorithms


Improvement Strategy (5)
Where did Wrong-String-Matching go wrong?

T a a b a a a b a b a b a c a

P a a b

Design and Analysis of Algorithms


Improvement Strategy (5)
Where did Wrong-String-Matching go wrong?

T a a b a a a b a b a b a c a

P a a b

q+1

Design and Analysis of Algorithms


Improvement Strategy (5)
Where did Wrong-String-Matching go wrong?

T a a b a a a b a b a b a c a

P a a b

q+1

Design and Analysis of Algorithms


Improvement Strategy (5)
Where did Wrong-String-Matching go wrong?

T a a b a a a b a b a b a c a

P a a b

q+1

Design and Analysis of Algorithms


Improvement Strategy (5)
Where did Wrong-String-Matching go wrong?

T a a b a a a b a b a b a c a

P a a b

q+1

Design and Analysis of Algorithms


Improvement Strategy (5)
Where did Wrong-String-Matching go wrong?

T a a b a a a b a b a b a c a

P a a b

q+1

Wrong: by going all the way back to q = 0 we throw away a good


prefix of P that we already matched

Design and Analysis of Algorithms


Improvement Strategy (6)
Another example

Design and Analysis of Algorithms


Improvement Strategy (6)
Another example

T a b a b a b a c b a c b c a

P a b a b a c

Design and Analysis of Algorithms


Improvement Strategy (6)
Another example

T a b a b a b a c b a c b c a

P a b a b a c

q+1

Design and Analysis of Algorithms


Improvement Strategy (6)
Another example

T a b a b a b a c b a c b c a

P a b a b a c

q+1

Design and Analysis of Algorithms


Improvement Strategy (6)
Another example

T a b a b a b a c b a c b c a

P a b a b a c

q+1

Design and Analysis of Algorithms


Improvement Strategy (6)
Another example

T a b a b a b a c b a c b c a

P a b a b a c

q+1

Design and Analysis of Algorithms


Improvement Strategy (6)
Another example

T a b a b a b a c b a c b c a

P a b a b a c

q+1

Design and Analysis of Algorithms


Improvement Strategy (6)
Another example

T a b a b a b a c b a c b c a

P a b a b a c

q+1

Design and Analysis of Algorithms


Improvement Strategy (6)
Another example

T a b a b a b a c b a c b c a

P a b a b a c

q+1

We have matched “ababa”

Design and Analysis of Algorithms


Improvement Strategy (6)
Another example

T a b a b a b a c b a c b c a

P a b a b a c

q+1

We have matched “ababa”


◮ suffix “aba” can be reused as a prefix

Design and Analysis of Algorithms


Improvement Strategy (6)
Another example

T a b a b a b a c b a c b c a

P a b a b a c

q+1

We have matched “ababa”


◮ suffix “aba” can be reused as a prefix
◮ i.e. restart from position 3

Design and Analysis of Algorithms


Improvement Strategy (6)
Another example

T a b a b a b a c b a c b c a

P a b a b a c

q+1

We have matched “ababa”


◮ suffix “aba” can be reused as a prefix
◮ i.e. restart from position 3

Design and Analysis of Algorithms


Improvement Strategy (6)
Another example

T a b a b a b a c b a c b c a

P a b a b a c

q+1

We have matched “ababa”


◮ suffix “aba” can be reused as a prefix
◮ i.e. restart from position 3

Design and Analysis of Algorithms


Improvement Strategy (6)
Another example

T a b a b a b a c b a c b c a

P a b a b a c

q+1

We have matched “ababa”


◮ suffix “aba” can be reused as a prefix
◮ i.e. restart from position 3

Design and Analysis of Algorithms


Improvement Strategy (6)
Another example

T a b a b a b a c b a c b c a
output(2)

P a b a b a c

We have matched “ababa”


◮ suffix “aba” can be reused as a prefix
◮ i.e. restart from position 3

Design and Analysis of Algorithms


Definitions
Prefix: A prefix of a string is any sequence of characters that
appears at the beginning of the string.
Suffix: A suffix of a string is any sequence of characters that
appears at the end of the string.
Proper Suffix: A proper suffix is a suffix of a string that is not
equal to the entire string itself.

Design and Analysis of Algorithms


Definitions-Examples

Design and Analysis of Algorithms


Definitions-Examples
1. For the string ”hello,” ”he” is a prefix, ”lo” is a suffix
2. ”hello” itself is a prefix but not a proper prefix

Design and Analysis of Algorithms


Definitions-Examples
1. For the string ”hello,” ”he” is a prefix, ”lo” is a suffix
2. ”hello” itself is a prefix but not a proper prefix
3. The empty string is both a prefix and a suffix of any string.

Design and Analysis of Algorithms


Definitions-Examples
1. For the string ”hello,” ”he” is a prefix, ”lo” is a suffix
2. ”hello” itself is a prefix but not a proper prefix
3. The empty string is both a prefix and a suffix of any string.
4. For the string ”programming,” ”programming” is not a proper suffix
because it is equal to the string itself.

Design and Analysis of Algorithms


New Strategy
P[1 . . . q] is the prefix of P matched so far and on the next char we
have a mismatch

Design and Analysis of Algorithms


New Strategy
P[1 . . . q] is the prefix of P matched so far and on the next char we
have a mismatch
To avoid unnecessary comparisons, we look for the longest proper
suffix of the matched pattern segment P[1 · · · q] that is also a prefix
of the entire pattern P.
We want the this suffix-prefix pair to be as long as possible ???

Design and Analysis of Algorithms


New Strategy
P[1 . . . q] is the prefix of P matched so far and on the next char we
have a mismatch
To avoid unnecessary comparisons, we look for the longest proper
suffix of the matched pattern segment P[1 · · · q] that is also a prefix
of the entire pattern P.
We want the this suffix-prefix pair to be as long as possible ???
Find the longest prefix of P that is also a suffix of P[2 . . . q]
◮ i.e., find the longest prefix P[1 . . . π] = P[q − π + 1 . . . q]
◮ π = 0 means that such a prefix does not exist

Design and Analysis of Algorithms


New Strategy
P[1 . . . q] is the prefix of P matched so far and on the next char we
have a mismatch
To avoid unnecessary comparisons, we look for the longest proper
suffix of the matched pattern segment P[1 · · · q] that is also a prefix
of the entire pattern P.
We want the this suffix-prefix pair to be as long as possible ???
Find the longest prefix of P that is also a suffix of P[2 . . . q]
◮ i.e., find the longest prefix P[1 . . . π] = P[q − π + 1 . . . q]
◮ π = 0 means that such a prefix does not exist

P a b a b a c

q+1

Design and Analysis of Algorithms


New Strategy
P[1 . . . q] is the prefix of P matched so far and on the next char we
have a mismatch
To avoid unnecessary comparisons, we look for the longest proper
suffix of the matched pattern segment P[1 · · · q] that is also a prefix
of the entire pattern P.
We want the this suffix-prefix pair to be as long as possible ???
Find the longest prefix of P that is also a suffix of P[2 . . . q]
◮ i.e., find the longest prefix P[1 . . . π] = P[q − π + 1 . . . q]
◮ π = 0 means that such a prefix does not exist

P a b a b a c
π=3
q+1

Design and Analysis of Algorithms


New Strategy
P[1 . . . q] is the prefix of P matched so far and on the next char we
have a mismatch
To avoid unnecessary comparisons, we look for the longest proper
suffix of the matched pattern segment P[1 · · · q] that is also a prefix
of the entire pattern P.
We want the this suffix-prefix pair to be as long as possible ???
Find the longest prefix of P that is also a suffix of P[2 . . . q]
◮ i.e., find the longest prefix P[1 . . . π] = P[q − π + 1 . . . q]
◮ π = 0 means that such a prefix does not exist

P a b a b a c
π=3
q+1

Design and Analysis of Algorithms


New Strategy
P[1 . . . q] is the prefix of P matched so far and on the next char we
have a mismatch
To avoid unnecessary comparisons, we look for the longest proper
suffix of the matched pattern segment P[1 · · · q] that is also a prefix
of the entire pattern P.
We want the this suffix-prefix pair to be as long as possible ???
Find the longest prefix of P that is also a suffix of P[2 . . . q]
◮ i.e., find the longest prefix P[1 . . . π] = P[q − π + 1 . . . q]
◮ π = 0 means that such a prefix does not exist

P a b a b a c

q+1
Restart from q = π

Design and Analysis of Algorithms


New Strategy
P[1 . . . q] is the prefix of P matched so far and on the next char we
have a mismatch
To avoid unnecessary comparisons, we look for the longest proper
suffix of the matched pattern segment P[1 · · · q] that is also a prefix
of the entire pattern P.
We want the this suffix-prefix pair to be as long as possible ???
Find the longest prefix of P that is also a suffix of P[2 . . . q]
◮ i.e., find the longest prefix P[1 . . . π] = P[q − π + 1 . . . q]
◮ π = 0 means that such a prefix does not exist

P a b a b a c

q+1
Restart from q = π
Iterate as usual. This is Knuth-Morris-Pratt algorithm
Design and Analysis of Algorithms
The Prefix Function
Given a pattern P, the matched prefix P[1 . . . q], the longest prefix
of P that is also a suffix of P[2 . . . q] depends only on P and q

Design and Analysis of Algorithms


The Prefix Function
Given a pattern P, the matched prefix P[1 . . . q], the longest prefix
of P that is also a suffix of P[2 . . . q] depends only on P and q

This prefix is identified by its length π(q)

Design and Analysis of Algorithms


The Prefix Function
Given a pattern P, the matched prefix P[1 . . . q], the longest prefix
of P that is also a suffix of P[2 . . . q] depends only on P and q

This prefix is identified by its length π(q)

π can be computed at the beginning by Prefix-Function


◮ we represent π as an array of length m

Design and Analysis of Algorithms


The Prefix Function
Given a pattern P, the matched prefix P[1 . . . q], the longest prefix
of P that is also a suffix of P[2 . . . q] depends only on P and q

This prefix is identified by its length π(q)

π can be computed at the beginning by Prefix-Function


◮ we represent π as an array of length m

Example

P a b a b a c

Design and Analysis of Algorithms


The Prefix Function
Given a pattern P, the matched prefix P[1 . . . q], the longest prefix
of P that is also a suffix of P[2 . . . q] depends only on P and q

This prefix is identified by its length π(q)

π can be computed at the beginning by Prefix-Function


◮ we represent π as an array of length m

Example

P a b a b a c

π 0 0 1 2 3 0

Design and Analysis of Algorithms


The Knuth-Morris-Pratt Algorithm

KMP-String-Matching(T , P)
1 n = length(T )
2 m = length(P)
3 π = Prefix-Function(P)
4 q =0 // number of character matched
5 for i = 1 to n // scan the text left-to-right
6 while q > 0 and P[q + 1] 6= T [i]
7 q = π[q] // no match: go back using π
8 if P[q + 1] == T [i]
9 q = q+1
10 if q == m
11 output(i − m + 1)
12 q = π[q] // go back for the next match

Design and Analysis of Algorithms


If Condition ?
Question
Why do we have an if condition on the line 8?

If q = 0 then looks will not skipped and we implicitly assume that


P[q + 1] == T [i]

Design and Analysis of Algorithms


Prefix Function Algorithm

Design and Analysis of Algorithms


Prefix Function Algorithm
Finds self similarity in the pattern P in itself
For each P[q] is finds the longest proper prefix A[1 · · · k] of
A[1 · · · q] that also equal to a suffix of A[1 · · · q]

Design and Analysis of Algorithms


Prefix Function Algorithm
Finds self similarity in the pattern P in itself
For each P[q] is finds the longest proper prefix A[1 · · · k] of
A[1 · · · q] that also equal to a suffix of A[1 · · · q]
In fact, Prefix-Function is remarkably similar to
KMP-String-Matching

Design and Analysis of Algorithms


Prefix Function

Prefix-Function(P)
1 m = length(P)
2 π[1] = 0
3 k =0
4 for q = 2 to m
5 while k > 0 and P[k + 1] 6= P[q]
6 k = π[k]
7 if P[k + 1] == P[q]
8 k = k +1
9 π[q] = k

k is the length of the longest prefix so far.

Design and Analysis of Algorithms


Prefix Function at Work

Prefix-Function(P)
1 m = length(P)
2 π[1] = 0
3 k =0 P a b a b a c
4 for q = 2 to m
5 while k > 0 and P[k + 1] 6= P[q]
6 k = π[k]
7 if P[k + 1] == P[q] π
8 k = k +1
9 π[q] = k

Design and Analysis of Algorithms


Prefix Function at Work

Prefix-Function(P)
1 m = length(P)
2 π[1] = 0
3 k =0 P a b a b a c
4 for q = 2 to m
5 while k > 0 and P[k + 1] 6= P[q]
6 k = π[k]
7 if P[k + 1] == P[q] π 0
8 k = k +1
9 π[q] = k

Design and Analysis of Algorithms


Prefix Function at Work

Prefix-Function(P)
q
1 m = length(P)
2 π[1] = 0
3 k =0 P a b a b a c
4 for q = 2 to m
5 while k > 0 and P[k + 1] 6= P[q] k+1
6 k = π[k]
7 if P[k + 1] == P[q] π 0
8 k = k +1
9 π[q] = k

Design and Analysis of Algorithms


Prefix Function at Work

Prefix-Function(P)
q
1 m = length(P)
2 π[1] = 0
3 k =0 P a b a b a c
4 for q = 2 to m
5 while k > 0 and P[k + 1] 6= P[q] k+1
6 k = π[k]
7 if P[k + 1] == P[q] π 0 0
8 k = k +1
9 π[q] = k

Design and Analysis of Algorithms


Prefix Function at Work

Prefix-Function(P)
q
1 m = length(P)
2 π[1] = 0
3 k =0 P a b a b a c
4 for q = 2 to m
5 while k > 0 and P[k + 1] 6= P[q] k+1
6 k = π[k]
7 if P[k + 1] == P[q] π 0 0
8 k = k +1
9 π[q] = k

Design and Analysis of Algorithms


Prefix Function at Work

Prefix-Function(P)
q
1 m = length(P)
2 π[1] = 0
3 k =0 P a b a b a c
4 for q = 2 to m
5 while k > 0 and P[k + 1] 6= P[q] k+1
6 k = π[k]
7 if P[k + 1] == P[q] π 0 0
8 k = k +1
9 π[q] = k

Design and Analysis of Algorithms


Prefix Function at Work

Prefix-Function(P)
q
1 m = length(P)
2 π[1] = 0
3 k =0 P a b a b a c
4 for q = 2 to m
5 while k > 0 and P[k + 1] 6= P[q] k+1
6 k = π[k]
7 if P[k + 1] == P[q] π 0 0 1
8 k = k +1
9 π[q] = k

Design and Analysis of Algorithms


Prefix Function at Work

Prefix-Function(P)
q
1 m = length(P)
2 π[1] = 0
3 k =0 P a b a b a c
4 for q = 2 to m
5 while k > 0 and P[k + 1] 6= P[q] k+1
6 k = π[k]
7 if P[k + 1] == P[q] π 0 0 1
8 k = k +1
9 π[q] = k

Design and Analysis of Algorithms


Prefix Function at Work

Prefix-Function(P)
q
1 m = length(P)
2 π[1] = 0
3 k =0 P a b a b a c
4 for q = 2 to m
5 while k > 0 and P[k + 1] 6= P[q] k+1
6 k = π[k]
7 if P[k + 1] == P[q] π 0 0 1
8 k = k +1
9 π[q] = k

Design and Analysis of Algorithms


Prefix Function at Work

Prefix-Function(P)
q
1 m = length(P)
2 π[1] = 0
3 k =0 P a b a b a c
4 for q = 2 to m
5 while k > 0 and P[k + 1] 6= P[q] k+1
6 k = π[k]
7 if P[k + 1] == P[q] π 0 0 1 2
8 k = k +1
9 π[q] = k

Design and Analysis of Algorithms


Prefix Function at Work

Prefix-Function(P)
q
1 m = length(P)
2 π[1] = 0
3 k =0 P a b a b a c
4 for q = 2 to m
5 while k > 0 and P[k + 1] 6= P[q] k+1
6 k = π[k]
7 if P[k + 1] == P[q] π 0 0 1 2
8 k = k +1
9 π[q] = k

Design and Analysis of Algorithms


Prefix Function at Work

Prefix-Function(P)
q
1 m = length(P)
2 π[1] = 0
3 k =0 P a b a b a c
4 for q = 2 to m
5 while k > 0 and P[k + 1] 6= P[q] k+1
6 k = π[k]
7 if P[k + 1] == P[q] π 0 0 1 2
8 k = k +1
9 π[q] = k

Design and Analysis of Algorithms


Prefix Function at Work

Prefix-Function(P)
q
1 m = length(P)
2 π[1] = 0
3 k =0 P a b a b a c
4 for q = 2 to m
5 while k > 0 and P[k + 1] 6= P[q] k+1
6 k = π[k]
7 if P[k + 1] == P[q] π 0 0 1 2 3
8 k = k +1
9 π[q] = k

Design and Analysis of Algorithms


Prefix Function at Work

Prefix-Function(P)
q
1 m = length(P)
2 π[1] = 0
3 k =0 P a b a b a c
4 for q = 2 to m
5 while k > 0 and P[k + 1] 6= P[q] k+1
6 k = π[k]
7 if P[k + 1] == P[q] π 0 0 1 2 3
8 k = k +1
9 π[q] = k

Design and Analysis of Algorithms


Prefix Function at Work

Prefix-Function(P)
q
1 m = length(P)
2 π[1] = 0
3 k =0 P a b a b a c
4 for q = 2 to m
5 while k > 0 and P[k + 1] 6= P[q] k+1
6 k = π[k]
7 if P[k + 1] == P[q] π 0 0 1 2 3
8 k = k +1
9 π[q] = k

Design and Analysis of Algorithms


Prefix Function at Work

Prefix-Function(P)
q
1 m = length(P)
2 π[1] = 0
3 k =0 P a b a b a c
4 for q = 2 to m
5 while k > 0 and P[k + 1] 6= P[q] k+1
6 k = π[k]
7 if P[k + 1] == P[q] π 0 0 1 2 3
8 k = k +1
9 π[q] = k

Design and Analysis of Algorithms


Prefix Function at Work

Prefix-Function(P)
q
1 m = length(P)
2 π[1] = 0
3 k =0 P a b a b a c
4 for q = 2 to m
5 while k > 0 and P[k + 1] 6= P[q] k+1
6 k = π[k]
7 if P[k + 1] == P[q] π 0 0 1 2 3 0
8 k = k +1
9 π[q] = k

Design and Analysis of Algorithms


Complexity of KMP

Design and Analysis of Algorithms


Complexity of KMP
O(n) for the search phase

Design and Analysis of Algorithms


Complexity of KMP
O(n) for the search phase

O(m) for the pre-processing of the pattern

Design and Analysis of Algorithms


Complexity of KMP
O(n) for the search phase

O(m) for the pre-processing of the pattern

Hence it is O(n)

Design and Analysis of Algorithms

You might also like