UNIT-4 PPT New

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 47

Pattern Matching

a b a c a a b
1
a b a c a b
4

2
a b a c
a b
Strings

A string is a sequence of characters


Examples of strings:
• Python program
• HTML document
• DNA sequence
• Digitized image
An alphabet ∑ is the set of possible characters for a family of strings
Example of alphabets:
• ASCII
• Unicode
• {0, 1}
• {A, C, G, T}

Strings

Let P be a string of size m

• A substring P[i .. j] of P is the subsequence of P consisting of the


characters with ranks between i and j
• A prefix of P is a substring of the type P[0 .. i]
• A suffix of P is a substring of the type P[i ..m - 1]

Given strings T (text) and P (pattern), the pattern matching problem


consists of finding a substring of T equal to P
Applications:
• Text editors
• Search engines
• Biological research

Patten Matching Algorithms

Generally used pattern matching algorithms are

Brute-force algorithm
Boyer-Moore algorithm

Knuth-Morris-Pratt algorithm
Brute-force algorithm
• The Brute Force algorithm compares the pattern to the text, one
character at a time, until unmatching characters are found
• The algorithm can be designed to stop on either the first occurrence of
the pattern, or upon reaching the end of the text.
• Compared characters are italicized.
• Correct matches are in boldface type.
Brute-force algorithm
• The Brute Force algorithm compares the pattern P with the text T for
each possible shift of P relative to T, until either
• a match is found, or
• all placements of the pattern are tried
Brute-force pattern matching runs in time O(nm)

Example of worst case:


• T = aaa … ah
• P = aaah
• may occur in images and DNA sequences
• unlikely in English text
Brute-force algorithm
Algorithm BruteForceMatch (T, P)
Input text T of size n and pattern P of size m
Output starting index of a substring of T equal to P
or − 1 if no such substring exists
for i ← 0 to n − m
{ test shift i of the pattern }
j←0
while j < m ∧ T[i + j] = P[j]
j←j+1
if j = m
return i {match at i }
{else mismatch at i}
return -1 {no match anywhere}
Brute-force algorithm
• Given a pattern M characters in length, and a text N characters in
length...
• Worst case: compares pattern to each substring of text of length M. For
example, M=5.
• This kind of case can occur for image data.

Total number of comparisons: M (N-M+1)


Worst case time complexity: O(MN)
Brute-force algorithm
• Given a pattern M characters in length, and a text N characters in
length...
• Best case if pattern found: Finds pattern in first M positions of text. For
example, M=5

Total number of comparisons: M


Best case time complexity: O(M)
Boyer-moore algorithm
• Problem with Brute-Force search
• We keep considering to many bad options for comparison
• We can eliminate lot of possibilities

• TO overcome this Boyer-Moore algorithm is came into


picture
• Very efficient string search algorithm
• It preprocess the pattern instead of whole text
• Runs faster as the length of the pattern increases
• The key features of the algorithm are to match on the tail of pattern
rather than the head
• It is good because we can skip multiple characters at the same time
instead of searching for every single character in the text.
Boyer-moore algorithm

Preprocessing of Pattern:
• Construct bad match table for preprocessing of the pattern
• This table never has elements smaller than 1
• Keep comparing the pattern to the text starting from rightmost
character in the pattern
• When mismatch occurs we have to shift the pattern to the
right corresponding to the value in the bad match table
• With this we can skip several characters unlike brute force
search
• So that the algorithm runs faster
Boyer-moore algorithm

BAD MATCH TABLE or LAST OCCURANCE FUNCTION:


• Table of characters
• Make sure the table doesn’t contain repetitive characters
• If several ‘a’ are there, the table should contain only one ‘a’
• The formula to be used
• Max(1, length of pattern – actual index -1)
• We iterate over the pattern and compute the values to the bad
match table
• We keep updating old values for the same characters
Boyer-moore algorithm
Boyer-moore algorithm
Boyer-moore algorithm
Boyer-moore algorithm
Boyer-moore algorithm
Boyer-moore algorithm
Boyer-moore algorithm
Boyer-moore algorithm
Boyer-moore algorithm
Boyer-moore algorithm
Function for bad match table or pre-computing number of shifts
# include <limits.h>
# include <string.h>
# include <stdio.h>

# define NO_OF_CHARS 256

int max (int a, int b) { return (a > b)? a: b; }


// The preprocessing function for Boyer Moore's
void badCharHeuristic( char *str, int size,
int badchar[NO_OF_CHARS])
{
int i;
for (i = 0; i < NO_OF_CHARS; i++)
badchar[i] = -1;
for (i = 0; i < size; i++)
badchar[(int) str[i]] = i;
}
Boyer-moore algorithm
Function for Boyer-Moore Algorithm: if (j < 0)
{
printf("\n pattern occurs at shift = %d",
void search( char *txt, char *pat)
s);
{ s += (s+m < n)? m-badchar[txt[s+m]] : 1;
int m = strlen(pat); }
int n = strlen(txt); else s += max(1, j - badchar[txt[s+j]]);
int badchar[NO_OF_CHARS]; }
badCharHeuristic(pat, m, badchar); }
int s = 0; // s is shift of the pattern with
// respect to text
while(s <= (n - m))
{
int j = m-1;
while(j >= 0 && pat[j] == txt[s+j])
j--;
Boyer-moore algorithm
Function for Boyer-Moore Algorithm:

Best-case performance: Θ(m) preprocessing + Ω(n/m) matching


Worst-case performance: Θ(m) preprocessing + O(mn) matching

The worst case may occur in images and DNA sequences


Boyer-Moore’s algorithm is significantly faster than the brute-force
algorithm on English text
Knuth-Morris – Pratt algorithm
• This algorithm was developed by Knuth-Morris-Pratt.
• It is the first linear time pattern matching algorithm.
• It prevents reexamination of previously matched characters.
• The implementation of Knuth-Morris-Pratt algorithm is efficient because
it minimizes the total number of comparisons of the pattern against the
input string.

text t1 . . . . . tj . . .
tj-r+1
.
pattern p1 . . . pr . . .

pattern p1 . . . pr
. . .
pattern p1 . . . pr . .
.
Knuth-Morris – Pratt algorithm
• Knuth-Morris-Pratt’s algorithm
compares the pattern to the text in
. . a b a a b x . . . . .
left-to-right, but shifts the pattern
more intelligently than the brute-
force algorithm. a b a a b a
• When a mismatch occurs, what is j
the most we can shift the pattern
so as to avoid redundant a b a a b a
comparisons?
• Answer: the largest prefix of P[0..j] No need to Resume
that is a suffix of P[1..j] repeat these comparing
comparisons
Knuth-Morris – Pratt algorithm
KMP Failure function or LPS(Longest Prefix that is also Suffix) array:

Knuth-Morris-Pratt’s algorithm preprocesses the pattern to find matches of


prefixes of the pattern with the pattern itself
The failure function F(j) is defined as the size of the largest prefix of P[0..j]
that is also a suffix of P[1..j]
Knuth-Morris – Pratt algorithm
KMP Failure function or LPS(Longest Prefix that is also Suffix) array:
Function computelpsarray(pat,m,lps)
{
len=0,i=1;lps[0]=0
while(i<m)
{
if(pat[i]==pat[len])
{
lps[i]=len+1;
len++; i++;
}
else
{
if(len!=0)
len=lps[len-1]
else
lps[i]=0
i++
}
}
< M) { if (pat[i] == pat[length]) { length++; pps[i] = length; i++; } else { if (length != 0)

Knuth-Morris – Pratt algorithm


KMP Failure function or LPS(Longest Prefix that is also Suffix) array:
Function computelpsarray(pat,m,lps)
{
len=0,i=1;lps[0]=0
while(i<m)
{
if(pat[i]==pat[len])
{
lps[i]=len+1;
len++; i++;
}
else
{
if(len!=0)
len=lps[len-1]
else
lps[i]=0
i++
}
}
Knuth-Morris – Pratt algorithm
if (j == M) {
KMP ALGORITHM: printf("Found pattern at index %d\
void KMPAlgorithm(char* text, char*
n", i - j);
pattern) {
j = pps[j - 1];
int M = strlen(pattern);
}
int N = strlen(text);
else if (i < N && pattern[j] != text[i]) {
int pps[M];
if (j != 0)
prefixSuffixArray(pattern, M, pps);
j = pps[j - 1];
int i = 0;
else
int j = 0;
i = i + 1;
while (i < N) {
}
if (pattern[j] == text[i]) {
}
j++;
}
i++;
int main() {
}
char text[] = "xyztrwqxyzfg";
char pattern[] = "xyz";
printf("The pattern is found in the text at
the following index : \n");
KMPAlgorithm(text, pattern);
}
Knuth-Morris – Pratt algorithm
KMP Failure function or LPS(Longest Prefix that is also Suffix) array:
Knuth-Morris – Pratt algorithm
Knuth-Morris – Pratt algorithm

Best-case performance: Θ(m) preprocessing +O(m)


Worst-case performance: O(n).

The best case may occur in images and DNA sequences



Text Processing
We have seen that preprocessing the pattern speeds up pattern matching
queries
• After preprocessing the pattern in time proportional to the pattern length,
the Boyer-Moore algorithm searches an arbitrary English text in (average)
time proportional to the text length
• If the text is large, immutable and searched for often (e.g., works by
Shakespeare), we may want to preprocess the text instead of the pattern in
order to perform pattern matching queries in time proportional to the
pattern length.
• Tradeoffs in text searching

Dr.L.Lakshmi 34
Tries
• A trie is a tree-based data structure for representing a set of strings,
such as all the words in a text

A tries supports pattern matching queries in time proportional to the pattern size
(or)
• A trie is a tree-based data structure for storing strings in order to
support fast pattern matching.

e i mi nimize ze

mize nimize ze nimize ze

Dr.L.Lakshmi 35
Tries
• Standard Tries
• Compressed Tries
• Suffix Tries

Dr.L.Lakshmi 36
Standard Tries
• The standard trie for a set of strings S is an ordered tree such that:
• each node but the root is labeled with a character
• the children of a node are alphabetically ordered
• the paths from the external nodes to the root yield the strings of S

• Example: standard trie for


the set of strings
S = { bear, bell, bid, bull,
buy, sell, stock, stop }

Dr.L.Lakshmi 37
Standard Tries

• A standard trie uses O(n) space. Operations (find, insert,


remove) take time O(dm) each, where:
n = total size of the strings in S,
m =size of the string parameter of the operation
d =alphabet size,

Dr.L.Lakshmi 38
Word Matching with a Trie
• A standard trie supports the following operations on a preprocessed text in time O(m), where
m = |X|
-word matching: find the first occurrence of word X in the text
-prefix matching: find the first occurrence of the longest prefix of word X in the text
• Each operation is performed by tracing a path in the trie starting at the root

Dr.L.Lakshmi 39
Compressed Tries
• Trie with nodes of degree at least 2
• Obtained from standard trie by compressing chains of redundant nodes

Standard Trie:

Compressed Trie:

Dr.L.Lakshmi 40
Compact Storage of Compressed Tries
• A compressed trie can be stored in space O(s), where s = |S|, by using O(1) space index ranges at
the nodes

Dr.L.Lakshmi 41
Insertion and Deletion into/from a Compressed Trie

Dr.L.Lakshmi 42
Suffix Trie
• The suffix trie of a string X is the compressed trie of all the suffixes of X

m i n i m i z e
0 1 2 3 4 5 6 7

e i mi nimize ze

mize nimize ze nimize ze


Dr.L.Lakshmi 43
Compact representation:

Dr.L.Lakshmi 44
Properties of Suffix Tries
• The suffix trie for a text X of size n from an alphabet of size d
-stores all the n(n-1)/2 suffixes of X in O(n) space
-supports arbitrary pattern matching and prefix matching queries in O(dm)
time, where m is the length of the pattern
-can be constructed in O(dn) time

Dr.L.Lakshmi 45
Application of Tries
• The index of a search engine (collection of all searchable words) is stored into a
compressed trie
• Each leaf of the trie is associated with a word and has a list of pages (URLs)
containing that word, called occurrence list
• The trie is kept in internal memory
• The occurrence lists are kept in external memory and are ranked by relevance
• Boolean queries for sets of words (e.g., Java and coffee) correspond to set operations
(e.g., intersection) on the occurrence lists
• Additional information retrieval techniques are used, such as
• stopword elimination (e.g., ignore “the” “a” “is”)
• stemming (e.g., identify “add” “adding” “added”)
• link analysis (recognize authoritative pages)

Dr.L.Lakshmi 46
Tries and Internet Routers
• Computers on the internet (hosts) are identified by a unique 32-bit IP (internet
protocol) addres, usually written in “dotted-quad-decimal” notation
• E.g., www.cs.brown.edu is 128.148.32.110
• Use nslookup on Unix to find out IP addresses
• An organization uses a subset of IP addresses with the same prefix, e.g., Brown
uses 128.148.*.*, Yale uses 130.132.*.*
• Data is sent to a host by fragmenting it into packets. Each packet carries the IP
address of its destination.
• The internet whose nodes are routers, and whose edges are communication links.
• A router forwards packets to its neighbors using IP prefix matching rules. E.g., a
packet with IP prefix 128.148. should be forwarded to the Brown gateway router.
• Routers use tries on the alphabet 0,1 to do prefix matching.

47

You might also like