Data Structures Unit-5 Notes
Data Structures Unit-5 Notes
Pattern Matching
• A pattern matching algorithm is used to determine the index positions where a given
pattern string (P) is matched in a text string (T).
• It returns "pattern not found" if the pattern does not match in the text string.
• A string is a sequence of characters. Let
Text[0,…n-1] be the string of length n, and
Pattern[0,…,m-1] be some substring of length m, then pattern matching is a
technique of finding the substring in text which is equal to pattern.
• The pattern matching problem is also called as PMP.
• For example, for the given string (s) = "packt publisher", and the pattern
(p)= "publisher", the pattern matching algorithm returns the index position where the
pattern is matched in the text string.
Applications of pattern matching :
1.pattern matching technique is used in text editors.
2.Search engines use the pattern matching algorithem for matching the query submitted by
the user.
3.In biological research pattern matching algorithm is used.
Example:
T
a b b b a b a b a a b
P
a b a a
Step 1: |
a b b b a b a b a a b
a b a a
|
The match is found. Hence compare immediate right characters.
Step 2 : |
a b b b a b a b a a b
a b a a
|
The match is found. Hence compare immediate right characters.
Step 3 : |
a b b b a b a b a a b
a b a a
|
The match is not found. So shift pattern by one position to right.
Step 4 :
|
a b b b a b a b a a b
a b a a
|
The match is not found. So shift pattern by one position to right.
Step 5 : |
a b b b a b a b a a b
a b a a
|
The match is not found. So shift pattern by one position to right.
Step 6 : |
a b b b a b a b a a b
a b a a
|
The match is not found. So shift pattern by one position to right.
Step 7 : |
a b b b a b a b a a b
a b a a
|
The match is found. Hence compare immediate right characters.
Step 8 : |
a b b b a b a b a a b
a b a a
|
The match is found. Hence compare immediate right characters.
Step 9 : |
a b b b a b a b a a b
a b a a
|
The match is found. Hence compare immediate right characters.
Step 10 : |
a b b b a b a b a a b
a b a a
|
The match is not found. So shift pattern by one position to right.
Step 11 : |
a b b b a b a b a a b
a b a a
|
The match is not found. So shift pattern by one position to right.
Step 12 : |
a b b b a b a b a a b
a b a a
|
The match is found. Hence compare immediate right characters
Step 13 : | |
a b b b a b a b a a b
a b a a
| |
The match is found. Hence compare immediate right characters
Step 14 : |
a b b b a b a b a a b
b a aa
|
The match is found. Hence compare immediate right characters
Step 15 : | | | |
a b b b a b a b a a b
a b a a
| | | |
Match is found.
• Both of the above heuristics can also be used independently to search a pattern in a
text.
• It processes the pattern and creates different arrays for both heuristics.
• At every step, it slides the pattern by the max of the slides suggested by the two
heuristics.
• So it uses best of the two heuristics at every step.
• Unlike the previous pattern searching algorithms, Boyer Moore algorithm starts
matching from the last character of the pattern.
Explanation:
• Here we have a mismatch at position 7.
• The mismatching character “C” does not exist in pattern before position 7 so we’ll
shift pattern past to the position 7 and eventually in above example we have got a
perfect match of pattern (displayed in Green).
• We are doing this because, “C” do not exist in pattern so at every shift before
position 7 we will get mismatch and our search will be fruitless.
Explanation:
• In the above example, we have got a substring t of text T matched with pattern P (in
green) before mismatch at index 2.
• Now we will search for occurrence of t (“AB”) in P.
• We have found an occurrence starting at position 1 (in yellow background) so we
will right shift the pattern 2 times to align t in P with t in T.
• This is weak rule of original Boyer Moore and not much effective, we will discuss
a Strong Good Suffix rule shortly.
Explanation:
• In above example, we have got t (“BAB”) matched with P (in green) at index 2-4
before mismatch .
• But because there exists no occurrence of t in P we will search for some prefix of P
which matches with some suffix of t.
• We have found prefix “AB” (in the yellow background) starting at index 0 which
matches not with whole t but the suffix of t “AB” starting at index 3.
• So now we will shift pattern 3 times to align prefix with the suffix.
Knuth-Morris-Pratt Algorithm
• KMP Algorithm is one of the most popular patterns matching algorithms.
• KMP stands for Knuth Morris Pratt.
• KMP algorithm was invented by Donald Knuth and Vaughan Pratt together and
independently by James H Morris in the year 1970.
• In the year 1977, all the three jointly published KMP Algorithm.
• KMP algorithm is one of the string matching algorithms used to find a Pattern in a
Text.
• This algorithm compares character by character from left to right. But whenever a
mismatch occurs, it uses a preprocessed table called "Prefix Table" to skip characters
comparison while matching.
• Some times prefix table is also known as LPS Table.
• Here LPS stands for "Longest proper Prefix which is also Suffix".
Steps for Creating LPS Table (Prefix Table)
Step 1: Define a one dimensional array with the size equal to the length of the Pattern.
(LPS[size])
Step 2: Define variables i & j. Set i = 0, j = 1 and LPS[0] = 0.
Step 3: Compare the characters at Pattern[i] and Pattern[j].
Step 4: If both are matched then set LPS[j] = i+1 and increment both i & j values by one.
Goto to Step 3.
Step 5: If both are not matched then check the value of variable 'i'. If it is '0' then set LPS[j] =
0 and increment 'j' value by one, if it is not '0' then set i = LPS[i-1]. Goto Step 3.
Step 6: Repeat above steps until all the values of LPS[] are filled.
Let us use above steps to create prefix table for a pattern...
How to use LPS Table
• We use the LPS table to decide how many characters are to be skipped for
comparison when a mismatch has occurred.
• When a mismatch occurs, check the LPS value of the previous character of the
mismatched character in the pattern.
• If it is '0' then start comparing the first character of the pattern with the next
character to the mismatched character in the text.
• If it is not '0' then start comparing the character which is at an index value equal to
the LPS value of the previous character to the mismatched character in pattern with
the mismatched character in the Text.
How the KMP Algorithm Works
Let us see a working example of KMP Algorithm to find a Pattern in a Text...
Example 2:
T A B C A A B A C A A B A B A D B C B C A D
P A B A D B C B C
0 1 2 3 4 5 6 7
LPS 0 0 1 0 0 0 0 0
Step 1 Start comparing first character of pattern with first character of Text from left to right
T A B C A A B A C A A B A B A D B C B C A D
P A B A D B C B C
0 1 2 3 4 5 6 7
Here mismatch occurred at Pattern[2],So we need to consider LPS[1] value.Since LPS[1]=0
we must compare first character in Pattern with next character in Text.
Step 2 Start comparing first character of pattern with first character of Text from left to right
T A B C A A B A C A A B A B A D B C B C A D
P A B A D B C B C
0 1 2 3 4 5 6 7
Here mismatch occurred at Pattern[1],So we need to consider LPS[0] value.Since LPS[0]=0
we must compare first character in Pattern with next character in Text.
Step 3 Start comparing first character of pattern with first character of Text from left to right
T A B C A A B A C A A B A B A D B C B C A D
P A B A D B C B C
0 1 2 3 4 5 6 7
Here mismatch occurred at Pattern[0],So we need to shift pattern one position to right.
Step 4 Start comparing first character of pattern with first character of Text from left to right
T A B C A A B A C A A B A B A D B C B C A D
P A B A D B C B C
0 1 2 3 4 5 6 7
Here mismatch occurred at Pattern[1],So we need to consider LPS[0] value.Since LPS[0]=0
we must compare first character in Pattern with next character in Text.
Step 5 Start comparing first character of pattern with first character of Text from left to right
T A B C A A B A C A A B A B A D B C B C A D
P A B A D B C B C
0 1 2 3 4 5 6 7
Here mismatch occurred at Pattern[1],So we need to consider LPS[0] value.Since LPS[0]=0
we must compare first character in Pattern with next character in Text.
Step 6 Start comparing first character of pattern with first character of Text from left to right
T A B C A A B A C A A B A B A D B C B C A D
P A B A D B C B C
0 1 2 3 4 5 6 7
Here mismatch occurred at Pattern[0],So we need to shift pattern one position to right.
Step 7 Start comparing first character of pattern with first character of Text from left to right
T A B C A A B A C A A B A B A D B C B C A D
P A B A D B C B C
0 1 2 3 4 5 6 7
Match is Found at index value 11.
Tries
• All the search trees are used to store the collection of numerical values but they are
not suitable for storing the collection of words or strings.
• Trie is a data structure which is used to store the collection of strings and makes
searching of a pattern in words more easy.
• The term trie came from the word retrieval.
• Trie data structure makes retrieval of a string from the collection of strings more
easily.
• Trie is also called as Prefix Tree and some times Digital Tree.
Trie Properties:
• The trie data structure provides fast pattern matching for string data values.
• In trie, every node except the root stores a character value.
• Every node in trie can have one or a number of children.
• All the children of a node are alphabetically ordered.
• If any two strings have a common prefix then they will have the same ancestors.
Example
Standard Trie
A standard trie have the following properties:
1. It is an ordered tree like data structure.
2. Each node(except the root node) in a standard trie is labeled with a character.
3. The children of a node are in alphabetical order.
4. Each node or branch represents a possible character of keys or words.
5. Each node or branch may have multiple branches.
6. The last node of every key or word is used to mark the end of word or node.
Example:
Compressed Trie
A Compressed trie have the following properties:
1. A Compressed Trie is an advanced version of the standard trie.
2. Each node(except the leaf nodes) have atleast 2 children.
3. It is used to achieve space optimization.
4. To derive a Compressed Trie from a Standard Trie, compression of chains of redundant
nodes is performed.
5. It consists of grouping, re-grouping and un-grouping of keys of characters.
6. While performing the insertion operation, it may be required to un-group the already
grouped characters.
7. While performing the deletion operation, it may be required to re-group the already
grouped characters.
8. A compressed trie T storing s strings(keys) has s external nodes and O(s) total number
of nodes.
Example:
Suffix Trie
A Suffix trie have the following properties:
1. A Suffix Trie is an advanced version of the compressed trie.
2. The most common application of suffix trie is Pattern Matching.
3. While performing the insertion operation, both the word and its suffixes are stored.
4. A suffix trie is also used in word matching and prefix matching.
5. To generate a suffix trie, all the suffixes of given string are considered as individual
words.
6. Using the suffixes, compressed trie is built.
Example:
Let us consider an example text “banana\0” where ‘\0’ is string termination character.
Following are all suffixes of “banana\0”
banana\0
anana\0
nana\0
ana\0
na\0
a\0
\0
If we consider all of the above suffixes as individual words and build a trie, we get
following.
If we join chains of single nodes, we get the following compressed trie, which is the Suffix
Tree for given text “banana\0”