Acstv10n8 43
Acstv10n8 43
Acstv10n8 43
Jiji. N
Associate Professor,
Department of CSE, Younus College of Engg. & Tech.,
Kollam, Kerala, India.
Dr. T Mahalakshmi
Principal, Sree Narayana Institute of Technology,
Kollam, Kerala, India.
Abstract
Extracting and detecting normal/anomaly patterns from the massive amount of
online data are gaining more attention in the world of Big Data Analysis. Last
two decades, researchers proposed a number of searching algorithms for pattern
matching from large volume of data. Most of these algorithms are based on
string matching concept. In this paper, we have made a survey of string
matching algorithms for pattern matching in protein sequence. The exact string
matching algorithms are taken for this survey. And only protein sequence
datasets are considered for experiment evaluation.
Keyword: String matching, pattern matching, gene database
I. INTRODUCTION
String matching is a process of finding a particular string pattern from a large volume
of text. String matching detects a particular string pattern from the stored data.
Nowadays, most of the applications are using string matching concept for data retrieval
or pattern matching from massive amount of data. Formal definition is given below,
Definition 1: Let Σ be an alphabet (finite set). Formally, both the pattern and searched
text are vectors of elements of Σ. The pattern is a combination of alphabets from the Σ
2708 Jiji. N and Dr. T Mahalakshmi
and the searching pattern fromΣ = {𝐴, … , 𝑍|𝑎, … , 𝑧|0, … 9}. Other applications may use
binary alphabet (Σ = {0,1}) or DNA alphabet (Σ = {A,C,G,T}) in bioinformatics.
Definition 2: Find one, or more generally, all the occurrences of a pattern 𝑥 =
[𝑥0 𝑥1 𝑥2 … 𝑥𝑚−1 ]; 𝑥𝑖 ∈ Σ; 𝑖 = 0,1,2, … 𝑚 − 1, in a text of 𝑦 = [𝑦0 𝑦1 𝑦2 … 𝑦𝑛−1 ]; 𝑦𝑗 ∈
Σ; 𝑗 = 0,1,2, … , 𝑛 − 1
There are two techniques of string matching ,
1) Exact String Matching : For given two strings T and P and wants to find all
substrings of T that are equal to P. Formally, it calculates all indices i such that
𝑇[𝑖 + 𝑠] = 𝑃[𝑠] for each 0 ≤ 𝑠 ≤ |𝑃| − 1. The following algorithms are used
to find the exact substring matching, Needleman Wunsch (NW) [16], Smith
Waterman(SW) [15], Knuth Morris Pratt (KMP) [7], Dynamic Programming,
Boyer Moore Horspool (BMH) [13].
2) Approximate String Matching: Approximate string matching (fuzzy string
searching) is the technique of finding strings that match a pattern approximately
(rather than exactly). The problem of approximate string matching is typically
divided into two sub-problems: finding approximate substring matches inside a
given string and finding dictionary strings that match the pattern approximately
(Fuzzy string searching, Rabin Karp, Brute Force).
Various string matching algorithms were proposed by the researchers over the period
of time for solving the string matching problems. The following algorithms are widely
used in various string matching applications, wide window pattern matching,
approximate string matching, polymorphic string matching, string matching with
minimum mismatches, prefix matching, suffix matching, similarity measure, longest
common subsequence (dynamic programming algorithm), BMH, Brute Force, KMP,
Quick search, Rabin Karp [12].
We have used Gene Database for analyzing the similarity measurements by using
various kinds of string matching algorithms such as Boyer Moore (BM) algorithm [13],
NW algorithm, SW algorithm, Hamming Distance, Levenshtein Distance, Aho-
Corasick (AC) algorithm [12], KMP algorithm, Rabin Karp algorithm [14],
CommentZ-walter (CZW) algorithm. We have compared the proposed two methods
with these algorithms.
The remaining sections are organized as follows; section 2 provides a survey on basic
string matching algorithms with applications. Section 3 provides detailed view about
the proposed two string matching algorithms with examples. In section 4, we have
provided the result and discussion for the proposed algorithm with related string
matching algorithm. Section 5 gives the conclusion with remarks.
string matching and approximate string matching. Here we have considered only the
exact string matching algorithms.
Brute Force algorithm
The brute force string matching algorithm is a classical matching scheme and
this algorithm requires no preprocessing phase. The brute force algorithm consists in
checking, at all positions in the text between 0 and n-m, whether an occurrence of the
pattern starts there or not. The extracted patterns are compared one by one character.
The searching window is shifted exactly one position from right to left. The searching
may start from any order (left to right/right to left). The searching phase time
complexity is Ο(𝑚𝑛) and minimum 2𝑛 expected text character comparisons. This
following algorithm illustrates the steps in brute force algorithm,
FINITE-AUTOMATON-MATCHER(Text, 𝛿,m)
1. 𝑛 ← 𝑙𝑒𝑛𝑔𝑡ℎ[𝑇𝑒𝑥𝑡]
2. 𝑞←0
3. 𝐹𝑜𝑟 𝑖 ← 1 𝑡𝑜 𝑛
4. 𝑑𝑜
5. 𝑞 ← 𝛿(𝑞, 𝑇𝑒𝑥𝑡[𝑖])
6. 𝐼𝑓 𝑞 == 𝑚 𝑡ℎ𝑒𝑛
7. 𝑝𝑟𝑖𝑛𝑡 "𝑃𝑎𝑡𝑡𝑒𝑟𝑛 𝑜𝑐𝑐𝑢𝑟𝑠 𝑤𝑖𝑡ℎ 𝑠ℎ𝑖𝑓𝑡 𝑖 − 𝑚
Figure 2: Finite-Automation String Matching Algorithm
requires Ο(𝑛) time complexity. This algorithm needs minimum 2n text character
comparisons in the worst case.
1 𝑖𝑓 𝑥[𝑚 − 1] = 𝑎
𝑚−𝑖+1 𝑖𝑓 𝑥[𝑖]𝑥[𝑖 + 1] = 𝑏
𝑏𝑟𝐵𝑐[𝑎, 𝑏] = 𝑀𝑖𝑛 { }
𝑚+1 𝑖𝑓 𝑥[0] = 𝑏
𝑚+2 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
The preprocessing phase requires Ο(𝑚 + 𝜎 2 ) space and time complexity. The
searching phase of the Berry-Ravindran algorithm has a Ο(𝑚𝑛) time complexity.
Table 1: Proposed algorithm Accuracy for string matching with related algorithms
Execution
Algorithm Preprocessing String Matching Accuracy
Time
Accuracy
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00% Accuracy
0.00%
BM
SW
BR
Raita
RF
DFA
AG
NW
AS
AC
Brute Force
Rabin-Karp
Morris-Pratt
Tuned BM
Colussi
Turbo-BM
Reverse Colussi
BM
SW
Raita
BR
RF
DFA
AG
NW
AS
AC
Rabin-Karp
Morris-Pratt
Colussi
Tuned BM
Turbo-BM
Reverse Colussi
Figure 2: Execution time comparison of String matching algorithms for Gene Dataset
2718 Jiji. N and Dr. T Mahalakshmi
IV. CONCLUSION
String matching algorithms are the most important research area in the field of content
retrieval and pattern searching. Most of the string matching algorithms are designed for
specific applications and the accuracy differs based on the dataset. In this paper, we
have made an extensive survey of exact string matching algorithms and basic working
principles. We have taken 18 different string matching algorithms for comparison. We
have considered the Gene Dataset for the experiment evaluation with 3481 characters
and the searching pattern size is same for all the related algorithms. Boyer-Moore [13]
string matching algorithm provides an accuracy of 83% and the execution time is ≈
84𝑚𝑠. Reverse Colussi [22] string matching algorithm provides an accuracy of 79%
and the execution time is ≈ 57𝑚𝑠. In future, researcher have a scope in the field of
designing an efficient algorithm for string matching/searching algorithm for Gene
Dataset, to reduce the execution time and to increase the accuracy.
REFERENCE
[1] Singla N., Garg D., “String Matching Algorithms and their Applicability in
various Applications,” International Journal of Soft Computing and
Engineering, ISSN: 2231-2307, Volume-I, Issue-6, January 2012
[2] Vidanagamachchi S.M., Dewasurendra S.D., Ragal B.G., Niranjan M.,
“Comment Z Walter: Any Better than Aho-Corasick for Peptide Sequence
Identification,” International Journal of Research in Computer Science,
eISSN:2249-8265, Vol 2, Issue 6(2012) PP 33-37
[3] http://www.ebi.ac.uk/Tools/psa/emboss_water/nucleotide.html
[4] Gomaa N.H., Fahmy A.A., “Short Answer Grading using String Similarity
and Corpus-Based Similarity,” International Journal of Advanced Computer
Science and Applications, Vol 3,No.11, 2012
[5] Jain P., Pandey S., “Comparative Study on Text Pattern Matching for
Heterogeneous System,” International Journal of Computer Science and
Engineering Technology, ISSN: 2229-3345, Vol.3 No.11 Nov 2012
[6] Alsmadi I., Nuser M., “String Matching Evaluation Methods for DNA
Comparisons,” International Journal of Advanced Science and Technology,
Vol.47, 2012
[7] Knuth D.E., Morris J.H., and Pratt V.R., “Fast Pattern Matching in Strings,”
Journal of Computing, Vol.6, No.2, 1977
[8] Hussain I., Kausar S., Hussain L., and Asifkhan M., “Improved Approach for
Exact Pattern Matching,” International Journal of Computer Science Issues,
Vol.10, Issue 3, No.1, 2013
[9] Amir A., Lewenstein M., and Porat E., “Faster Algorithms for String
Matching with K-Mismatches,” Journal of Algorithms 50(2004) 257-275
Survey of Exact String Matching Algorithm for Detecting Patterns in Protein Sequence 2719
[10] Yeh M, Cheng K.T, “A String Matching Approach for Visual Retrieval and
Classification,” in proceeding of the ACM SIGMOD 978-1-60558-312-
9/08/10
[11] Boytsov, Leonid, “Indexing methods for approximate dictionary searching:
Comparative analysis,” Journal of ACM. Vol. 16, No.1, pp.1–91, 2011
doi:10.1145/1963190.1963191
[12] Aho, Alfred V., Corasick, Margaret J, 1975, “Efficient string matching: An
aid to bibliographic search,” Communications of the ACM, Vol. 18, No.6,
pp.333–340, 1975.
[13] Boyer, Robert S., Moore, J Strother. “A Fast String Searching Algorithm,”
Comm. ACM. New York, NY, USA, 1977, Association for Computing
Machinery, Vol. 20, No.10, pp. 762–772
[14] Karp, Richard M., Rabin, Michael O, “Efficient randomized pattern-matching
algorithms,”1987, IBM Journal of Research and Development, Vol. 31, No.2,
pp. 249–260
[15] Smith, Temple F. & Waterman, Michael S, “Identification of Common
Molecular Subsequences,” 1981, Journal of Molecular Biology. Vol. 147,
pp.195–197
[16] Needleman, Saul B. & Wunsch, Christian D, “A general method applicable to
the search for similarities in the amino acid sequence of two proteins,” 1970,
Journal of Molecular Biology, Vol. 48, No.3, pp.443–453
[17] Crochemore, M., Hancart, C., 1997. Automata for Matching Patterns, in
Handbook of Formal Languages, Volume 2, Linear Modeling: Background
and Application, G. Rozenberg and A. Salomaa ed., Chapter 9, pp 399-462,
Springer-Verlag, Berlin
[18] Morris (Jr) J.H., Pratt V.R., 1970, A linear pattern-matching algorithm,
Technical Report 40, University of California, Berkeley
[19] Colussi L., 1991, Correctness and efficiency of the pattern matching
algorithms, Information and Computation 95(2):225-251
[20] Crochemore, M., Czumaj A., Gasieniec L., Jarominek S., Lecroq T.,
Plandowski W., Rytter W., 1992, Deux méthodes pour accélérer l'algorithme
de Boyer-Moore, in Théorie des Automates et Applications, Actes des 2e
Journées Franco-Belges, D. Krob ed., Rouen, France, 1991, pp 45-63, PUR
176, Rouen, France
[21] Hume A. and Sunday D.M., 1991. Fast string searching. Software - Practice
& Experience 21(11):1221-1248
[22] Colussi L., 1994, Fastest pattern matching in strings, Journal of Algorithms.
16(2):163-189
2720 Jiji. N and Dr. T Mahalakshmi
[23] Apostolico A., Giancarlo R., 1986, The Boyer-Moore-Galil string searching
strategies revisited, SIAM Journal on Computing 15(1):98-105
[24] Raita T., 1992, Tuning the Boyer-Moore-Horspool string searching
algorithm, Software - Practice & Experience, 22(10):879-884
[25] Lecroq T., 1992, A variation on the Boyer-Moore algorithm, Theoretical
Computer Science 92(1):119—144
[26] Berry, T., Ravindran, S., 1999, A fast string matching algorithm and
experimental results, in Proceedings of the Prague Stringology Club
Workshop`99, J. Holub and M. Simánek ed., Collaborative Report DC-99-05,
Czech Technical University, Prague, Czech Republic, 1999, pp 16-26
[27] Charras C., Lecroq T., Pehoushek J.D., 1998, A very fast string matching
algorithm for small alphabets and long patterns, in Proceedings of the 9th
Annual Symposium on Combinatorial Pattern Matching , M. Farach-Colton
ed., Piscataway, New Jersey, Lecture Notes in Computer Science 1448, pp
55-64, Springer-Verlag, Berlin