Approximate String Matching For Music Retrieval System
Approximate String Matching For Music Retrieval System
Approximate String Matching For Music Retrieval System
Seminar Report
Submitted in partial fulfillment of the degree of
Master of Technology
by
Pankaj E. Gulhane
03329007
Kanwal Rekhi School of Information Technology
1 Introduction
Today user can access large music data. So it is strongly required to provide
users with the ability to search music by content. Conventional keyword
based methods attaches some set of keywords to a music file, and user are
suppose to pose query against these keywords. We want some method which
will enable user to pose search bsed on contents of a music file and the most
natural way to search in a song database would be, by humming a fragment
of a song.
In Melody Information Retrieval system (MIR)[6], there is a melody
database and a mechanism to query this database. When user want to fetch
some melody, (s)he fires a query. Query is typically a few notes sung or
hummed by user. System processes this query and tries to extract melody
information[4], and represent it in some standard form. Database is searched
for this melody information(in standard form), and the nearest match is
returned. So the major modules over here are extraction of melody infor-
mation, and a similarity calculating function. The functional part of MIR
is shown in figure 1.
• 3 Level: U/D/S indicating that pitch goes Up, Down or remain Same.
1
Acoustic
query
Midrophone Melody
and database
sound card
Pitch
detection
Ranked
listing
These levels decides accuracy of search(eg. TaNSEN [7] got 5 levels) Using
above representation we can get a melody string associated with some music.
similarly we can get user query in this string format. Now the only job
remaining is, finding the closest match between a query and a melody string.
2 Problem
We are given a music string and a query string. Music string means, a
string associated with some song(i.e. string representing music information
2
present in that song), whereas query string is a string generated by user
query. These strings can have any, but same alphabet size. For TaNSEN it
is five, which can be changed.
So algorithm should
1. allow us to change weights associated with operations.
2.1 Notations
Now onward query string will be called as pattern P 1...m and music string as
text T1...n . ed(x,y) will be the edit distance between string x and string y.
k will be the maximum allowed error. Possible operations may be insertion,
deletion, or substitution.
3
we have to perform in converting one string to another. That’s why lot of
methods assign equal weights to all operations.
1. Dynamic programming
This is the oldest method for finding edit distance. and will be dis-
cussed at length in this paper.
2. Automaton
In this method, we model the problem as nondeterministic finite au-
tomaton(NFA). Operations like insert, delete, and substitute changes
state of system.
3. Filters
Filtering is based on the fact that it may be easier to tell, that a
text position does not match. Filter algorithms discard areas of the
text, that cannot contain match. These algorithms also use strict
string matching. Strict string matching algorithms are faster than
approximate string matching algorithms.
4. Bit-parallelism
These algorithms exploite parallelism of computer, when it works on
bits. Many algorithms from these category are modified implementa-
tion of DP algorithms.
4 Dynamic programming(DP)
This is the oldest method for calculating edit distance. In this area, there
are two flavours of algorithms, one having good average case time and good
worst case time complexities. Good worst case time algorithms are generally
difficult to implement. Good avgerage case time algorithms are simple and
mostly work out pretty well.
To simplify analysis, we will assume that all operations cost 1, unless
and untill it is specified.
4
Computing edit distance
Let w(α, β) represent cost of transformation, i.e. cost of opearation to trans-
form from α to β. Let D(A, B) denotes distance between two strings, A and
B. Here D should have following properties[6].
D(A, B) ≥ 0
D(A, B) = D(B, A)
D(A, B) ≤ D(A, C) + D(C, B)
Di,0 = i
D−1,j = 0
Di,j = if (xi = yj ) then Di−1,j−1
else 1 + min(Di−1,j , Di,j−1 , Di−1,j−1 ) (1)
The rationale behind formula1 is as follows. First, D i,0 and D0,j repre-
sent the edit distance between a string of length i or j and the empty string.
Clearly i (respectively j ) deletions are needed on the nonempty string. For
two nonempty strings of length i and j , we assume inductively that all the
edit distances between shorter strings have already been computed, and try
to convert x1...i into y1...j . Here we can see that to calculate whole matrix
we need |x||y| operations. Space complexity can be reduced to O(m) as we
need only one column at a time.
Example
Lets assume that all cost of all operations are same and equal to one. Given
text string is duddddu and pattern string dudddU .
d u d d d d u
0 0 0 0 0 0 0 0
d 1 0 1 0 0 0 0 1
u 2 1 0 1 1 1 1 0
d 3 2 1 0 1 1 1 1
d 4 3 2 1 0 1 1 2
d 5 4 3 2 1 0 1 2
U 6 5 4 3 2 1 1 2
5
The complexity of algorithm can be calculated by looking at table1. As
there are m ∗ n entries in table, we have to calculate those many terms.
After looking at equation 1 we can find out that, only left column is used
for calculation of new column. This gives us space complexity of O(m).
1: for 0 ≤ d ≤ n + 1 do
2: Set Ld,−1 = −1
3: end for
4: for −(k + B1) ≤ d ≤ −1 do
5: Set Ld,|d−2| = −∞
6: Set Ld,|d−1| = |d − 1|
7: end for
8: for e = 0 to k do
9: for d = −e to n do
10: Set maxRow = max (Ld−1,e−1 , Ld,e−1 + 1, Ld+1,e−1 + 1)
11: Calculate length of diagonal having constant error e.
12: Assign this length to Ld,e
13: end for
14: if Ld,e = m then
15: Match found at td,m
16: end if
17: end for
6
strings.). Doing this will put all common substring in internal nodes. and
length of common substring could be found in constant time.
Different weights can be assinged to each operation by changing the way
we calculate Ld,e .
2. k is smaller than
k ∗ = m/(logb m + c1 ) − c2 (2)
Lemma
For some chosen c1 and c2 , P [Sk∗ +3 − S1 ≥ m − k ∗ ] ≤ 1/m3
7
Proof
For this proof, we will assume that b = 2( b > 2, will give slightly smaller
ci ).
Let Xj be the random variable Sj+1 − Sj . Note that the Xj ’s are i.i.d.
P ∗ +2
Sk +3 − S1 = kj=1
∗ Xj . Let T [p, · · · , p + d − 1] matches with pattern P .
There are m different words of length lgm+d out of m2 d different possible
strings.
Let D be the random variable taking value d For all d > 0
E[X1 ] = E[lg(m) + D]
= lg(m) ∗ 1 + E[D]
< lg(m) + 2 (3)
m − k∗
Yi = X i −
k∗ + 2
m m
> For any c2 > 2 (4)
k∗+2 k∗ + c2
m k∗
Yi = X i − +
k ∗ + c2 k ∗ + c2
m
< Xi − ∗ +1
k + c2
< Xi − lg(m) − c2 + 1 from Eq. 2
< Xi − lg(m) − 2 for c2 > 3 (5)
= P [Y1 + Y2 + · · · + Yk+2 ≥ 0]
∗ +2
≤ E[etY1 ]k (6)
For any d ≥ 0
8
∞
X ∗ )/(k ∗ +2)
E[e tY1
] < etlg(m)+td−t(m−k 2−d
d=0
∞
X
tlg(m)−t(m−k ∗ )/(k ∗ +2)
< e etd .2−d
d=0
loge 2
Now select t = 2
∞
X
t(lg(m)−(m−k ∗ )/(k ∗ +2))
Y1
E[e ] < e ed/2∗loge 2 .2−d
d=0
∗ ∞
X
1
(m− m−k )
< 2 2 k∗ +2 2−d/2
d=0
Y1 k ∗ +2 ((k ∗ +2)m−(m−k ∗ )+3.6(k ∗ +2))/2
E[e ] < 2
∗ +2)−(c
2 −2)(c1 +lg(m))/2
< 2(4.6−c1 )(k (7)
5 Other DP Algorithms
5.1 Hariharan : a simpler faster algorithm [2]
This is good worst case time algorithm and the fastest algorithm till date.
It uses filtering approach. The only problem is that, it is complex with
many restrictions. And if pattern is periodic and non-periodic then it has
got different complexities. This algorithm can give better performance for
longer pattern lenght (> 5k 3 ), but for smaller pattern length it’s better to
use LV algorithm.
9
5.2 Other promising methods
5.2.1 Filtering Method [9]
Filtering is based on the fact that, it may be easier to tell that strings do
not match than to tell strings matches. Filtering method reduces search
space by discarding substrings where we can’t get a match. For example if
we have to find a match with one error, then we can search for either half
part of pattern string. If both halfs are absent then there is no way that we
can get a match with one error, and we can skip that region. Most filtering
method takes advantage of exact string matching algorithms as those are
faster than approximate string matching. This algorithm will be very useful
in long pattern matching.
6 Conclusion
We saw three major algorithms. Each algorithm is discarding some unimpor-
tant information. Levenshtein distance is the most basic and more flexible in
term of assigning weights. The time complexity of this algorithm is O(mn),
so it may not be useful while comparing long melody and query strings. This
algorithm is surely better, when strings are small in length and we want to
experiment with weights assigned to operations.
Lindau-Vishkin algorithm is a major break-through in field of dynamic
programming. It works on monotonically non-decreasing errorlevel along
diagonal. It is faster than previous one, and works in time O(kn). Deciding
right value of k is very important in this algorithm.
Expected linear time algorithms will be useful only when pattern size is
large enough. e.g. for c1 = 4.6, c2 = 8.0, k = 3, Σ = 5, we need pattern
length to be more than 32 characters. System having more alphabets needs
lesser value of m. Restriction on pattern length is not a problem as such,
as most of pattern will be of this length. But if pattern length is less than
this(example size=32), this algorithm will boil down to LV. Here we have
improved average case time, but worst case time remain same, as that of LV
algorithm(O(kn)).
10
References
[1] Willam Chang and E.L. lawler. Approximate string matching in sub-
linear time. Technical report, Computer science division, U.C.Berkeley,
1990.
[7] Bharat Sundaram M. Anand Raju and Preeti Rao. Tansen: A query-by-
humming based music retrieval system. Technical report, Department
of Electrical Engineering, IIT, Bombay, 2003.
[8] Gene Myers. A fast bit-vector algorithm for approximate string match-
ing based on dynamic programming. Journal of the ACM (JACM),
46(3):395–415, 1999.
[10] Anand Raju and Preeti Rao. Building a melody retrieval system. Tech-
nical report, Department of Electrical Engineering, IIT, Bombay, 2002.
[11] Gad M. Landau Uzi Vishkin. Introducing efficient parallelism into ap-
proximate string matching and a new serial algorithm. Technical report,
Tel Aviv University, 1986.
11