Approximate String Matching For Music Retrieval System

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Approximate String Matching

for Music Retrieval System

Seminar Report
Submitted in partial fulfillment of the degree of
Master of Technology
by

Pankaj E. Gulhane
03329007
Kanwal Rekhi School of Information Technology

Under the guidance of


Prof. Preeti Rao
Department of Electrical Engineering and
Advanced Centre for Research in Electronics

Prof. Kavi Arya


Department of Information Technology

Indian Institute of Technology, Powai,


Mumbai
Abstract
Today music database is growing very fast. Searching a particular
song in such a database, is no longer a trivial task. Tagging each song
by a singer name, year etc. doesn’t help much. More natural way of
searching would be query by humming(QBH). Current QBH methods
represent a song and a query in some string format, and by comparing
these strings, it outputs a match. While comparing these two strings,
some errors are allowed. To incorporate these errors in comparison,
approximate string matching should be used.
In this paper, various approximate string matching(ASM) algo-
rithms and methods are surveyed. Some dynamic programming al-
gorithms are explained, which bring complexity of ASM from qudratic
time to sublinear expected time.

1 Introduction
Today user can access large music data. So it is strongly required to provide
users with the ability to search music by content. Conventional keyword
based methods attaches some set of keywords to a music file, and user are
suppose to pose query against these keywords. We want some method which
will enable user to pose search bsed on contents of a music file and the most
natural way to search in a song database would be, by humming a fragment
of a song.
In Melody Information Retrieval system (MIR)[6], there is a melody
database and a mechanism to query this database. When user want to fetch
some melody, (s)he fires a query. Query is typically a few notes sung or
hummed by user. System processes this query and tries to extract melody
information[4], and represent it in some standard form. Database is searched
for this melody information(in standard form), and the nearest match is
returned. So the major modules over here are extraction of melody infor-
mation, and a similarity calculating function. The functional part of MIR
is shown in figure 1.

1.1 Pitch detection


The pitch is one of the fundamental attribute of music. Melody is sequence
of variation of pitches and duration. Variation of pitches can characterise
melody. This variation in pitches is called pitch contour[10] and it is invari-
ant to key transposition.
There are many pitch contour representation. Few are listed below

• 3 Level: U/D/S indicating that pitch goes Up, Down or remain Same.

• 5 Level: U/u/S/d/D levels are more fine grained than above.

1
Acoustic
query

Midrophone Melody
and database
sound card

Pitch
detection

Contour Search engine


(Uses approximate
string matching)
Note
segmentation

Ranked
listing

Figure 1: Basic blocks of melody retrieving system

These levels decides accuracy of search(eg. TaNSEN [7] got 5 levels) Using
above representation we can get a melody string associated with some music.
similarly we can get user query in this string format. Now the only job
remaining is, finding the closest match between a query and a melody string.

1.2 String Matching


The database of songs is indexed by melody strings. User query will be
compared with these entries and match will be returned. Here user may
not hum or sing exactly. He/She may add/delete/replace some notes, these
issues need to be taken care of, while comparing two strings. This can be
very well done by using approximate string matching.
Approximate string matching module is very important, as it is going
to decide accuracy and efficiency of system. It has to response faster when
working with large databases.
This paper will try to present some algorithms, which will reduce runtime
complexity of approximate string matching. It will also discuss about some
implemented methods and other promissing methods.

2 Problem
We are given a music string and a query string. Music string means, a
string associated with some song(i.e. string representing music information

2
present in that song), whereas query string is a string generated by user
query. These strings can have any, but same alphabet size. For TaNSEN it
is five, which can be changed.
So algorithm should
1. allow us to change weights associated with operations.

2. be simple and fast.

2.1 Notations
Now onward query string will be called as pattern P 1...m and music string as
text T1...n . ed(x,y) will be the edit distance between string x and string y.
k will be the maximum allowed error. Possible operations may be insertion,
deletion, or substitution.

3 Approximate string matching(ASM)


Given two strings, we have to find out, how many operations we have to
perform so as to convert one string into another. Every insert, delete, and
replace operation is associated with some cost. Number of different opera-
tions that we have to do, to convert from one string to another is called as
distance between two strings. There are various distance functions, few of
them are listed below[9].

• Levenshtein or edit distance: allows insertions, deletions and substi-


tutions.(k differences).

• Hamming distance: allows only substitutions.(k mismatches)

• Episode distance: allows only insertions

• Longest common subsequence distance: allows only insertions and


deletions,
In MIR context, user can make some mistakes(like insertion, deletion, sub-
stitution of note), while querying, so edit distance would be better choice.

How is it different from general ASM?


In normal ASM, characters are compared exactly. If characters found un-
equal, then character is replaced with another, without bothering about dis-
tance between two characters, i.e. cost is independent of characters which
are being replaced. Whereas this is not the case with melody string compar-
ision. Here we take characters in account. Same operation on two different
pair of characters may result in different cost.
But most of the times importance is given to number of operations, that

3
we have to perform in converting one string to another. That’s why lot of
methods assign equal weights to all operations.

3.1 Taxonomy of algorithms [9]


Approximate string matching problem can be attacked in various ways

1. Dynamic programming
This is the oldest method for finding edit distance. and will be dis-
cussed at length in this paper.

2. Automaton
In this method, we model the problem as nondeterministic finite au-
tomaton(NFA). Operations like insert, delete, and substitute changes
state of system.

3. Filters
Filtering is based on the fact that it may be easier to tell, that a
text position does not match. Filter algorithms discard areas of the
text, that cannot contain match. These algorithms also use strict
string matching. Strict string matching algorithms are faster than
approximate string matching algorithms.

4. Bit-parallelism
These algorithms exploite parallelism of computer, when it works on
bits. Many algorithms from these category are modified implementa-
tion of DP algorithms.

4 Dynamic programming(DP)
This is the oldest method for calculating edit distance. In this area, there
are two flavours of algorithms, one having good average case time and good
worst case time complexities. Good worst case time algorithms are generally
difficult to implement. Good avgerage case time algorithms are simple and
mostly work out pretty well.
To simplify analysis, we will assume that all operations cost 1, unless
and untill it is specified.

4.1 The General Method [9]


This is the most common method for finding similarity between two strings.
This method is not vary much efficient, But it is most flexible in adapting
various distance function.

4
Computing edit distance
Let w(α, β) represent cost of transformation, i.e. cost of opearation to trans-
form from α to β. Let D(A, B) denotes distance between two strings, A and
B. Here D should have following properties[6].

D(A, B) ≥ 0
D(A, B) = D(B, A)
D(A, B) ≤ D(A, C) + D(C, B)

The matrix Di,j represent minimum number of operations needed to


match P1...i to T1...j . Lets define D|x|,|y| = ed(x, y)

Di,0 = i
D−1,j = 0
Di,j = if (xi = yj ) then Di−1,j−1
else 1 + min(Di−1,j , Di,j−1 , Di−1,j−1 ) (1)

The rationale behind formula1 is as follows. First, D i,0 and D0,j repre-
sent the edit distance between a string of length i or j and the empty string.
Clearly i (respectively j ) deletions are needed on the nonempty string. For
two nonempty strings of length i and j , we assume inductively that all the
edit distances between shorter strings have already been computed, and try
to convert x1...i into y1...j . Here we can see that to calculate whole matrix
we need |x||y| operations. Space complexity can be reduced to O(m) as we
need only one column at a time.

Example
Lets assume that all cost of all operations are same and equal to one. Given
text string is duddddu and pattern string dudddU .

d u d d d d u
0 0 0 0 0 0 0 0
d 1 0 1 0 0 0 0 1
u 2 1 0 1 1 1 1 0
d 3 2 1 0 1 1 1 1
d 4 3 2 1 0 1 1 2
d 5 4 3 2 1 0 1 2
U 6 5 4 3 2 1 1 2

Table 1: Dynamic programming

5
The complexity of algorithm can be calculated by looking at table1. As
there are m ∗ n entries in table, we have to calculate those many terms.
After looking at equation 1 we can find out that, only left column is used
for calculation of new column. This gives us space complexity of O(m).

Time complexity O(mn)


Space complexity O(m)

4.2 Landau and Vishkin(LV) [5]


This is diagonal transition algorithm. Diagonal transition algorithms are
based on the fact, that error along diagonal of matrix is monotonically non-
decreasing. This method computes edit distance(D) matrix using diagonals.
We will construct diagonal di against number of differences as matrix L. Let
Diagonal matrix be L and Normal matrix be D. and d is diagonal index
and e is an error(number of differences). L d,e denotes the largest row such
that Dm,n = e. This represents number of differences between P 1···Ld,e and
any text substring ending at TLd,e +d

1: for 0 ≤ d ≤ n + 1 do
2: Set Ld,−1 = −1
3: end for
4: for −(k + B1) ≤ d ≤ −1 do
5: Set Ld,|d−2| = −∞
6: Set Ld,|d−1| = |d − 1|
7: end for
8: for e = 0 to k do
9: for d = −e to n do
10: Set maxRow = max (Ld−1,e−1 , Ld,e−1 + 1, Ld+1,e−1 + 1)
11: Calculate length of diagonal having constant error e.
12: Assign this length to Ld,e
13: end for
14: if Ld,e = m then
15: Match found at td,m
16: end if
17: end for

Figure 2: Landau Vishkin Algorithm

From figure 2 we can infer, that complexity is O(kn), provided length


of diagonal can be calculated in constant time. Suffix tree can be used to
calculate this[11].
Suffix tree is a datastructre, which represents string in a tree format. In
this tree, out degree of any internal node is more than one. Internal nodes
represent common substring under that node. Leaf node represent suffix
of the string. Here we will calculate suffix tree for P.T (i.e. concat P,T

6
strings.). Doing this will put all common substring in internal nodes. and
length of common substring could be found in constant time.
Different weights can be assinged to each operation by changing the way
we calculate Ld,e .

maxRow = max(Ld−1,e−insertCost , Ld,e−replaceCost + 1, Ld+1,e−deleteCost + 1)

Time complexity O(kn)


Space complexity O(n)

4.3 Extension to LV algorithm[Chang and Lawler] [1]


4.3.1 Linear expected time algorithm
Here is an algorithm(Figure 3), which runs in O(n) expected time, provided

1. T is uniformly distrubuted random string of length n over a finite


alphabet.

2. k is smaller than

k ∗ = m/(logb m + c1 ) − c2 (2)

where c1 , c2 are constants, and b is alphabet size

Let us say that M (T, P )i , denotes length of longest substring of pattern P ,


beginning at position i of text T . M (T.P ) i can be calculated using suffix
tree.

1: Set S1 = 1 {Sj denotes beginning of j th common subsequence}


2: for Each j > 0 do
3: Sj+1 = Sj + M (T, P )Sj + 1 {The (j + 1)st start position is calculated by
taking maximum jump at Sj plus one more letter }
4: if Sj+k+2 − Sj ≥ m − k then
5: Apply Landau-Vishkin algorithm to T [Sj · · · Sj+k+2 − 1]
6: end if
7: end for

Figure 3: Linear expected time algorithm

Next we have to show that probability of having to work at S j is small.

Lemma
For some chosen c1 and c2 , P [Sk∗ +3 − S1 ≥ m − k ∗ ] ≤ 1/m3

7
Proof
For this proof, we will assume that b = 2( b > 2, will give slightly smaller
ci ).
Let Xj be the random variable Sj+1 − Sj . Note that the Xj ’s are i.i.d.
P ∗ +2
Sk +3 − S1 = kj=1
∗ Xj . Let T [p, · · · , p + d − 1] matches with pattern P .
There are m different words of length lgm+d out of m2 d different possible
strings.
Let D be the random variable taking value d For all d > 0

P [X1 = lg(m) + d] < 2−d

E[X1 ] = E[lg(m) + D]
= lg(m) ∗ 1 + E[D]
< lg(m) + 2 (3)

In equation 3, E[D] < 2 as P [D = d] < 2−d .


Now we will derive P[X1 + · · · + Xk∗ +2 > m − k ∗ ]. Let us define

m − k∗
Yi = X i −
k∗ + 2

m m
> For any c2 > 2 (4)
k∗+2 k∗ + c2
m k∗
Yi = X i − +
k ∗ + c2 k ∗ + c2
m
< Xi − ∗ +1
k + c2
< Xi − lg(m) − c2 + 1 from Eq. 2
< Xi − lg(m) − 2 for c2 > 3 (5)

This implies that E[Yi ] < 0

P [X1 + X2 + · · · + Xk+2 > m − k ∗ ]

= P [Y1 + Y2 + · · · + Yk+2 ≥ 0]
∗ +2
≤ E[etY1 ]k (6)

For any d ≥ 0

P [Yi = lg(m) + d + (m − k ∗ )/(k ∗ + 2)] < 2−d

8

X ∗ )/(k ∗ +2)
E[e tY1
] < etlg(m)+td−t(m−k 2−d
d=0

X
tlg(m)−t(m−k ∗ )/(k ∗ +2)
< e etd .2−d
d=0

loge 2
Now select t = 2


X
t(lg(m)−(m−k ∗ )/(k ∗ +2))
Y1
E[e ] < e ed/2∗loge 2 .2−d
d=0
∗ ∞
X
1
(m− m−k )
< 2 2 k∗ +2 2−d/2
d=0
Y1 k ∗ +2 ((k ∗ +2)m−(m−k ∗ )+3.6(k ∗ +2))/2
E[e ] < 2
∗ +2)−(c
2 −2)(c1 +lg(m))/2
< 2(4.6−c1 )(k (7)

if we put c1 = 4.6, c2 = 8 in equation(7), value of equation will be less than


1/m3 . From this we can infer that, expected work done at start position S 1
is O(1). This is applicable to all Sj s. Hence total work done will be O(n).


4.3.2 SubLinear expected time algorithm [1]


We can get sublinear expected time, provided k < k ∗ /2 − 3. We will be
partitioning text into regions of (m−k)/2. If pattern matches with substring
of text, then it has to contain one complete region. Starting from left of each
region R, we will calculate k + 1 jumps(S k+1 ), ending at some point p. If
position of p is within same region R then there can be no match. If p is
beyond that R then use LV algorithm to a stretch beginning (m + 3k)/2
letters to the left of region and ednding at position p. Expected time of
this algorithm can be calculated by dividing (m − k ∗ ) and (k ∗ + 2) by 2
everywhere in above proof.
Time Complexity O((n/m)klogb m) expected time.

5 Other DP Algorithms
5.1 Hariharan : a simpler faster algorithm [2]
This is good worst case time algorithm and the fastest algorithm till date.
It uses filtering approach. The only problem is that, it is complex with
many restrictions. And if pattern is periodic and non-periodic then it has
got different complexities. This algorithm can give better performance for
longer pattern lenght (> 5k 3 ), but for smaller pattern length it’s better to
use LV algorithm.

9
5.2 Other promising methods
5.2.1 Filtering Method [9]
Filtering is based on the fact that, it may be easier to tell that strings do
not match than to tell strings matches. Filtering method reduces search
space by discarding substrings where we can’t get a match. For example if
we have to find a match with one error, then we can search for either half
part of pattern string. If both halfs are absent then there is no way that we
can get a match with one error, and we can skip that region. Most filtering
method takes advantage of exact string matching algorithms as those are
faster than approximate string matching. This algorithm will be very useful
in long pattern matching.

5.2.2 Bit-Parallism [11][6]


These algorithms exploite power of processor to process bits in parallel. so
time complexity reduces drastically. In these algorithms operation costs are
fixed and can’t be changed easily. Now the obvious thing is that we lost the
flexibility of assigning any ramdom weight to any operation. But if we are
sticking to same operation cost for every operation then these algorithms
are certainly better and faster than their DP equivalent algorithms.

6 Conclusion
We saw three major algorithms. Each algorithm is discarding some unimpor-
tant information. Levenshtein distance is the most basic and more flexible in
term of assigning weights. The time complexity of this algorithm is O(mn),
so it may not be useful while comparing long melody and query strings. This
algorithm is surely better, when strings are small in length and we want to
experiment with weights assigned to operations.
Lindau-Vishkin algorithm is a major break-through in field of dynamic
programming. It works on monotonically non-decreasing errorlevel along
diagonal. It is faster than previous one, and works in time O(kn). Deciding
right value of k is very important in this algorithm.
Expected linear time algorithms will be useful only when pattern size is
large enough. e.g. for c1 = 4.6, c2 = 8.0, k = 3, Σ = 5, we need pattern
length to be more than 32 characters. System having more alphabets needs
lesser value of m. Restriction on pattern length is not a problem as such,
as most of pattern will be of this length. But if pattern length is less than
this(example size=32), this algorithm will boil down to LV. Here we have
improved average case time, but worst case time remain same, as that of LV
algorithm(O(kn)).

10
References
[1] Willam Chang and E.L. lawler. Approximate string matching in sub-
linear time. Technical report, Computer science division, U.C.Berkeley,
1990.

[2] Richard Cole and Ramesh Hariharan. Approximate string matching:


a simpler faster algorithm. In Proceedings of the ninth annual ACM-
SIAM symposium on Discrete algorithms, pages 463–472. Society for
Industrial and Applied Mathematics, 1998.

[3] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. In-


troduction to Algorithms. The MIT press, Cambridge, Massachusetts,
1997.

[4] Mark Sandler Giuliano Monti. Monophonic transcription with autocor-


relation. Technical report, Department of Electronic Engineering, King
s College London, Strand, London WC2R 2LS, UK, 1999.

[5] G. M. Landau and U. Vishkin. Fast parallel and serial approximate


string matching. Journal of Algorithms, 10(2):157–169, 1989.

[6] K. Lemström. String Matching Techniques for Music Retrieval. PhD


thesis, University of Helsinki, Faculty of Science, Department of Com-
puter Science, 2000.

[7] Bharat Sundaram M. Anand Raju and Preeti Rao. Tansen: A query-by-
humming based music retrieval system. Technical report, Department
of Electrical Engineering, IIT, Bombay, 2003.

[8] Gene Myers. A fast bit-vector algorithm for approximate string match-
ing based on dynamic programming. Journal of the ACM (JACM),
46(3):395–415, 1999.

[9] Gonzalo Navarro. A guided tour to approximate string matching. ACM


Computing Surveys (CSUR), 33(1):31–88, 2001.

[10] Anand Raju and Preeti Rao. Building a melody retrieval system. Tech-
nical report, Department of Electrical Engineering, IIT, Bombay, 2002.

[11] Gad M. Landau Uzi Vishkin. Introducing efficient parallelism into ap-
proximate string matching and a new serial algorithm. Technical report,
Tel Aviv University, 1986.

11

You might also like