Electronics: Audio Fingerprint Extraction Based On Locally Linear Embedding For Audio Retrieval System

electronics
Article
Audio Fingerprint Extraction Based on Locally Linear
Embedding for Audio Retrieval System
Maoshen Jia 1, * , Tianhao Li 1 and Jing Wang 2, *
1 Beijing Key Laboratory of Computational Intelligence and Intelligent System, Faculty of Information
Technology, Beijing University of Technology, Beijing 100124, China; litianhao29@126.com
2 School of Information and Electronic, Beijing Institute of Technology, Beijing 100081, China
* Correspondence: jiamaoshen@bjut.edu.cn (M.J.); wangjing@bit.edu.cn (J.W.); Tel.: +86-150-1112-0926 (M.J.)

Received: 3 August 2020; Accepted: 8 September 2020; Published: 10 September 2020
Abstract: With the appearance of a large amount of audio data, people have a higher demand for
audio retrieval, which can quickly and accurately find the required information. Audio fingerprint
retrieval is a popular choice because of its excellent performance. However, there is a problem about
the large amount of audio fingerprint data in the existing audio fingerprint retrieval method which
takes up more storage space and affects the retrieval speed. Aiming at the problem, this paper
presents a novel audio fingerprinting method based on locally linear embedding (LLE) that has smaller
fingerprints and the retrieval is more efficient. The proposed audio fingerprint extraction divides the
bands around each peak in the frequency domain into four groups of sub-regions and the energy
of every sub-region is computed. Then the LLE is performed for each group, respectively, and the
audio fingerprint is encoded by comparing adjacent energies. To solve the distortion of linear speed
changes, a matching strategy based on dynamic time warping (DTW) is adopted in the retrieval part
which can compare two audio segments with different lengths. To evaluate the retrieval performance
of the proposed method, the experiments are carried out under different conditions of single and
multiple groups’ dimensionality reduction. Both of them can achieve a high recall and precision rate
and has a better retrieval efficiency with less data compared with some state-of-the-art methods.
Keywords: audio fingerprint; sub-regions; dimensionality reduction; audio retrieval
1. Introduction
With the development of internet technology, people are in an age of information explosion.
The amount of audio information increased sharply in 30 years, which means people are exposed to
various information every day [1]. For individuals, only a small fraction of it needs to be used, thus,
how to obtain the content we want from the numerous audio information sources we are exposed to,
namely audio retrieval, is an important problem worth studying [2,3].
In the field of audio retrieval, high-dimensional audio features reduce the searching efficiency
because of their redundancy of information and large storage. These features are computed in the
time or frequency domain [4]. For instance, short time energy, linear prediction coefficients (LPC),
perceptual linear predictive (PLP), and Mel frequency cepstral coefficient (MFCC) are commonly
adopted as audio features [5,6]. Additionally, the human auditory system (HAS) reflects similarities
in auditory perception, so pitch is also used in related research. With the development of audio
compression and storage technology, quantities of digital audio information have appeared on the
internet which puts forward a higher requirement for the efficiency of audio features and gives rise to
audio fingerprinting [7].
Audio fingerprinting can be seen as a short summary of audio signals that is a compact
representation of acoustic characteristics [8]. Therefore an audio fingerprint with a limited number of
Electronics 2020, 9, 1483; doi:10.3390/electronics9091483 www.mdpi.com/journal/electronics

Electronics 2020, 9, x FOR PEER REVIEW 2 of 15
Electronics 2020, 9, 1483 2 of 15
of bits can represent an audio signal consisting of a large amount of data. It is extracted from the most
important part of HAS,
bits can represent an audiowhich has similarities
signal consisting of ofaperception.
large amount Meanwhile
of data. Itaudio fingerprinting
is extracted from the shows
most
higher
importantsearchpartefficiency
of HAS, in audio
which hasretrieval works.
similarities It is widely Meanwhile
of perception. used in music identification,
audio fingerprinting copyright
shows
protection,
higher search integrity verification,
efficiency in audio and otherworks.
retrieval aspectsIt[9].
is widely used in music identification, copyright
In a variety
protection, integrity of verification,
algorithms, and the other
most aspects
classic [9].
one is Philips’ audio fingerprinting algorithm
proposed by Haitsma
In a variety and Kalker
of algorithms, the[10].
most It classic
computes one fingerprints
is Philips’ audioby comparing
fingerprinting the energies
algorithm inproposed
adjacent
bands. The algorithm has high robustness to different kinds of signal
by Haitsma and Kalker [10]. It computes fingerprints by comparing the energies in adjacent bands. degradations and the retrieval
performance
The algorithmcan be maintained
has high robustness towhen differentthe kinds
audioof signal is interfered.and
signal degradations Moreover,
the retrieval it performance
has a short
granularity of about 3 s, which is the minimum length of an audio segment
can be maintained when the audio signal is interfered. Moreover, it has a short granularity of about required for retrieval. The
algorithm
3 s, which has is thelow resistance
minimum to of
length linear
an audiospeedsegment
changesrequired
of the for audio signalThe
retrieval. [11,12]. Another
algorithm has
representative
low resistance algorithm,
to linear speed named Shazam
changes of fingerprinting,
the audio signal proposed by Wangrepresentative
[11,12]. Another encodes the relationship
algorithm,
between
named Shazam every two spectral maxima
fingerprinting, proposed byby using
Wang their time the
encodes andrelationship
frequency between
information every[13].twoItspectral
gains
better
maxima robustness
by using especially
their time and under noisy condition,
frequency information but[13].
it uses a large
It gains number
better of bits especially
robustness in its encoding,
under
which reduces thebut
noisy condition, retrieval
it uses speed.
a largeIn addition
number to the
of bits in above two common
its encoding, which methods,
reduces the other researchers
retrieval speed.
also introduced
In addition to thewavelet
above two transform
commoninto fingerprint
methods, other extraction
researchers[14,15]. This kind
also introduced of method
wavelet transformcan
provide time-frequency
into fingerprint extraction analysis of the
[14,15]. This signal
kindwhich has a can
of method good resolution
provide both in time and
time-frequency frequency
analysis of the
[16]. It has a high recall rate in audio retrieval, but its accuracy drops
signal which has a good resolution both in time and frequency [16]. It has a high recall rate in audio when the audio signal is
disturbed.
retrieval, but its accuracy drops when the audio signal is disturbed.
In
Inrecent
recent years,
years, Anguera
Anguera combined
combinedthe the two
two methods
methods mentioned
mentioned above above and and proposed
proposed aa novel novel
fingerprint called Masked Audio Spectral Keypoints (MASK)
fingerprint called Masked Audio Spectral Keypoints (MASK) [17]. The algorithm builds regions [17]. The algorithm builds regions
around selected salient
around selected salientspectral
spectralpoints
pointsand and generates
generates binarybinary fingerprints
fingerprints by comparing
by comparing the band theenergy.
band
energy. Although it has excellent performance of audio retrieval, its fingerprint
Although it has excellent performance of audio retrieval, its fingerprint dimension has room for further dimension has room
for further reduction.
reduction. Thus, this
Thus, this paper paper the
optimizes optimizes
methodthe bymethod
extracting by aextracting a smaller fingerprint.
smaller fingerprint.
In
In this paper,ananaudio
this paper, audio fingerprint
fingerprint extraction
extraction basedbased on locally
on locally linear embedding
linear embedding (LLE) is
(LLE) is proposed
proposed for audio retrieval system. The method uses LLE to
for audio retrieval system. The method uses LLE to reduce the number of sub-region energies and reduce the number of sub-region
energies and the
the reduction reduction
reflects in thereflects
dimensionin the ofdimension
fingerprints ofonfingerprints
account ofon account ofthe
comparing comparing
energies [18].the
energies
In addition,[18]. aInmatching
addition, strategy
a matching strategy
based based ontime
on dynamic dynamic
warpingtime (DTW)
warpingis(DTW) introducedis introduced
into the
into the system to gain a high resistance against linear speed changes
system to gain a high resistance against linear speed changes of audio signals [19]. The two process of audio signals [19]. The two
process above realize
above jointly jointly an realize
audio anretrieval
audio retrieval
system system
and theand the framework
framework includes includes
two main twoparts:
main parts:
audio
audio fingerprint extraction and audio fingerprint retrieval. The former
fingerprint extraction and audio fingerprint retrieval. The former consists of pre-process and fingerprint consists of pre-process and
fingerprint computation. The latter contains the establishment of database
computation. The latter contains the establishment of database and fingerprint matching. The novel and fingerprint matching.
The novel fingerprint
fingerprint has less data hasandlessimproves
data and the improves
retrievaltheefficiency.
retrieval efficiency.
Figure
Figure 1 shows a common audio fingerprinting system
1 shows a common audio fingerprinting system including
including extraction,
extraction, database
database
establishment, and matching of
establishment, and matching of fingerprints. fingerprints.
Audio fingerprint Audio fingerprint

extraction retrieval
Reference Audio
audio Pre-process fingerprint
computation
Reference
ID Database
Query
Audio Results
audio Pre-process fingerprint Match
computation
Figure
Figure 1.
1. Audio
Audiofingerprinting
fingerprintingsystem
systemframework.
framework.
Electronics 2020, 9, 1483 3 of 15
The rest2020,
Electronics of paper is structured
9, x FOR PEER REVIEW as follows: Section 2 explains the proposed audio fingerprint 3 of 15
extraction based on LLE. Section 3 describes the process of audio fingerprint matching by bit error rate
(BER). Then,The in
rest of paper
Section is structured
4 we as follows: Section
show the experiments that we2 explains
test the the proposedand
fingerprint audio fingerprint with
comparisons
extraction based on LLE. Section 3 describes the
other algorithm. Finally, we draw a conclusion in Section 5. process of audio fingerprint matching by bit error
rate (BER). Then, in Section 4 we show the experiments that we test the fingerprint and comparisons
with other
2. Audio algorithm.
Fingerprint Finally, we
Extraction drawon
Based a conclusion
LLE in Section 5.
A typicalFingerprint
2. Audio audio fingerprint
Extraction is Based
encoded by comparing the energy of bands or sub-regions in the
on LLE
frequencyA domain. This paper introduces LLE into audio fingerprint extraction which maps the
typical audio fingerprint is encoded by comparing the energy of bands or sub-regions in the
energy vector to
frequency a lower
domain. dimension
This so that smaller
paper introduces LLE into fingerprint can be obtained.
audio fingerprint The newly-acquired
extraction which maps the
fingerprint can represent
energy vector to a lowerthe audio signal
dimension so thatuniquely with fewer
smaller fingerprint canbits and it is The
be obtained. more efficient in audio
newly-acquired
retrieval. The extraction
fingerprint process
can represent is assignal
the audio follows: first with
uniquely the data
fewerisbits
pre-processed,
and it is morethen band
efficient in division
audio is
retrieval. The extraction process is as follows: first the data is pre-processed, then
implemented in the frequency domain and a set of energies is computed. Finally, after being processed band division is
implemented
by LLE, the audio in the frequency
fingerprint domainbyand
is encoded a set of energies
comparing the adjacentis computed.
energy. Finally, after being
processed
As seen inbyFigure
LLE, the audio
2 the fingerprint is encoded
dimensionality reduction by consists
comparing of the adjacent
three steps.energy.
First, K frames with the
As seen in Figure 2 the dimensionality reduction consists of three steps. First, K frames with the
nearest energy vector are selected for each frame according to the shortest Euclidean distance. Then,
nearest energy vector are selected for each frame according to the shortest Euclidean distance. Then,
all the neighbors’ reconstruction weights are computed. Finally, we obtain an energy vector mapping
all the neighbors’ reconstruction weights are computed. Finally, we obtain an energy vector mapping
in a lower dimensional
in a lower dimensionalspace.
space.More
Moredetails
detailsare
are described
described ininthe
thefollowing
following sections.
sections.
Energy
Audio input Pre-process Band division
computation
Audio Energy Embedding Reconstruction K-nearest

fingerprint comparison computation weights neighbor
LLE algorithm
Figure 2. Block
Figure diagram
2. Block diagramof
ofaudio fingerprintextraction
audio fingerprint extraction based
based on on LLE.
LLE.
2.1. Pre-Processing
2.1. Pre-Processing
As theAs audio signal
the audio hashas
signal short-time
short-timestability, pre-processingsteps,
stability, pre-processing steps, such
such as pre-emphasis,
as pre-emphasis, are are
required before the fingerprint extraction. First, we down-sample the input audio
required before the fingerprint extraction. First, we down-sample the input audio signal to 8 KHz signal to 8 KHz
because
because the sensitive
the sensitive frequencyarea
frequency areaofofhuman
human is is concentrated
concentratedonon 4 KHz
4 KHz andand
lower which
lower can also
which can also
reduce the amount of data involved in calculation. Audio fingerprinting
reduce the amount of data involved in calculation. Audio fingerprinting needs high frequencyneeds high frequency
resolution and wide time coverage so we use frame of length 100 ms overlapped with an interval of
resolution and wide time coverage so we use frame of length 100 ms overlapped with an interval of
10 ms.This framing strategy can reduce the error caused by the inconsistency of the boundary
10 ms.This framing strategy can reduce the error caused by the inconsistency of the boundary between
between test and sample and have a good retrieval efficiency even in the worst case (5 ms off). In
test and
ordersample and
to reduce thehave a goodofretrieval
poor impact efficiency
muted segments even
in the in theprocess
matching worst we casesift(5the
msoriginal
off). In order to
signal
reduce the poor impact of muted segments in the matching
roughly using voice activity detection (VAD) with frame energy. process we sift the original signal roughly
using voice activity detection (VAD) with frame energy.
2.2. Audio Fingerprint Extraction
2.2. Audio Fingerprint Extraction
(1) Maximum Spectral Points Extraction
The discrete
(1) Maximum FourierPoints
Spectral transform (DFT) is applied to the pre-processed audio signal and the
Extraction
spectrum
The is divided
discrete Fourierinto 18 bands.(DFT)
transform In thisiswork, we to
applied select
the apre-processed
peak which has the maximum
audio signal and amplitude
the spectrum
is divided into 18 bands. In this work, we select a peak which has the maximum amplitudetime
for each frame recording its current frame number and band number. To maintain a wide for each
framecoverage the number of selected peaks should be in the range of 70 to 100 samples.
recording its current frame number and band number. To maintain a wide time coverage the
(2) MASK Region Construction
number of selected peaks should be in the range of 70 to 100 samples.
After the peak selection step, two index sets of data is obtained, namely the frame index and
(2) MASKband
frequency Region Construction
index where the selected peak of each frame is located, which is equivalent to the
After the peak selection
coordinate information of each step,salient
two index
point. sets
Thenof datadata
these is obtained,
are used tonamely theMASK
construct frameregions
index and
frequency band
centered indexselected
by each wherepeak,
the selected
so there peak
shouldofbe
each
oneframe
MASKisregion
located, which
for each is equivalent
frame. The regionto the
coordinate information
covers five frequency of each
bands salient
(one point.
located band,Then theseabove,
two bands data are
and used to construct
two bands below) and MASK regions
the time
centered by each selected peak, so there should be one MASK region for each frame. The region covers
Electronics 2020, 9, 1483 4 of 15
five frequency
Electronics 2020,
Electronics bands
9, x9,FOR
2020, x FOR(one
PEER
PEER located
REVIEWband, two bands above, and two bands below) and the time
REVIEW span
of 15 is
4 of415
19 frames (one located frame, nine frames before, and nine frames after). It is worth mentioning that
span is1919frames
frames(one (onelocated
located frame,
frame, nine
span
overlaps isare allowed between regions with nine frames before,
frames
different before,
peaks and
during nine
andthenine frames
framesafter).
construction after).It It
process. is is
worth
Toworth
prevent
mentioning
mentioning that
that overlapsare
overlaps areallowed
allowedbetween
between regions
regions with
with different
different peaks
peaks during
during the construction
the construction
peaks from appearing at the band boundary, the first and last two bands are not used as a selection
process.
process. ToTo preventpeaks
prevent peaksfrom
fromappearing
appearing at at the
the band
band boundary,
boundary,the thefirst
firstand
andlast two
last twobands
bands areare
notnot
area used
so therea is no casearea where a region is built out of scope.
used as as selectionarea
a selection sosothere
thereisisnonocase
case where
where aa region
region isisbuilt
builtout
outofofscope.
scope.
Supposing
Supposing thatthat thethepeakpeak in n-th
is is n-thband
bandofofm-th
m-th frame anditsits MASKregionregion is shown in Figure 3.
Supposing that the peak is ininn-th band of m-th frame
frameandand itsMASK
MASK regionis is shown
shown in in
Figure
Figure
Note3.thatNote the frequency
that the bands
frequencybands and
bandsand time
and time frames
time frames are
frames aredistributed symmetrically. The MASK region is
3. Note that the frequency are distributed
distributedsymmetrically.
symmetrically.The TheMASK
MASK region
region
represented as a grid
is represented range
as a grid of 5of×519,
range and
× 19, andthe
theordinate hasfive
ordinate has fivefrequency
frequency bands
bands andand the abscissa
the abscissa
is represented as a grid range of 5 × 19, and the ordinate has five frequency bands and the abscissa
consists
consists of 19oftime
19 time frames.
frames.
consists of 19 time frames.
Frequency
Frequency
bands
bands
n+2
n+2n+1
n+1 n
n n-1
n-1n-2
n-2 Time frames
m-9 m m+9
Time frames
m-9 m m+9
Figure
Figure 3. 3.
AAMASK
MASKregion
region centered
centered by byaaselected
selected peak
peak(red block).
(red block).
Figure 3. A MASK region centered by a selected peak (red block).
(3) Sub-region
(3) Sub-region Division
Division
The(3)The MASK
Sub-region
MASK region
regionDivisionis divided
is divided into
into 4 4groups
groupsofofsub-regions
sub-regions as as shown
shownininFigureFigure4.4.The Thefourfour groups
groups of
sub-regions are defined as follows: the first group is horizontal sub-regions which corresponds toto1a–1h;
of sub-regions
The MASK are defined
region is as
divided follows:
into 4the first
groups group
of is horizontal
sub-regions as sub-regions
shown in which
Figure 4. corresponds
The four groups
1a–1h; the seconddefined
group is asvertical
follows:sub-regions including 2a–2d; the third group represents central to
theofsecond
sub-regions
groupare is vertical sub-regions the first group
including is horizontal
2a–2d; the third group sub-regions which
represents corresponds
central sub-regions
sub-regions
1a–1h; the with group
second 3a–3d; is thevertical
fourth group represent
sub-regions marginal
including sub-regions
2a–2d; the in the
third formrepresents
group of 4a–4h. Itcentral
can
with 3a–3d; the fourth group represent marginal sub-regions in the form of 4a–4h. It can be seen that
be seen that the horizontal and marginal groups have eight sub-regions, while the vertical and central
thesub-regions
horizontal with 3a–3d; the fourth group represent marginal sub-regions in the form of 4a–4h. It can
andfour
marginal groupsAll have eight
them sub-regions, while the vertical and central groups have
be groups
seen that have sub-regions.
the horizontal and marginal of groups correspond
have eight to the same
sub-regions, number
while theofvertical
energies and which
central
four represented
sub-regions. as L (L = 4 or 8). Then the energies of these sub-regions are calculated and the energy as L
All of them correspond to the same number of energies which represented
groups have four sub-regions. All of them correspond to the same number of energies which
(L = 4vector
or 8).ofThen the group
a single energies of theseas
is denoted 𝒆𝐇
sub-regions𝐕
𝒎 , 𝒆𝒎 , 𝒆𝒎
𝐂are calculated
and 𝒆𝐌 𝒎 which and the energy
means the energy vector vector of aofsingle
represented as L (L H = 4 or 8). Then the energies of these sub-regions are calculated and the energy
group is denoted
horizontal, vertical, eV
as em ,central, e C and
and
m ismdenoted e M which
marginal means
group in the
that energy
order. vector
We take of
m as 𝒆𝐇 , 𝒆𝐕 , 𝒆𝐂 and 𝒆𝐌 which means the energy vector 𝒆 = [𝑒
horizontal,
𝒎 , 𝑒 , … , 𝑒
vertical, ] tocentral
vector of a single group 𝐇 𝐕 𝐂 𝒎 𝒎 𝒎 𝒎 of
uniformlygroup
and marginal presentin𝒆that
𝒎 , 𝒆order.
𝒎 , 𝒆𝒎 , We 𝒆𝐌
and take𝒎 for
em simplicity,
= [em1 , em2where 𝑒 ] to
, . . . , emL is the energy of
uniformly eH
l-th sub-region
present , eV , eof
C , and
horizontal, vertical, central and marginal group in that order. We take 𝒆𝒎 = [𝑒 , 𝑒 , … , 𝑒 ] to m m m
eM form-th frame andwhere
simplicity, the𝐇value
e 𝐕 isof the
L
𝐂
depends
energy on the group type.
𝐌 of l-th sub-region of m-th frame and the value of L depends
muniformly present 𝒆 , 𝒆ml , 𝒆 , and 𝒆
𝒎 𝒎 𝒎 𝒎 for simplicity, where 𝑒 is the energy of l-th sub-region of
onm-th
the group type.the value of L depends on the group type.
frame and
Frequency
bands
n+2
Frequency
n+1
bands
n+2 n
n+1n-1
n n-2 Time frames
n-1 m-9 m m+9
n-2 (a) Horizontal sub-regions. Time frames
m-9 m m+9
Frequency
(a) Horizontal sub-regions.
bands
n+2
Frequency
n+1
bands
n
n+2n-1
n+1n-2
n Time frames
m-9 m m+9
n-1
(b) Vertical sub-regions
n-2
Time frames
m-9 m m+9
(b) Vertical sub-regions
Figure 4. Cont.
Electronics 2020, 9, 1483 5 of 15
Frequency
bands
n+2
n+1
n
n-1
n-2
m-9 m m+9 Time frames
(c) Central sub-regions
Frequency
bands
n+2
n+1
n
n-1
n-2
m-9 m m+9 Time frames
(d) Marginal sub-regions
Figure 4.
Figure 4. 44 layers
layers of
of sub-region
sub-regiondivision.
division.
(4)(4) AudioFingerprint
Audio FingerprintExtraction
Extractionbased based on on LLE
LLE
The dimension of the energy vectors obtained above
The dimension of the energy vectors obtained aboveisisstill
stillhigh
highand andthetheredundancy
redundancy cancanbe be
reduced. Thus, in consideration of the relationship of the energies, this
reduced. Thus, in consideration of the relationship of the energies, this paper use LLE to reduce the paper use LLE to reduce the
number of energies of each group respectively and finally extracts smaller fingerprints. LLE
number of energies of each group respectively and finally extracts smaller fingerprints. LLE algorithm
algorithm is a typical nonlinear dimensionality reduction method, which considers that the observed
is a typical nonlinear dimensionality reduction method, which considers that the observed data is
data is actually a result of mapping from a low-dimensional space to a high-dimensional space. Due
actually a result of mapping from a low-dimensional space to a high-dimensional space. Due to
to the internal characteristics of the data, there is redundancy in some high-dimensional data which
the internal characteristics of the data, there is redundancy in some high-dimensional data which
can be uniquely represented with less data. The algorithm needs to first obtain the reconstruction
can be uniquely
weights represented
of the sample frame within theless data. The algorithm
neighborhood, and then solveneedsthe tomapping
first obtain thesample
of the reconstruction
in the
weights
low-dimensional space according to the principle that the reconstruction weights remains sample
of the sample frame in the neighborhood, and then solve the mapping of the unchanged in the
low-dimensional space according to the principle that the reconstruction
equally to minimize the loss. The method in this paper fully considers the linear relationship between weights remains unchanged
equally
sample toframes
minimize which thecorresponds
loss. The method in this
to the time paper fully
sequence considers
structure of audio the signals,
linear relationship
and describes between
the
sample
essential characteristics of the data in the low dimension with the minimal feature loss. We reducethe
frames which corresponds to the time sequence structure of audio signals, and describes
essential characteristics
the dimension of thevector
of the energy data in ofthe
audiolowsignals
dimension withto
belonging the
theminimal
same group feature loss. We reduce
of sub-regions whichthe
dimension
also reduce of the
the energy
amountvector of energyof audio signals
involved belonging calculation.
in fingerprint to the same The group new ofenergy
sub-regions
vectorwhich
is used also
to extract
reduce low dimensional
the amount of energy fingerprints
involved in byfingerprint
comparing the adjacent energies.
calculation. The new energy vector is used to
extract(i) lowSelection of K nearest
dimensional frames by comparing the adjacent energies.
fingerprints
(i)The calculation
Selection of K is closelyframes
nearest related to the group of its sub-regions. Therefore, the processed data is
theThe
energy vector composed
calculation is closelyof all theto
related sub-regions’
the group energies of an audioTherefore,
of its sub-regions. frame in a thesame group which
processed data is
is denoted as 𝒆 𝒎 mentioned above.
the energy vector composed of all the sub-regions’ energies of an audio frame in a same group which
is denotedAssuming that there are
as em mentioned above.M frames in the given audio data, the MASK region of each frame has
fourAssuming that there are M Lframes
groups of sub-regions and energiesin are
the calculated
given audio in adata,
singlethegroup
MASK (L =region
4 or 8) of
which
eachforms
frameanhas
energy matrix with M L-dimensional data E = [e1, e2, …, eM]. The algorithm selects K nearest frames
four groups of sub-regions and L energies are calculated in a single group (L = 4 or 8) which forms
by finding the results with minimum Euclidean distance between the m-th energy vector 𝒆𝒎 and
an energy matrix with M L-dimensional data E = [e1 , e2 , . . . , eM ]. The algorithm selects K nearest
other data in matrix E. The results are arranged in ascending order of Euclidean distance and
frames by finding the results with minimum Euclidean distance between the m-th energy vector em
represented as 𝚯 = {𝜼𝟏 , 𝜼𝟐 , . . , 𝜼𝑲 } where 𝜼𝒌 is the k-th nearest energy vector of 𝒆𝒎 and the value
and other data in matrix E. The results are arranged in ascending order of Euclidean distance and
range of K is from 3to 8 which can ensure the efficiency and prevent the complexity from becoming
represented as Θ = η1 , η2 , . . . , ηK where ηK is the k-th nearest energy vector of em and the value
too high. However, if we only use Euclidean distance to select the proximities, it inevitably occurs
range
that oftwo K frames
is fromare 3 totoo8 which
far apart can
in ensure
time. Inthe efficiency
practical and prevent
conditions, the complexity
it is obviously from becoming
unreasonable for two
tooframes
high. that
However,
are not related in information to be determined as proximity. In order to ensure thatoccurs
if we only use Euclidean distance to select the proximities, it inevitably the
that twospan
times frames arenearest
of the too farselection
apart inresults
time. In practical
is not conditions,
too long and in view it isofobviously
the upper endunreasonable for two
of the K’s value
frames
rangethat is 8,are
thisnot related
paper limitsintheinformation to be determined
range of selection as proximity.
within 10 frames (five before In order
and fiveto ensure that the
after) which
times span of the nearest selection results is not too long and in view of the upper end of the K’s value
range is 8, this paper limits the range of selection within 10 frames (five before and five after) which
Electronics 2020, 9, 1483 6 of 15
considers the correlation between audio frames and reduces the possibility of proximities appearing
out of the matching segment, making a contribution to reducing the error of retrieval.
(ii) Reconstruction weights computation
The purpose of this step is to represent each energy vector by its K nearest frames with linear
weights, called reconstruction weights. For the regression problem of calculating the reconstruction
weights, this paper uses the mean square error as the loss function of the algorithm:
XM XK
J (wm ) = kem − wmk ηk k22 (1)
m=1 k =1
where Θ = η1 , η2 , . . . , ηK is the set of energy vectors of K nearest frames, the loss function J (wm )

represents the construction error which aimed to minimize. wm = [wm1 , wm2 , . . . , wmK ] where wmk is
the construction weight between the energy vector em of the m-th frame and its k-th proximity ηk with
the following constraint:
XK
wmk = 1 (2)
k =1
The normalized representation of the reconstruction vector wm can be obtained by solving the
Equation (1) with the Lagrange multiplier method:
Z−1
m cK
wm = (3)
cTK Z−1
m cK
where cK is K-dimensional 1 vector, Zm is the local covariance matrix and its formula is as follows:
Zm = (em − ηk )T (em − ηk ) (4)
Finally, the reconstruction vectors of all frames are combined into a reconstruction matrix, denoted
as W = [w1 , w2 , . . . , wM ].
(iii) Low dimensional mapping
0
Suppose the mapping result of the low dimension is E’ = [e10 , e20 , . . . , eM ], and the linear
relationship is expected to be consistent with that of the high dimension. The loss function J (wm )
also needs to be minimized which is solved by the Lagrange multiplier method. In this step the
dimensionality reduction problem can be simplified as the eigenvalue decomposition of the target
matrix D:
D = (I − W)T (I − W) (5)
where the constraint is:

1 XM 1 0 0 T
I= em 0 (e0m )T = E E (6)
M m = 1 M
where em 0 = [e0m1 , e0m2 , . . . , e0md ] is the energy vector in the low dimension.
To figure out em 0 , the loss function should be also minimized in the low dimension and it can be
rewritten as follow according to matrix theory:
0 0 T
0
J E = trace E D E (7)
Then the Lagrange multiplier method is adopted, then the equation can be represented as:
0 0 T T
0 0 0
Lag E = E D E + λ E E − MI (8)
where λ is the Lagrange multiplier. Finally the derivative of Equation (8) is taken and set it equal to 0:
0
∂Lag E 0 T 0 T
0 = 2D E + 2 λ E = 0 (9)
∂E
Electronics 2020, 9, 1483 7 of 15
Equation (9) formally conforms to the definition of matrix eigenvalues and eigenvectors. Thus,
the optimization result of this paper is a matrix composed of the eigenvectors corresponding to the
smallest d non-0 eigenvalues of matrix D (d < L). Until now, the algorithm has obtained a new energy
vector e0m = [e0m1 , e0m2 , . . . , e0md ] and the number of vector is still M corresponding the M frames audio
signal. But the dimension of the elements reduces from L to d reflecting the reduction of sub-regions’
energy. Finally, the audio fingerprints are extracted by comparing the energies in adjacent sub-regions
as follow. 
 1, i f em,i+1 − em,i > 0
 0 0
, i = 1, 2, . . . , d − 1

F(m, i) =  (10)
 0, i f e0m,i+1 − e0m,i < 0

where e0m,i is the energy of i-th sub-region in the m-th frame, F(m, i) is the i-th bit of the fingerprint in
one of the sub-region group and the fingerprints in other groups can also be computed in the same
way. By changing the d and observing the retrieval results, the limitation of dimensionality reduction
in a single group can be determined.
3. Audio Fingerprint Matching
3.1. Similarity Measurement

In this paper, the similarity measurement used in audio fingerprint matching is the bit error rate:
PR PS L
r=1 s=1 F(r, s) F0 (r, s)
BER = (11)
R×S
L
where is XOR operation, F(r,s) is the reference fingerprint in database and F0 (r,s) is the actual input
which called the query fingerprint, s is bit index and r is frame index, R is the total number of frames in
the matching segment, S is the number of the fingerprint dimensions.
The threshold is set based on the experimental results. If the BER of the query is less than the
reference which means they are very similar, the current reference should be as one of the candidate
result, whereas there is a big difference between the query and the reference, the latter would not be
selected as a retrieval result.
3.2. Dynamic Time Warping

For improving the resistance against linear distortion of the audio signal in the matching process,
the DTW is adopted. The DTW is mainly used to solve the case that the two sequences are not equal in
length of the comparison, which exactly corresponds to the change of audio signal in linear speed.
The retrieved objects include speech and music. If the query is a recording or someone’s speech received
by microphone, the playback speeds depends on the speaking speeds which varies from person to
person. The same goes for retrieving music by humming. Especially in the copyright protection,
differences in audio versions can result in different playback speeds. When the reference is compared
with the query, the length of two sequences is no longer equal, denoted R and R’. The algorithm
calculates the similarity between every two fingerprints, then plans a matching path from the first
to the last comparison according to the minimum accumulative error. The path planning strategy
ensures the alignment in time and reduces the errors caused by time dislocation. The specific method
to achieve the following:
(1) Similarity Matrix Computation
Before aligning the query and the reference, first a similarity matrix of size R × R0 needs to be
constructed, denoted as X where R is the length of reference and R’ is the length of query. The elements
in the matrix is the similarity of every two fingerprints in the query and the reference. The row and
column represent the positions of two fingerprints in the respective sequences of the comparison. To be
expressed more clarity, an example of matrix X is given in Figure 5.
Electronics 2020, 9, 1483 8 of 15
2 1 7 5
Electronics 2020, 9, x FOR PEER REVIEW 1 5 1 6 8 of 15
4 7 2 4
R 2 1 7 5
5 2 4 3
1 5 1 6
3 4 8 2
4 7 2 4
R 2 1 5 1
5 2 4 3
R’
3 4 8 2
Figure 5. Similarity matrix.
Figure 5.2 Similarity 5 . 1
1 matrix
The matrix X is shown schematically in Figure 5 where the numbers in the grid is the similarities
between(2)the Similarity Accumulation
reference and queryMatrix Computation
fingerprints computed by the Equation (11). The length of the
Accumulating the above similarity matrix X from R’the starting point (1, 1) to (R, R′). There are
reference and the query could be different or be set artificially to gain a better resistance against linear
three kinds of paths for each accumulation: transverse, longitudinal, or oblique. The direction with
speed changes.
the minimum similarity is chosen for each accumulation. The computation is carried out first in row
(2) Similarity Accumulation Matrix 5. Similarity matrix.
Computation
Figure
and then in column, and the result is recorded as matrix Y. Its formula can be express as:
Accumulating the above similarity matrix X from the starting point (1, 1) to (R, R0 ). There are
𝐘(𝑖, 𝑗) = 𝐗(𝑖,Matrix
(2) Similarity Accumulation 𝑗) + min {𝐘(𝑖 − 1, 𝑗 − 1), 𝐘(𝑖 − 1, 𝑗), 𝐘(𝑖, 𝑗 − 1)}
Computation (12)
three kinds of paths for each accumulation: transverse, longitudinal, or oblique. The direction with the
Accumulating the above similarity matrix X from the starting point (1, 1) to (R, R′). There are
minimum The similarityisaccumulation
similarity chosen for eachof theaccumulation.
first row is computed from left to right
The computation as shown
is carried outinfirst
Figure 6a and
in row
three kinds of paths for each accumulation: transverse, longitudinal, or oblique. The direction with
thenand
in the second
column, androw
theis result
computed
is in the same
recorded as order represented
matrix Y. Its in Figure
formula can 6b.express
be The computation
the minimum similarity is chosen for each accumulation. The computation is carried out first in row
as: rule
of the similarity accumulation is the number in grid plus the former accumulation according to the
and then in column, and the result is recorded as matrix Y. Its formula can be express as:
minimum of the sum Y(as
i, jmentioned
) = X(i, j)in +Equation
min Y(i −(12).
1, jAll
− 1the
), Yaccumulations
(i − 1, j), Y(i, jare
− 1labeled
)
in the upper (12)
𝐘(𝑖, 𝑗) in
right corner of each number = grid
𝐗(𝑖, 𝑗) + min
and {𝐘(𝑖
these − 1, 𝑗 − 1), form
superscripts 𝐘(𝑖 −the
1, 𝑗), 𝐘(𝑖, 𝑗 −
matrix Y.1)} (12)
The similarity accumulation
The similarity accumulation ofofthe
thefirst
firstrow
row is computedfrom
is computed fromleft
left
to to right
right as shown
as shown in Figure
in Figure 6a 6a
and the
andsecond rowrow
the second is computed
is computed ininthe
thesame
sameorder
order represented
represented ininFigure
Figure6b.6b.
TheThe computation
computation rule rule
of theofsimilarity
the similarity 2
accumulation 1 is is
accumulation 7 thenumber
the 5
number in
in grid plusthe
grid plus 2 accumulation
theformer
former 1accumulation
7 5 according
according to theto the
minimum of the sum as mentioned in Equation (12). All the accumulations are
minimum of the sum as mentioned in Equation (12). All the accumulations are labeled in the upper labeled in the upper
right right corner
corner of each
of each 1number
number 5in in 1 and
grid
grid and6these
thesesuperscripts
superscripts form 1thematrix
formthe 5 Y. Y.
matrix 1 6
4 7 2 4 4 7 2 4
R 2 1 7 5 R 2 1 7 5
5 2 4 3 5 2 4 3
1 5 1 6 1 5 1 6
3 4 8 2 35 46 811 210
4 7 2 4 4 7 2 4
R 22 13 58 19 R 22 13 58 19
5 2 4 3 5 2 4 3
R’ R’
3 4 8 2 35 46 811 210
(a) Similarity accumulation of the first row (b) Similarity accumulation of the second row
22 13 58 19 22 13 58 19
Figure 6. Similarity accumulation.
R’ R’
(a) Similarity accumulation of the first row (b) Similarity accumulation of the second row
Figure6.6.Similarity
Figure Similarity accumulation.
accumulation.
Electronics 2020, 9, 1483 9 of 15
(3) The Shortest Path Planning

(3) The Shortest Path Planning
Once the accumulation matrix Y is obtained, the DTW works backwards from the end point to
Once the accumulation matrix Y is obtained, the DTW works backwards from the end point to
the starting point to plan the path with minimum similarity accumulation and count the number of
the starting point to plan the path with minimum similarity accumulation and count the number of
matching points it passes served as compensation. The path planning process is shown in Figure 7:
matching points it passes served as compensation. The path planning process is shown in Figure 7:
The superscripts in Figure 7a shows the accumulation results and the red arrow is the
The superscripts in Figure 7a shows the accumulation results and the red arrow is the accumulation
accumulation direction. The blue arrow in Figure 7b is the matching path obtained by looking back
direction. The blue arrow in Figure 7b is the matching path obtained by looking back from the end to
from the end to the beginning. It is important to notice that three conditions must be met for path
the beginning. It is important to notice that three conditions must be met for path planning:
planning:
a. The path planning begins from (1, 1) and ends at (R, R0 );
a. The path planning begins from (1, 1) and ends at (R, R′);
b. The path has continuity, that is, the fingerprints cannot be matched across each other, but can
b. The path has continuity, that is, the fingerprints cannot be matched across each other, but can
only be aligned with their adjacent fingerprints, so as to ensure that each fingerprint appears in
only be aligned with their adjacent fingerprints, so as to ensure that each fingerprint appears in
the path;
the path;
c.c. Monotonicity.
Monotonicity. The
The path
path selection
selection must
must bebe carried
carried out
out monotonicallyinintime
monotonically timeorder
ordertotoprevent
preventany
any
crossover occurred in the alignment process.
crossover occurred in the alignment process.
Finally,the
Finally, theminimum
minimumsimilarity
similarityaccumulation
accumulationand
andthe
thenumber
numberofofmatching
matchingpoints
pointsthough
thoughthe
the
pathare
path areobtained
obtainedand
andnow
nowthe
thesimilarity
similaritymeasure
measurecan
canbe
berepresented
represented as
as follow:
follow:
𝑌(𝑅,
Y(R,𝑅′)
R0 )
𝛾
γ== (13)
(13)
𝑝p
where 𝑌(𝑅, 𝑅′) is the minimum similarity accumulation, 𝑝 is the total number of the matching
where Y(R, R0 ) is the minimum similarity accumulation, p is the total number of the matching points on
points on the path and 𝛾 is the final similarity measurement. If 𝛾 is less than the threshold value,
the path and γ is the final similarity measurement. If γ is less than the threshold value, the matching is
the matching is considered successful and the corresponding reference is taken as a candidate.
considered successful and the corresponding reference is taken as a candidate. Otherwise, the matching
Otherwise, the matching fails and the query continues to match with other references. When the
fails and the query continues to match with other references. When the matching process is over, all
matching process is over, all the candidates are arranged in ascending order of 𝛾.
the candidates are arranged in ascending order of γ.
217 116 717 515 217 116 717 515
115 519 110 615 115 519 110 615
414 714 29 413 414 714 29 413

R R
510 27 410 313 510 27 410 313
35 46 811 210 35 46 811 210
22 13 58 19 22 13 58 19
R’ R’
(a) Similarity accumulations (b) Path planning

Figure
Figure7.7.The
Theshortest
shortestpath
pathplanning.
planning.
4.4.Experimental
ExperimentalResults
Resultsand
andAnalysis
Analysis
Theexperimental
The experimentaldata
dataisisintroduced
introducedas asfollows:
follows:
Database11 consists
Database consists of 5000
5000 audio
audiofiles,
files,including
includingthethedata collected
data in the
collected laboratory
in the andand
laboratory the data
the
data collected from the internet. The length of each audio file is between 3 s and 5 min. The storageof
collected from the internet. The length of each audio file is between 3 s and 5 min. The storage size
the of
size data
theisdata
12.3isGB and
12.3 GBthe total
and thelength is 230 is
total length h. 230 h.
Database 2 is the dataset generated by adding 40 dB white noise to database 1.
Electronics 2020, 9, 1483 10 of 15

Database 6 is a random interception of 1000 audio segments with the length of 3 s from database 1
and it is used as the query audio set.
Database 7 is a random interception of 200 audio segments with the length of 3 s from database 1.
The playback speeds of the segments are in a random range of 95% to 105%.
4.1. Performance Evaluation

In this paper, recall and precision rate which commonly used in the field of information retrieval,
are selected as the metrics to evaluate the performance of the algorithm. They are defined as follows:
Recall rate = the number of correct targets detected from the retrieval source/the number of targets
that should have been retrieved × 100%
Precision rate = the number of correct targets detected from the retrieval source/the number of
targets actually retrieved × 100%
4.2. Results and Analysis

It is preliminarily verified by experiments that if all groups of LLE fingerprint extraction is carried
out at the same time, too much features are lost and the retrieval performance is poor. So as to ensure the
efficiency of audio retrieval is not seriously affected by the feature lost in the dimensionality reduction
process, the algorithm does not reduce the dimension of energy vectors in all groups. The rest of the
energy vectors that have not been reduced by LLE are calculated as shown in Table 1.
Table 1. Non-reduced fingerprint extraction.
Dimension Group Type Comparison

1–7 Horizontal 1a−1b, 1b−1c, 1c−1d, 1d−1e, 1e−1f, 1f−1g, 1g−1h
8–10 Vertical 2a−2b, 2b−2c, 2c−2d
11–14 Central 3a−3b, 3b−3c, 3c−3d, 3d−3a
15–18 Marginal 4a−4b, 4c−4d, 4e−4f, 4g−4h
19–22 Extended 4a + 4b−(4c + 4d), 4c + 4d−(4e + 4f), 4e + 4f−(4g + 4h), 4g + 4h−(4a + 4b)
As can be seen in Table 1, 1–18 dimensions are computed in a single group and represent the
energy distribution of horizontal, vertical, central, and marginal sub-regions. The extended sub-region
is composed of marginal sub-region in pairs and then generates the fingerprint by the comparison of
energy pairs corresponding to the 19–22 dimensions.
Once the fingerprint has extracted by the reduced energy vector, the computation has uncertain
effect on the efficiency of retrieval. The maximum reduced dimensions in a single group and the group
numbers can be reduced at the same time are presented in the rest of paper by giving the experimental
results. The performance is evaluated by the metrics mentioned in Section 4.1.
4.2.1. Influence of Nearest Frames Number K on Retrieval Performance

The algorithm is sensitive to the selection of K and different K has great influence on the retrieval
result. The influence of different K is analyzed in each group reduction and the optimal K is selected
for the experiment. Generally K is between 3 and 8. Table 2 shows the retrieval results corresponding
to different K under the condition that the horizontal sub-region is reduced by 1 dimension.
As can be seen in Table 2, when K is set to 3, the best retrieval performance is obtained. Through
the same experimental steps, the selection of K value under different conditions is determined and the
optimal K value is applied in the subsequent experiment. It is worth noting that the influence of K
value on the retrieval results is not always monotonous, but the higher its value means the increase of
Electronics 2020, 9, 1483 11 of 15
algorithm complexity. Therefore, this paper comprehensively considers both the retrieval results and
the algorithm complexity to select the most appropriate K value.
Table 2. The retrieval result of different K (reduced dimension of horizontal group = 1).
K Value Recall (%) Precision (%)

3 99.6 98.6
4 97.2 95.4
5 93.3 89.3
6 89.1 85.4
7 82.6 76.5
8 75.2 70.8
4.2.2. Dimensionality Reduction in a Single Group

In order to verify the maximum dimensionality reduction that LLE can achieve, this paper
increases the number of reduced dimension step by step on the single sub-region group. The increasing
is implemented until the group has no space for further reduction or its retrieval performance cannot
be accepted by the actual situation. However, since the construction of the fingerprint in the extended
group is a combination of multiple sub-region energies, this group is not included in the consideration
of dimensionality reduction in the paper. First, Table 3 shows the retrieval results of different reduction
of horizontal group.
In the case that the reduced dimension is set to 4 in the horizontal group, the method can
still maintain high recall and precision rate (Table 3). The original dimension of fingerprint in the
horizontal group is 7 and the reduced dimension is raised from 1 to 4. It can be seen that the reduction
proportion is more than 50% in the horizontal group and 20% in the overall fingerprint (22 dimensions).
Similar results are obtained in other groups.
Table 3. Horizontal dimensionality reduction results.
Reduced Proportion of Proportion of

Optimal K Recall (%) Precision (%)
Dimension Group Fingerprint
1 3 1/7 1/22 99.6 98.6
2 3 2/7 2/22 99.4 98.4
3 5 3/7 3/22 99.0 98.2
4 5 4/7 4/22 98.8 97.9
The original fingerprint dimensions in each group are different, so the maximum reduction are also
different. The reduction of the three groups shown in Table 4 reached a proportion 50% of group and
about 10% of the overall fingerprint. The result is mainly limited by the original fingerprint dimension.
The original fingerprint dimension in the horizontal group is quite large (seven dimensions), while the
other groups have no reduction space as large as that in the horizontal group, making the overall
dimensionality reduction lower than that in the horizontal area.
Table 4. None horizontal dimensionality reduction results.
Reduced Proportion Proportion of Precision

Group Optimal K Recall (%)
Dimension of Group Fingerprint (%)
1 Vertical 5 1/3 1/22 99.2 98.0
2 Vertical 5 2/3 2/22 98.1 96.5
1 Central 5 1/4 1/22 99.3 98.9
2 Central 5 2/4 2/22 98.5 97.6
1 Marginal 6 1/4 1/22 99.1 97.8
2 Marginal 4 2/4 2/22 97.4 94.2
Electronics 2020, 9, 1483 12 of 15
4.2.3.Single
4.2.3. SingleReduction
ReductionResult
ResultCompared
Comparedwith
withOther
OtherAlgorithm
Algorithmunder
underDifferent
DifferentSNR
SNR
Theresults
The resultsofofthe
thehorizontal
horizontalgroup
groupwith
withthethelargest
largestdimensionality
dimensionality(four (fourdimensions)
dimensions)reduction
reduction
are selected in the comparison. To verify its robustness under noisy conditions,
are selected in the comparison. To verify its robustness under noisy conditions, Philips and Shazam Philips and Shazam
algorithmsare
algorithms areapplied
appliedasasreference
referencemethods.
methods.As Asclassical
classicalalgorithms,
algorithms,they theyare arewidely
widelyused
usedininaudio
audio
fingerprint retrieval and have good retrieval performance. In Philips’ algorithm, the spectrum ofof
fingerprint retrieval and have good retrieval performance. In Philips’ algorithm, the spectrum
audiosignal
audio signalisisdivided
dividedinto into3333bands
bandsandandgenerates
generatesaa32-bit 32-bitbinary
binaryfingerprint
fingerprintby bycomparing
comparingthe the
energiesininadjacent
energies adjacentbands.
bands.On Onthetheother
otherhand,
hand,Shazam
Shazamuses usesthe thepeak
peakinformation
informationextracted
extractedininthethe
frequency domain to construct a hash value with the storage size of 32 bits. In
frequency domain to construct a hash value with the storage size of 32 bits. In addition, a comparison addition, a comparison
ofofthe
theoriginal
originalMASKMASKaudioaudiofingerprint
fingerprintisisadded
addedininthe theexperiment
experimenttotoprove provethe thefeasibility
feasibilityofofthe
the
dimensionality reduction. The original MASK size is 22 bits and the reduced
dimensionality reduction. The original MASK size is 22 bits and the reduced fingerprint size is 18 bits. fingerprint size is 18
bits. Figure 8 shows the retrieval performance of each method
Figure 8 shows the retrieval performance of each method under different SNR. under different SNR.
Thecomparison
The comparisonofofthe therecall
recalland
andprecision
precisionrate rateofofeach
eachmethod
methodisisshown
shownininFigure
Figure8a,b.
8a,b.Both
Bothofof
them represent the retrieval performance and the performance of the horizontal
them represent the retrieval performance and the performance of the horizontal dimension reduced dimension reduced
fingerprintininthis
fingerprint thispaper
paperisisbasically
basicallythe
thesame
sameasasthe theoriginal
originalMASK,
MASK,and andslightly
slightlyhigher
higherthan
thanthat
thatofof
Philipsand
Philips andShazam
Shazamalgorithms.
algorithms.With Withthe
thedecrease
decreaseofofSNR, SNR,the theretrieval
retrievalperformance
performanceofofthe themethod
method
presented in this paper decreases slowly which indicates that it has stronger
presented in this paper decreases slowly which indicates that it has stronger resistance against noise.resistance against noise.
Tosum
To sumup,
up,
thethe method
method in this
in this paperpaper reduces
reduces the amount
the amount of fingerprint
of fingerprint data which
data which saves storage
saves storage space
space and improves retrieval speed on the premise that the robust performance
and improves retrieval speed on the premise that the robust performance is almost not affected. is almost not affected.
(a) Comparison of the recall rate (b) Comparison of the precision rate
Figure8.8.Comparison
Figure Comparisonofofhorizontal
horizontalreduction
reductionwith
withother
othermethods.
methods.
4.2.4.
4.2.4.Dimensionality
DimensionalityReduction
ReductionininMultiple
MultipleGroups
Groups
Once
Oncethe themaximum
maximumreduction
reductionin in
single group
single groupis determined,
is determined, thisthis
section presents
section the result
presents of theof
the result
reduction
the reductionin multiple groups.
in multiple The results
groups. are based
The results on theon
are based central group group
the central and combine it with itother
and combine with
groups to analyze the influence on retrieval performance. The reason for choosing
other groups to analyze the influence on retrieval performance. The reason for choosing the central the central group
isgroup
that inis the
thatprocess of sub-region
in the process division,division,
of sub-region there arethere
more areormore
less overlaps betweenbetween
or less overlaps central group
central
and other groups which means that they have partial similarity in features and
group and other groups which means that they have partial similarity in features and the feature the feature loss in the
loss
multiple
in the multiple reduction process should be small. In the multiple reduction experiment, the Kisvalue
reduction process should be small. In the multiple reduction experiment, the K value the
optimal value invalue
is the optimal each ingroup
eachand the and
group reduction operation
the reduction is relatively
operation independent,
is relatively so there issonothere
independent, needis
tonoadjust the K value again. The experiment is conducted to prove all cases
need to adjust the K value again. The experiment is conducted to prove all cases of multiple of multiple reduction
starting
reduction from 1 reduction
starting from 1to the maximum.
reduction The result The
to the maximum. of the central
result and
of the horizontal
central group is firstly
and horizontal group
shown in shown
is firstly Table 5.in Table 5.
The
Theretrieval performance of
retrieval performance ofmultiple
multiplereduction
reduction is lower
is lower than than the corresponding
the corresponding resultsresults of
of single
single group due to the increase of feature loss. When both groups reach
group due to the increase of feature loss. When both groups reach the maximum reduction, the the maximum reduction,
the algorithm
algorithm also
also hashas highretrieval
high retrievalefficiency
efficiencywhich
whichindicates
indicates thatthat the
the reduction
reduction in in central
central and
and
horizontal group can be carried out simultaneously to achieve a total reduction
horizontal group can be carried out simultaneously to achieve a total reduction of six dimensions of six dimensions with
with a proportion of about 27%, more than 1/4 (Table 5). Moreover, the reduction in central and
vertical group has similar results.
Electronics 2020,9,9,1483
Electronics2020, x FOR PEER REVIEW 13
13of
of15
15
The reduction in central and vertical group can also reach a total dimensionality reduction of
a proportion of about 27%, more than 1/4 (Table 5). Moreover, the reduction in central and vertical
four dimensions with a reduction rate of about 18% (Table 6). However, there is a significant decrease
group has similar results.
in retrieval performance, and it cannot be applied to large databases.
The reduction in central and vertical group can also reach a total dimensionality reduction of four
In addition to the above two combinations, the fingerprint features in other cases are greatly
dimensions with a reduction rate of about 18% (Table 6). However, there is a significant decrease in
different and the feature loss is quite large, which makes it difficult to guarantee the retrieval
retrieval performance, and it cannot be applied to large databases.
efficiency and fails to achieve ideal results. The horizontal group has more bits of fingerprint, that is
In addition to the above two combinations, the fingerprint features in other cases are greatly
to say, has more reduction space. Thus, the reduction in central and horizontal can get less data and
different and the feature loss is quite large, which makes it difficult to guarantee the retrieval efficiency
better retrieval performance.
and fails to achieve ideal results. The horizontal group has more bits of fingerprint, that is to say,
has more reduction space. Thus, the reduction in central and horizontal can get less data and better
Table 5. Central and horizontal reduction results.
retrieval performance.
Central Reduced Dimension = 1 Reduced Dimension = 2
Table 5. Central and horizontal reduction results.
Precision Recall
Recall (%) Precision (%)
Horizontal
Central Reduced Dimension = (%)
1 (%)
Reduced Dimension = 2
Reduced dimension = 1 99.2 98.6 99.0 98.5
Horizontal Recall (%) Precision (%) Recall (%) Precision (%)
Reduced dimension
Reduced dimension = 3 = 4 98.8 98.5 98.097.7 98.2
98.5 97.497.8
Table 6. Central and vertical reduction results.
Central Table 6. Central

ReducedandDimension
vertical reduction
= 1 results.
Reduced Dimension = 2
Central Reduced Dimension =1
Precision Reduced Dimension = 2
Recall
Vertical
Horizontal Recall (%) (%)
Precision (%) (%) (%)
Recall Precision (%)
Reduced
Reduced dimension
dimension =1 = 1 98.6 98.6 97.297.2 97.2
97.2 96.2 96.2
Reduced
Reduced dimension
dimension =2 = 2 96.8 96.8 94.694.6 95.4
95.4 93.8 93.8
4.2.5. Multiple
4.2.5. Multiple Reduction
Reduction Result
Result Compared
Compared with
with Other
Other Algorithm
Algorithm under
underDifferent
DifferentSNR
SNR
The combination
The combination of of central
central and
and horizontal
horizontal group
group with
with the
the largest
largest reduction
reduction isis selected
selected for
for the
the
experiment which has a reduction of six dimensions and fingerprint size of 16 bits. The
experiment which has a reduction of six dimensions and fingerprint size of 16 bits. The comparison comparison
methods are
methods arestill
stillPhilips,
Philips,Shazam,
Shazam,andandMASK
MASKalgorithms,
algorithms,and
andthe
theresults
resultsare
areshown
shownininFigure
Figure9.9.
(a) Comparison of the recall rate (b) Comparison of the precision rate
Figure 9.
Figure 9. Comparison
Comparison of
of multiple
multiple reduction
reduction with
with other
other methods.
methods.
The retrieval
The retrieval performance
performance of of the
the reduction
reduction combination
combination in in this
this paper
paper isis slightly
slightly lower
lower than
than that
that
of the Shazam and the original MASK, which indicates that the combination
of the Shazam and the original MASK, which indicates that the combination of groups aggravates of groups aggravates
the feature
the feature loss.
loss. But
But its
its performance
performance is is basically
basically the
the same
same asas that
that ofof Philips
Philips and
and Shazam
Shazam algorithms,
algorithms,
indicatingthat
indicating thatthe
the method
method in this
in this paper paper can the
can meet meet the retrieval
retrieval requirements
requirements to someFor
to some extent. extent. For
example,
example, when dealing with small- or medium-sized audio retrieval, the multiple reduction
proposed in this paper reduces the amount fingerprint data to the greatest extent and has basic
Electronics 2020, 9, 1483 14 of 15
when dealing with small- or medium-sized audio retrieval, the multiple reduction proposed in this
paper reduces the amount fingerprint data to the greatest extent and has basic retrieval efficiency.
In large-scale audio retrieval, the performance of multiple reduction is difficult to maintain the accuracy,
but single reduction can still be used for it.
To present the compression potential more clearly, a table of the fingerprint size of each method is
shown below.
Table 7 includes the fingerprint size of three comparison methods and two proposed methods.
All the methods have same length of the frame and extract a fingerprint from each frame, so the
experiment only needs to compare the size of a single fingerprint. The multiple reduction has the
smallest fingerprint size which is half that of the Philips and Shazam. The fingerprint size of proposed
method is also a significant reduction compared with the MASK.
Table 7. Fingerprint size of different methods.
Method Fingerprint Size (per Frame)

Philips 32 bits
Shazam 32 bits
MASK 22 bits
LLE (horizontal) 18 bits
LLE (horizontal + central) 16 bits
4.2.6. Resistance to Linear Speed Changes

In order to prove the validity of DTW, 200 audio segments are randomly selected and artificially
modified playback speeds at a random rate (Database 7). These segments are used as a query set and
the fingerprint with the smallest size is applied. An experiment is carried out not using DTW for the
purpose of comparison.
The experiment obtains a higher recall and precision rate with using DTW (Table 8), so the DTW
can indeed improve the resistance of the retrieval system to linear speed changes.
Table 8. Results of linear speed changes.

Using DTW 100 99.6
Not using DTW 99.4 98.5
5. Conclusions
A fingerprint dimensionality reduction method based on LLE and a fingerprint matching method
based on DTW are introduced in this paper. Firstly, the energy vectors of sub-regions in the frequency
domain is reduced which involved in fingerprint calculation. Then the reduction of the fingerprint
is obtained by comparing the adjacent energies. Additionally, the matching segment is adjusted
by the DTW algorithm, which improved the resistance to linear speed changes of the audio signal.
The experimental results show that this novel method has good retrieval performance both in single
group and multiple group reduction, and the maximum fingerprint reduction ratio can reach 27%,
which greatly reduces the fingerprint data.
Author Contributions: M.J. and T.L. contributed equally in conceiving the whole proposed codec architecture,
designing and performing the experiments, collecting and analyzing the data, and writing the paper. J.W. collected
the data and implemented the final revision. J.W. supervised all aspects of this research. All authors have read
and agreed to the published version of the manuscript.
Funding: This work has been supported by the National Natural Science Foundation of China under grant
no. 61971015.
Conflicts of Interest: The authors declare no conflict of interest.
Electronics 2020, 9, 1483 15 of 15
References
1. Zhang, X.; Zou, X.; Hu, Q.; Zhang, P. Audio retrieval method based on weighted DNA. China Sci. 2018, 13,
2295–2300.
2. Wang, Y.; Liu, Z.; Huang, J. Multimedia content analysis using both audio and visual clues. IEEE Signal
Process. Mag. 2000, 17, 12–36. [CrossRef]
3. Li, G.; Wu, D.; Zhang, J. Concept framework for audio information retrieval. J. Comput. Sci. Technol. 2003, 18,
667–673. [CrossRef]
4. Klapuri, A.P. Multiple fundamental frequency estimation based on harmonicity and spectral smoothness.
IEEE Trans. Speech Audio Process. 2003, 11, 804–816. [CrossRef]
5. Jiang, X. An audio data retrieval method based on MFCC. Comput. Digit. Eng. 2008, 9, 24–26.
6. Jiang, X.; Li, Y. Audio data retrieval method based on LPCMCC. Comput. Eng. 2009, 35, 246–247.
7. Li, W.; Li, X.; Chen, F.; Wang, S. Review of digital audio fingerprinting. J. Chin. Comput. Syst. 2008, 29,
2124–2130.
8. Cano, P.; Batle, E.; Kalker, T.; Haitsma, J. A Review of algorithms for audio fingerprinting. In Proceedings of
the Multimedia Signal Processing, St. Thomas, VI, USA, 9–11 December 2002; pp. 169–173.
9. Doets, P.J.O.; Lagendijk, R.L. Extracting quality parameters for compressed audio from fingerprints.
In Proceedings of the International Conference on ISMIR, London, UK, 11–15 September 2005; pp. 498–503.
10. Haitsma, J.; Kalker, T. A highly robust audio fingerprinting system. In Proceedings of the International
Conference on Music Information Retrieval, Paris, France, 13–17 October 2002; pp. 107–115.
11. Doets, P.J.O.; Lagendijk, R.L. Stochastic model of a robust audio fingerprinting system. In Proceedings of the
5th International Conference on Music Information Retrieval, Barcelona, Spain, 10–14 October 2004; pp. 2–5.
12. Park, M.; Kim, H.; Yang, S. Frequency-temporal filtering for a robust audio fingerprinting scheme in real-noise
environments. IEEE Trans. Inf. Syst. 2006, 28, 509–512. [CrossRef]
13. Wang, A.; Li, C. An industrial strength audio search algorithm. In Proceedings of the International Conference
on Music Information Retrieval, Baltimore, MD, USA, 27–30 October 2003; pp. 7–13.
14. Baluja, S.; Covell, M. Waveprint: Efficient wavelet-based audio fingerprinting. Pattern Recognit. 2008, 41,
3467–3480. [CrossRef]
15. Pucciarelli, G. Wavelet analysis in volcanology: The case of phlegrean fields. J. Environ. Sci. Eng. A 2017, 6,
300–307.
16. Chen, D.; Zhang, W.; Zhang, Z. Audio retrieval based on wavelet transform. In Proceedings of the IEEE/ACIS
16th International Conference on Computer and Information Science (ICIS), Wuhan, China, 24–26 May 2017;
pp. 531–534.
17. Anguera, X.; Garzon, A.; Adamek, T. MASK: Robust local features for audio fingerprinting. In Proceedings
of the IEEE International Conference on Multimedia and Expo, Melbourne, Australia, 9–13 July 2012;
pp. 455–460.
18. Roweis, S.T.; Saul, L.K. Nonlinear dimensionality reduction by locally linear embedding. Science 2001, 290,
2323–2326. [CrossRef] [PubMed]
19. Vavrek, J.; Viszlay, P.; Lojka, M.; Juhar, J.; Pleva, M. Weighted fast sequential DTW for multilingual audio
query-by-example retrieval. J. Intell. Inf. Syst. 2018, 51, 439–455. [CrossRef]
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Electronics: Audio Fingerprint Extraction Based On Locally Linear Embedding For Audio Retrieval System

Uploaded by

Copyright:

Available Formats

Electronics: Audio Fingerprint Extraction Based On Locally Linear Embedding For Audio Retrieval System

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Electronics: Audio Fingerprint Extraction Based On Locally Linear Embedding For Audio Retrieval System

Uploaded by

Copyright:

Available Formats

electronics

Keywords: audio fingerprint; sub-regions; dimensionality reduction; audio retrieval

Electronics 2020, 9, 1483; doi:10.3390/electronics9091483 www.mdpi.com/journal/electronics

Audio fingerprint Audio fingerprint

Audio Energy Embedding Reconstruction K-nearest

Zm = (em − ηk )T (em − ηk ) (4)

where the constraint is:

3. Audio Fingerprint Matching

3.1. Similarity Measurement

3.2. Dynamic Time Warping

Electronics 2020, 9, x FOR PEER REVIEW 1 5 1 6 8 of 15

(3) The Shortest Path Planning

217 116 717 515 217 116 717 515

115 519 110 615 115 519 110 615

414 714 29 413 414 714 29 413

35 46 811 210 35 46 811 210

(a) Similarity accumulations (b) Path planning

Database 2 is the dataset generated by adding 40 dB white noise to database 1.

4.1. Performance Evaluation

4.2. Results and Analysis

Table 1. Non-reduced fingerprint extraction.

Dimension Group Type Comparison

4.2.1. Influence of Nearest Frames Number K on Retrieval Performance

K Value Recall (%) Precision (%)

4.2.2. Dimensionality Reduction in a Single Group

Table 3. Horizontal dimensionality reduction results.

Reduced Proportion of Proportion of

Table 4. None horizontal dimensionality reduction results.

Reduced Proportion Proportion of Precision

Central Table 6. Central

Table 7. Fingerprint size of different methods.

Method Fingerprint Size (per Frame)

4.2.6. Resistance to Linear Speed Changes

Table 8. Results of linear speed changes.

Recall (%) Precision (%)

You might also like