IEEETASLP200609goto PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING , VOL. 14, NO.

5, SEPTEMBER 2006 1783

A Chorus Section Detection Method for


Musical Audio Signals and Its Application
to a Music Listening Station
Masataka Goto

Abstract—This paper describes a method for obtaining a list of This paper describes a method, called Refrain Detecting
repeated chorus (“hook”) sections in compact-disc recordings of Method (RefraiD), that exhaustively detects all repeated chorus
popular music. The detection of chorus sections is essential for sections appearing in a song with a focus on popular music.
the computational modeling of music understanding and is useful
in various applications, such as automatic chorus-preview/search It can obtain a list of the beginning and end points of every
functions in music listening stations, music browsers, or music chorus section in real-world audio signals and can detect mod-
retrieval systems. Most previous methods detected as a chorus a ulated chorus sections. Furthermore, because it detects chorus
repeated section of a given length and had difficulty identifying sections by analyzing various repeated sections in a song, it can
both ends of a chorus section and dealing with modulations (key generate an intermediate-result list of repeated sections that
changes). By analyzing relationships between various repeated
sections, our method, called RefraiD, can detect all the chorus usually reflect the song structure; for example, the repetition of
sections in a song and estimate both ends of each section. It can a structure like verse A, verse B, and chorus is often found in
also detect modulated chorus sections by introducing a percep- the list.
tually motivated acoustic feature and a similarity that enable This paper also describes a music listening station called
detection of a repeated chorus section even after modulation. SmartMusicKIOSK that was implemented as an application
Experimental results with a popular music database showed that
this method correctly detected the chorus sections in 80 of 100 system of the RefraiD method. In music stores, customers typ-
songs. This paper also describes an application of our method, ically search out the chorus or “hook” of a song by repeatedly
a new music-playback interface for trial listening called Smart- pressing the fast-forward button, rather than passively listening
MusicKIOSK, which enables a listener to directly jump to and to the music. This activity is not well supported by current
listen to the chorus section while viewing a graphical overview technology. Our research has led to a function for jumping to
of the entire song structure. The results of implementing this
application have demonstrated its usefulness. the chorus section and other key parts (repeated sections) of a
song, plus a function for visualizing the song structure. These
Index Terms—Chorus detection, chroma vector, music-playback functions eliminate the hassle of searching for the chorus and
interface, music structure, music understanding.
make it easier for a listener to find desired parts of a song,
thereby facilitating an active listening experience.
I. INTRODUCTION The following sections introduce related research, describe
the problems dealt with, explain the RefraiD method in detail,
C HORUS (“hook” or refrain) sections of popular music are
the most representative, uplifting, and prominent thematic
sections in the music structure of a song, and human listeners
and show experimental results indicating that the method
is robust enough to correctly detect the chorus sections in
can easily understand where the chorus sections are because 80 of 100 songs of a popular-music database. Finally, the
these sections are the most repeated and memorable portions SmartMusicKIOSK system and its usefulness are described.
of a song. Automatic detection of chorus sections is essential
for building a music-scene-description system [1], [2] that can II. RELATED WORK
understand musical audio signals in a human-like fashion, and
is useful in various practical applications. In music browsers or Most previous chorus detection methods [3]–[5] only extract
music retrieval systems, it enables a listener to quickly preview a a single segment from several chorus sections by detecting a re-
chorus section as an “audio thumbnail” to find a desired song. It peated section of a designated length as the most representative
can also increase the efficiency and precision of music retrieval part of a song. Logan and Chu [3] developed a method using
systems by enabling them to match a query with only the chorus clustering techniques and hidden Markov models (HMMs) to
sections. categorize short segments (1 s) in terms of their acoustic fea-
tures, where the most frequent category is then regarded as a
chorus. Bartsch and Wakefield [4] developed a method that cal-
Manuscript received January 31, 2005; revised October 10, 2005. The as-
sociate editor coordinating the review of this manuscript and approving it for culates the similarity between acoustic features of beat-length
publication was Dr. Malcom Slaney. segments obtained by beat tracking and finds the given-length
The author is with the Information Technology Research Institute, National segment with the highest similarity averaged over its segment.
Institute of Advanced Industrial Science and Technology (AIST), Tsukuba,
Ibaraki 305-8568, Japan (e-mail: m.goto@aist.go.jp). Cooper and Foote [5] developed a method that calculates a sim-
Digital Object Identifier 10.1109/TSA.2005.863204 ilarity matrix of acoustic features of short frames (100 ms) and
1558-7916/$20.00 © 2006 IEEE
1784 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING , VOL. 14, NO. 5, SEPTEMBER 2006

finds the given-length segment with the highest similarity be- each section. In this process, the similarity must be high be-
tween it and the whole song. Note that these methods assume tween acoustic features even if the accompaniment or melody
that the output segment length is given and do not identify both line changes somewhat in the repeated section (e.g., the absence
ends of a chorus section. of accompaniment on bass and/or drums after repetition). This
Music segmentation or structure discovery methods [6]–[13] condition is difficult to satisfy if acoustic features are taken to be
where the output segment length is not assumed have also been simple power spectrums or mel-frequency cepstral coefficients
studied. Dannenberg and Hu [6], [7] developed a structure dis- (MFCC) as used in audio/speech signal processing.
covery method of clustering pairs of similar segments obtained [Problem 2] Repetition Judgment Criterion: The criterion
by several techniques such as efficient dynamic programming establishing how high similarity must be to indicate repetition
or iterative greedy algorithms. This method finds, groups, depends on the song. For a song containing many repeated ac-
and removes similar pairs from the beginning to group all the companiment phrases, for example, only a section with very
pairs. Peeters et al. [8] and Peeters and Rodet [9] developed high similarity should be considered the chorus section repe-
a supervised learning method of modeling dynamic features tition. For a song containing a chorus section with accompani-
and studied two structure discovery approaches: the sequence ments changed after repetition, on the other hand, a section with
approach of obtaining repetitions of patterns and the state somewhat lower similarity can be considered the chorus section
approach of obtaining a succession of states. The dynamic repetition. This criterion can be easily set for a small number
features are selected from the spectrum of a filter-bank output of specific songs by manual means. For a large open song set,
by maximizing the mutual information between the selected however, the criterion should be automatically modified based
features and hand-labeled music structures. Aucouturier and on the song being processed.
Sandler [14] developed two methods of finding repeated pat- [Problem 3] Estimating Both Ends of Repeated Sections:
terns in a succession of states (texture labels) obtained by Both ends (the beginning and end points) of repeated sections
HMMs. They used two image processing techniques, the kernel must be estimated by examining the mutual relationships
convolution and Hough transform, to detect line segments in among the various repeated sections. For example, given a song
the similarity matrix between the states. Foote and Cooper [10], having the structure (A B C B C C), the long repetition cor-
[11] developed a method of segmenting music by correlating responding to (B C) would be obtained by a simple repetition
a kernel along the diagonal of the similarity matrix, and clus- search. Both ends of the C section in (B C) could be inferred,
tering the obtained segments on the basis of the self-similarity however, from the information obtained regarding the final
of their statistics. Chai and Vercoe [12] developed a method of repetition of C in this structure.
detecting segment repetitions by using dynamic programming, [Problem 4] Detecting Modulated Repetition: Because the
clustering the obtained segments, and labeling the segments acoustic features of a section generally undergo a significant
based on heuristic rules such as the rule of first labeling the change after modulation (key change), similarity with the sec-
most frequent segments, removing them, and repeating the tion before modulation is low, making it difficult to judge repe-
labeling process. Wellhausen and Crysandt [13] studied the tition. The detection of modulated repetition is important since
similarity matrix of spectral-envelope features defined in the modulation sometimes occurs in chorus repetitions, especially
MPEG-7 descriptors and a technique of detecting noncentral in the latter half of a song.1
diagonal line segments.
None of these methods, however, address the problem of de- IV. CHORUS SECTION DETECTION METHOD: REFRAID
tecting all the chorus sections in a song. Furthermore, while
Fig. 1 shows the process flow of the RefraiD method. First, a
chorus sections are sometimes modulated (the key is changed)
12-dimensional feature vector called a chroma vector, which is
during their repetition in a song, previously reported methods
robust with respect to changes of accompaniments, is extracted
did not deal with modulated repetition.
from each frame of an input audio signal and then the similarity
between these vectors is calculated (solution to Problem 1).
III. CHORUS SECTION DETECTION PROBLEM Each element of the chroma vector corresponds to one of the 12
pitch classes (C, C#, D, D#, E, F, F#, G, G#, A, A#, and B) and
To enable the handling of a large number of songs in popular is the sum of magnitude at frequencies of its pitch class over six
music, this research aims for a general and robust chorus section octaves. Pairs of repeated sections are then listed (found) using
detection method using no prior information on acoustic fea- an adaptive repetition-judgment criterion that is configured by
tures unique to choruses. To this end, we focus on the fact that an automatic threshold selection method based on a discrimi-
chorus sections are usually the most repeated sections of a song nant criterion [17] (solution to Problem 2). To organize common
and adopt the following basic strategy: find sections that repeat repeated sections into groups and to identify both ends of each
and output those that appear most often. It must be pointed out, section, the pairs of repeated sections are integrated (grouped)
however, that it is difficult for a computer to judge repetition by analyzing their relationships over the whole song (solution
because it is rare for repeated sections to be exactly the same.
The following summarizes the main problems that must be ad- 1Although a reviewer of this paper pointed out that songs with modulation are
dressed in this regard. generally rare in Western popular music, they are not rare in Japanese popular
[Problem 1] Acoustic Features and Similarity: Whether a music, which has been influenced by Western music. We conducted a survey on
Japan’s popular music hit chart (top 20 singles ranked weekly from fiscal 2000
section is a repetition of another must be judged on the basis to fiscal 2003) and found that modulation occurred in chorus repetitions in 152
of the similarity between the acoustic features obtained from songs (10.3%) out of 1481.
GOTO: CHORUS SECTION DETECTION METHOD FOR MUSICAL AUDIO SIGNALS 1785

TABLE I
LIST OF SYMBOLS

Fig. 1. Overview of chorus section detection method RefraiD.

Fig. 2. Example of chorus sections and repeated sections detected by the


RefraiD method. The horizontal axis is the time axis (in seconds) covering
the entire song. The upper window shows the power. The top row in the lower
window shows the list of the detected chorus sections, which were correct for
this song (RWC-MDB-P-2001 no. 18 of the RWC Music Database [15], [16])
and the last of which was modulated. The bottom five rows show the list of
various repeated sections (only the five longest repeated sections are shown).

to Problem 3). Because each element of a chroma vector cor-


responds to a different pitch class, a before-modulation chroma
vector is close to the after-modulation chorus vector whose el-
ements are shifted (exchanged) by the pitch difference of the
key change. By considering 12 kinds of shift (pitch differences), Fig. 3. Overview of calculating a 12-dimensional chroma vector. The
12 sets of the similarity between nonshifted and shifted chroma magnitude at six different octaves is summed into just one octave which
is divided into 12 log-spaced divisions corresponding to pitch classes.
vectors are then calculated, pairs of repeated sections from those The Shepard’s helix representation of musical pitch perception [18] is shown
sets are listed, and all of them are integrated (solution to Problem at the right.
4). Finally, the chorus measure, which is the possibility of being
chorus sections for each group, is evaluated, and the group of
hand, height refers to the vertical position of the helix seen from
chorus sections with the highest chorus measure as well as other
the side (the position of an octave). Here, there are two major
groups of repeated sections are output (Fig. 2).
types of cue for pitch perception: “temporal cues” based on the
The main symbols used in this section are listed in Table I.
periodicity of auditory nerve firing and “place cues” based on
the position on the basilar membrane [19]. A study by Fujisaki
A. Extract Acoustic Feature and Kashino [20] indicates that the temporal cue is important
Fig. 3 shows an overview of calculating the chroma vector, for chroma identification, and that the place cue is important for
which is a perceptually-motivated feature vector using the con- height judgment.
cept of chroma in the Shepard’s helix representation of musical The chroma vector represents magnitude distribution on the
pitch perception [18]. According to Shepard [18], the perception chroma that is discretized into 12 pitch classes within an octave:
of pitch with respect to a musical context can be graphically rep- the basic idea is to coil the magnitude spectrum around the helix
resented by using a continually cyclic helix that has two dimen- and squash it flat to project the frequency axis to the chroma.
sions, chroma and height, as shown at the right of Fig. 3. Chroma The 12-dimensional chroma vector is extracted from the
refers to the position of a musical pitch within an octave that cor- magnitude spectrum, at the log-scale frequency at
responds to a cycle of the helix: it refers to the position on the cir- time , calculated by using the short-time Fourier transform
cumference of the helix seen from directly above. On the other (STFT). Each element of corresponds to a pitch class
1786 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING , VOL. 14, NO. 5, SEPTEMBER 2006

in the equal temperament and is represented as

BPF (1)

The BPF is a bandpass filter that passes the signal at the


log-scale frequency (in cents) of pitch class (chroma) in
octave position (height)

(2)

where frequency in hertz is converted to frequency


in cents so that there are 100 cents to a tempered semitone and
1200 to an octave

(3) Fig. 4. Sketch of line segments, the similarity r(t; l), and the possibility
R (t; l) of containing line segments. The similarity r (t; l) is defined in
the right-angled isosceles triangle (time-lag triangle) in the lower right-hand
The BPF is defined using a Hanning window as follows: corner. The actual r (t; l) is noisy and ambiguous and usually contains many
line segments irrelevant to chorus sections.

BPF (4)
C. List Repeated Sections
This filter is applied to octaves from Oct to Oct . Pairs of repeated sections are obtained from the similarity
In the current implementation, the input signal is digitized at . Considering that is drawn within a right-angled
16 bit/16 kHz, and then the STFT with a 4096-sample Hanning isosceles triangle in the two-dimensional time-lag space
window is calculated using the fast Fourier transform (FFT). (time-lag triangle) as shown in Fig. 4, the method finds line
Since the FFT frame is shifted by 1280 samples, the discrete segments that are parallel to the horizontal time axis and that
time step (1 frame shift) is 80 ms. The Oct and Oct , the octave indicate consecutive regions with high . When the sec-
range for the summation of (1), are, respectively, three and eight. tion between times and is denoted , each line
This covers six octaves (130 Hz–8 kHz). segment between the points and is repre-
There are several advantages to using the chroma vector.2 Be- sented as ( , ), which means that the sec-
cause it captures the overall harmony (pitch-class distribution), tion is similar to (i.e., is a repetition of) the section
it can be similar even if accompaniments or melody lines are . In other words, each horizontal line seg-
changed in some degree after repetition. In fact, we have con- ment in the time-lag triangle indicates a repeated-section pair.
firmed that the chroma vector is effective for identifying chord We, therefore, need to detect all horizontal line segments in
names [22], [23].3 The chroma vector also enables modulated the time-lag triangle . To find a horizontal line segment
repetition to be detected as described in Section IV-E. ( , ), the possibility of containing line seg-
ments at the lag , ,4 is evaluated at the current time
B. Calculate Similarity (e.g., at the end of a song) as follows (Fig. 4):
The similarity between the chroma vectors and
is defined as (6)

Before this calculation, is normalized by subtracting


(5) a local mean value while removing noise and emphasizing
horizontal lines. In more detail, given each point in
where is the lag. Since the denominator the time-lag triangle, six-directional local mean values of
is the length of the diagonal line of a 12-dimensional hyper- points along the right, left, upper, lower, upper-right,
cube with edge length 1, satisfies . In our and lower-left directions starting from the point are
experience with chroma vectors, the combination of the above calculated, and the maximum and minimum are obtained
similarity using the Euclidean distance and the vector normal- points s . If the local mean along the
ization using a maximum element is superior to the similarity right or left direction (i.e., or
using the cosine angle (scalar product) and other vector normal- ) takes the maximum, is
ization techniques. considered part of a horizontal line and emphasized by sub-
2The chroma vector is similar to the chroma spectrum [21] that is used in
tracting the minimum from . Otherwise, is
reference [4], although its formulation is different. 4This can be considered the Hough transform where only horizontal lines
3Other studies [24]–[26] have also shown the effectiveness of using the con- are detected: the parameter (voting) space R (t; l) is, therefore, simply one
cept of chroma for identifying chord names. dimensional along l.
GOTO: CHORUS SECTION DETECTION METHOD FOR MUSICAL AUDIO SIGNALS 1787

Fig. 5. Examples of the similarity r (; L1) at high-peak lags L1. The bottom
Fig. 6. Sketch of a group  = ([T s ; T e ]; 0 ) of line segments
that have almost the same section [T s ; T e ], a set 0 of those lags
horizontal bars indicate the regions above an automatically adjusted threshold, (j = 1; 2; . . . ; 5), and the possibility R (l) of containing line
which means they correspond to line segments. segments within [T s ; T e ].

considered noise and suppressed by subtracting the maximum


from ; noise tends to appear as lines along the upper, having points on each slope points s , the
lower, upper right, and lower left directions. method obtains line segments on which the smoothed
The method then picks up each peak in along the lag is above a threshold and whose length is long enough (more
by finding a point where the smoothed differential of than 6.4 s). This threshold is also adjusted using the above
automatic threshold selection method based on the discriminant
criterion. Here, instead of dichotomizing peak heights, the
(7) method selects the top five peak heights of , obtains
the five lags corresponding to those selected
changes sign from positive to negative [27] high peaks, and dichotomizes all the values of the smoothed
points s . Before this calculation, it removes the global at those lags .
drift caused by cumulative noise in from :
it subtracts, from , a smoothed low-pass D. Integrate Repeated Sections
filtered by using a moving average whose weight function is
the second-order cardinal B-spline having points on each Since each line segment indicates just a pair of repeated sec-
slope points s ; this subtraction is equivalent tions, it is necessary to organize into a group the line segments
to obtaining a high-pass-filtered . that have common sections. Suppose a section is repeated
The method then selects only high peaks above a threshold times , the number of line segments to be grouped to-
to search the line segments. Because this threshold is closely gether should theoretically be if all of them are found
related to the repetition-judgment criterion which should be in the time-lag triangle. First, line segments that have almost
adjusted for each song, we use an automatic threshold se- the same section are organized into a group; more
lection method based on a discriminant criterion [17]. When specifically, two line segments are grouped when both the differ-
dichotomizing the peak heights into two classes by a threshold, ence between their beginning points and the difference between
the optimal threshold is obtained by maximizing the discrimi- their end points are smaller than a dynamic threshold equal to
nant criterion measure defined by the following between-class percent of the segment length with a ceiling of points
variance: ( and points s ). The group is
represented as , where
(8) ( is the number of line segments in the group)
is a set of the lags of those segments—corresponding to the
where and are the probabilities of class occurrence high peaks in —in this group (Fig. 6). A set of these
(number of peaks in each class/total number of peaks), and groups is denoted by ( is the
and are the means of the peak heights in each class. number of all groups).
For each picked-up high peak with lag , the line segments Aiming to exhaustively detect all the repeated (chorus) sec-
are finally searched in the direction of the horizontal time tions, the method then redetects some missing (hidden) line seg-
axis on the one-dimensional function ments not found in the bottom-up detection process (described
(Fig. 5). After smoothing using a moving average filter in Section IV-C) through top-down processing using informa-
whose weight function is the second-order cardinal B-spline tion on other detected line segments. In Fig. 4, for example, we
1788 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING , VOL. 14, NO. 5, SEPTEMBER 2006

can expect that two line segments corresponding to the repe- 3) Remove a peak that is too close to other peaks and causes
tition of the first and third C and the repetition of the second sections to overlap.
and fourth C, which overlap with the long line segment corre- To avoid sections overlapping, it is necessary to make
sponding to the repetition of ABCC, are found even if they were the interval between adjacent peaks along the lag greater
hard to find in the bottom-up process. than the length of its section. One of every pair of peaks
For this purpose, line segments are searched again by using having an interval less than the section length is removed
just within of each group . Starting from so that higher peaks can remain overall.
Finally, by using the lag corresponding to each peak of
(9) , the method searches for a group whose section is
(i.e., is shared by the current group ) and in-
tegrates it with if it is found. They are integrated by adding
instead of , the method performs almost the same peak- all the peaks of the found group to after adjusting the lag
picking process described in Section IV-C and forms a new set values (peak positions); the found group is then removed. In ad-
of high peaks above a threshold in (Fig. 6). dition, if there is a group that has a peak indicating the section
In more detail, it picks up each peak by finding a point where , it too is integrated.
the smoothed differential of
E. Integrate Repeated Sections With Modulation
(10) The processes described above do not deal with modulation
(key change), but they can easily be extended to it. A modula-
tion can be represented by the pitch difference of its key change,
changes sign from positive to negative , which denotes the number of tempered semi-
points s . Before this calculation, it also re- tones. For example, means the modulation of nine semi-
moves the global drift in the same way by smoothing with tones upward or the modulation of three semitones downward.
the second-order cardinal B-spline having points on One of the advantages of the 12-dimensional chroma vector
each slope. This threshold is again adjusted using the above is that a transposition amount of the modulation can nat-
automatic threshold selection method based on the discriminant urally correspond to the amount by which its 12 elements are
criterion. Here, the method optimizes the threshold by shifted (rotated). When is the chroma vector of a certain
dichotomizing all local peak heights of taken performance and is the chroma vector of the performance
from all groups of . that is modulated by semitones upward from the original per-
The method then removes inappropriate peaks in each as formance, they tend to satisfy
follows.
1) Remove unnecessary peaks that are equally spaced. (11)
When similar accompaniments are repeated throughout where is a 12-by-12 shift matrix5 defined by
most of a song, peaks irrelevant to chorus sections tend
to appear at even intervals in . A group
where the number of equally spaced peaks exceeds is .. .. .. .. ..
judged to be irrelevant to chorus sections and is removed . . . . . (12)
from . For this judgment, we consider only
peaks that are higher than a threshold determined by the
standard deviation of the lower half of peaks. In addition,
when the number of equally spaced low peaks is more To detect the modulated repetition by using this feature of
than , those peaks are judged to be irrelevant to chroma vectors and considering 12 destination keys, we calcu-
chorus sections and are removed from . late 12 kinds of extended similarity for each as follows:
For this judgment, we consider only peaks that are higher
than the above threshold and lower than the average of the
above threshold and the highest peak. (13)
2) Remove a peak whose line segment has a highly deviated
similarity. Starting from each , the processes of listing and inte-
When only part of similarity at a peak grating the repeated sections are performed as described in
within is high, its peak is not appropriate for Sections IV-C and D, except that the threshold automatically
use. A peak is removed from when the standard adjusted at is used for the processes at (which sup-
deviation of after smoothing presses harmful false detection of nonrepeated sections). After
with the above second-order cardinal B-spline (having these processes, 12 sets of line-segment groups are obtained
points on each slope) is larger than a threshold. for 12 kinds of . To organize nonmodulated and modulated
Since peaks detected in Section IV-C can be considered repeated sections into the same groups, the method integrates
reliable, this threshold is determined as multiplied several groups across all the sets if they share the same section.
by the maximum of the above standard deviation at all 5Note that this shift (rotation) operation is not applicable to other acoustic
those peaks . features such as simple power spectrums and MFCC features.
GOTO: CHORUS SECTION DETECTION METHOD FOR MUSICAL AUDIO SIGNALS 1789

Hereafter, we use to denote the groups TABLE II


of line segments obtained from all the . By unfolding each line PARAMETER VALUES
segment of to the pair of repeated sections indicated by it,
we can obtain
(14)
where represents an unfolded repeated sec-
tion that corresponds to the lag and is calculated by
. The is the possibility
of being repeated sections (the possibility that the sections
and are really repeated), and is defined
as the mean of the similarity on the corresponding line
segment. For corresponding not to a line segment
but to the section itself, we define and
as and . The V. EXPERIMENTAL RESULTS
modulated sections are labeled with their for reference. The RefraiD method has been implemented in a real-time
system that takes a musical audio signal as input and outputs a
F. Select Chorus Sections
list of the detected chorus sections and repeated sections. Along
After evaluation of the chorus measure , which is the pos- with the real-time audio input, the system can display visual-
sibility of being chorus sections for each group , the ized lists of chorus sections and repeated sections, which are
group that maximizes the chorus measure is selected as obtained using just the past input and are considered the most
the chorus sections probable at every instance. The final detected results for a song
(15) are obtained at the end of the song. The parameter values in the
current implementation are listed in Table II.
The chorus measure is a sum of weighted by the length We evaluated the accuracy of chorus section detection done
of the section and is defined by through the RefraiD method. The method was tested on 100
songs6 of the popular-music database “RWC Music Database:
Popular Music” (RWC-MDB-P-2001 Nos. 1–100) [15], [16],
(16)
which is an original database available to researchers around the
world. These 100 songs were originally composed, arranged,
where is a constant (1.4 s). Before is calculated, the performed, and recorded in a way that reflected the complexity
possibility of each repeated section is adjusted according to and diversity of real-world music. In addition, to provide a refer-
three assumptions (heuristics), which fit a large class of popular ence for judging whether detection results were right or wrong,
music. correct chorus sections in targeted songs had to be labeled man-
[Assumption 1]: The length of the chorus section has an ap- ually. To enable this, we developed a song structure labeling ed-
propriate, allowed range (7.7 to 40 s in the current implementa- itor that can divide up a song and correctly label chorus sections.
tion). If the length is out of this range, is set to 0. We compared the output of the proposed method with the
[Assumption 2]: When a repeated section is long enough to correct chorus sections that were hand-labeled by using this la-
be likely to correspond to long-term repetition such as verse A, beling editor. The degree of matching between the detected and
verse B, and chorus, the chorus section is likely to be near its correct chorus sections was evaluated using the F-measure [28],
end. If there is a repeated section whose end is close which is the harmonic mean of the recall rate and the pre-
to the end of another long repeated section (longer then 50 s), cision rate
its is doubled; i.e., is doubled if the difference of the end
points of those sections is smaller than points. (17)
[Assumption 3]: Because a chorus section tends to have two
half-length repeated subsections within its section, a section
having such subsections is likely to be the chorus section. If total length of correctly detected chorus sections
there is a repeated section that has such subsec- total length of correct chorus sections
tions in another group, half of the mean of the possibility of the (18)
two subsections is added to its . total length of correctly detected chorus sections
The RefraiD method then outputs a list of chorus sections total length of detected chorus sections
found as explained above as well as a list of repeated sections (19)
obtained as its intermediate result. As postprocessing for the
The output for a song was judged to be correct if its F-measure
chorus sections of determined by (15), only
was more than 0.75. For the case of modulation (key change), a
a small gap between adjacent chorus sections is padded (elim-
chorus section was judged correctly detected only if the relative
inated) by equally prolonging the end of those sections; more
width of the key shift matched the actual width.
specifically, only when the gap is smaller than points or half
of the section length, it is padded points s . 699, 64, and 54 songs out of 100 fit assumptions 1–3, respectively.
1790 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING , VOL. 14, NO. 5, SEPTEMBER 2006

TABLE III 0.974 by using the modulation detector. In all cases, the modu-
RESULTS OF EVALUATING REFRAID: NUMBER OF SONGS WHOSE CHORUS lated sections themselves were not correctly detected when the
SECTIONS WERE DETECTED CORRECTLY UNDER FOUR SETS OF CONDITIONS
modulation detector was disabled because the similarity based
on chroma vectors is sensitive to the modulation. These results
show the effectiveness of the modulation detector and assump-
tions 2 and 3.

VI. APPLICATION: MUSIC LISTENING STATION


The results are listed in Table III. The method dealt correctly WITH CHORUS-SEARCH FUNCTION
with 80 songs7 out of 100 (with the averaged F-measure of When “trial listening” to prerecorded music on compact discs
those 80 songs being 0.938). The main reasons for the method (CDs) at a music store, a listener often takes an active role in the
making mistakes were choruses that did not repeat more often playback of musical pieces or songs by picking out only those
than other sections and the repetition of similar accompaniments sections of interest. This new type of music interaction differs
throughout most of a song. Among these 100 songs, ten songs from passive music appreciation in which people usually listen
(RWC-MDB-P-2001 Nos. 3, 18, 22, 39, 49, 71, 72, 88, 89, and to entire musical selections. To give some background, music
90) included modulated chorus sections (these songs are re- stores in recent years have installed music listening stations to
ferred to as “modulated songs” in the following), and nine of allow customers to listen to CDs on a trial basis to facilitate a
these songs (except for no. 72) were dealt with correctly (the purchasing decision. In general, the main objective of listening
F-measure was more than 0.75). While the modulation itself to music is to appreciate it, and it is common for a listener to
was correctly detected in all of the ten modulated songs, modu- play a musical selection from start to finish. In trial listening,
lated sections were not correctly selected as the chorus sections however, the objective is to quickly determine whether a selec-
in two of the songs (Nos. 72 and 89).8 There were 22 songs tion is the music one has been looking for and whether one likes
(RWC-MDB-P-2001 Nos. 3, 5, 9, 14, 17, 19, 24, 25, 33, 36, it, so listening to entire selections in the above manner is rare.
37, 38, 44, 46, 50, 57, 58, 64, 71, 91, 96, and 100) that had In the case of popular music, for example, customers often want
choruses exhibiting significant changes in accompaniment or to listen to the chorus to pass judgment on that song. This de-
melody on repetition, and 21 of these (except for no. 91) were sire produces a special way of listening in which the trial lis-
dealt with correctly (the F-measure was more than 0.75); the re- tener first listens briefly to a song’s “intro” and then jumps ahead
peated chorus section itself was correctly detected in 16 of these in search of the chorus by repeatedly pushing the fast-forward
(except for Nos. 5, 17, 25, 44, 57, and 91). These results show button, eventually finding the chorus and listening to it.
that the method is robust enough to deal with real-world audio The functions provided by conventional listening stations for
signals. music CDs, however, do not support this unique way of trial
When the function to detect the modulated repetition (re- listening very well. These listening stations are equipped with
ferred to as the “modulation detector” in the following) was playback-operation buttons typical of an ordinary CD player,
disabled, only 74 songs were dealt with correctly. On the other and among these, only the fast-forward and rewind buttons can
hand, when assumptions 2 and 3 were not used, the performance be used to find the chorus section of a song. On the other hand,
fell as shown by the entries in the two rightmost columns of the digital listening stations that have recently been installed
Table III. Enabling the modulation detector without assump- in music stores enable playback of musical selections from a
tions 2 and 3 increased the number of correctly detected songs hard disk or over a network. Here, however, only one part (e.g.,
from 68 to 73 (the five additional songs were Nos. 3, 4, 22, 88, the beginning) of each musical selection (an interval of about
and 90), and enabling it with assumptions 2 and 3 increased the 30–45 s) is mechanically excerpted and stored, which means
number from 74 to 80 (additional songs were the above five that a trial listener may not necessarily hear the chorus section.
songs plus no. 39). Using assumptions 2 and 3 increased the Against the above background, we propose SmartMusic-
number of correctly detected songs from 68 to 74 with the mod- KIOSK, a music listening station equipped with a chorus search
ulation detector and from 73 to 80 songs without it: in the former function. With SmartMusicKIOSK, a trial listener can jump
case, seven additional songs were correctly detected (Nos. 10, to the beginning of a song’s chorus (perform an instantaneous
25, 33, 38, 44, 46, and 82), but one song (no. 39) which was pre- fast-forward to the chorus) by simply pushing the button for
viously detected correctly was not detected; in the latter case, the this function. This eliminates the hassle of manually searching
same additional songs were correctly detected. Under the four for the chorus. SmartMusicKIOSK also provides a function
sets of conditions, four songs (Nos. 18, 49, 71, and 89) of the for jumping to the beginning of the next structural (repeated)
ten modulated songs were always dealt with correctly and one section of the song.
song (no. 72) was never dealt with correctly, while the averaged Much research has been performed in the field of music infor-
F-measure of Nos. 18, 49, and 71 was improved from 0.827 to mation processing, especially in relation to music information
retrieval and music understanding, but there has been practically
7The F-measure was not more than 0.75 for RWC-MDB-P-2001 Nos. 2, 12,
none in the area of trial listening. Interaction between people and
16, 29, 30, 31, 41, 53, 56, 59, 61, 66, 67, 69, 72, 79, 83, 91, 92, and 95.
8Even if the modulated chorus section itself was not selected in
music can be mainly divided into two types: the creating/ac-
RWC-MDB-P-2001 no. 89, the song was detected correctly because its tive side (composing, performing, etc.) and the receiving/pas-
F-measure (0.877) was more than 0.75. sive side (appreciating music, hearing background music, etc.).
GOTO: CHORUS SECTION DETECTION METHOD FOR MUSICAL AUDIO SIGNALS 1791

Trial listening, on the other hand, differs from the latter type, that
is, musical appreciation, since it involves listening to musical
selections while taking an active part in their playback. This is
why we felt that this activity would be a new and interesting
subject for research.

A. Past Forms of Interaction in Music Playback


The ability to play an interactive role in music playback by
changing the current playback position is a relatively recent de-
velopment in the history of music. In the past, before it became
possible to record the audio signals of music, a listener could
only listen to a musical piece at the place where it was per-
formed live. Then, when the recording of music to records and
tape became a reality, it did become possible to change playback
from one musical selection to another, but the bother and time
involved in doing so made this a form of nonreal-time interac- Fig. 7. SmartMusicKIOSK screen display. The lower window presents
the playback operation buttons and the upper window provides a visual
tion. The ability of a listener to play back music interactively representation of a song’s contents (results of automatic chorus section
really only began with the coming of technology for recording detection using RWC-MDB-P-2001 no. 18 of the RWC Music Database [15],
music onto magnetooptical media like CDs. These media made [16]).
it possible to move the playback position almost instantly with
just a push of a button making it easy to jump from one song to
another while listening to music. 1) “Jump to chorus” function: automatic jumping to the be-
However, while it became easy to move between selections ginning of sections relevant to a song’s structure: Func-
(CD tracks), there was not sufficient support for interactively tions are provided enabling automatic jumping to sec-
changing the playback position within a selection as demanded tions that will be of interest to listeners. These functions
by trial listening. Typical playback operation buttons found on are “jump to chorus (NEXT CHORUS button),” “jump to
conventional CD players (including music listening stations) are previous section in song (PREV SECTION button),” and
play, pause, stop, fast-forward, rewind, jump to next track, and “jump to next section in song (NEXT SECTION button),”
jump to previous track (a single button may be used to perform and they can be invoked by pushing the buttons shown
more than one function). Among these, only the fast-forward above in parentheses. With these functions, a listener can
and rewind buttons can change the playback position within a directly jump to and listen to chorus sections, or jump to
musical selection. Here, however, listeners are provided with the previous or next repeated section of the song.
only the following three types of feedback as aids to finding the 2) “Music map” function: visualization of song contents: A
position desired: function is provided to enable the contents of a song to be
1) sound of fast playback that can be heard while holding visualized to help the listener decide where to jump next.
down the fast-forward/rewind button; Specifically, this function provides a visual representation
2) sound after releasing the button; of the song’s structure consisting of chorus sections and
3) display of elapsed time from the start of the selection in repeated sections, as shown in Fig. 7. While examining
question. this display, the listener can use the automatic jump but-
Consequently, a listener who wanted to listen to the chorus of tons, the usual fast-forward/rewind buttons, or a playback
a song, for example, would have to look for it manually by slider to move to any point of interest in the song.
pressing and releasing a button any number of times. The following describes the lower and upper windows shown in
These types of feedback are essentially the same when using Fig. 7.
media-player software on a personal computer (PC) to listen to • Playback operation window (lower window): The three
songs recorded on a hard disk, although a playback slider may automatic jump buttons added to the conventional play-
be provided. The total length of the playback slider corresponds back-operation buttons are named NEXT CHORUS,
to the length of a song, and the listener can manipulate the slider PREV SECTION, and NEXT SECTION. These buttons
to jump to any position in a song. Here as well, however, the are marked with newly designed symbols.
listener must use manual means to search out a specific playback Pressing the NEXT CHORUS button causes the system
position, so nothing has really changed. to search for the next chorus in the song from the present
position (returning to the first one if none remain) and to
B. Intelligent Music Listening Station: SmartMusicKIOSK jump to the start of that chorus. Pressing the other two but-
For music that would normally not be understood unless some tons causes the system to search for the immediately fol-
time was taken for listening, the problem here is how to enable lowing section or immediately preceding section with re-
changing between specific playback positions before actual lis- spect to the present position and to jump to the start of that
tening. We propose the following two functions to solve this section. While searching, the system ignores section-end
problem, assuming the main target is popular music. points.
1792 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING , VOL. 14, NO. 5, SEPTEMBER 2006

• Song-structure display window (upper window): The top


row of this display provides a visual representation of
chorus sections while the lower rows (five maximum in
the current implementation) provide a visual representa-
tion of repeated sections. On each row, colored sections
indicate similar (repeated) sections. In Fig. 7, for example,
the second row from the top indicates the structural rep-
etition of “verse A verse B chorus” (the longest
repetition of a visual representation often suggests such a
structural repetition); the bottom row with two short col-
ored sections indicates the similarity between the “intro”
and “ending” of this song. In addition, the thin horizontal
bar at the very bottom of this window is a playback slider
whose position corresponds to elapsed time in the song.
Clicking directly on a section (touching in the case of a
touch panel or tablet PC) plays that section, and clicking
the playback slider changes the playback position. Fig. 8. Demonstration of SmartMusicKIOSK implemented on a tablet PC.
The above interface functions promote a type of listening in
which the listener can first listen to the intro of a song for just
a short time and then jump and listen to the chorus with just a Fig. 8 shows a photograph of the SmartMusicKIOSK system
push of a button.9 Furthermore, visualizing the entire structure taken during a technical demonstration in February 2003. This
of a song allows the listener to choose various parts of a song system can be executed on a stand-alone tablet PC (Microsoft
for trial listening. Windows XP Tablet PC Edition, Pentium III 933-MHz CPU)
as shown in the center of the photograph. It can be operated by
C. System Implementation and Results touching the screen with a pen or by pushing the keys of an
We built a SmartMusicKIOSK system incorporating all the external keypad (center-right of the photograph) that duplicates
functions described in Section VI-B. The system is executed the playback button group shown on the screen.
with files that include descriptions of chorus sections and Our experience with the SmartMusicKIOSK demonstration
repeated sections, which can be obtained beforehand by the showed that the proposed interface was effective enough to en-
RefraiD method. Although the results of automatic detection in- able listeners to play back songs in an interactive manner by
clude errors and are, therefore, not 100% accurate as described pushing jump buttons while receiving visual assistance from the
in Section V, they still provide the listener with a valuable aid music map display. The music map facilitated jump operations
to finding a desired playback position and make a listening and the listening to various parts of a song while moving back
station much more convenient than in the past. If, however, and forth as desired on the song structure. The proposed func-
there are times when an accurate description is required, results tions were intuitively easy to use requiring no training: listeners
of automatic detection may be manually corrected. The song who had received no explanation about jump button functions
structure labeling editor described in Section V can also be or display windows were nevertheless able to surmise their pur-
used for this manual correction and labeling. This is useful pose in little time.
especially for songs not suitable for automatic detection or
outside the category of popular music. D. Discussion
In the SmartMusicKIOSK system, the song file playback en-
In the following, we consider how interaction in music play-
gine, graphical user interface (GUI) module, and audio device
back need not be limited to trial listening scenarios, and discuss
control module are all implemented as separate processes to im-
what kinds of situation our method can be applied to.
prove extendibility. These processes have been ported on sev-
1) Interface for Active Listening of Music: In recent years,
eral operating systems, such as Linux, SGI IRIX, and Microsoft
the music usage scene has been expanding and usage styles
Windows, and can be distributed over a LAN (Ethernet) and
of choosing music as one wishes, checking its content, and at
connected by using a network protocol called Remote Audio
times even extracting portions of music have likewise been in-
Control Protocol (RACP), which we have designed to enable
creasing. For example, in addition to trial listening of CDs at
efficient sharing of audio signals and various types of control
music stores, end users select musical ring tones for cellular
information. This protocol is an extension of remote music con-
phones, find background music appropriate to certain situations,
trol protocol (RMCP) [29] enabling the transmission of audio
and use music on the World Wide Web. On the other hand,
signals.
interfaces for music playback have become fixed to standard
9Both a “PREV CHORUS” and “NEXT CHORUS” button may also be pre- playback operation buttons even after the appearance of the
pared in the playback operation window. Only one button was used here for CD player and computer-based media players as described in
the following reasons. 1) Pushing the present NEXT CHORUS button repeat- Section VI-A. Interfaces of this type, while suitable for passive
edly loops through all chorus sections enabling the desired chorus to be found
quickly. 2) A previous chorus can be returned to immediately by simply clicking appreciation of music, are inadequate for interactively finding
on that section in the song structure display window. sections of interest within a song.
GOTO: CHORUS SECTION DETECTION METHOD FOR MUSICAL AUDIO SIGNALS 1793

As a general interface for music playback, we can see Smart- as the chorus sections. Analysis of the relationships between
MusicKIOSK as adding an interface that targets structural sec- various repeated sections enables all the chorus sections to be
tions of a song as operational units in contrast to the conven- detected with their beginning and end points. In addition, intro-
tional interface (e.g., a CD player) that targets only songs as ducing the similarity between nonshifted and shifted chroma
operational units. With this conventional interface, songs of no vectors makes it possible to detect modulated chorus sections.
interest to the listener can easily be skipped, but skipping sec- Experimental results with the “RWC Music Database: Popular
tions of no interest within a particular song is not as easy. An Music” showed that the method was robust enough to correctly
outstanding advantage of the SmartMusicKIOSK interface is detect the chorus sections in 80 of 100 songs.
the ability to “listen to any part of a song whenever one likes” We have also described the SmartMusicKIOSK application
without having to follow the timeline of the original song. Ex- system, which is a music listening station based on the Re-
tending this idea, it would be interesting to add a “shuffle play” fraiD method. It provides content-based playback controls al-
function in units of musical sections by drawing an analogy lowing a listener to skim rapidly through music, plus a graphical
from operation in song units. overview of the entire song structure. While entire songs of no
While not expected when building this interface, an inter- interest to a listener can be skipped on conventional music play-
esting phenomenon has appeared in situations that permit long- back interfaces, SmartMusicKIOSK is the first interface that al-
term listening as opposed to trial listening. Specifically, we have lows the listener to easily skip sections of no interest even within
found some listeners tend to listen to music in a more ana- a song.
lytical fashion, compared to past forms of music appreciation, The RefraiD method has relevance to music summarization
when they can interactively change the playback position while studies [6], [8]–[12], [30], none of which has addressed the
viewing the structure of a musical piece. For example, we have problem of detecting all the chorus sections. One of the chorus
observed listeners checking the kind of structure possessed by sections detected by our method can be regarded as a song sum-
an entire piece, listening to each section in that structure, and mary, as could another long repeated section in the intermediate-
comparing sections that repeat. Another finding is that visual- result list of repeated sections. Music summarization studies
ization of a song’s structure has proven to be interesting and aimed at shortening the length of a song are also related to
useful for listeners who just want to passively appreciate music. SmartMusicKIOSK because they share one of the objectives of
2) Other Applications: In addition to the SmartMusicK- trial listening, that is, to listen to music in a short time. Previous
IOSK application, the RefraiD method has a potentially wide studies, however, have not considered an interactive form of lis-
range of application. The following presents other application tening as taken up by our research. From the viewpoint of trial
examples. listening, the ability of a listener to easily select any section of
• Digital listening station: The RefraiD method could en- a song for listening in a true interactive fashion is very effective
able digital listening stations to excerpt and store chorus as discussed in Section VI-D.1.
sections instead of mechanically stored excerpts. In the Our repetition-based approach of the RefraiD method has
future, we hope to see digital listening stations in music proven effective for popular music. To improve the performance
stores upgrade to functions such as those of SmartMu- of the method, however, we will need to use prior information
sicKIOSK. on acoustic features unique to choruses. We also plan to experi-
• Music thumbnail: The ability to playback (preview) just ment with other music genres and extend the method to make it
the beginning of a chorus section detected by the RefraiD widely applicable. In addition, our future work will include re-
method would provide added convenience when browsing search on new directions of making interaction between people
through a large set of songs or when presenting search and music even more active and enriching.
results of music information retrieval. This function can
be regarded as a music version of the image thumbnail. ACKNOWLEDGMENT
• Computer-based media players: A variety of functions
The author would like to thank H. Asoh (National Institute of
have recently been added to media players, such as ex-
Advanced Industrial Science and Technology) for his valuable
changeable appearance (skins) and music-synchronized
discussions and the anonymous reviewers for their helpful com-
animation in the form of geometrical drawings moving
ments and suggestions.
synchronously with waveforms and frequency spectrums
during playback. No essential progress, however, has been
seen in the interface itself. We hope not only that the REFERENCES
SmartMusicKIOSK interface will be adopted for various [1] M. Goto, “Music scene description project: toward audio-based real-
media players, but also that other approaches of reexam- time music understanding,” in Proc. Int. Conf. Music Information Re-
trieval, 2003, pp. 231–232.
ining the entire functional makeup of music playback in- [2] , “A real-time music scene description system: Predominant-F0 es-
terfaces will follow. timation for detecting melody and bass lines in real-world audio signals,”
Speech Commun., vol. 43, no. 4, pp. 311–329, 2004.
[3] B. Logan and S. Chu, “Music summarization using key phrases,” in
VII. CONCLUSION Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing , 2000, pp.
II-749–II-752.
We have described the RefraiD method which detects chorus [4] M. A. Bartsch and G. H. Wakefield, “To catch a chorus: using chroma-
based representations for audio thumbnailing,” in Proc. IEEE Workshop
sections and repeated sections in real-world popular music Applications of Signal Processing to Audio and Acoustics, 2001, pp.
audio signals. It basically regards the most repeated sections 15–18.
1794 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING , VOL. 14, NO. 5, SEPTEMBER 2006

[5] M. Cooper and J. Foote, “Automatic music summarization via similarity [24] T. Fujishima, “Realtime chord recognition of musical sound: A system
analysis,” in Proc. Int. Conf. Music Information Retrieval, 2002, pp. using common lisp music,” in Proc. Int. Computer Music Conf., 1999,
81–85. pp. 464–467.
[6] R. B. Dannenberg and N. Hu, “Pattern discovery techniques for music [25] A. Sheh and D. P. Ellis, “Chord segmentation and recognition using
audio,” J. New Music Res., vol. 32, no. 2, pp. 153–163, 2003. EM-trained hidden Markov models,” in Proc. Int. Conf. Music Infor-
[7] , “Discovering musical structure in audio recordings,” in Proc. Int. mation Retrieval, 2003, pp. 183–189.
Conf. Music and Artificial Intelligence, 2002, pp. 43–57. [26] T. Yoshioka, T. Kitahara, K. Komatani, T. Ogata, and H. G. Okuno, “Au-
[8] G. Peeters, A. L. Burthe, and X. Rodet, “Toward automatic music audio tomatic chord transcription with concurrent recognition of chord sym-
summary generation from signal analysis,” in Proc. Int. Conf. Music In- bols and boundaries,” in Proc. Int. Conf. Music Information Retrieval,
formation Retrieval, 2002, pp. 94–100. 2004, pp. 100–105.
[9] G. Peeters and X. Rodet, “Signal-based music structure discovery for [27] A. Savitzky and M. J. Golay, “Smoothing and differentiation of data by
music audio summary generation,” in Proc. Int. Computer Music Con- simplified least squares procedures,” Anal. Chem., vol. 36, no. 8, pp.
ference, 2003, pp. 15–22. 1627–1639, 1964.
[10] J. T. Foote and M. L. Cooper, “Media segmentation using self-simi- [28] C. J. van Rijsbergen, Information Retrieval, 2nd ed. London, U.K.:
larity decomposition,” in Proc. SPIE Storage and Retrieval for Media Butterworths, 1979.
Databases, vol. 5021, 2003, pp. 167–175. [29] M. Goto, R. Neyama, and Y. Muraoka, “RMCP: Remote music control
[11] M. Cooper and J. Foote, “Summarizing popular music via structural sim- protocol, design and applications,” in Proc. Int. Computer Music Conf.,
ilarity analysis,” in Proc. IEEE Workshop Applications of Signal Pro- 1997, pp. 446–449.
cessing to Audio and Acoustics, 2003, pp. 127–130. [30] K. Hirata and S. Matsuda, “Interactive music summarization based on
[12] W. Chai and B. Vercoe, “Structural analysis of musical signals for in- GTTM,” in Proc. Int. Conf. Music Information Retrieval, 2002, pp.
dexing and thumbnailing,” in Proc. ACM/IEEE Joint Conf. Digital Li- 86–93.
braries, 2003, pp. 27–34.
[13] J. Wellhausen and H. Crysandt, “Temporal audio segmentation using
mpeg-7 descriptors,” Proc. SPIE Storage and Retrieval for Media
Databases, vol. 5021, pp. 380–387, 2003. Masataka Goto received the Doctor of Engineering
[14] J.-J. Aucouturier and M. Sandler, “Finding repeating patterns in acoustic degree in electronics, information, and communi-
musical signals: Applications for audio thumbnailing,” in Proc. AES cation engineering from Waseda University, Tokyo,
22nd Int. Conf. Virtual, Synthetic and Entertainment Audio, 2002, pp. Japan, in 1998.
412–421. He then joined the Electrotechnical Laboratory
[15] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka, “RWC music data- (ETL; reorganized as the National Institute of Ad-
base: Popular, classical, and jazz music databases,” in Proc. Int. Conf. vanced Industrial Science and Technology (AIST) in
Music Information Retrieval, 2002, pp. 287–288. 2001), Tsukuba, Ibaraki, Japan, where he has since
[16] M. Goto, “Development of the RWC music database,” in Proc. Int. been a Research Scientist. He served concurrently as
Congr. Acoustics, 2004, pp. I-553–I-556. a Researcher in Precursory Research for Embryonic
[17] N. Otsu, “A threshold selection method from gray-level histograms,” Science and Technology (PRESTO), Japan Science
IEEE Trans. Syst., Man, Cybern., vol. SMC-9, no. 1, pp. 62–66, Jan. and Technology Corporation (JST) from 2000 to 2003, and an Associate
1979. Professor of the Department of Intelligent Interaction Technologies, Graduate
[18] R. N. Shepard, “Circularity in judgments of relative pitch,” J. Acoust. School of Systems and Information Engineering, University of Tsukuba,
Soc. Amer., vol. 36, no. 12, pp. 2346–2353, 1964. since 2005. His research interests include music information processing and
[19] B. C. J. Moore, An Introduction to the Psychology of Hearing, 4th spoken-language processing.
ed. New York: Academic, 1997. Dr. Goto is a member of the Information Processing Society of Japan (IPSJ),
[20] W. Fujisaki and M. Kashino, “Basic hearing abilities and characteristics Acoustical Society of Japan (ASJ), Japanese Society for Music Perception and
of musical pitch perception in absolute pitch possessors,” in Proc. Int. Cognition (JSMPC), Institute of Electronics, Information, and Communication
Congr. Acoustics, 2004, pp. V-3607–V-3610. Engineers (IEICE), and the International Speech Communication Association
[21] G. H. Wakefield, “Mathematical representation of joint time-chroma dis- (ISCA). He has received 17 awards, including the IPSJ Best Paper Award and
tributions,” Proc. SPIE, pp. 637–645, 1999. IPSJ Yamashita SIG Research Awards (special interest group on music and com-
[22] H. Yamada, M. Goto, H. Saruwatari, and K. Shikano, “Multi-timbre puter, and spoken language processing) from the IPSJ, the Awaya Prize for Out-
chord classification for musical audio signals (in Japanese),” in Proc. standing Presentation and Award for Outstanding Poster Presentation from the
Autumn Meeting Acoustical Soc. Japan, Sep. 2002, pp. 641–642. ASJ, Award for Best Presentation from the JSMPC, Best Paper Award for Young
[23] , “Multi-timbre chord classification method for musical audio Researchers from the Kansai-Section Joint Convention of Institutes of Electrical
signals: Application to musical pieces (in Japanese),” in Proc. Spring Engineering, WISS 2000 Best Paper Award and Best Presentation Award, and
Meeting Acoustical Soc. Japan, Mar. 2003, pp. 835–836. Interaction 2003 Best Paper Award.

You might also like