IEEETASLP200609goto PDF
IEEETASLP200609goto PDF
IEEETASLP200609goto PDF
Abstract—This paper describes a method for obtaining a list of This paper describes a method, called Refrain Detecting
repeated chorus (“hook”) sections in compact-disc recordings of Method (RefraiD), that exhaustively detects all repeated chorus
popular music. The detection of chorus sections is essential for sections appearing in a song with a focus on popular music.
the computational modeling of music understanding and is useful
in various applications, such as automatic chorus-preview/search It can obtain a list of the beginning and end points of every
functions in music listening stations, music browsers, or music chorus section in real-world audio signals and can detect mod-
retrieval systems. Most previous methods detected as a chorus a ulated chorus sections. Furthermore, because it detects chorus
repeated section of a given length and had difficulty identifying sections by analyzing various repeated sections in a song, it can
both ends of a chorus section and dealing with modulations (key generate an intermediate-result list of repeated sections that
changes). By analyzing relationships between various repeated
sections, our method, called RefraiD, can detect all the chorus usually reflect the song structure; for example, the repetition of
sections in a song and estimate both ends of each section. It can a structure like verse A, verse B, and chorus is often found in
also detect modulated chorus sections by introducing a percep- the list.
tually motivated acoustic feature and a similarity that enable This paper also describes a music listening station called
detection of a repeated chorus section even after modulation. SmartMusicKIOSK that was implemented as an application
Experimental results with a popular music database showed that
this method correctly detected the chorus sections in 80 of 100 system of the RefraiD method. In music stores, customers typ-
songs. This paper also describes an application of our method, ically search out the chorus or “hook” of a song by repeatedly
a new music-playback interface for trial listening called Smart- pressing the fast-forward button, rather than passively listening
MusicKIOSK, which enables a listener to directly jump to and to the music. This activity is not well supported by current
listen to the chorus section while viewing a graphical overview technology. Our research has led to a function for jumping to
of the entire song structure. The results of implementing this
application have demonstrated its usefulness. the chorus section and other key parts (repeated sections) of a
song, plus a function for visualizing the song structure. These
Index Terms—Chorus detection, chroma vector, music-playback functions eliminate the hassle of searching for the chorus and
interface, music structure, music understanding.
make it easier for a listener to find desired parts of a song,
thereby facilitating an active listening experience.
I. INTRODUCTION The following sections introduce related research, describe
the problems dealt with, explain the RefraiD method in detail,
C HORUS (“hook” or refrain) sections of popular music are
the most representative, uplifting, and prominent thematic
sections in the music structure of a song, and human listeners
and show experimental results indicating that the method
is robust enough to correctly detect the chorus sections in
can easily understand where the chorus sections are because 80 of 100 songs of a popular-music database. Finally, the
these sections are the most repeated and memorable portions SmartMusicKIOSK system and its usefulness are described.
of a song. Automatic detection of chorus sections is essential
for building a music-scene-description system [1], [2] that can II. RELATED WORK
understand musical audio signals in a human-like fashion, and
is useful in various practical applications. In music browsers or Most previous chorus detection methods [3]–[5] only extract
music retrieval systems, it enables a listener to quickly preview a a single segment from several chorus sections by detecting a re-
chorus section as an “audio thumbnail” to find a desired song. It peated section of a designated length as the most representative
can also increase the efficiency and precision of music retrieval part of a song. Logan and Chu [3] developed a method using
systems by enabling them to match a query with only the chorus clustering techniques and hidden Markov models (HMMs) to
sections. categorize short segments (1 s) in terms of their acoustic fea-
tures, where the most frequent category is then regarded as a
chorus. Bartsch and Wakefield [4] developed a method that cal-
Manuscript received January 31, 2005; revised October 10, 2005. The as-
sociate editor coordinating the review of this manuscript and approving it for culates the similarity between acoustic features of beat-length
publication was Dr. Malcom Slaney. segments obtained by beat tracking and finds the given-length
The author is with the Information Technology Research Institute, National segment with the highest similarity averaged over its segment.
Institute of Advanced Industrial Science and Technology (AIST), Tsukuba,
Ibaraki 305-8568, Japan (e-mail: m.goto@aist.go.jp). Cooper and Foote [5] developed a method that calculates a sim-
Digital Object Identifier 10.1109/TSA.2005.863204 ilarity matrix of acoustic features of short frames (100 ms) and
1558-7916/$20.00 © 2006 IEEE
1784 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING , VOL. 14, NO. 5, SEPTEMBER 2006
finds the given-length segment with the highest similarity be- each section. In this process, the similarity must be high be-
tween it and the whole song. Note that these methods assume tween acoustic features even if the accompaniment or melody
that the output segment length is given and do not identify both line changes somewhat in the repeated section (e.g., the absence
ends of a chorus section. of accompaniment on bass and/or drums after repetition). This
Music segmentation or structure discovery methods [6]–[13] condition is difficult to satisfy if acoustic features are taken to be
where the output segment length is not assumed have also been simple power spectrums or mel-frequency cepstral coefficients
studied. Dannenberg and Hu [6], [7] developed a structure dis- (MFCC) as used in audio/speech signal processing.
covery method of clustering pairs of similar segments obtained [Problem 2] Repetition Judgment Criterion: The criterion
by several techniques such as efficient dynamic programming establishing how high similarity must be to indicate repetition
or iterative greedy algorithms. This method finds, groups, depends on the song. For a song containing many repeated ac-
and removes similar pairs from the beginning to group all the companiment phrases, for example, only a section with very
pairs. Peeters et al. [8] and Peeters and Rodet [9] developed high similarity should be considered the chorus section repe-
a supervised learning method of modeling dynamic features tition. For a song containing a chorus section with accompani-
and studied two structure discovery approaches: the sequence ments changed after repetition, on the other hand, a section with
approach of obtaining repetitions of patterns and the state somewhat lower similarity can be considered the chorus section
approach of obtaining a succession of states. The dynamic repetition. This criterion can be easily set for a small number
features are selected from the spectrum of a filter-bank output of specific songs by manual means. For a large open song set,
by maximizing the mutual information between the selected however, the criterion should be automatically modified based
features and hand-labeled music structures. Aucouturier and on the song being processed.
Sandler [14] developed two methods of finding repeated pat- [Problem 3] Estimating Both Ends of Repeated Sections:
terns in a succession of states (texture labels) obtained by Both ends (the beginning and end points) of repeated sections
HMMs. They used two image processing techniques, the kernel must be estimated by examining the mutual relationships
convolution and Hough transform, to detect line segments in among the various repeated sections. For example, given a song
the similarity matrix between the states. Foote and Cooper [10], having the structure (A B C B C C), the long repetition cor-
[11] developed a method of segmenting music by correlating responding to (B C) would be obtained by a simple repetition
a kernel along the diagonal of the similarity matrix, and clus- search. Both ends of the C section in (B C) could be inferred,
tering the obtained segments on the basis of the self-similarity however, from the information obtained regarding the final
of their statistics. Chai and Vercoe [12] developed a method of repetition of C in this structure.
detecting segment repetitions by using dynamic programming, [Problem 4] Detecting Modulated Repetition: Because the
clustering the obtained segments, and labeling the segments acoustic features of a section generally undergo a significant
based on heuristic rules such as the rule of first labeling the change after modulation (key change), similarity with the sec-
most frequent segments, removing them, and repeating the tion before modulation is low, making it difficult to judge repe-
labeling process. Wellhausen and Crysandt [13] studied the tition. The detection of modulated repetition is important since
similarity matrix of spectral-envelope features defined in the modulation sometimes occurs in chorus repetitions, especially
MPEG-7 descriptors and a technique of detecting noncentral in the latter half of a song.1
diagonal line segments.
None of these methods, however, address the problem of de- IV. CHORUS SECTION DETECTION METHOD: REFRAID
tecting all the chorus sections in a song. Furthermore, while
Fig. 1 shows the process flow of the RefraiD method. First, a
chorus sections are sometimes modulated (the key is changed)
12-dimensional feature vector called a chroma vector, which is
during their repetition in a song, previously reported methods
robust with respect to changes of accompaniments, is extracted
did not deal with modulated repetition.
from each frame of an input audio signal and then the similarity
between these vectors is calculated (solution to Problem 1).
III. CHORUS SECTION DETECTION PROBLEM Each element of the chroma vector corresponds to one of the 12
pitch classes (C, C#, D, D#, E, F, F#, G, G#, A, A#, and B) and
To enable the handling of a large number of songs in popular is the sum of magnitude at frequencies of its pitch class over six
music, this research aims for a general and robust chorus section octaves. Pairs of repeated sections are then listed (found) using
detection method using no prior information on acoustic fea- an adaptive repetition-judgment criterion that is configured by
tures unique to choruses. To this end, we focus on the fact that an automatic threshold selection method based on a discrimi-
chorus sections are usually the most repeated sections of a song nant criterion [17] (solution to Problem 2). To organize common
and adopt the following basic strategy: find sections that repeat repeated sections into groups and to identify both ends of each
and output those that appear most often. It must be pointed out, section, the pairs of repeated sections are integrated (grouped)
however, that it is difficult for a computer to judge repetition by analyzing their relationships over the whole song (solution
because it is rare for repeated sections to be exactly the same.
The following summarizes the main problems that must be ad- 1Although a reviewer of this paper pointed out that songs with modulation are
dressed in this regard. generally rare in Western popular music, they are not rare in Japanese popular
[Problem 1] Acoustic Features and Similarity: Whether a music, which has been influenced by Western music. We conducted a survey on
Japan’s popular music hit chart (top 20 singles ranked weekly from fiscal 2000
section is a repetition of another must be judged on the basis to fiscal 2003) and found that modulation occurred in chorus repetitions in 152
of the similarity between the acoustic features obtained from songs (10.3%) out of 1481.
GOTO: CHORUS SECTION DETECTION METHOD FOR MUSICAL AUDIO SIGNALS 1785
TABLE I
LIST OF SYMBOLS
BPF (1)
(2)
(3) Fig. 4. Sketch of line segments, the similarity r(t; l), and the possibility
R (t; l) of containing line segments. The similarity r (t; l) is defined in
the right-angled isosceles triangle (time-lag triangle) in the lower right-hand
The BPF is defined using a Hanning window as follows: corner. The actual r (t; l) is noisy and ambiguous and usually contains many
line segments irrelevant to chorus sections.
BPF (4)
C. List Repeated Sections
This filter is applied to octaves from Oct to Oct . Pairs of repeated sections are obtained from the similarity
In the current implementation, the input signal is digitized at . Considering that is drawn within a right-angled
16 bit/16 kHz, and then the STFT with a 4096-sample Hanning isosceles triangle in the two-dimensional time-lag space
window is calculated using the fast Fourier transform (FFT). (time-lag triangle) as shown in Fig. 4, the method finds line
Since the FFT frame is shifted by 1280 samples, the discrete segments that are parallel to the horizontal time axis and that
time step (1 frame shift) is 80 ms. The Oct and Oct , the octave indicate consecutive regions with high . When the sec-
range for the summation of (1), are, respectively, three and eight. tion between times and is denoted , each line
This covers six octaves (130 Hz–8 kHz). segment between the points and is repre-
There are several advantages to using the chroma vector.2 Be- sented as ( , ), which means that the sec-
cause it captures the overall harmony (pitch-class distribution), tion is similar to (i.e., is a repetition of) the section
it can be similar even if accompaniments or melody lines are . In other words, each horizontal line seg-
changed in some degree after repetition. In fact, we have con- ment in the time-lag triangle indicates a repeated-section pair.
firmed that the chroma vector is effective for identifying chord We, therefore, need to detect all horizontal line segments in
names [22], [23].3 The chroma vector also enables modulated the time-lag triangle . To find a horizontal line segment
repetition to be detected as described in Section IV-E. ( , ), the possibility of containing line seg-
ments at the lag , ,4 is evaluated at the current time
B. Calculate Similarity (e.g., at the end of a song) as follows (Fig. 4):
The similarity between the chroma vectors and
is defined as (6)
Fig. 5. Examples of the similarity r (; L1) at high-peak lags L1. The bottom
Fig. 6. Sketch of a group = ([T s ; T e ]; 0 ) of line segments
that have almost the same section [T s ; T e ], a set 0 of those lags
horizontal bars indicate the regions above an automatically adjusted threshold,
(j = 1; 2; . . . ; 5), and the possibility R (l) of containing line
which means they correspond to line segments. segments within [T s ; T e ].
can expect that two line segments corresponding to the repe- 3) Remove a peak that is too close to other peaks and causes
tition of the first and third C and the repetition of the second sections to overlap.
and fourth C, which overlap with the long line segment corre- To avoid sections overlapping, it is necessary to make
sponding to the repetition of ABCC, are found even if they were the interval between adjacent peaks along the lag greater
hard to find in the bottom-up process. than the length of its section. One of every pair of peaks
For this purpose, line segments are searched again by using having an interval less than the section length is removed
just within of each group . Starting from so that higher peaks can remain overall.
Finally, by using the lag corresponding to each peak of
(9) , the method searches for a group whose section is
(i.e., is shared by the current group ) and in-
tegrates it with if it is found. They are integrated by adding
instead of , the method performs almost the same peak- all the peaks of the found group to after adjusting the lag
picking process described in Section IV-C and forms a new set values (peak positions); the found group is then removed. In ad-
of high peaks above a threshold in (Fig. 6). dition, if there is a group that has a peak indicating the section
In more detail, it picks up each peak by finding a point where , it too is integrated.
the smoothed differential of
E. Integrate Repeated Sections With Modulation
(10) The processes described above do not deal with modulation
(key change), but they can easily be extended to it. A modula-
tion can be represented by the pitch difference of its key change,
changes sign from positive to negative , which denotes the number of tempered semi-
points s . Before this calculation, it also re- tones. For example, means the modulation of nine semi-
moves the global drift in the same way by smoothing with tones upward or the modulation of three semitones downward.
the second-order cardinal B-spline having points on One of the advantages of the 12-dimensional chroma vector
each slope. This threshold is again adjusted using the above is that a transposition amount of the modulation can nat-
automatic threshold selection method based on the discriminant urally correspond to the amount by which its 12 elements are
criterion. Here, the method optimizes the threshold by shifted (rotated). When is the chroma vector of a certain
dichotomizing all local peak heights of taken performance and is the chroma vector of the performance
from all groups of . that is modulated by semitones upward from the original per-
The method then removes inappropriate peaks in each as formance, they tend to satisfy
follows.
1) Remove unnecessary peaks that are equally spaced. (11)
When similar accompaniments are repeated throughout where is a 12-by-12 shift matrix5 defined by
most of a song, peaks irrelevant to chorus sections tend
to appear at even intervals in . A group
where the number of equally spaced peaks exceeds is .. .. .. .. ..
judged to be irrelevant to chorus sections and is removed . . . . . (12)
from . For this judgment, we consider only
peaks that are higher than a threshold determined by the
standard deviation of the lower half of peaks. In addition,
when the number of equally spaced low peaks is more To detect the modulated repetition by using this feature of
than , those peaks are judged to be irrelevant to chroma vectors and considering 12 destination keys, we calcu-
chorus sections and are removed from . late 12 kinds of extended similarity for each as follows:
For this judgment, we consider only peaks that are higher
than the above threshold and lower than the average of the
above threshold and the highest peak. (13)
2) Remove a peak whose line segment has a highly deviated
similarity. Starting from each , the processes of listing and inte-
When only part of similarity at a peak grating the repeated sections are performed as described in
within is high, its peak is not appropriate for Sections IV-C and D, except that the threshold automatically
use. A peak is removed from when the standard adjusted at is used for the processes at (which sup-
deviation of after smoothing presses harmful false detection of nonrepeated sections). After
with the above second-order cardinal B-spline (having these processes, 12 sets of line-segment groups are obtained
points on each slope) is larger than a threshold. for 12 kinds of . To organize nonmodulated and modulated
Since peaks detected in Section IV-C can be considered repeated sections into the same groups, the method integrates
reliable, this threshold is determined as multiplied several groups across all the sets if they share the same section.
by the maximum of the above standard deviation at all 5Note that this shift (rotation) operation is not applicable to other acoustic
those peaks . features such as simple power spectrums and MFCC features.
GOTO: CHORUS SECTION DETECTION METHOD FOR MUSICAL AUDIO SIGNALS 1789
TABLE III 0.974 by using the modulation detector. In all cases, the modu-
RESULTS OF EVALUATING REFRAID: NUMBER OF SONGS WHOSE CHORUS lated sections themselves were not correctly detected when the
SECTIONS WERE DETECTED CORRECTLY UNDER FOUR SETS OF CONDITIONS
modulation detector was disabled because the similarity based
on chroma vectors is sensitive to the modulation. These results
show the effectiveness of the modulation detector and assump-
tions 2 and 3.
Trial listening, on the other hand, differs from the latter type, that
is, musical appreciation, since it involves listening to musical
selections while taking an active part in their playback. This is
why we felt that this activity would be a new and interesting
subject for research.
As a general interface for music playback, we can see Smart- as the chorus sections. Analysis of the relationships between
MusicKIOSK as adding an interface that targets structural sec- various repeated sections enables all the chorus sections to be
tions of a song as operational units in contrast to the conven- detected with their beginning and end points. In addition, intro-
tional interface (e.g., a CD player) that targets only songs as ducing the similarity between nonshifted and shifted chroma
operational units. With this conventional interface, songs of no vectors makes it possible to detect modulated chorus sections.
interest to the listener can easily be skipped, but skipping sec- Experimental results with the “RWC Music Database: Popular
tions of no interest within a particular song is not as easy. An Music” showed that the method was robust enough to correctly
outstanding advantage of the SmartMusicKIOSK interface is detect the chorus sections in 80 of 100 songs.
the ability to “listen to any part of a song whenever one likes” We have also described the SmartMusicKIOSK application
without having to follow the timeline of the original song. Ex- system, which is a music listening station based on the Re-
tending this idea, it would be interesting to add a “shuffle play” fraiD method. It provides content-based playback controls al-
function in units of musical sections by drawing an analogy lowing a listener to skim rapidly through music, plus a graphical
from operation in song units. overview of the entire song structure. While entire songs of no
While not expected when building this interface, an inter- interest to a listener can be skipped on conventional music play-
esting phenomenon has appeared in situations that permit long- back interfaces, SmartMusicKIOSK is the first interface that al-
term listening as opposed to trial listening. Specifically, we have lows the listener to easily skip sections of no interest even within
found some listeners tend to listen to music in a more ana- a song.
lytical fashion, compared to past forms of music appreciation, The RefraiD method has relevance to music summarization
when they can interactively change the playback position while studies [6], [8]–[12], [30], none of which has addressed the
viewing the structure of a musical piece. For example, we have problem of detecting all the chorus sections. One of the chorus
observed listeners checking the kind of structure possessed by sections detected by our method can be regarded as a song sum-
an entire piece, listening to each section in that structure, and mary, as could another long repeated section in the intermediate-
comparing sections that repeat. Another finding is that visual- result list of repeated sections. Music summarization studies
ization of a song’s structure has proven to be interesting and aimed at shortening the length of a song are also related to
useful for listeners who just want to passively appreciate music. SmartMusicKIOSK because they share one of the objectives of
2) Other Applications: In addition to the SmartMusicK- trial listening, that is, to listen to music in a short time. Previous
IOSK application, the RefraiD method has a potentially wide studies, however, have not considered an interactive form of lis-
range of application. The following presents other application tening as taken up by our research. From the viewpoint of trial
examples. listening, the ability of a listener to easily select any section of
• Digital listening station: The RefraiD method could en- a song for listening in a true interactive fashion is very effective
able digital listening stations to excerpt and store chorus as discussed in Section VI-D.1.
sections instead of mechanically stored excerpts. In the Our repetition-based approach of the RefraiD method has
future, we hope to see digital listening stations in music proven effective for popular music. To improve the performance
stores upgrade to functions such as those of SmartMu- of the method, however, we will need to use prior information
sicKIOSK. on acoustic features unique to choruses. We also plan to experi-
• Music thumbnail: The ability to playback (preview) just ment with other music genres and extend the method to make it
the beginning of a chorus section detected by the RefraiD widely applicable. In addition, our future work will include re-
method would provide added convenience when browsing search on new directions of making interaction between people
through a large set of songs or when presenting search and music even more active and enriching.
results of music information retrieval. This function can
be regarded as a music version of the image thumbnail. ACKNOWLEDGMENT
• Computer-based media players: A variety of functions
The author would like to thank H. Asoh (National Institute of
have recently been added to media players, such as ex-
Advanced Industrial Science and Technology) for his valuable
changeable appearance (skins) and music-synchronized
discussions and the anonymous reviewers for their helpful com-
animation in the form of geometrical drawings moving
ments and suggestions.
synchronously with waveforms and frequency spectrums
during playback. No essential progress, however, has been
seen in the interface itself. We hope not only that the REFERENCES
SmartMusicKIOSK interface will be adopted for various [1] M. Goto, “Music scene description project: toward audio-based real-
media players, but also that other approaches of reexam- time music understanding,” in Proc. Int. Conf. Music Information Re-
trieval, 2003, pp. 231–232.
ining the entire functional makeup of music playback in- [2] , “A real-time music scene description system: Predominant-F0 es-
terfaces will follow. timation for detecting melody and bass lines in real-world audio signals,”
Speech Commun., vol. 43, no. 4, pp. 311–329, 2004.
[3] B. Logan and S. Chu, “Music summarization using key phrases,” in
VII. CONCLUSION Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing , 2000, pp.
II-749–II-752.
We have described the RefraiD method which detects chorus [4] M. A. Bartsch and G. H. Wakefield, “To catch a chorus: using chroma-
based representations for audio thumbnailing,” in Proc. IEEE Workshop
sections and repeated sections in real-world popular music Applications of Signal Processing to Audio and Acoustics, 2001, pp.
audio signals. It basically regards the most repeated sections 15–18.
1794 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING , VOL. 14, NO. 5, SEPTEMBER 2006
[5] M. Cooper and J. Foote, “Automatic music summarization via similarity [24] T. Fujishima, “Realtime chord recognition of musical sound: A system
analysis,” in Proc. Int. Conf. Music Information Retrieval, 2002, pp. using common lisp music,” in Proc. Int. Computer Music Conf., 1999,
81–85. pp. 464–467.
[6] R. B. Dannenberg and N. Hu, “Pattern discovery techniques for music [25] A. Sheh and D. P. Ellis, “Chord segmentation and recognition using
audio,” J. New Music Res., vol. 32, no. 2, pp. 153–163, 2003. EM-trained hidden Markov models,” in Proc. Int. Conf. Music Infor-
[7] , “Discovering musical structure in audio recordings,” in Proc. Int. mation Retrieval, 2003, pp. 183–189.
Conf. Music and Artificial Intelligence, 2002, pp. 43–57. [26] T. Yoshioka, T. Kitahara, K. Komatani, T. Ogata, and H. G. Okuno, “Au-
[8] G. Peeters, A. L. Burthe, and X. Rodet, “Toward automatic music audio tomatic chord transcription with concurrent recognition of chord sym-
summary generation from signal analysis,” in Proc. Int. Conf. Music In- bols and boundaries,” in Proc. Int. Conf. Music Information Retrieval,
formation Retrieval, 2002, pp. 94–100. 2004, pp. 100–105.
[9] G. Peeters and X. Rodet, “Signal-based music structure discovery for [27] A. Savitzky and M. J. Golay, “Smoothing and differentiation of data by
music audio summary generation,” in Proc. Int. Computer Music Con- simplified least squares procedures,” Anal. Chem., vol. 36, no. 8, pp.
ference, 2003, pp. 15–22. 1627–1639, 1964.
[10] J. T. Foote and M. L. Cooper, “Media segmentation using self-simi- [28] C. J. van Rijsbergen, Information Retrieval, 2nd ed. London, U.K.:
larity decomposition,” in Proc. SPIE Storage and Retrieval for Media Butterworths, 1979.
Databases, vol. 5021, 2003, pp. 167–175. [29] M. Goto, R. Neyama, and Y. Muraoka, “RMCP: Remote music control
[11] M. Cooper and J. Foote, “Summarizing popular music via structural sim- protocol, design and applications,” in Proc. Int. Computer Music Conf.,
ilarity analysis,” in Proc. IEEE Workshop Applications of Signal Pro- 1997, pp. 446–449.
cessing to Audio and Acoustics, 2003, pp. 127–130. [30] K. Hirata and S. Matsuda, “Interactive music summarization based on
[12] W. Chai and B. Vercoe, “Structural analysis of musical signals for in- GTTM,” in Proc. Int. Conf. Music Information Retrieval, 2002, pp.
dexing and thumbnailing,” in Proc. ACM/IEEE Joint Conf. Digital Li- 86–93.
braries, 2003, pp. 27–34.
[13] J. Wellhausen and H. Crysandt, “Temporal audio segmentation using
mpeg-7 descriptors,” Proc. SPIE Storage and Retrieval for Media
Databases, vol. 5021, pp. 380–387, 2003. Masataka Goto received the Doctor of Engineering
[14] J.-J. Aucouturier and M. Sandler, “Finding repeating patterns in acoustic degree in electronics, information, and communi-
musical signals: Applications for audio thumbnailing,” in Proc. AES cation engineering from Waseda University, Tokyo,
22nd Int. Conf. Virtual, Synthetic and Entertainment Audio, 2002, pp. Japan, in 1998.
412–421. He then joined the Electrotechnical Laboratory
[15] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka, “RWC music data- (ETL; reorganized as the National Institute of Ad-
base: Popular, classical, and jazz music databases,” in Proc. Int. Conf. vanced Industrial Science and Technology (AIST) in
Music Information Retrieval, 2002, pp. 287–288. 2001), Tsukuba, Ibaraki, Japan, where he has since
[16] M. Goto, “Development of the RWC music database,” in Proc. Int. been a Research Scientist. He served concurrently as
Congr. Acoustics, 2004, pp. I-553–I-556. a Researcher in Precursory Research for Embryonic
[17] N. Otsu, “A threshold selection method from gray-level histograms,” Science and Technology (PRESTO), Japan Science
IEEE Trans. Syst., Man, Cybern., vol. SMC-9, no. 1, pp. 62–66, Jan. and Technology Corporation (JST) from 2000 to 2003, and an Associate
1979. Professor of the Department of Intelligent Interaction Technologies, Graduate
[18] R. N. Shepard, “Circularity in judgments of relative pitch,” J. Acoust. School of Systems and Information Engineering, University of Tsukuba,
Soc. Amer., vol. 36, no. 12, pp. 2346–2353, 1964. since 2005. His research interests include music information processing and
[19] B. C. J. Moore, An Introduction to the Psychology of Hearing, 4th spoken-language processing.
ed. New York: Academic, 1997. Dr. Goto is a member of the Information Processing Society of Japan (IPSJ),
[20] W. Fujisaki and M. Kashino, “Basic hearing abilities and characteristics Acoustical Society of Japan (ASJ), Japanese Society for Music Perception and
of musical pitch perception in absolute pitch possessors,” in Proc. Int. Cognition (JSMPC), Institute of Electronics, Information, and Communication
Congr. Acoustics, 2004, pp. V-3607–V-3610. Engineers (IEICE), and the International Speech Communication Association
[21] G. H. Wakefield, “Mathematical representation of joint time-chroma dis- (ISCA). He has received 17 awards, including the IPSJ Best Paper Award and
tributions,” Proc. SPIE, pp. 637–645, 1999. IPSJ Yamashita SIG Research Awards (special interest group on music and com-
[22] H. Yamada, M. Goto, H. Saruwatari, and K. Shikano, “Multi-timbre puter, and spoken language processing) from the IPSJ, the Awaya Prize for Out-
chord classification for musical audio signals (in Japanese),” in Proc. standing Presentation and Award for Outstanding Poster Presentation from the
Autumn Meeting Acoustical Soc. Japan, Sep. 2002, pp. 641–642. ASJ, Award for Best Presentation from the JSMPC, Best Paper Award for Young
[23] , “Multi-timbre chord classification method for musical audio Researchers from the Kansai-Section Joint Convention of Institutes of Electrical
signals: Application to musical pieces (in Japanese),” in Proc. Spring Engineering, WISS 2000 Best Paper Award and Best Presentation Award, and
Meeting Acoustical Soc. Japan, Mar. 2003, pp. 835–836. Interaction 2003 Best Paper Award.