PHD Tristan
PHD Tristan
PHD Tristan
by
Tristan Jehan
Diplome dIngenieur en Informatique et Telecommunications
IFSIC, Universite de Rennes 1, France, 1997
M.S. Media Arts and Sciences
Massachusetts Institute of Technology, 2000
Submitted to the Program in Media Arts and Sciences,
School of Architecture and Planning,
in partial fulllment of the requirements for the degree of
Doctor of Philosophy
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
September, 2005
c Massachusetts Institute of Technology 2005. All rights reserved.
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Program in Media Arts and Sciences
June 17, 2005
Certied by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tod Machover
Professor of Music and Media
Thesis Supervisor
Accepted by. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Andrew B. Lippman
Chairman, Departmental Committee on Graduate Students
2
Creating Music by Listening
by Tristan Jehan
Submitted to the Program in Media Arts and Sciences,
School of Architecture and Planning, on June 17, 2005,
in partial fulllment of the requirements
for the degree of Doctor of Philosophy
Abstract
Machines have the power and potential to make expressive music on their own.
This thesis aims to computationally model the process of creating music using
experience from listening to examples. Our unbiased signal-based solution mod-
els the life cycle of listening, composing, and performing, turning the machine
into an active musician, instead of simply an instrument. We accomplish this
through an analysis-synthesis technique by combined perceptual and structural
modeling of the musical surface, which leads to a minimal data representation.
We introduce a music cognition framework that results from the interaction
of psychoacoustically grounded causal listening, a time-lag embedded feature
representation, and perceptual similarity clustering. Our bottom-up analysis in-
tends to be generic and uniform by recursively revealing metrical hierarchies and
structures of pitch, rhythm, and timbre. Training is suggested for top-down un-
biased supervision, and is demonstrated with the prediction of downbeat. This
musical intelligence enables a range of original manipulations including song
alignment, music restoration, cross-synthesis or song morphing, and ultimately
the synthesis of original pieces.
Thesis supervisor: Tod Machover, D.M.A.
Title: Professor of Music and Media
4
Thesis Committee
Thesis supervisor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tod Machover
Professor of Music and Media
MIT Program in Media Arts and Sciences
Thesis reader. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Peter Cariani
Research Assistant Professor of Physiology
Tufts Medical School
Thesis reader. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Francois Pachet
Senior Researcher
Sony Computer Science Laboratory
Thesis reader. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Julius O. Smith III
Associate Professor of Music and (by courtesy) Electrical Engineering
CCRMA, Stanford University
Thesis reader. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Barry Vercoe
Professor of Media Arts and Sciences
MIT Program in Media Arts and Sciences
6
Acknowledgments
It goes without saying that this thesis is a collaborative piece of work. Much
like the system presented here draws musical ideas and sounds from multiple
song examples, I personally drew ideas, inuences, and inspirations from many
people to whom I am very thankful for:
My committee: Tod Machover, Peter Cariani, Francois Pachet, Julius O. Smith
III, Barry Vercoe.
My collaborators and friends: Brian, Mary, Hugo, Carla, Cati, Ben, Ali, An-
thony, Jean-Julien, Hedlena, Giordano, Stacie, Shelly, Victor, Bernd, Fredo,
Joe, Peter, Marc, Sergio, Joe Paradiso, Glorianna Davenport, Sile OModhrain,
Deb Roy, Alan Oppenheim.
My Media Lab group and friends: Adam, David, Rob, Gili, Mike, Jacqueline,
Ariane, Laird.
My friends outside of the Media Lab: Jad, Vincent, Gaby, Erin, Brazilnut, the
Wine and Cheese club, 24 Magazine St., 45 Banks St., Rustica, 1369, Annas
Taqueria.
My family: Micheline, Rene, Cecile, Francois, and Co.
Acknowledgments 7
8 Acknowledgments
A good composer does not imitate; he steals.
Igor Stravinsky
10 Acknowledgments
Table of Contents
1 Introduction 23
2 Background 29
2.1 Symbolic Algorithmic Composition . . . . . . . . . . . . . . . . . 29
2.2 Hybrid MIDI-Audio Instruments . . . . . . . . . . . . . . . . . . 30
2.3 Audio Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Music information retrieval . . . . . . . . . . . . . . . . . . . . . 33
2.5 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.1 Music analysis/resynthesis . . . . . . . . . . . . . . . . . . 35
2.5.2 Description . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5.3 Hierarchical description . . . . . . . . . . . . . . . . . . . 39
2.5.4 Meaningful sound space . . . . . . . . . . . . . . . . . . . 40
2.5.5 Personalized music synthesis . . . . . . . . . . . . . . . . 41
3 Music Listening 43
3.0.6 Anatomy . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.0.7 Psychoacoustics . . . . . . . . . . . . . . . . . . . . . . . . 44
3.1 Auditory Spectrogram . . . . . . . . . . . . . . . . . . . . . . . . 45
3.1.1 Spectral representation . . . . . . . . . . . . . . . . . . . 45
3.1.2 Outer and middle ear . . . . . . . . . . . . . . . . . . . . 46
3.1.3 Frequency warping . . . . . . . . . . . . . . . . . . . . . . 46
3.1.4 Frequency masking . . . . . . . . . . . . . . . . . . . . . . 48
3.1.5 Temporal masking . . . . . . . . . . . . . . . . . . . . . . 49
3.1.6 Putting it all together . . . . . . . . . . . . . . . . . . . . 50
3.2 Loudness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3 Timbre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4 Onset Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4.1 Prior approaches . . . . . . . . . . . . . . . . . . . . . . . 54
3.4.2 Perceptually grounded approach . . . . . . . . . . . . . . 54
3.4.3 Tatum grid . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5 Beat and Tempo . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.5.1 Comparative models . . . . . . . . . . . . . . . . . . . . . 58
3.5.2 Our approach . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.6 Pitch and Harmony . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.7 Perceptual feature space . . . . . . . . . . . . . . . . . . . . . . . 61
4 Musical Structures 65
4.1 Multiple Similarities . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.1 Hierarchical representations . . . . . . . . . . . . . . . . . 66
12 Contents
4.2.2 Global timbre methods . . . . . . . . . . . . . . . . . . . 67
4.2.3 Rhythmic similarities . . . . . . . . . . . . . . . . . . . . 67
4.2.4 Self-similarities . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . 69
4.4 Sound Segment Similarity . . . . . . . . . . . . . . . . . . . . . . 72
4.5 Beat Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.6 Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.6.1 Pattern length . . . . . . . . . . . . . . . . . . . . . . . . 74
4.6.2 Heuristic approach to downbeat detection . . . . . . . . . 76
4.6.3 Pattern-synchronous similarities . . . . . . . . . . . . . . 77
4.7 Larger Sections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.8 Chapter Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 79
5 Learning Music Signals 81
5.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1.1 Supervised, unsupervised, and reinforcement learning . . 82
5.1.2 Generative vs. discriminative learning . . . . . . . . . . . 83
5.2 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2.1 Regression and classication . . . . . . . . . . . . . . . . . 84
5.2.2 State-space forecasting . . . . . . . . . . . . . . . . . . . . 84
5.2.3 Principal component analysis . . . . . . . . . . . . . . . . 84
5.2.4 Understanding musical structures . . . . . . . . . . . . . . 84
5.2.5 Learning and forecasting musical structures . . . . . . . . 86
Contents 13
5.2.6 Support Vector Machine . . . . . . . . . . . . . . . . . . . 86
5.3 Downbeat prediction . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3.1 Downbeat training . . . . . . . . . . . . . . . . . . . . . . 88
5.3.2 The James Brown case . . . . . . . . . . . . . . . . . . . . 90
5.3.3 Inter-song generalization . . . . . . . . . . . . . . . . . . . 90
5.4 Time-Axis Redundancy Cancellation . . . . . . . . . . . . . . . . 92
5.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4.2 Nonhierarchical k-means clustering . . . . . . . . . . . . . 93
5.4.3 Agglomerative Hierarchical Clustering . . . . . . . . . . . 94
5.4.4 Compression . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6 Composing with sounds 99
6.1 Automated DJ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.1.1 Beat-matching . . . . . . . . . . . . . . . . . . . . . . . . 100
6.1.2 Time-scaling . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2 Early Synthesis Experiments . . . . . . . . . . . . . . . . . . . . 103
6.2.1 Scrambled Music . . . . . . . . . . . . . . . . . . . . . . . 103
6.2.2 Reversed Music . . . . . . . . . . . . . . . . . . . . . . . . 104
6.3 Music Restoration . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3.1 With previously known structure . . . . . . . . . . . . . . 106
6.3.2 With no prior knowledge . . . . . . . . . . . . . . . . . . 106
6.4 Music Textures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
14 Contents
6.5 Music Cross-Synthesis . . . . . . . . . . . . . . . . . . . . . . . . 111
6.6 Putting it all together . . . . . . . . . . . . . . . . . . . . . . . . 114
7 Conclusion 115
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.3.1 Scientic contributions . . . . . . . . . . . . . . . . . . . . 116
7.3.2 Engineering contributions . . . . . . . . . . . . . . . . . . 117
7.3.3 Artistic contributions . . . . . . . . . . . . . . . . . . . . 118
7.4 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.5 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Appendix A Skeleton 121
A.1 Machine Listening . . . . . . . . . . . . . . . . . . . . . . . . . . 122
A.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
A.3 Music Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
A.4 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
A.5 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Bibliography 127
Contents 15
16 Contents
List of Figures
1-1 Example of paintings by computer program AARON . . . . . . . 24
1-2 Life cycle of the music making paradigm . . . . . . . . . . . . . . 27
2-1 Sound analysis/resynthesis paradigm . . . . . . . . . . . . . . . . 35
2-2 Music analysis/resynthesis paradigm . . . . . . . . . . . . . . . . 36
2-3 Machine listening, transformation, and concatenative synthesis . 36
2-4 Analysis framework . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2-5 Example of a song decomposition in a tree structure . . . . . . . 40
2-6 Multidimensional scaling perceptual space . . . . . . . . . . . . . 41
3-1 Anatomy of the ear . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3-2 Transfer function of the outer and middle ear . . . . . . . . . . . 46
3-3 Cochlea and scales . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3-4 Bark and ERB scales compared . . . . . . . . . . . . . . . . . . . 47
3-5 Frequency warping examples: noise and pure tone . . . . . . . . 48
3-6 Frequency masking example: two pure tones . . . . . . . . . . . . 49
3-7 Temporal masking schematic . . . . . . . . . . . . . . . . . . . . 50
3-8 Temporal masking examples: four sounds . . . . . . . . . . . . . 50
3-9 Perception of rhythm schematic . . . . . . . . . . . . . . . . . . . 51
3-10 Auditory spectrogram: noise, pure tone, sounds, and music . . . 52
3-11 Timbre and loudness representations on music . . . . . . . . . . . 53
3-12 Segmentation of a music example . . . . . . . . . . . . . . . . . . 55
3-13 Tatum tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3-14 Beat tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3-15 Chromagram schematic . . . . . . . . . . . . . . . . . . . . . . . 60
3-16 Chroma analysis example: four sounds . . . . . . . . . . . . . . . 61
3-17 Chromagram of a piano scale . . . . . . . . . . . . . . . . . . . . 62
3-18 Pitch-content analysis of a chord progression . . . . . . . . . . . 63
3-19 Musical metadata extraction . . . . . . . . . . . . . . . . . . . . 64
4-1 Similarities in the visual domain . . . . . . . . . . . . . . . . . . 66
4-2 3D representation of the hierarchical structure of timbre . . . . . 70
4-3 Dynamic time warping schematic . . . . . . . . . . . . . . . . . . 71
4-4 Weight function for timbre similarity of sound segments . . . . . 72
4-5 Chord progression score . . . . . . . . . . . . . . . . . . . . . . . 73
4-6 Timbre vs. pitch analysis . . . . . . . . . . . . . . . . . . . . . . 74
4-7 Hierarchical self-similarity matrices of timbre . . . . . . . . . . . 75
4-8 Pattern length analysis . . . . . . . . . . . . . . . . . . . . . . . . 76
4-9 Heuristic analysis of downbeat: simple example . . . . . . . . . . 78
4-10 Heuristic analysis of downbeat: real-world example . . . . . . . . 78
18 Figures
4-11 Pattern self-similarity matrices of rhythm and pitch . . . . . . . 79
5-1 PCA schematic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5-2 Manifold examples: electronic, funk, jazz music . . . . . . . . . . 85
5-3 Rhythm prediction with CWM and SVM . . . . . . . . . . . . . 87
5-4 SVM classication schematic . . . . . . . . . . . . . . . . . . . . 87
5-5 Time-lag embedding example . . . . . . . . . . . . . . . . . . . . 89
5-6 PCA reduced time-lag space . . . . . . . . . . . . . . . . . . . . . 89
5-7 Supervised learning schematic . . . . . . . . . . . . . . . . . . . . 90
5-8 Intra-song downbeat prediction . . . . . . . . . . . . . . . . . . . 91
5-9 Causal downbeat prediction schematic . . . . . . . . . . . . . . . 91
5-10 Typical Maracatu rhythm score notation . . . . . . . . . . . . . . 91
5-11 Inter-song downbeat prediction . . . . . . . . . . . . . . . . . . . 92
5-12 Segment distribution demonstration . . . . . . . . . . . . . . . . 94
5-13 Dendrogram and musical path . . . . . . . . . . . . . . . . . . . . 95
5-14 Compression example . . . . . . . . . . . . . . . . . . . . . . . . 97
6-1 Time-scaling schematic . . . . . . . . . . . . . . . . . . . . . . . . 101
6-2 Beat matching example . . . . . . . . . . . . . . . . . . . . . . . 102
6-3 Beat matching schematic . . . . . . . . . . . . . . . . . . . . . . 102
6-4 Scrambled music source example . . . . . . . . . . . . . . . . . . 103
6-5 Scrambled music result . . . . . . . . . . . . . . . . . . . . . . . . 104
6-6 Reversed music result . . . . . . . . . . . . . . . . . . . . . . . . 104
Figures 19
6-7 Fragment-based image completion . . . . . . . . . . . . . . . . . 105
6-8 Restoring music schematic . . . . . . . . . . . . . . . . . . . . . . 106
6-9 Segment-based music completion example . . . . . . . . . . . . . 107
6-10 Video textures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6-11 Music texture schematic . . . . . . . . . . . . . . . . . . . . . . . 109
6-12 Music texture example (1600%) . . . . . . . . . . . . . . . . . . . 110
6-13 Cross-synthesis schematic . . . . . . . . . . . . . . . . . . . . . . 112
6-14 Photomosaic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6-15 Cross-synthesis example . . . . . . . . . . . . . . . . . . . . . . . 113
A-1 Skeleton software screenshot . . . . . . . . . . . . . . . . . . . . . 121
A-2 Skeleton software architecture . . . . . . . . . . . . . . . . . . . . 124
20 Figures
CHAPTER ONE
Introduction
The secret to creativity is knowing how to hide your sources.
Albert Einstein
Can computers be creative? The question drives an old philosophical debate
that goes back to Alan Turings claim that a computational system can possess
all important elements of human thinking or understanding (1950). Creativity
is one of those things that makes humans special, and is a key issue for articial
intelligence (AI) and cognitive sciences: if computers cannot be creative, then
1) they cannot be intelligent, and 2) people are not machines [35].
The standard argument against computers ability to create is that they merely
follow instructions. Lady Lovelace states that they have no pretensions to
originate anything. A distinction, is proposed by Boden between psychological
creativity (P-creativity) and historical creativity (H-creativity) [14]. Something
is P-creative if it is fundamentally novel for the individual, whereas it is H-
creative if it is fundamentally novel with respect to the whole of human history.
A work is therefore granted H-creative only in respect to its context. There
seems to be no evidence whether there is a continuum between P-creativity and
H-creativity.
Despite the lack of conceptual and theoretical consensus, there have been sev-
eral attempts at building creative machines. Harold Cohens AARON [29] is a
painting program that produces both abstract and lifelike works (Figure 1-1).
The program is built upon a knowledge base full of information about the mor-
phology of people, and painting techniques. It plays randomly with thousands
of interrelated variables to create works of art. It is arguable that the creator
in this case is Cohen himself, since he provided the rules to the program, but
more so because AARON is not able to analyze its own work.
Figure 1-1: Example of paintings by Cohens computer program AARON. In
Cohens own words: I do not believe that AARON constitutes an existence proof
of the power of machines to think, or to be creative, or to be self-aware, to display
any of those attributes coined specically to explain something about ourselves. It
constitutes an existence proof of the power of machines to do some of the things we
had assumed required thought, and which we still suppose would require thought,
and creativity, and self-awareness, of a human being. If what AARON is making
is not art, what is it exactly, and in what ways, other than its origin, does it dier
from the real thing?
Composing music is creating by putting sounds together. Although it is known
that humans compose, it turns out that only few of them actually do it. Com-
position is still regarded as an elitist, almost mysterious ability that requires
years of training. And of those people who compose, one might wonder how
many of them really innovate. Not so many, if we believe Lester Young, who is
considered one of the most important tenor saxophonists of all time:
The trouble with most musicians today is that they are copycats. Of
course you have to start out playing like someone else. You have a model,
or a teacher, and you learn all that he can show you. But then you start
playing for yourself. Show them that youre an individual. And I can
count those who are doing that today on the ngers of one hand.
If truly creative music is rare, then what can be said about the rest? Perhaps,
it is not fair to expect from a computer program to either become the next
Arnold Schonberg, or not to be creative at all. In fact, if the machine brings into
existence a piece of music by assembling sounds together, doesnt it compose
music? We may argue that the programmer who dictates the rules and the
constraint space is the composer, like in the case of AARON. The computer
remains an instrument, yet a sophisticated one.
22 CHAPTER 1. INTRODUCTION
The last century has been particularly rich in movements that consisted of
breaking the rules of previous music, from serialists like Schonberg and Stock-
hausen, to aleatorists like Cage. The realm of composition principles today
is so disputed and complex, that it would not be practical to try and dene a
set of rules that ts them all. Perhaps a better strategy is a generic modeling
tool that can accommodate specic rules from a corpus of examples. This is
the approach that, as modelers of musical intelligence, we wish to take. Our
goal is more specically to build a machine that denes its own creative rules
by listening to and learning from musical examples.
Humans naturally acquire knowledge, and comprehend music from listening.
They automatically hear collections of auditory objects and recognize patterns.
With experience they can predict, classify, and make immediate judgments
about genre, style, beat, composer, performer, etc. In fact, every composer
was once ignorant, musically inept, and learned certain skills essentially from
listening and training. The act of composing music is an act of bringing personal
experiences together, or inuences. In the case of a computer program, that
personal experience is obviously quite non-existent. Though, it is reasonable
to believe that the musical experience is the most essential, and it is already
accessible to machines in a digital form.
There is a fairly high degree of abstraction between the digital representation
of an audio le (WAV, AIFF, MP3, AAC, etc.) in the computer, and its mental
representation in the humans brain. Our task is to make that connection by
modeling the way humans perceive, learn, and nally represent music. The
latter, a form of memory, is assumed to be the most critical ingredient in their
ability to compose new music. Now, if the machine is able to perceive music
much like humans, learn from the experience, and combine the knowledge into
creating new compositions, is the composer: 1) the programmer who conceives
the machine; 2) the user who provides the machine with examples; or 3) the
machine that makes music, inuenced by these examples?
Such ambiguity is also found on the synthesis front. While composition (the cre-
ative act) and performance (the executive act) are traditionally distinguishable
notionsexcept with improvised music where both occur simultaneouslywith
new technologies the distinction can disappear, and the two notions merge.
With machines generating sounds, the composition, which is typically repre-
sented in a symbolic form (a score), can be executed instantly to become a
performance. It is common in electronic music that a computer program syn-
thesizes music live, while the musician interacts with the parameters of the
synthesizer, by turning knobs, selecting rhythmic patterns, note sequences,
sounds, lters, etc. When the sounds are stolen (sampled) from already
existing music, the authorship question is also supplemented with an ownership
issue. Undoubtedly, the more technical sophistication is brought to computer
music tools, the more the musical artifact gets disconnected from its creative
source.
23
The work presented in this document is merely focused on composing new mu-
sic automatically by recycling a preexisting one. We are not concerned with
the question of transcription, or separating sources, and we prefer to work di-
rectly with rich and complex, polyphonic sounds. This sound collage procedure
has recently gotten popular, dening the term mash-up music: the practice
of making new music out of previously existing recordings. One of the most
popular composers of this genre is probably John Oswald, best known for his
project Plunderphonics [121].
The number of digital music titles available is currently estimated at about 10
million in the western world. This is a large quantity of material to recycle
and potentially to add back to the space in a new form. Nonetheless, the space
of all possible music is nite in its digital form. There are 12,039,300 16-bit
audio samples at CD quality in a 4-minute and 33-second song
1
, which account
for 65,536
12,039,300
options. This is a large amount of music! However, due to
limitations of our perception, only a tiny fraction of that space makes any sense
to us. The large majority of it sounds essentially like random noise
2
. From the
space that makes any musical sense, an even smaller fraction of it is perceived
as unique (just-noticeably dierent from others).
In a sense, by recycling both musical experience and sound material, we can
more intelligently search through this large musical space and nd more e-
ciently some of what is left to be discovered. This thesis aims to computationally
model the process of creating music using experience from listening to exam-
ples. By recycling a database of existing songs, our model aims to compose
and perform new songs with similar characteristics, potentially expanding
the space to yet unexplored frontiers. Because it is purely based on the signal
content, the system is not able to make qualitative judgments of its own work,
but can listen to the results and analyze them in relation to others, as well
as recycle that new music again. This unbiased solution models the life cycle
of listening, composing, and performing, turning the machine into an active
musician, instead of simply an instrument (Figure 1-2).
In this work, we claim the following hypothesis:
Analysis and synthesis of musical audio can share a minimal data
representation of the signal, acquired through a uniform approach
based on perceptual listening and learning.
In other words, the analysis task, which consists of describing a music signal,
is equivalent to a structured compression task. Because we are dealing with a
perceived signal, the compression is perceptually grounded in order to give the
most compact and most meaningful description. Such specic representation is
analogous to a music cognition modeling task. The same description is suitable
1
In reference to Cages silent piece: 433.
2
We are not referring to heavy metal!
24 CHAPTER 1. INTRODUCTION
Listening
Composing Performing
Figure 1-2: Life cycle of the music making paradigm.
for synthesis as well: by reciprocity of the process, and redeployment of the
data into the signal domain, we can resynthesize the music in the form of a
waveform. We say that the model is lossy as it removes information that is not
perceptually relevant, but optimal in terms of data reduction. Creating new
music is a matter of combining multiple representations before resynthesis.
The motivation behind this work is to personalize the music experience by seam-
lessly merging together listening, composing, and performing. Recorded music
is a relatively recent technology, which already has found a successor: syn-
thesized music, in a sense, will enable a more intimate listening experience by
potentially providing the listeners with precisely the music they want, whenever
they want it. Through this process, it is potentially possible for our metacom-
poser to turn listenerswho induce the musicinto composers themselves.
Music will ow and be live again. The machine will have the capability of mon-
itoring and improving its prediction continually, and of working in communion
with millions of other connected music fans.
The next chapter reviews some related works and introduces our framework.
Chapters 3 and 4 deal with the machine listening and structure description
aspects of the framework. Chapter 5 is concerned with machine learning, gen-
eralization, and clustering techniques. Finally, music synthesis is presented
through a series of applications including song alignment, music restoration,
cross-synthesis, song morphing, and the synthesis of new pieces. This research
was implemented within a stand-alone environment called Skeleton developed
by the author, as described in appendix A. The interested readers may refer to
the supporting website of this thesis, and listen to the audio examples that are
analyzed or synthesized throughout the document:
http://www.media.mit.edu/tristan/phd/
25
26 CHAPTER 1. INTRODUCTION
CHAPTER TWO
Background
When theres something we think could be better, we must make an
eort to try and make it better.
John Coltrane
Ever since Max Mathews made his rst sound on a computer in 1957 at Bell
Telephone Laboratories, there has been an increasing appeal and eort for using
machines in the disciplines that involve music, whether composing, performing,
or listening. This thesis is an attempt at bringing together all three facets by
closing the loop that can make a musical system entirely autonomous (Figure
1-2). A complete survey of precedents in each eld goes well beyond the scope
of this dissertation. This chapter only reviews some of the most related and
inspirational works for the goals of this thesis, and nally presents the framework
that ties the rest of the document together.
2.1 Symbolic Algorithmic Composition
Musical composition has historically been considered at a symbolic, or conven-
tional level, where score information (i.e., pitch, duration, dynamic material,
and instrument, as dened in the MIDI specications) is the output of the
compositional act. The formalism of music, including the system of musical
sounds, intervals, and rhythms, goes as far back as ancient Greeks, to Pythago-
ras, Ptolemy, and Plato, who thought of music as inseparable from numbers.
The automation of composing through formal instructions comes later with the
canonic composition of the 15th century, and leads to what is now referred to
as algorithmic composition. Although it is not restricted to computers
1
, using
algorithmic programming methods as pioneered by Hiller and Isaacson in Il-
liac Suite (1957), or Xenakis in Atrees (1962), has opened the door to new
vistas in the expansion of the computers development as a unique instrument
with signicant potential [31].
Computer-generated, automated composition can be organized into three main
categories: stochastic methods, which use sequences of jointly distributed ran-
dom variables to control specic decisions (Aleatoric movement); rule-based
systems, which use a strict grammar and set of rules (Serialism movement);
and articial intelligence approaches, which dier from rule-based approaches
mostly by their capacity to dene their own rules: in essence, to learn. The
latter is the approach that is most signicant to our work, as it aims at creating
music through unbiased techniques, though with intermediary MIDI represen-
tation.
Probably the most popular example is David Copes system called Experi-
ments in Musical Intelligence (EMI). EMI analyzes the score structure of a
MIDI sequence in terms of recurring patterns (a signature), creates a database
of the meaningful segments, and learns the style of a composer, given a certain
number of pieces [32]. His system can generate compositions with surprising
stylistic similarities to the originals. It is, however, unclear how automated the
whole process really is, and if the system is able to extrapolate from what it
learns.
A more recent system by Francois Pachet, named Continuator [122], is ca-
pable of learning live the improvisation style of a musician who plays on a
polyphonic MIDI instrument. The machine can continue the improvisation
on the y, and performs autonomously, or under user guidance, yet in the style
of its teacher. A particular parameter controls the closeness of the generated
music, and allows for challenging interactions with the human performer.
2.2 Hybrid MIDI-Audio Instruments
George Lewis, trombone improviser and composer, is a pioneer in building
computer programs that create music by interacting with a live performer
through acoustics. The so-called Voyager software listens via a microphone
to his trombone improvisation, and comes to quick conclusions about what was
played. It generates a complex response that attempts to make appropriate
decisions about melody, harmony, orchestration, ornamentation, rhythm, and
silence [103]. In Lewis own words, the idea is to get the machine to pay atten-
tion to the performer as it composes. As the performer engages in a dialogue,
1
Automated techniques (e.g., through randomness) have been used for example by Mozart
in Dice Music, by Cage in Reunion, or by Messiaen in Mode de valeurs et dintensites.
28 CHAPTER 2. BACKGROUND
the machine may also demonstrate an independent behavior that arises from
its own internal processes.
The so-called Hyperviolin developed at MIT [108] uses multichannel audio
input and perceptually-driven processes (i.e., pitch, loudness, brightness, noisi-
ness, timbre), as well as gestural data input (bow position, speed, acceleration,
angle, pressure, height). The relevant but high dimensional data stream un-
fortunately comes together with the complex issue of mapping that data to
meaningful synthesis parameters. Its latest iteration, however, features an un-
biased and unsupervised learning strategy for mapping timbre to intuitive and
perceptual control input (section 2.3).
The piece Sparkler (composed by Tod Machover) exploits similar techniques,
but for a symphony orchestra [82]. Unlike many previous works where only solo
instruments are considered, in this piece a few microphones capture the entire
orchestral sound, which is analyzed into perceptual data streams expressing
variations in dynamics, spatialization, and timbre. These instrumental sound
masses, performed with a certain freedom by players and conductor, drive a
MIDI-based generative algorithm developed by the author. It interprets and
synthesizes complex electronic textures, sometimes blending, and sometimes
contrasting with the acoustic input, turning the ensemble into a kind of Hy-
perorchestra.
These audio-driven systems employ rule-based generative principles for synthe-
sizing music [173][139]. Yet, they dier greatly from score-following strategies
in their creative approach, as they do not rely on aligning pre-composed mate-
rial to an input. Instead, the computer program is the score, since it describes
everything about the musical output, including notes and sounds to play. In
such a case, the created music is the result of a compositional act by the pro-
grammer. Pushing even further, Lewis contends that:
[...] notions about the nature and function of music are embedded in
the structure of software-based music systems, and interactions with these
systems tend to reveal characteristics of the community of thought and
culture that produced them. Thus, Voyager is considered as a kind of
computer music-making embodying African-American aesthetics and mu-
sical practices. [103]
2.3 Audio Models
Analyzing the musical content in audio signals rather than symbolic signals is
an attractive idea that requires some sort of perceptual models of listening.
Perhaps an even more dicult problem is being able to synthesize meaningful
audio signals without intermediary MIDI notation. Most worksoften driven
2.3. AUDIO MODELS 29
by a particular synthesis techniquedo not really make a distinction between
sound and music.
CNMATs Additive Synthesis Tools (CAST) are exible and generic real-time
analysis/resynthesis routines based on sinusoidal decomposition, Sound De-
scription Interchange Format (SDIF) content description format [174], and
Open Sound Control (OSC) communication protocol [175]. The system can
analyze, modify, and resynthesize a live acoustic instrument or voice, encour-
aging a dialogue with the acoustic performer. Nonetheless, the synthesized
music is controlled by the electronic performer who manipulates the inter-
face. As a result, performers remain in charge of the music, while the software
generates the sound.
The Spectral Modeling Synthesis (SMS) technique initiated in 1989 by Xavier
Serra is a powerful platform for the analysis and resynthesis of monophonic and
polyphonic audio [149][150]. Through decomposition into its deterministic and
stochastic components, the software enables several applications, including time
scaling, pitch shifting, compression, content analysis, sound source separation,
instrument modeling, and timbre morphing.
The Perceptual Synthesis Engine (PSE), developed by the author, is an ex-
tension of SMS for monophonic sounds [79][83]. It rst decomposes the au-
dio recording into a set of streaming signal coecients (frequencies and am-
plitudes of sinusoidal functions) and their corresponding perceptual correlates
(instantaneous pitch, loudness, and brightness). It then learns the relationship
between the two data sets: the high-dimensional signal description, and the
low-dimensional perceptual equivalent. The resulting timbre model allows for
greater control over the sound than previous methods by removing the time
dependency from the original le
2
. The learning is based on a mixture of Gaus-
sians with local linear models and converges to a unique solution through the
Expectation-Maximization (EM) algorithm. The outcome is a highly compact
and unbiased synthesizer that enables the same applications as SMS, with in-
tuitive control and no time-structure limitation. The system runs in real time
and is driven by audio, such as the acoustic or electric signal of traditional in-
struments. The work presented in this dissertation is, in a sense, a step towards
extending this monophonic timbre model to polyphonic structured music.
Methods based on data-driven concatenative synthesis typically discard the
notion of analytical transcription, but instead, they aim at generating a musical
surface (i.e., what is perceived) through a set of compact audio descriptors, and
the concatenation of sound samples. The task consists of searching through a
sound database for the most relevant segments, and of sequencing the small
units granularly, so as to best match the overall target data stream. The method
was rst developed as part of a text-to-speech (TTS) system, which exploits
large databases of speech phonemes in order to reconstruct entire sentences [73].
2
The re prex in resynthesis.
30 CHAPTER 2. BACKGROUND
Schwarzs Caterpillar system [147] aims at synthesizing monophonic musi-
cal phrases via the concatenation of instrumental sounds characterized through
a bank of descriptors, including: unit descriptors (location, duration, type);
source descriptors (sound source, style); score descriptors (MIDI note, polypho-
ny, lyrics); signal descriptors (energy, fundamental frequency, zero crossing rate,
cuto frequency); perceptual descriptors (loudness, sharpness, timbral width);
spectral descriptors (centroid, tilt, spread, dissymmetry); and harmonic descrip-
tors (energy ratio, parity, tristimulus, deviation). The appropriate segments are
selected through constraint-solving techniques, and aligned into a continuous
audio stream.
Zils and Pachets Musical Mosaicing [181] aims at generating music from
arbitrary audio segments. A rst application uses a probabilistic generative
algorithm to compose the music, and an overlap-add technique for synthesizing
the sound. An overall measure of concatenation quality and a constraint-solving
strategy for sample selection insures a certain continuity in the stream of audio
descriptors. A second application uses a target song as the overall set of con-
straints instead. In this case, the goal is to replicate an existing audio surface
through granular concatenation, hopefully preserving the underlying musical
structures (section 6.5).
Lazier and Cooks MoSievius system [98] takes up the same idea, and allows
for real-time interactive control over the mosaicing technique by fast sound
sieving: a process of isolating sub-spaces as inspired by [161]. The user can
choose input and output signal specications in real time in order to generate
an interactive audio mosaic. Fast time-stretching, pitch shifting, and k-nearest
neighbor search is provided. An (optionally pitch-synchronous) overlap-add
technique is used for synthesis. Only few or no audio examples of Schwarzs,
Zilss, and Laziers systems are available.
2.4 Music information retrieval
The current proliferation of compressed digital formats, peer-2-peer networks,
and online music services is transforming the way we handle music, and increases
the need for automatic management technologies. Music Information Retrieval
(MIR) is looking into describing the bits of the digital music in ways that
facilitate searching through this abundant world without structure. The signal
is typically tagged with additional information called metadata (data about the
data). This is the endeavor of the MPEG-7 le format, of which the goal is to
enable content-based search and novel applications. Still, no commercial use
of the format has yet been proposed. In the following paragraphs, we briey
describe some of the most popular MIR topics.
Fingerprinting aims at describing the audio surface of a song with a compact
representation metrically distant from other songs, i.e., a musical signa-
2.4. MUSIC INFORMATION RETRIEVAL 31
ture. The technology enables, for example, cell-phone carriers or copy-
right management services to automatically identify audio by comparing
unique ngerprints extracted live from the audio with ngerprints in a
specially compiled music database running on a central server [23].
Query by description consists of querying a large MIDI or audio database
by providing qualitative text descriptors of the music, or by humming
the tune of a song into a microphone (query by humming). The system
typically compares the entry with a pre-analyzed database metric, and
usually ranks the results by similarity [171][54][26].
Music similarity is an attempt at estimating the closeness of music signals.
There are many criteria with which we may estimate similarities, in-
cluding editorial (title, artist, country), cultural (genre, subjective quali-
ers), symbolic (melody, harmony, structure), perceptual (energy, texture,
beat), and even cognitive (experience, reference) [167][6][69][9].
Classication tasks integrate similarity technologies as a way to cluster music
into a nite set of classes, including genre, artist, rhythm, instrument,
etc. [105][163][47]. Similarity and classication applications often face
the primary question of dening a ground truth to be taken as actual
facts for evaluating the results without error.
Thumbnailing consists of building the most representative audio summary
of a piece of music, for instance by removing the most redundant and
least salient sections from it. The task is to detect the boundaries and
similarities of large musical structures, such as verses and choruses, and
nally assemble them together [132][59][27].
The Music Browser, developed by Sony CSL, IRCAM, UPF, Fraunhofer, and
others, as part of a European eort (Cuidado, Semantic Hi-Fi) is the rst
entirely automatic chain for extracting and exploiting musical metadata for
browsing music [124]. It incorporates several techniques for music description
and data mining, and allows for a variety of queries based on editorial (i.e.,
entered manually by an editor) or acoustic metadata (i.e., the sound of the
sound), as well as providing browsing tools and sharing capabilities among
users.
Although this thesis deals exclusively with the extraction and use of acous-
tic metadata, music as a whole cannot be solely characterized by its objec-
tive content. Music, as experienced by listeners, carries much subjective
value that evolves in time through communities. Cultural metadata attached
to music can be extracted online in a textual form through web crawling and
natural-language processing [125][170]. Only a combination of these dierent
types of metadata (i.e., acoustic, cultural, editorial) can lead to viable music
management and retrieval systems [11][123][169].
32 CHAPTER 2. BACKGROUND
2.5 Framework
Much work has already been done under the general paradigm of analysis/resyn-
thesis. As depicted in Figure 2-1, the idea is rst to break down a sound
into some essential, quantiable components, e.g., amplitude and phase partial
coecients. These are usually altered in some way for applications including
time stretching, pitch shifting, timbre morphing, or compression. Finally, the
transformed parameters are reassembled into a new sound through a synthesis
procedure, e.g., additive synthesis. The phase vocoder [40] is an obvious example
of this procedure where the new sound is directly an artifact of the old sound
via some describable transformation.
sound analysis
transformation
new sound synthesis
Figure 2-1: Sound analysis/resynthesis paradigm.
2.5.1 Music analysis/resynthesis
The mechanism applies well to the modication of audio signals in general,
but is generally blind
3
regarding the embedded musical content. We introduce
an extension of the sound analysis/resynthesis principle for music (Figure 2-
2). Readily, our music-aware analysis/resynthesis approach enables higher-level
transformations independently of the sound content, including beat matching,
music morphing, music cross-synthesis, music similarities.
The analysis framework characterizing this thesis work is the driving force of
the synthesis focus of section 6, and it can be summarized concisely by the
following quote:
Everything should be made as simple as possible but not simpler.
Albert Einstein
3
Perhaps we should say deaf...
2.5. FRAMEWORK 33
music analysis
transformation
new music synthesis
sound analysis
new sound synthesis
Figure 2-2: Music analysis/resynthesis procedure, including sound analysis into
music features, music analysis, transformation, music synthesis, nally back into
sound through sound synthesis.
We seek to simplify the information of interest to its minimal form. Depending
on the application, we can choose to approximate or discard some of that infor-
mation, consequently degrading our ability to resynthesize. Reconstructing the
original signal as it reaches the ear is a signal modeling problem. If the source is
available though, the task consists of labeling the audio as it is being perceived:
a perception modeling problem. Optimizing the amount of information required
to describe a signal of given complexity is the endeavor of information theory
[34]: here suitably perceptual information theory.
Our current implementation uses concatenative synthesis for resynthesizing rich
sounds without having to deal with signal modeling (Figure 2-3). Given the
inherent granularity of concatenative synthesis, we safely reduce the description
further, resulting into our nal acoustic metadata, or music characterization.
transformation
concatenative
synthesis
new sound
music analysis sound analysis
music listening
Figure 2-3: In our music analysis/resynthesis implementation, the whole synthesis
stage is a simple concatenative module. The analysis module is referred to as music
listening (section 3).
34 CHAPTER 2. BACKGROUND
We extend the traditional music listening scheme as described in [142] with a
learning extension to it. Indeed, listening cannot be disassociated from learning.
Certain problems such as, for example, downbeat prediction, cannot be fully
solved without this part of the framework (section 5.3).
Understanding the mechanisms of the brain, in particular the auditory path,
is the ideal basis for building perceptual models of music cognition. Although
particular models have great promises [24][25], it is still a great challenge to
make these models work in real-world applications today. However, we can
attempt to mimic some of the most basic functionalities of music perception,
and build a virtual listener that will process, interpret, and describe music
signals much as humans do; that is, primarily, from the ground-up.
The model depicted below, inspired by some empirical research on human listen-
ing and learning, may be considered the rst practical attempt at implement-
ing a music cognition machine. Although we implement most of the music
listening through deterministic signal processing algorithms, we believe that
the whole process may eventually be solved via statistical learning approaches
[151]. But, since our goal is to make music, we favor practicality over a truly
uniform approach.
2.5.2 Description
We propose a four-building-block diagram, where each block represents a sig-
nal reduction stage of another. Information ows from left to right between
each stage and always corresponds to a simpler, more abstract, and slower-
rate signal (Figure 2-4). Each of these four successive stageshearing, feature
extraction, short-term memory, and long-term memoryembodies a dierent
concept, respectively: ltering, symbolic representation, time dependency, and
storage. The knowledge is re-injected to some degree through all stages via a
top-down feedback loop.
music cognition
The rst three blocks roughly represent what is often referred to as listening,
whereas the last three blocks represent what is often referred to as learning. The
interaction between music listening and music learning (the overlapping area of
our framework schematic) is what we call music cognition, where most of the
interesting musical phenomena occur. Obviously, the boundaries of music
cognition are not very well dened and the term should be used with great
caution. Note that there is more to music cognition than the signal path itself.
Additional external inuences may act upon the music cognition experience,
including vision, culture, emotions, etc., but these are not represented here.
2.5. FRAMEWORK 35
Hearing
Feature
extraction
Short-
term
memory
Long-
term
Memory
signal processing
psychoacoustics
segmentation
audio descriptors
pattern recognition
online learning
clustering
classification
Listening Learning Cognition
attention prediction expectation
Figure 2-4: Our music signal analysis framework. The data ows from left to
right and is reduced in each stage. The rst stage is essentially an auditory lter
where the output data describes the audio surface. The second stage, analyzes that
audio surface in terms of perceptual features, which are represented in the form
of a symbolic musical-DNA stream. The third stage analyzes the time compo-
nent of the streaming data, extracting redundancies, and patterns, and enabling
prediction-informed decision making. Finally, the last stage stores and compares
macrostructures. The rst three stages represent listening. The last three stages
represent learning. The overlapping area may represent musical cognition. All stages
feedback to each other, allowing for example memories to alter our listening per-
ception.
hearing
This is a ltering stage, where the output signal only carries what we hear.
The ear being physiologically limited, only a portion of the original signal is
actually heard (in terms of coding, this represents less than 10% of the incoming
signal). The resulting signal is presented in the form of an auditory spectrogram,
where what appears in the time-frequency display corresponds strictly to what
is being heard in the audio. This is where we implement psychoacoustics as in
[183][116][17]. The analysis period here is on the order of 10 ms.
feature extraction
This second stage converts the auditory signal into a symbolic representation.
The output is a stream of symbols describing the music (a sort of musical-
DNA sequence). This is where we could implement sound source separation.
Here we may extract perceptual features (more generally audio descriptors) or
we describe the signal in the form of a musical surface. In all cases, the output
of this stage is a much more compact characterization of the musical content.
The analysis period is on the order of 100 ms.
36 CHAPTER 2. BACKGROUND
short-term memory
The streaming music DNA is analyzed in the time-domain during this third
stage. The goal here is to detect patterns and redundant information that
may lead to certain expectations, and to enable prediction. Algorithms with
a built-in temporal component, such as symbolic learning, pattern matching,
dynamic programming or hidden Markov models are especially applicable here
[137][89][48]. The analysis period is on the order of 1 sec.
long-term memory
Finally, this last stage clusters the macro information, and classies the analysis
results for long-term learning, i.e., storage memory. All clustering techniques
may apply here, as well as regression and classication algorithms, including
mixture of Gaussians, articial neural networks, or support vector machines
[42][22][76]. The analysis period is on the order of several seconds or more.
feedback
For completeness, all stages must feedback to each other. Indeed, our prior
knowledge of the world (memories and previous experiences) alters our listening
experience and general musical perception. Similarly, our short-term memory
(pattern recognition, beat) drives our future prediction, and nally these may
direct our attention (section 5.2).
2.5.3 Hierarchical description
Interestingly, this framework applies nicely to the metrical analysis of a piece
of music. By analogy, we can describe the music locally by its instantaneous
sound, and go up in the hierarchy through metrical grouping of sound segments,
beats, patterns, and nally larger sections. Note that the length of these audio
fragments coincides roughly with the analysis period of our framework (Figure
2-5).
Structural hierarchies [112], which have been studied in the frequency domain
(relationship between notes, chords, or keys) and the time domain (beat, rhyth-
mic grouping, patterns, macrostructures), reveal the intricate complexity and
interrelationship between the various components that make up music. Deli`ege
[36] showed that listeners tend to prefer grouping rules based on attack and
timbre over other rules (i.e., melodic and temporal). Lerdahl [101] stated that
music structures could not only be derived from pitch and rhythm hierarchies,
but also from timbre hierarchies. In auditory scene analysis, by which humans
build mental descriptions of complex auditory environments, abrupt events rep-
resent important sound source-separation cues [19][20]. We choose to rst detect
sound events and segment the audio in order to facilitate its analysis, and rene
2.5. FRAMEWORK 37
measure 2 measure 3 ... measure 8 measure 1
section 2 section 3 ... section 12 section 1
beat 2 ... beat 4 beat 1
segment 2 segment 3 ... segment 6 segment 1
frame 2 frame 3 ... frame 20 frame 1
Length
Time
4-60 sec.
1-5 sec.
0.25-1 sec.
60-300 ms
3-20 ms
10 sec. 1 sec. 100 ms 10 ms 1 ms
... frame 20000
... segment 1000
... beat 200
... measure 50
100 sec.
Figure 2-5: Example of a song decomposition in a tree structure.
the description of music. This is going to be the recurrent theme throughout
this document.
2.5.4 Meaningful sound space
Multidimensional scaling (MDS) is a set of data analysis techniques that display
the structure of distance-like data as a geometrical picture, typically into a low
dimensional space. For example, it was shown that instruments of the orchestra
could be organized in a timbral space of three main dimensions [66][168] loosely
correlated to brightness, the bite of the attack, and the spectral energy dis-
tribution. Our goal is to extend this principle to all possible sounds, including
polyphonic and multitimbral.
Sound segments extracted from a piece of music may be represented by data
points scattered around a multidimensional space. The music structure appears
as a path in the space (Figure 2-6). Consequently, musical patterns materialize
literally into geometrical loops. The concept is simple, but the outcome may
turn out to be powerful if one describes a complete music catalogue within that
common space. Indeed, the boundaries of the space and the dynamics within
it determine the extent of knowledge the computer may acquire: in a sense, its
inuences. Our goal is to learn what that space looks like, and nd meaningful
ways to navigate through it.
Obviously, the main diculty is to dene the similarity of sounds in the rst
place. This is developed in section 4.4. We also extend the MDS idea to other
scales of analysis, i.e., beat, pattern, section, and song. We propose a three-way
hierarchical description, in terms of pitch, timbre, and rhythm. This is the main
38 CHAPTER 2. BACKGROUND
sound segment
perceptual threshold
musical path
dim. 2
dim. 1
dim. 3
Figure 2-6: Geometrical representation of a song in a perceptual space (only 3
dimensions are represented). The sound segments are described as data points in the
space, while the music structure is a path throughout the space. Patterns naturally
correspond to geometrical loops. The perceptual threshold here is an indication of
the ear resolution. Within its limits, sound segments sound similar.
topic of chapter 4. Depending on the application, one may project the data onto
one of these musical dimensions, or combine them by their signicance.
2.5.5 Personalized music synthesis
The ultimate goal of this thesis is to characterize a musical space given a large
corpus of music titles by only considering the acoustic signal, and to propose
novel active listening strategies through the automatic generation of new mu-
sic, i.e., a way to navigate freely through the space. For the listener, the system
is the creator of personalized music that is derived from his/her own song
library. For the machine, the system is a synthesis algorithm that manipulates
and combines similarity metrics of a highly constrained sound space.
By only providing machine listening (chapter 3) and machine learning (chapter
5) primitive technologies, it is our goal to build a bias-free system that learns the
structure of particular music by only listening to song examples. By considering
the structural content of music, our framework enables novel transformations,
or music processing, which goes beyond traditional audio processing.
So far, music-listening systems have been implemented essentially to query mu-
sic information [123]. Much can be done on the generative side of music man-
agement through acoustic analysis. In a way, our framework elevates audio
2.5. FRAMEWORK 39
signals to the rank of music signals, leveraging music cognition, and enabling
various applications as described in section 6.
40 CHAPTER 2. BACKGROUND
CHAPTER THREE
Music Listening
My rst relationship to any kind of musical situation is
as a listener.
Pat Metheny
Music listening [68] is concerned with the understanding of how humans perceive
music. As modelers, our goal is to implement algorithms known as machine
listening, capable of mimicking this process. There are three major machine-
listening approaches: the physiological approach, which attempts to model the
neurophysical mechanisms of the hearing system; the psychoacoustic approach,
rather interested in modeling the eect of the physiology on perception; and the
statistical approach, which models mathematically the reaction of a sound input
to specic outputs. For practical reasons, this chapter presents a psychoacoustic
approach to music listening.
3.0.6 Anatomy
The hearing system is physiologically limited. The torso, head, and outer ear
lter the sound eld (mostly below 1500 Hz) through shadowing and reection.
The outer ear canal is about 2-cm long, which corresponds to a quarter of the
wavelength of frequencies near 4000 Hz, and emphasizes the ear sensitivity to
those frequencies. The middle ear is a transducer that converts oscillations
in the air into oscillations in the inner ear, which contains uids. To avoid
large losses of energy through reection, impedance matching is achieved by a
mechanical lever systemeardrum, malleus, incus, stapes, and oval window, as
in Figure 3-1that reaches an almost perfect match around 1000 Hz.
Along the basilar membrane, there are roughly 3000 inner hair cells arranged
in a regular geometric pattern. Their vibration causes ionic ows that lead to
the ring of short-duration electrical pulses (the language of the brain) in
the nerve bers connected to them. The entire ow of information runs from
the inner ear through approximately 30,000 aerent nerve bers to reach the
midbrain, thalamus, and nally the temporal lobe of the cerebral cortex where
is is nally perceived as sound. The nature of the central auditory process-
ing is, however, still very much unclear, which mainly motivates the following
psychophysical approach [183].
Semicircular canals
Pinna
External
auditory
canal
Lobule
Malleus
Eardrum Incus
Stapes
Eustachian
tube
Cochlea
Vestibular
cochlear
nerve
Oval
window
Figure 3-1: Anatomy of the ear. The middle ear is essentially a transducer that
converts air oscillations in the outer ear (on the left) into uid oscillations in the
inner ear (on the right). It is depicted with greater details in the bottom drawing.
The vestibular cochlear nerves connect the cochlea with the auditory processing
system of the brain. Image from [44].
3.0.7 Psychoacoustics
Psychoacoustics is the study of the subjective human perception of sounds. It
connects the physical world of sound vibrations in the air to the perceptual
world of things we actually hear when we listen to sounds. It is not directly
concerned with the physiology of the hearing system as discussed earlier, but
rather with its eect on listening perception. This is found to be the most
practical and robust approach to an application-driven work. This chapter is
about modeling our perception of music through psychoacoustics. Our model
is causal, meaning that it does not require knowledge about the future, and can
be implemented both in real time, and faster than real time. A good review of
reasons that motivate and inspire this approach can also be found in [142].
42 CHAPTER 3. MUSIC LISTENING
Let us begin with a monophonic audio signal of arbitrary length and sound
quality. Since we are only concerned with the human appreciation of music,
the signal may have been formerly compressed, ltered, or resampled. The
music can be of any kind: we have tested our system with excerpts taken from
jazz, classical, funk, electronic, rock, pop, folk and traditional music, as well as
speech, environmental sounds, and drum loops.
3.1 Auditory Spectrogram
The goal of our auditory spectrogram is to convert the time-domain waveform
into a reduced, yet perceptually meaningful, time-frequency representation. We
seek to remove the information that is the least critical to our hearing sensation
while retaining the most important parts, therefore reducing signal complexity
without perceptual loss. The MPEG1 audio layer 3 (MP3) codec [18] is a good
example of an application that exploits this principle for compression purposes.
Our primary interest here is understanding our perception of the signal rather
than resynthesizing it, therefore the reduction process is sometimes simplied,
but also extended and fully parametric in comparison with usual perceptual
audio coders.
3.1.1 Spectral representation
First, we apply a standard Short-Time Fourier Transform (STFT) to obtain a
standard spectrogram. We experimented with many window types and sizes,
which did not have a signicant impact on the nal results. However, since we
are mostly concerned with timing accuracy, we favor short windows (e.g., 12-ms
Hanning), which we compute every 36 ms (i.e., every 128256 samples at 44.1
KHz). The Fast Fourier Transform (FFT) is zero-padded up to 46 ms to gain
additional interpolated frequency bins. We calculate the power spectrum and
scale its amplitude axis to decibels (dB SPL, a measure of sound pressure level)
as in the following equation:
I
i
(dB) = 20 log
10
_
I
i
I
0
_
(3.1)
where i > 0 is an index of the power-spectrum bin of intensity I, and I
0
is an arbitrary threshold of hearing intensity. For a reasonable tradeo between
dynamic range and resolution, we choose I
0
= 60, and we clip sound pressure
levels below -60 dB. The threshold of hearing is in fact frequency-dependent
and is a consequence of the outer and middle ear response.
3.1. AUDITORY SPECTROGRAM 43
3.1.2 Outer and middle ear
As described earlier, physiologically the outer and middle ear have a great
implication on the overall frequency response of the ear. A transfer function
was proposed by Terhardt in [160], and is dened in decibels as follows:
A
dB
(f
KHz
) = 3.64 f
0.8
+ 6.5 exp
_
0.6 (f 3.3)
2
_
10
3
f
4
(3.2)
As depicted in Figure 3-2, the function is mainly characterized by an attenuation
in the lower and higher registers of the spectrum, and an emphasis around 25
KHz, interestingly where much of the speech information is carried [136].
0.1 1 10 KHz
-60
-50
-40
-30
-20
-10
0
dB
2 5
Figure 3-2: Transfer function of the outer and middle ear in decibels, as a function
of logarithmic frequency. Note the ear sensitivity between 2 and 5 KHz.
3.1.3 Frequency warping
The inner ear (cochlea) is shaped like a 32 mm long snail and is lled with
two dierent uids separated by the basilar membrane. The oscillation of the
oval window takes the form of a traveling wave which moves along the basilar
membrane. The mechanical properties of the cochlea (wide and sti at the base,
narrower and much less sti at the tip) act as a cochlear lterbank: a roughly
logarithmic decrease in bandwidth (i.e., constant-Q on a logarithmic scale) as
we move linearly away from the cochlear opening (the oval window).
The dierence in frequency between two pure tones by which the sensation of
roughness disappears and the tones sound smooth is known as the critical
band. It was found that at low frequencies, critical bands show an almost con-
stant width of about 100 Hz, while at frequencies above 500 Hz, their bandwidth
is about 20% of the center frequency. A Bark unit was dened and led to the
so-called critical-band rate scale. The spectrum frequency f is warped to the
Bark scale z(f) as in equation (3.3) [183]. An Equivalent Rectangular Band-
44 CHAPTER 3. MUSIC LISTENING
basilar membrane
mm
Oval
window
0 8 16 24 32
0 160 320 480 640
0 600 1200 1800 2400
0 6 12 18 24 3 9 15 21
0 0.5 2 4 16 0.25 1 8 0.125
mel
Bark
KHz
steps
length
just-audible pitch
ratio pitch
critical band rate
frequency
cochlea
Figure 3-3: Dierent scales shown in relation to the unwound cochlea. Mel in
particular is a logarithmic scale of frequency based on human pitch perception. Note
that all of them are on a linear scale except for frequency. Tip is shown on the left
and base on the right.
width (ERB) scale was later introduced by Moore and is shown in comparison
with the Bark scale in gure 3-4 [116].
z(f) = 13 arctan(0.00076f) + 3.5 arctan
_
(f/7500)
2
_
(3.3)
10
-1
10
0
10
1
10
2
10
3
Frequency (KHz)
B
a
n
d
w
id
t
h
(
H
z
)
Bark
ERB
Max(100, f/5)
Figure 3-4: Bark critical bandwidth and ERB as a function of frequency. The
rule-of-thumb Bark-scale approximation is also plotted (Figure adapted from [153]).
The eect of warping the power spectrum to the Bark scale is shown in Figure
3-5 for white noise, and for a pure tone sweeping linearly from 20 to 20K Hz.
Note the non-linear auditory distortion of the frequency (vertical) axis.
3.1. AUDITORY SPECTROGRAM 45
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 sec.
0
1
2
3
0.5
5
10
KHz
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 sec.
0
1
2
3
0.5
5
10
KHz
Figure 3-5: Frequency warping onto a Bark scale for [top] white noise; [bottom]
a pure tone sweeping linearly from 20 to 20K Hz.
3.1.4 Frequency masking
Simultaneous masking is a property of the human auditory system where certain
maskee sounds disappear in the presence of so-called masker sounds. Masking
in the frequency domain not only occurs within critical bands, but also spreads
to neighboring bands. Its simplest approximation is a triangular function with
slopes +25 dB/Bark and -10 dB/Bark (Figure 3-6), where the lower frequencies
have a stronger masking inuence on higher frequencies than vice versa [146].
A more rened model is highly non-linear and depends on both frequency and
amplitude. Masking is the most powerful characteristic of modern lossy coders:
more details can be found in [17]. A non-linear spreading function as found in
[127] and modied by Lincoln in [104] is:
SF(z) = (15.81 i) + 7.5(z + 0.474) (17.5 i)
_
1 + (z + 0.474)
2
(3.4)
where i = min
_
5 PS(f) BW(f), 2.0
_
BW(f) =
_
100 for f < 500
0.2f for f 500
PS is the power spectrum,
and z is dened in equation 3.3. (3.5)
Instantaneous masking was essentially dened through experimentation with
pure tones and narrow-band noises [50][49]. Integrating spreading functions in
46 CHAPTER 3. MUSIC LISTENING
the case of complex tones is not very well understood. To simplify, we compute
the full spectral mask through series of individual partials.
-4 -2 0 2 4 Bark
-80
-70
-60
-50
-40
-30
-20
-10
0
dB
i = 0
i = 1
i = 2
0 0.5 1 1.5 2 sec.
0 0.5 1 1.5 2 sec.
1000
Hz
1200
1000
Hz
1200
B2
B1
C2
C1
A2
A1
+
2
5
d
B
/
B
a
r
k -
1
0
d
B
/
B
a
r
k
Figure 3-6: [right] Spectral masking curves in the Bark scale as in reference [104],
and its approximation (dashed-green). [left] The eect of frequency masking is
demonstrated with two pure tones at 1000 and 1200 Hz. The two Bark spectrograms
are zoomed around the frequency range of interest. The top one is raw. The bottom
one includes frequency masking curves. In zone A, the two sinusoids are equally loud.
In zone B and C, the amplitude of the tone at 1200 Hz is decreased exponentially.
Note that in zone C1 the tone at 1200 Hz is clearly visible, while in zone C2, it
entirely disappears under the masker, which makes it inaudible.
3.1.5 Temporal masking
Another perceptual phenomenon that we consider as well is temporal masking.
As illustrated in Figure 3-7, there are two types of temporal masking besides
simultaneous masking: pre-masking and post-masking. Pre-masking is quite
unexpected and not yet conclusively researched, but studies with noise bursts
revealed that it lasts for about 20 ms [183]. Within that period, sounds softer
than the masker are typically not audible. We do not implement it since signal-
windowing artifacts have a similar smoothing eect. However, post-masking is a
kind of ringing phenomenon which lasts for almost 200 ms. We convolve the
envelope of each frequency band with a 200-ms half-Hanning (raised cosine)
window. This stage induces smoothing of the spectrogram, while preserving
attacks. The eect of temporal masking is depicted in Figure 3-8 for various
sounds, together with their loudness curve (more on loudness in section 3.2).
The temporal masking eects have important implications on the perception
of rhythm. Figure 3-9 depicts the relationship between subjective and physical
duration of sound events. The physical duration of the notes gives an incorrect
estimation of the rhythm (in green), while if processed through a psychoacoustic
3.1. AUDITORY SPECTROGRAM 47
-50 50 0 100 0 150 50 100 200 150
20
0
40
60
dB
ms
pre-masking post-masking simultaneous-masking
masker
S
e
n
s
a
t
io
n
L
e
v
e
l
Time (originating at masker onset) Time (originating at masker offset)
Figure 3-7: Schematic drawing of temporal masking, including pre-masking, si-
multaneous masking, and post-masking. Note that post-masking uses a dierent
time origin.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 sec.
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 sec.
0
1
2
3
0.5
5
10
KHz
Figure 3-8: Bark spectrogram of four sounds with temporal masking: a digital
click, a clave, a snare drum, and a staccato violin sound. Note the 200-ms smoothing
eect in the loudness curve.
model, the rhythm estimation is correct (in blue), and corresponds to what the
performer and audience actually hear.
3.1.6 Putting it all together
Finally, we combine all the preceding pieces together, following that order,
and build our hearing model. Its outcome is what we call the audio surface.
Its graphical representation, the auditory spectrogram, merely approximates a
what-you-see-is-what-you-hear type of spectrogram, meaning that the just
visible in the time-frequency display corresponds to the just audible in the
underlying sound. Note that we do not understand music yet, but only sound.
48 CHAPTER 3. MUSIC LISTENING
100 100 260 100
200 200 400 200
Allegretto ( = 132)
Unprocessed audio
Processed audio
S
e
n
s
a
t
io
n
le
v
e
l
Time
(ms)
(ms)
Figure 3-9: Importance of subjective duration for the estimation of rhythm. A
rhythmic pattern performed by a musician (see sta) results in a subjective sensation
(blue) much dierent from the physical reality (green)the physical duration of the
audio signal. A temporal model is implemented for accurate duration analysis and
correct estimation of rhythm.
Figure 3-10 displays the audio surface of white noise, a sweeping pure tone, four
distinct sounds, and a real-world musical excerpt.
3.2 Loudness
The area below the audio surface (the zone covered by the mask) is called the
excitation level, and minus the area covered by the threshold in quiet, leads to
the sensation of loudness: the subjective judgment of the intensity of a sound.
It is derived easily from our auditory spectrogram by adding the amplitudes
across all frequency bands:
L
dB
(t) =
N
k=1
E
k
(t)
N
(3.6)
where E
k
is the amplitude of frequency band k of total N in the auditory
spectrogram. Advanced models of loudness by Moore and Glasberg can be
found in [117][57]. An example is depicted in Figure 3-11.
3.3 Timbre
Timbre, or tone color, is a relatively poorly dened perceptual quality of
sound. The American Standards Association (ASA) denes timbre as that
attribute of sensation in terms of which a listener can judge that two sounds
having the same loudness and pitch are dissimilar [5]. In music, timbre is the
3.2. LOUDNESS 49
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 sec.
0
1
2
3
0.5
5
10
KHz
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 sec.
0
1
2
3
0.5
5
10
KHz
0
1
2
3
0.5
5
10
KHz
0
1
2
3
0.5
5
10
KHz
0 0.5 1 1.5 2 sec.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 sec.
[A]
[B]
[C]
[D]
Figure 3-10: Auditory spectrogram of [A] white noise; [B] a pure tone sweeping
linearly from 20 to 20K Hz; [C] four short sounds, including a digital click, a clave,
a snare drum, and a staccato violin sound; [D] a short excerpt of James Browns
Sex machine.
quality of a musical note that distinguishes musical instruments. It was shown
by Grey [66] and Wessel [168] that important timbre characteristics of the or-
chestral sounds are attack quality (temporal envelope), spectral ux (evolution
of the spectral distribution over time), and brightness (spectral centroid).
In fact, this psychoacousticians waste basket includes so many factors that
the latest trend for characterizing timbre, sounds, and other high-level musical
attributes consists of using a battery of so-called low-level audio descriptors
(LLD), as specied for instance in the MPEG7 standardization format [118].
Those can be organized in various categories including temporal descriptors
computed from the waveform and its envelope, energy descriptors referring
to various energy measurements of the signal, spectral descriptors computed
from the STFT, harmonic descriptors computed from the sinusoidal harmonic
50 CHAPTER 3. MUSIC LISTENING
modeling of the signal, and perceptual descriptors computed using a model of
the human hearing process [111][133][147].
1
10
15
5
20
25
0 0.5 1 1.5 2 sec.
0
0.2
0.4
0.6
0.8
1
0 0.5 1 1.5 2 sec.
Figure 3-11: 25 critical Bark bands for the short excerpt of James Browns Sex
machine as in Figure 3-10, and its corresponding loudness curve with 256 frequency
bands (dashed-red), or only 25 critical bands (blue). The measurement of loudness
through critical band reduction is fairly reasonable, and computationally much more
ecient.
The next step typically consists of nding the combination of those LLDs,
which hopefully best matches the perceptive target [132]. An original approach
by Pachet and Zils substitutes the basic LLDs by primitive operators. Through
genetic programming, the Extraction Discovery System (EDS) aims at compos-
ing these operators automatically, and discovering signal-processing functions
that are locally optimal for a given descriptor extraction task [126][182].
Rather than extracting specic high-level musical descriptors, or classifying
sounds given a specic taxonomy and arbitrary set of LLDs, we aim at rep-
resenting the timbral space of complex polyphonic signals with a meaningful,
yet generic description. Psychoacousticians tell us that the critical band can be
thought of as a frequency-selective channel of psychoacoustic processing. For
humans, only 25 critical bands cover the full spectrum (via the Bark scale).
These can be regarded as a reasonable and perceptually grounded description
of the instantaneous timbral envelope. An example of that spectral reduction
is given in Figure 3-11 for a rich polyphonic musical excerpt.
3.3. TIMBRE 51
3.4 Onset Detection
Onset detection (or segmentation) is the means by which we can divide the
musical signal into smaller units of sound. This section only refers to the most
atomic level of segmentation, that is the smallest rhythmic events possibly
found in music: individual notes, chords, drum sounds, etc. Organized in time,
a sequence of sound segments infers our perception of music. Since we are not
concerned with sound source separation, a segment may represent a rich and
complex polyphonic sound, usually short. Other kinds of segmentations (e.g.,
voice, chorus) are specic aggregations of our minimal segments which require
source recognition, similarity, or continuity procedures.
3.4.1 Prior approaches
Many applications, including the holy-grail transcription task, are primarily
concerned with detecting onsets in a musical audio stream. There has been a
variety of approaches including nding abrupt changes in the energy envelope
[38], in the phase content [10], in pitch trajectories [138], in audio similarities
[51], in autoregressive models [78], in spectral frames [62], through a multifea-
ture scheme [162], through ICA and hidden Markov modeling [1], and through
neural networks [110]. Klapuri [90] stands out for using psychoacoustic knowl-
edge; this is the solution proposed here as well.
3.4.2 Perceptually grounded approach
We dene a sound segment by its onset and oset boundaries. It is assumed
perceptually meaningful if its timbre is consistent, i.e., it does not contain
any noticeable abrupt changes. Typical segment onsets include abrupt loudness,
pitch or timbre variations. All of these events translate naturally into an abrupt
spectral variation in the auditory spectrogram.
We convert the auditory spectrogram into an event detection function by calcu-
lating the rst-order dierence function of each spectral band, and by summing
across channels. The resulting signal reveals peaks that correspond to onset
transients (Figure 3-12, pane 4). Transients within a 50-ms window typically
fuse perceptually into a single event [155]. We model fusion by convolving the
raw event detection signal with a Hanning window. Best results (i.e., with seg-
ments greater than 50 ms) are obtained with a 150-ms window. The ltering
generates a smooth function now appropriate for the peak-picking stage. Unlike
traditional methods that usually rely heavily on designing an adaptive thresh-
old mechanism, we can simply select the local maxima (Figure 3-12, pane 5).
We may reject the attest peaks through threshold as well, but this stage and
settings are not critical.
52 CHAPTER 3. MUSIC LISTENING
0 0.5 1 1.5 2 2.5 3 sec.
-2
-1
0
1
2
x 10
4
0 0.5 1 1.5 2 2.5 3 sec.
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
1
5
10
15
20
25
0 0.5 1 1.5 2 2.5 3 sec.
0 0.5 1 1.5 2 2.5 3 sec.
0 0.5 1 1.5 2 2.5 3 sec.
w
a
v
e
f
o
r
m
2
5
-
b
a
n
d
s
p
e
c
t
r
o
g
r
a
m
l
o
u
d
n
e
s
s
r
a
w
d
e
t
e
c
t
i
o
n
f
u
n
c
t
i
o
n
s
m
o
o
t
h
d
e
t
e
c
t
i
o
n
f
u
n
c
t
i
o
n
Figure 3-12: A short 3.25 sec. excerpt of Watermelon man by Herbie Hancock.
[1] waveform (blue) and segment onsets (red); [2] auditory spectrogram; [3] loudness
function; [4] raw event detection function; [5] smoothed detection function.
3.4. ONSET DETECTION 53
Since we are concerned with reusing the audio segments for synthesis, we rene
the onset location by analyzing it in relation with its corresponding loudness
function. An onset generally occurs with an increase variation in loudness. To
retain the entire attack, we seek the previous local minimum in the loudness
signal (in general a small time shift of at most 20 ms), which corresponds to the
softest pre-onset moment, that is the best time to cut. Finally, we look within
the corresponding waveform, and search for the closest zero-crossing, with an
arbitrary but consistent choice of direction (e.g., from negative to positive).
This stage is important to ensure signal continuity at synthesis.
3.4.3 Tatum grid
Segment sequencing is the reason for musical perception, and the inter-onset
interval (IOI) is at the origin of the metrical-structure perception [74]. The
tatum, named after jazz pianist Art Tatum in [12] can be dened as the
lowest regular pulse train that a listener intuitively infers from the timing of
perceived musical events: a time quantum. It is roughly equivalent to the time
division that most highly coincides with note onsets: an equilibrium between 1)
how well a regular grid explains the onsets, and 2) how well the onsets explain
the grid.
The tatum is typically computed via a time-varying IOI histogram [64], with
an exponentially decaying window for past data, enabling the tracking of ac-
celerandos and ritardandos [148]. The period is found by calculating the great-
est common divisor (GCD) integer that best estimates the histogram harmonic
structure, or by means of a two-way mismatch error procedure as originally
proposed for the estimation of the fundamental frequency in [109], and applied
to tatum analysis in [65][67]. Two error functions are computed: one that il-
lustrates how well the grid elements of period candidates explain the peaks of
the measured histogram; another one illustrates how well the peaks explain the
grid elements. The TWM error function is a linear combination of these two
functions. Phase is found in a second stage, for example through circular mean
in a grid-to-onset alignment procedure as in [148].
Instead of a discrete IOI histogram, our method is based on a moving autocorre-
lation computed on the smooth event-detection function as found in section 3.4.
The window length is chosen adaptively from the duration of x past segments
to ensure rough salience stability in the rst-peak estimation of the autocorre-
lation (e.g., x 15). The autocorrelation is only partially calculated since we
are guaranteed to nd a peak in the (100/x)% range around its center. The
rst peak gives the approximate tatum period. To rene that estimation, and
detect the phase, we run a search through a set of templates.
Templates are patterns or lters that we aim to align against the signal. We
pre-compute dozens of regular pulse trains in the range 1.515 Hz through
a series of click trains convolved with a Hanning window: the same used to
54 CHAPTER 3. MUSIC LISTENING
smooth the detection function in section 3.4.2. To account for memory fading,
we shape the templates with a half-raised cosine of several seconds, e.g., 36
sec. The templates are nally normalized by their total energy (Figure 3-13,
left). At a given estimation time, the optimal template is the one with highest
energy when cross-correlated with the current smoothed detection function.
For maximum eciency, we only estimate templates within the range 10%
of our rough period estimation. We limit the cross-correlation lag search for
the optimal template, to only the tatum period length , since it contains the
peak that will account for phase oset and allows us to predict the next tatum
location: [i +1] = [i] +[i] c [i] where c is a smoothing coecient and
[i] [[i]/2, +[i]/2[. The system quickly phase locks and is eciently
updated at tatum-period rate.
-3 -2.5 -2 -1.5 -1 - 0.5 sec.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
present past
tatum
phase-locking view
-3 -2.5 -2 -1.5 -1 -0.5 sec.
0
0.2
0.4
0.6
0.8
1
-150 -100 -50 0 50 100 150 ms
0
0.2
0.4
0.6
0.8
1
Figure 3-13: Tatum tracking. [left] A bank of dozens of templates like the ones
displayed here are pre-computed: eight are shown, with a memory decay of about
3.5 seconds: present is on the right; past is on the left. [right] Example of tracking
Misery by The Beatles. The top pane shows the smooth detection function (blue)
and the current best template match (red). The bottom pane displays the cross-
correlation response around the predicted phase for the optimal template. Here the
template is in perfect phase with the signal.
3.5 Beat and Tempo
The beat (or tactus) is a perceptually induced periodic pulse that is best de-
scribed by the action of foot-tapping to the music, and is probably the most
studied metrical unit. It denes tempo: a pace reference that typically ranges
from 40 to 260 beats per minute (BPM) with a mode roughly around 120 BPM.
Tempo is shown to be a useful time-normalization metric of music (section 4.5).
The beat is a down-sampled, aligned version of the tatum, although there is
no clear and right answer on how many tatum periods make up a beat period:
unlike tatum, which is derived directly from the segmentation signal, the beat
sensation is cognitively more complex and requires information both from the
temporal and the frequency domains.
3.5. BEAT AND TEMPO 55
3.5.1 Comparative models
Beat induction models can be categorized by their general approach: top-down
(rule- or knowledge-based), or bottom-up (signal processing). Early techniques
usually operate on quantized and symbolic representations of the signal, for
instance after an onset detection stage. A set of heuristic and gestalt rules
(based on accent, proximity, and grouping) is applied to infer the underly-
ing metrical structure [99][37][159][45]. More recently, the trend has been on
signal-processing approaches. The scheme typically starts with a front-end
subband analysis of the signal, traditionally using a lter bank [165][141][4]
or a discrete Fourier Transform [59][96][91]. Then, a periodicity estimation
algorithmincluding oscillators [141], histograms [39], autocorrelations [63], or
probabilistic methods [95]nds the rate at which signal events occur in con-
current channels. Finally, an integration procedure combines all channels into
the nal beat estimation. Gotos multiple-agent strategy [61] (also used by
Dixon [38][39]) combines heuristics and correlation techniques together, includ-
ing a chord change detector and a drum pattern detector. Klapuris Bayesian
probabilistic method applied on top of Scheirers bank of resonators determines
the best metrical hypothesis with constraints on continuity over time [92]. Both
approaches stand out for their concern with explaining a hierarchical organiza-
tion of the meter (section 4.6).
3.5.2 Our approach
A causal and bottom-up beat tracker based on our front-end auditory spec-
trogram (25 bands) and Scheirers bank of resonators [141] is developed. It
assumes no prior knowledge, and includes a condence value, which accounts
for the presence of a beat in the music. The range 60240 BPM is logarithmi-
cally distributed to a large bank of comb lters, whose properties are to resonate
at a given tempo. The lters are tested on multiple frequency channels of the
auditory spectrogram simultaneously, and are tuned to fade out within seconds,
as a way to model short-term memory. At any given time, their internal en-
ergy can be summed across channels by tempo class, which results in a tempo
spectrum as depicted in Figure 3-14 (bottom). Yet, one of the main drawbacks
of the model is its unreliable tempo-peak selection mechanism. A few peaks
of the spectrum may give a plausible answer, and choosing the highest is not
necessarily the best, or most stable strategy. A template mechanism is used
to favor the extraction of the fastest tempo in case of ambiguity
1
. Section 5.3,
however, introduces a bias-free method that can overcome this stability issue
through top-down feedback control.
Figure 3-14 shows an example of beat tracking a polyphonic jazz-fusion piece at
supposedly 143 BPM. A tempogram (middle pane) displays the tempo knowl-
edge gained over the course of the analysis. It starts with no knowledge, but
slowly the tempo space emerges. Note in the top pane that beat tracking was
1
It is always possible to down-sample by a tempo octave if necessary.
56 CHAPTER 3. MUSIC LISTENING
stable after merely 1 second. The bottom pane displays the current output
of each resonator. The highest peak is our extracted tempo. A peak at the
sub octave (72 BPM) is visible, as well as some other harmonics of the beat.
A real-time implementation of our beat tracker is available for the Max/MSP
environment [180].
0 5 10 15 20 25 sec.
-2
-1
0
1
2
x 10
4
0 5 10 15 20 25 sec.
72 96 143 BPM 190 240
0
0.2
0.4
0.6
0.8
1
60
60
72
96
114
143
BPM
114
190
Figure 3-14: Beat tracking of a 27 sec. excerpt of Watermelon man by Herbie
Hancock. [top] waveform (blue) and beat markers (red); [middle] tempogram: the
system starts with no knowledge (black area) and gets gradually more condent;
[bottom] tempo spectrum after 15 sec. of tracking.
3.6 Pitch and Harmony
The atomic audio fragments found through sound segmentation in section 3.4
represent individual notes, chords, drum sounds, or anything timbrally and
harmonically stable. If segmented properly, there should not be any abrupt
variations of pitch within a segment. Therefore it makes sense to analyze its
pitch content, regardless of its complexity, i.e., monophonic, polyphonic, noisy.
Since polyphonic pitch-tracking is yet to be solved, especially in a mixture
of sounds that includes drums, we opt for a simpler, yet quite relevant 12-
3.6. PITCH AND HARMONY 57
dimensional chroma (a pitch class regardless of its register) description as in [8].
A chromagram is a representation of chromas against time. It was previously
used for chord recognition [178], key analysis [131], chorus detection [60], and
thumbnailing [27].
C
log power-spectrum
C Hanning filters
Figure 3-15: Computing schematic for building a chromagram. The power spec-
trum energy is accumulated into 12 pitch classes through a bank of lters tuned to
the equal temperament chromatic scale.
We compute the FFT of the whole segment (generally between 80 to 300 ms
long), which gives us sucient frequency resolution. A standard Hanning win-
dow is applied rst, which slightly attenuates the eect of noisy transients while
emphasizing the sustained part of the segment. A chroma vector is the result of
folding the energy distribution of much of the entire power spectrum (6 octaves
ranging from C1 = 65 Hz to B7 = 7902 Hz) into 12 discrete pitch classes. This
is a fair approximation given that both fundamental and rst harmonic corre-
spond to the same pitch class and are often the strongest partials of the sound.
The output of the 72 logarithmically spaced Hanning lters of a whole-step
bandwidthaccordingly tuned to the equal temperament chromatic scaleis
accumulated into their corresponding pitch class (Figure 3-15). The scale is best
suited to western music, but applies to other tunings (Indian, Chinese, Arabic,
etc.), although it is not as easily interpretable or ideally represented. The nal
12-element chroma vector is normalized by dividing each of its elements by the
maximum element value. We aim at canceling the eect of loudness across vec-
tors (in time) while preserving the ratio between pitch classes within a vector
(in frequency). An example of a segment-synchronized chromagram for four
distinct sounds is displayed in Figure 3-16.
58 CHAPTER 3. MUSIC LISTENING
C
C#
D
D#
E
F
F#
G
G#
A
A#
B
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 sec.
click clave snare drum violin
Figure 3-16: Four segment-synchronous chroma stpectra for the sound example
in Figure 3-8: a digital click, a clave, a snare drum, and a staccato violin sound. The
broadband click sound is the noisiest (at). The violin sound is the most harmonic:
the pitch played is an A. Even the visible A# emphasis is recognizable by listening
to the sound of the noisy clave sound.
Our implementation diers signicantly from others as we compute chromas on
a segment basis. The great benets of doing a segment-synchronous
2
compu-
tation are:
1. accurate time resolution by short-window analysis of onsets;
2. accurate frequency resolution by adaptive window analysis;
3. computation speed since there is no need to overlap FFTs; and
4. ecient and meaningful representation for future use.
The usual time-frequency paradox and our simple solution are shown in Fig-
ure 3-17 in the case of a chromatic scale. Our method optimizes both time and
frequency analysis while describing the signal in terms of its musical features
(onsets and inter-onset harmonic content). Figure 3-18 demonstrates the ben-
et of our pitch representation in the case of monotimbral music. Indeed, the
critical-band representation, suitable for timbre recognition (section 3.3), is not
suitable for pitch-content analysis. However, the segment-synchronous chro-
magram, which discards the timbre eect through its normalization process,
appears suitable for describing pitch and harmonic content.
3.7 Perceptual feature space
In this chapter, we have modeled human perception through psychoacoustics.
We have constructed a low-level representation of music signals in the form of
a sequence of short audio segments, and their associated perceptual content:
2
The term is borrowed from its equivalent in pitch analysis, as in Pitch-Synchronous
Overlap Add (PSOLA).
3.7. PERCEPTUAL FEATURE SPACE 59
C
C#
D
D#
E
F
F#
G
G#
A
A#
B
C
C#
D
D#
E
F
F#
G
G#
A
A#
B
C
C#
D
D#
E
F
F#
G
G#
A
A#
B
12-ms long frames every 6 ms
93-ms long frames every 6 ms
segment-synchronous 200-ms long frames (every 200 ms)
[A]
[B]
[C]
Figure 3-17: Time-frequency resolution paradox. A three-octave chromatic scale
(from C2 to B4) is played on a piano at a rate of 5 notes per second. [A] Exam-
ple of the distortion eect in the low range of the spectrum when computing the
chromagram, due to the use of short windows (512 samples at 44.1 KHz); [B] long
windows (4096 samples at 44.1 KHz) mostly x the frequency resolution issue, but
1) compromises on temporal accuracy, and 2) is computationally more costly; [C]
our solution employs short windows for the segmentation, and one adapted window
for computing the frequency content of a segment. The result is both accurate and
fast to compute.
1. rhythm represented by a loudness curve
3
;
2. timbre represented by 25 channels of an auditory spectrogram;
3. pitch represented by a 12-class chroma vector.
Although the pitch content of a given segment is now represented by a compact
12-feature vector, loudness, and more so timbre, are still largeby an order of
magnitude, since we still describe them on a frame basis. Empirically, it was
foundvia resynthesis experiments constrained only by loudnessthat for a
given segment, the maximum value in the loudness curve is a better approxi-
mation of the overall perceived loudness, than the average of the curve. For a
more accurate representation, we describe the loudness curve with 5 dynamic
features: loudness at onset (dB), maximum loudness (dB), loudness at oset
(dB), length of the segment (ms), and time location of the maximum loudness
3
We use the terms rhythm, loudness function, and dynamic features interchangeably.
60 CHAPTER 3. MUSIC LISTENING
0 2 4 6 8 10 12 14 16
C
D
E
G
A
B
0 2 6 10 14 16
-2
-1
0
1
2
x 10
4
0 2 6 10 14 16
0
0.2
0.4
0.6
0.8
1
0 2 6 10 14 16
1
5
10
15
20
25
0 2 4 6 8 10 12 14 16
C
D
E
G
A
B
4
4
8 12
8 12
4 8 12
w
a
v
e
f
o
r
m
lo
u
d
n
e
s
s
a
u
d
it
o
r
y
s
p
e
c
t
r
o
g
r
a
m
c
h
r
o
m
a
g
r
a
m
Dm7 G7 CM7 C#O Dm7 G7 CM7 C#O Dm7 G7 CM7 C#O Dm7 G7 CM7 C#O
s
e
g
m
e
n
t
-
s
y
n
c
h
r
o
n
o
u
s
c
h
r
o
m
a
g
r
a
m
Figure 3-18: Typical II-V-I progressions played on a piano. No distinction can
be made by looking at the waveform, the loudness curve, or the 25-critical band
auditory spectrogram since this is only one timbre. However the chromagram
allows us to dierentiate chords and recognize similarities between chords of the
same class even when they are inverted. The last pane shows the chromagram
computed on a per-segment basis, while the previous one is computed every 6 ms
with a 93-ms long window. Our representation is much simpler (only one chroma
per segment), and preserves the most important information.
3.7. PERCEPTUAL FEATURE SPACE 61
relative to the onset (ms). A currently rough approximation of timbre consists
of averaging the Bark features over time, which reduces the timbre space to
only 25 Bark features per segment. We nally label our segments with a total
of 5 rhythm features + 25 timbral features + 12 pitch features = 42 features
4
.
2 4 6 8 10 12 14 16 segments
20
15
10
5
1
25
20
15
10
5
1
25
0 0.5 1 1.5 2 sec.
0 0.5 1 1.5 2 sec.
0
0.2
0.4
0.6
0.8
1
C
D
E
F#
G
B
C#
D#
F
G#
A
A#
0 0.5 1 1.5 2 sec.
s
e
g
m
e
n
t
a
t
i
o
n
a
u
d
i
t
o
r
y
s
p
e
c
t
r
o
g
r
a
m
p
i
t
c
h
f
e
a
t
u
r
e
s
t
i
m
b
r
e
f
e
a
t
u
r
e
s
Figure 3-19: A short excerpt of Sex Machine by James Brown illustrating the
12 pitch features in synchrony with segments, and corresponding timbral features,
equally distributed in time.
Such compact low-rate segment-synchronous feature description is regarded as
a musical signature, or signal metadata, and has proven to be ecient and
meaningful in a series of applications as demonstrated in chapter 6. In the fol-
lowing chapter, we interpret this musical metadata through a set of hierarchical
structures of pitch, rhythm, and timbre.
4
As already predicted by supercomputer Deep Thought in [2].
62 CHAPTER 3. MUSIC LISTENING
CHAPTER FOUR
Musical Structures
One has no right, under pretext of freeing yourself, to be illogical
and incoherent by getting rid of structure.
Thelonious Monk
This chapter is a natural extension of music listening as described in chapter 3.
The goal is to extract higher-level musical structures from the low-level musical
metadata already derived from the audio. Therefore, the work here implies
using more pattern-recognition techniques [42] than signal processing. With
perception in particular, pattern recognition has much to do with similarity,
as we are dealing with interpreted signalsthere is no absolute truth or
consensus on how to organize audio signals. Musical similarity though, remains
to be dened properly [81].
4.1 Multiple Similarities
Measuring similarities in music audio is a particularly ambiguous task that has
not been adequately addressed. The rst reason why this problem is dicult is
simply because there is a great variety of criteria (both objective and subjective)
that listeners must consider and evaluate simultaneously, e.g., genre, artist,
instrument, tempo, melody, rhythm, timbre. Figure 4-1 depicts this problem
in the visual domain, with at least two obvious classes of similarities that can
be considered: outline and texture.
Another major diculty comes from the scale at which the similarity is esti-
mated. Indeed, measuring the similarity of sound segments or entire records
raises dierent questions. Previous works have typically focused on one single
[A] [B] [C] [D]
Figure 4-1: Simple example of the similarity problem in the visual domain. When
asked How would you group these four images? about a 1/3 of Media Lab students
would group images as AB and CD, 1/3 would group them as AC and BD,
and the rest would not answer. The panda outlines are from Cesar Coelho. The
computer-generated 3D renderings are from Hedlena Bezerra.
task, for instance: instruments [111], rhythmic patterns [46], or entire songs [6].
Our approach, on the other hand, is an attempt at analyzing multiple levels of
musically meaningful similarities, in a uniform fashion.
4.2 Related Work
Recent original methods of mapping music into a subjective semantic space for
measuring the similarity of songs and artists have been proposed [11]. These
methods rely on statistical models of labeled signalsfor example, acquired
through web-crawling and natural language processing (NLP)which do not
easily scale down in measuring similarities within songs. Here, we are only con-
cerned with objective acoustic criteria (or musical dimensions) measurable
directly from the signal with no prior knowledge, such as the ones extracted
through music listening: pitch, rhythm, and timbre. The analysis task is chal-
lenging for the following reasons:
1. non-orthogonality of the musical dimensions; and
2. hierarchical dependency of the various scales of analysis.
Non-orthogonality because the pitch, loudness and timbre percepts depend
on a shared underlying signal structure. Hierarchical because the sensation
of tatum, tempo, or pattern all depend on the lowest-level segment structure.
We believe that a multi-scale and multi-class approach to acoustic similarities
is a step forward towards a more generalized model.
4.2.1 Hierarchical representations
Hierarchical representations of pitch and rhythm have already been proposed by
Lerdahl and Jackendo in the form of a musical grammar based on Chomskian
64 CHAPTER 4. MUSICAL STRUCTURES
linguistics [100]. Although a questionable oversimplication of music, among
other rules their theory includes the metrical structure, as in our representation.
However, in addition to pitch and rhythm, we introduce the notion of hierarchi-
cal timbre structure, a perceptually grounded description of music audio based
on the metrical organization of its timbral surface: the perceived spectral shape
in time, as described in section 3.3.
4.2.2 Global timbre methods
Global timbre analysis has received much attention in recent years as a means
to measure the similarity of songs [11][6][69]. Typically, the estimation is built
upon a pattern-recognition architecture. The algorithm aims at capturing the
overall timbre distance between songs, given a large set of short audio frames, a
small set of acoustic features, a statistical model of their distribution, and a dis-
tance function. It was shown, however, by Aucouturier and Pachet, that these
generic approaches quickly lead to a glass ceiling, at about 65% R-precision
[7]. They conclude that substantial improvements would most likely rely on a
deeper understanding of the cognitive processes underlying the perception of
complex polyphonic timbres, and the assessment of their similarity.
It is indeed unclear how humans perceive the superposition of sounds, or what
global means, and how much it is actually more signicant than local sim-
ilarities. Comparing most salient segments, or patterns in songs, may perhaps
lead to more meaningful strategies.
4.2.3 Rhythmic similarities
A similarity analysis of rhythmic patterns is proposed by Paulus and Klapuri
[130]. Their method, only tested with drum sounds, consists of aligning tempo-
variant patterns via dynamic time warping (DTW), and comparing their nor-
malized spectral centroid, weighted with the log-energy of the signal. It is
pointed out that aligning patterns turns out to be the most dicult task.
Ellis and Arroyo [46] present an approach to rhythm similarity called eigen-
rhythm using Principle Component Analysis (PCA) of MIDI drum patterns
from 100 popular songs. First, the length and downbeat of input patterns are
estimated via autocorrelation and by alignment to an average reference pat-
tern. Finally, through PCA analysis, they reduce the high-dimensionality data
to a small space of combined basis patterns that can be used for classication
and visualization.
In [129], Parry and Essa extend the notion of rhythmic elaboration to audio, rst
proposed by Tanguiane [158] for symbolic music. They divide the amplitude
envelope via beat tracking, and measure the pattern length as in [130]. This
4.2. RELATED WORK 65
method stands out for its ability to estimate the complexity of transitioning
between intra- or inter-song patterns.
4.2.4 Self-similarities
In 2001, Foote and Cooper [52] were rst to introduce self-similarities as a way
to visualize musical structures. The method consists of building a square matrix
where time runs from left to right, as well as from top to bottom. Cells {t
i
, t
j
}
of the matrix represent the audio similarity at time t
i
and t
j
in the waveform.
Typically the distance measure is symmetric (i.e., dist(t
i
, t
j
) = dist(t
j
, t
i
)),
thus the matrix is symmetric, and the diagonal is null as shown for the example
in Figure 4-7.
From matrix to segmentation
Foote and Cooper propose the cosine distance between Mel-Frequency Cepstral-
Coecient (MFCC
1
) feature vectors. They suggest that psychoacoustically
motivated parameterizations [...] may be especially appropriate if they match
the similarity judgments of human listeners. From their matrix can be derived
a representation of the rhythmic structure that they call beat spectrum. In
[30], they propose a statistically based framework for segmenting and clustering
larger audio segments via Singular Value Decomposition (SVD). The analysis
can, for instance, return the structural summarization of a piece by recognizing
its most representative sections, such as chorus and verse.
Peeters et al. also propose to summarize songs by detecting occurrences of
large sections in self-similarity matrices [132]. Their approach stands out for
introducing dynamic features, which model the temporal evolution of the spec-
tral shape over a xed duration of time. Interestingly, they consider various
scales of analysis (up to 1Hz), and employ an unsupervised grouping strategy
of successive states, through K-means and a hidden Markov model (HMM). In
a rst pass, states are dened as templates, by segmenting the matrix.
From segmentation to matrix
Our approach starts with a series of signicant segmentations, naturally de-
rived from perceptual models of listening, and hierarchically organized. Our
similarity matrices dier from others in the nature of the audio representation,
and in how we derive larger segment similarities (i.e., sound, beat, pattern).
The standard approach usually consists of 1) computing a ne-grain matrix of
generic similarities, through short windows of xed duration, and 2) segment-
ing the matrix at a given scale by detecting sequence repetitions (upper/lower
diagonals). Instead, we rst compute meaningful segmentations of specic
1
MFCC is a popular choice front end for state-of-the-art speech recognition systems.
66 CHAPTER 4. MUSICAL STRUCTURES
perceptual characteristics (i.e., pitch, rhythm, timbre), and only then compute
their similarity matrices. Our benets are:
1. multiple representations for more specic applications;
2. segment boundaries accurately preserved; and
3. independence to time (tempo) variations.
In a sense, our segmented and synchronous matrices (i.e., segment, beat, pat-
tern), which are orders of magnitude smaller than usual frame-based matrices
(Figure 4-2), are graphical artifacts of intermediate structure-analysis results,
rather than an algorithmic starting point, as is usually the case. We are not
particularly concerned with extracting very large structures from the lowest-
level matrices directly. Nonetheless, these algorithms still apply at higher lev-
els, where the typical O(N
2
) computation cost might originally be a limitation
with standard frame-based representations.
Our hierarchical representation is recursively computed using dynamic pro-
gramming (section 4.3), which itself is a recursive algorithm. We use larger
matrices (lower-level) to infer the smaller ones (higher-level), such that the
pattern-synchronous self-similarity matrices (our smallest) are computed via dy-
namic programming from the beat-synchronous self similarity matrices, which
themselves are computed from the segment-synchronous self-similarity matri-
ces, which are found via dynamic programming at the frame scale.
4.3 Dynamic Programming
If you dont do the best with what you have happened to have got,
you will never do the best with what you should have had.
Rutherford Aris
Dynamic programming
2
(DP) is a method for solving dynamic (i.e., with time
structure) optimization problems using recursion. It is typically used for text
processing and alignment and similarity of biological sequences, such as protein
or DNA sequences. First, a distance function determines the similarity of items
between sequences (e.g., 0 or 1 if items match or not). Then, an edit cost penal-
2
The word programming in the name has nothing to do with writing computer programs.
Mathematicians use the word to describe a set of rules, which anyone can follow to solve a
problem.
4.3. DYNAMIC PROGRAMMING 67
Pattern layer
Beat layer
Segment layer
Frame layer
1:512
~1:40
~1:3.5
1:4
Audio layer
1:262144
~1:1600
~1:12.25
1:16
linear ratio
square ratio
Figure 4-2: 3D representation of our hierarchical similarity representation in the
timbre case for a 47-second excerpt of the song Feel no pain by Sade. [right] The
rst number (black) represents the linear ratio (i.e., per dimension) going from one
scale to another. The second number (blue) is the square ratio (i.e., number of
cells in the matrix).
izes alignment changes in the sequence (e.g., deletion, insertion, substitution).
Finally, the total accumulated cost is a measure of dissimilarity.
A particular application of DP is known as dynamic time warping (DTW) [33].
The concept of dynamic time alignment has been applied to solve the problems
associated with spectral sequence comparison of speech in automated speech
recognition (ASR) systems [72].
The principle illustrated in Figure 4-3 is the following: given a test pattern
T = {t
1
, , t
N
} and a reference pattern R = {r
1
, , r
M
}, a particular
point-to-point mapping = (
t
,
r
) of length K
(4.1)
The optimal alignment minimizes the overall distortion D(T, R), where D(T, R)
is based on a sum of local distances between elements, i.e., d(t
i
, r
j
), where d is
a distance dened as Euclidean, Cosine, Mahalanobis, Manhattan, Minkowski,
etc., and t
i
, r
j
are specic feature vectors. The recursive algorithm incre-
mentally evaluates all possible paths to the desired endpoint {t
N
, r
M
}. Once
the endpoint is reached, the optimal path can be determined by backtracking
through the minimum cost function:
68 CHAPTER 4. MUSICAL STRUCTURES
1 M
1
N
reference
t
e
s
t
T
R
Figure 4-3: Example of dynamic time warping (DTW) in the case of one-
dimensional signals. In our case, we deal with two-dimensional signals: the distance
between each point t
i
and r
j
is measured as a distance between two feature vectors
(e.g., two 25-dimensional Bark frames).
D(T, R) = min
(T, R) (4.2)
D
(T, R) =
1
M
k=1
d(t
t(k)
, r
r(k)
) m
k
(4.3)
with the following endpoint, and monotonicity constraints,
t
(1) = 1
r
(1) = 1
t
(K) = N
r
(K) = M
t
(k + 1)
t
(k)
r
(k + 1)
r
(k)
M
=
K
k=1
m
k
(4.4)
4.3. DYNAMIC PROGRAMMING 69
and m
k
is a path weight tuned according to the problem being solved
as in Figure 4-4, or can be discarded: m
k
= 1 for k (1, K
).
1 0.5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
h
attack release
Figure 4-4: Weight function used for the timbre similarity of sound segments.
The parameter h is chosen empirically.
4.4 Sound Segment Similarity
In a rst task, we seek to measure the timbral similarity of the segments found
through section 3.4. A rst approximation of D(T, R) consists of measuring the
Euclidean distance (equation 4.5) between the 25-dimensional feature vectors
proposed in section 3.7. However, since timbre depends highly on temporal
resolution, if we have access to the auditory spectrogram, a much more ad-
vanced solution is to time align the segments using dynamic time warping. In
this case, d(t
i
, r
j
) is the distance between the 25-dimensional feature vectors
of the auditory spectrogram, and D(T, R) is dened by the optimal path in
the DTW algorithm (section 4.3). The Euclidean distance (equation 4.5), de-
ned as the straight line distance between two points, should be appropriate
since the auditory spectrogram is perceptually normalized, i.e., the geometric
distance between segments is proportional to the perceptual distance, as de-
ned psychoacoustically. It is computed for two points x = (x
1
, , x
N
) and
y = (y
1
, , y
N
) in Euclidean N-space as:
D
Euclidean
=
_
N
i=1
(x
i
y
i
)
2
(4.5)
70 CHAPTER 4. MUSICAL STRUCTURES
It was shown in [66] that attacks are more important than decays for timbre
recognition. We dynamically weigh the path (parameter m
k
in equation 4.3)
with a half-raised cosine function, as displayed in Figure 4-4, therefore increas-
ing the alignment cost at the attack more than at the decay. Two parameters
are chosen empirically (edit cost, and weight function value h), which could
be optimized in future implementations. We compute the segment-synchronous
self-similarity matrix of timbre as displayed in Figure 4-7 (top-right). Note that
the structural informationvisible in the frame-based self-similarity matrix
remains, although we are now looking at a much lower, yet musically informed
analysis rate (a matrix size ratio of almost 1600 to 1).
The pitch similarity of segments is computed directly by measuring the distance
between chroma vectors, since it was shown that time resolution is not really
an issue (section 3.6). We choose the Euclidean distance. However, here is a
place to insert specic heuristics on the perceptual distance between chords:
for example, CM7 may sound closer to Em7 than to C7 [93]. A simple example
of decorrelating timbre from pitch content in segments is shown in Figure 4-6.
Note how timbre boundaries are easily detected regardless of their pitch content,
and how chord patterns are clearly identied, regardless of their timbre. Finally,
the dynamic-loudness similarity of segments
3
can be computed by DTW of the
one-dimensional loudness curve.
D
m
7
G
7
C
M
7
C
#
o
D
m
7
G
7
C
M
7
C
#
o
Figure 4-5: Chord progression played successively with various timbres, as in the
example of Figure 4-6.
4.5 Beat Analysis
The beat analysis (section 3.5) reveals the underlying musical metric on which
sounds arrange. It is generally found between 2 to 5 segments per beat. Us-
ing the segment-synchronous self-similarity matrix of timbre as a new distance
function d(t
i
, r
j
), we can repeat the DP procedure again, and infer a beat-
synchronous self-similarity matrix of timbre. Although we do not consider it,
here is a place to insert more heuristics, e.g., by weighting on-beat segments
more than o -beat segments. Another option consists of computing the simi-
3
We cannot really talk about rhythm at this level.
4.5. BEAT ANALYSIS 71
5 10 15 20
5
10
15
20
5 10 15 20
5
10
15
20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Piano Brass Guitar D
m
7
G
7
C
M
7
C
#
o
D
m
7
G
7
C
M
7
C
#
o
D
m
7
G
7
C
M
7
C
#
o
D
m
7
G
7
C
M
7
C
#
o
D
m
7
G
7
C
M
7
C
#
o
D
m
7
G
7
C
M
7
C
#
o
Figure 4-6: Test example of decorrelating timbre from pitch content. A typical
II-V-I chord progression (as in Figure 4-5) is played successively with piano, brass,
and guitar sounds on a MIDI instrument. [left] segment-synchronous self-similarity
matrix of timbre. [right] segment-synchronous self-similarity matrix of pitch.
larity of beats directly from the auditory spectrogram as in section 4.4, which
is more costly.
It musically makes sense to consider pitch similarities at the beat level as
well. We compute the beat-synchronous self-similarity matrix of pitch much
like we do for timbre. Unlike sound-synchronous self-similarity matrices, beat-
synchronous self-similarity matrices are perfectly normalized in time, regard-
less of their local tempo. This is very important in comparing music especially
where tempos are not perfectly steady. Note in Figure 4-7 (bottom-left) that
the much smaller matrix displays a series of upper/lower diagonals, which reveal
the presence of musical patterns.
4.6 Pattern Recognition
Beats can often be grouped into patterns, also referred to as meter and indicated
by a symbol called a time signature in western notation (e.g., 3/4, 4/4, 12/8).
This section, however, deals with patterns as perceived by humans, rather than
their original score notation, as organized by measures.
4.6.1 Pattern length
A typical method for nding the length of a pattern consists of applying the
autocorrelation function of the signal energy (here the loudness curve). This
is a good approximation based on dynamic variations of amplitude (i.e., the
rhythmic content), but it does not consider pitch or timbre variations. Our
72 CHAPTER 4. MUSICAL STRUCTURES
1000 2000 3000 4000 5000 6000 7000 8000
1000
2000
3000
4000
5000
6000
7000
8000
20 40 60 80 100 120 140 160 180 200 220
20
40
60
80
100
120
140
160
180
200
220
10 20 30 40 50 60
10
20
30
40
50
60
Frames Segments
Beats Patterns
2 4 6 8 10 12 14 16
2
4
6
8
10
12
14
16
Figure 4-7: First 47 seconds of Sades Feel no pain represented hierarchically in
the timbre space as self-similarity matrices: [top-left] frames; [top-right] segments;
[bottom-left] beats; [bottom-right] patterns. Note that each representation is a
scaled transformation of the other, yet synchronized to a meaningful musical metric.
Only beat and pattern representations are tempo invariant. This excerpt includes 8
bars of instrumental introduction, followed by 8 bars of instrumental plus singing.
The two sections appear clearly in the pattern representation.
system computes pattern similarities from a short-term version of the beat-
synchronous self-similarity matrices (only considering a limited section around
the main diagonal), therefore it synchronizes the analysis to the beat.
We run parallel tests on a beat basis, measuring the similarity between succes-
sive patterns of 1- to 11-beat longtypical patterns are 3, 4, 5, 6, or 8 beats
longmuch like our bank of oscillators with the beat tracker (section 3.5). We
pick the rst peak, which corresponds to a particular number of beats (Figure
4-8). Note that patterns in the pitch dimension, if they exist, could be of dif-
ferent lengths than those found in the timbre dimension or rhythm dimension.
An example where only analyzing the pitch content can characterize the length
of a pattern is shown in Figure 4-9. A complete model should include all repre-
sentations, such that the length L of an ideal pattern is found as the rst peak
4.6. PATTERN RECOGNITION 73
in a combination of all similarities. However, this is not currently implemented
in our system: we choose timbre similarities for nding the pattern length.
1 2 3 4 5 6 7 8 9 10 beats
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
timbr e
rhythm
1 2 3 4 5 6 7 8 9 10 beats
0
0.2
0.4
0.6
0.8
1
1 2 3 4 5 6 7 8 9 10 beats
0
0.2
0.4
0.6
0.8
1
timbre
rhythm
timbre
rhythm
Pattern length estimation (first half)
Pattern length estimation (second half)
Pattern length estimation
10 20 30 40 50 60 beats
1
2
3
4
5
6
7
8
9
10
11
10 20 30 40 50 60 beats
1
2
3
4
5
6
7
8
9
10
11
Pattern length estimation through timbre Pattern length estimation through rhythm
[A] [B]
[C] [D]
[E]
Figure 4-8: Causal analysis of pattern length through timbre [A] or rhythm (via
the autocorrelation of our loudness function) [B] for the Sade excerpt of Figure 4-7.
Overall, both timbre and rhythm agree on a rst peak at 4 beats [C]. In the rst
half of the excerpt however, the rhythms rst peak is at 2 beats while at 4 beats in
the timbre space [D]. In the second half of the excerpt, both show a pattern length
of 4 beats again [E].
A drawback of autocorrelation methods is that they do not return the phase
information, i.e., where patterns begin. This problem of downbeat prediction is
addressed in section 5.3 by taking advantage of our current multi-class repre-
sentation, together with a training scheme. However, we now present a simpler
heuristically based approach, generally quite reasonable with popular music.
4.6.2 Heuristic approach to downbeat detection
We assume that chord changes are most likely to occur on the downbeat as
opposed to other beats: a fair assumption with most western music, typically
annotated on a measure basis. We suppose that we already know the length
L (in number of beats) of a pattern, as found in section 4.6.1. We dene
74 CHAPTER 4. MUSICAL STRUCTURES
the dissimilarity D between two successive patterns by the Euclidean distance
(equation 4.5) between their averaged chromas over L. For a given pattern i, the
downbeat estimation consists of nding the maximum likelihood max
D
j
[i]
in a set of L dissimilarity evaluations, i.e., for all beat phase
j
, where 0 j
L 1.
If L can be divided by two, then it is likely that the minimum likelihood
min
D
j
[i] occurs at opposite phase ((
j
+ L/2) mod L) compared to the
maximum likelihood. Indeed, averaging chromas over two chords is more likely
to minimize dissimilarities. Therefore, a more robust strategy rst computes
the absolute dierence between pairs of dissimilarities in phase opposition, and
chooses the best candidate (maximum likelihood) from the pair with highest
absolute dierence.
The process is causal, although it has a lag of 2L beats as demonstrated in
Figure 4-9 for a simple synthesized example, and in Figure 4-10 for a real-world
example. The lag can be completely removed through a general predictive
model of downbeat prediction, as proposed in section 5.3. However, if real-
time analysis is not a concern, then overall the present approach is statistically
reasonable.
4.6.3 Pattern-synchronous similarities
Finally, we derive the pattern-synchronous self-similarity matrix, again via dy-
namic programming. Here is a good place to insert more heuristics, such as the
weight of strong beats versus weak beats. However, our current model does not
assume any weighting. We implement pitch similarities from beat-synchronous
chroma vectors, and rhythm similarities using the elaboration distance function
proposed in [129], together with our loudness function. Results for an entire
song can be found in Figure 4-11. Note that the elaboration distance function
is not symmetric: a simple rhythmic pattern is considered more similar to a
complex pattern than vice versa.
4.7 Larger Sections
Several recent works have dealt with the question of thumbnailing [28], music
summarization [132][30], or chorus detection [60]. These related topics all deal
with the problem of extracting large non-periodic sections in music. As can
be seen in Figure 4-7 (bottom-right), larger musical structures appear in the
matrix of pattern self-similarities. As mentioned in section 4.2.4, advantages
of our system in extracting large structures are 1) its invariance to tempo, and
2) segmentation is inherent to its representation: nding section boundaries is
less of a concern as we do not rely on such resolution.
4.7. LARGER SECTIONS 75
1 2 3 4 5 6 7 8 9 10 11 patterns
0
0.05
0.1
C
D
E
G
A
B
0
1 2 3 4 5 6 7 8 9 10 11 patterns 0
Figure 4-9: [top] A series of eight dierent chords (four Am7 and four Dm7) is
looped 6 times on a piano at 120 BPM, minus one chord discarded in the middle,
i.e., 47 chords total. [middle] The chromagram depicts beat-synchronous chromas.
[bottom] Four evaluations that measure the dissimilarity of two consecutive patterns
of four beats, are run in parallel every four beats, for
0
(blue),
1
(red),
2
(black),
and
3
(green). While D
0
is clearly the highest during the rst half (i.e., the rst
beat is the downbeat), the highest beat phase then shifts to
3
(i.e., the downbeat
has shifted back by one beat).
2 4 6 8 10 12 14 16 18 patterns 20
0
0.01
0.02
0.03
0.04
0.05
C
D
E
G
A
B
2 4 6 8 10 12 14 16 18 patterns 20
Figure 4-10: Real-world example using a 1-minute excerpt of Lebanese blonde
by Thievery Corporation, a pop tune that includes drums, percussion, electric piano,
voice, sitar, ute, and bass. The song alternates between underlying chords Am7
and Dm7, with a fair amount of syncopation. The beat tracker takes about a
measure to lock on. The ground truth is then
2
as depicted by the black line.
The most unlikely beat phase is
0
, as shown in blue. We nd through pairs in
opposition phase that |D
2
[i] D
0
[i]| > |D
3
[i] D
1
[i]|, for 1 < i < 20, which
allows us to choose
2
[i] with max
(D
0
[i], D
2
[i]). Dotted lines show the average
phase estimation.
76 CHAPTER 4. MUSICAL STRUCTURES
10 20 30 40 50 60 70 80 90 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
100
10
20
30
40
50
60
70
80
90
100
Figure 4-11: Pitch [left] and rhythm [right] self-similarity matrices of patterns for
the entire Sade song Feel no pain. The red squares outline the rst 16 measures
studied in Figure 4-7 with timbre similarities. Note that a break section (blue dashed
squares) stands out in the rhythm representation, where it is only the beginning of
the last section in the pitch representation.
Although this is not currently implemented, previous techniques used for ex-
tracting large structures in music, such as the Gaussian-tapered checkerboard
kernel in [30] or the pattern matching technique in [28], apply here, but at a
much lower resolution, increasing greatly the computation speed, while pre-
serving the temporal accuracy. Note that the pattern level is a fair assumption
of section boundary in most popular music. In future work, we may consider
combining the dierent class representations (i.e., pitch, rhythm, timbre) into
a single tunable system for extracting large sections.
4.8 Chapter Conclusion
In this chapter, we propose a recursive multi-class (pitch, rhythm, timbre) ap-
proach to the structure analysis of acoustic similarities in popular music. Our
representation is hierarchically organized, where each level is musically mean-
ingful. Although fairly intensive computationally, our dynamic programming
method is time-aware, causal, and exible. Future work may include inserting
and optimizing heuristics at various stages of the algorithm. Our representation
may be useful for fast content retrieval (e.g., through vertical, rather than hori-
zontal searchgoing down the tree structure towards the best similarity rather
than testing all the leaves); improved song similarity architectures that include
specic content considerations; and music synthesis, as described in chapter 6.
4.8. CHAPTER CONCLUSION 77
78 CHAPTER 4. MUSICAL STRUCTURES
CHAPTER FIVE
Learning Music Signals
The beautiful thing about learning is that no one can take it
away from you.
B.B. King
Learning is acquiring knowledge or skill through study, experience, or teaching.
Whether a computer system learns or merely induces generalizations is
often a subject of debate. Indeed, learning from data or examples (similarity-
based learning) is another way of speaking about generalization procedures
and concept representations that are typically simplistic and brittle [119].
However, we argue that a music-listening system is not complete until it is
able to improve or adapt its performance over time on tasks similar to those
done previously. Therefore, this chapter introduces learning strategies that
apply to the music analysis context. We believe that state-of-the-art learning
algorithms are able to produce robust models of relatively complex systems, as
long as the data (i.e., musical features) is consistent and the learning problem
well posed.
5.1 Machine Learning
Machine learning [115], a subeld of articial intelligence [140], is concerned
with the question of how to construct computer programs (i.e., agents) that
automatically improve their performance at some task with experience. Learn-
ing takes place as a result of the interaction between the agent and the world,
and from observation by the agent of its own decision-making processes. When
learning from measured or observed data (e.g., music signals), the machine
learning algorithm is also concerned with how to generalize the representa-
tion of that data, i.e., to nd a regression or discrimination function that best
describes the data or category. There is a wide variety of algorithms and tech-
niques, and their description would easily ll up several volumes. There is no
ideal one: results usually depend on the problem that is given, the complex-
ity of implementation, and time of execution. Here, we recall some of the key
notions and concepts of machine learning.
5.1.1 Supervised, unsupervised, and reinforcement learning
When dealing with music signals and extracting perceptual information, there
is necessarily a fair amount of ambiguity and imprecision (noise) in the esti-
mated data, not only due to the analysis technique, but also to the inherent
fuzziness of the perceptual information. Therefore, statistics are widely used
and will often play an important role in machine perceptiona machine that
can recognize patterns grounded on our senses. If an external teacher provides
a category label or cost for each pattern (i.e., when there is specic feedback
available), the learning is said to be supervised: the learning element is given
the true output for particular inputs. It adapts its internal representation of a
correlation function to best match the information provided by the feedback.
More formally, we say that an example (or sample) is a pair (x, f(x)), where
x is the input and f(x) is the output of the function applied to x. Induc-
tion is the task that, given a collection of examples of f, returns a function h
(the hypothesis) that approximates f. Supervised learning can be incremental
(i.e., update its old hypothesis whenever a new example arrives) or based on a
representative training set of examples. One must use a large enough amount
of training samples, but one must keep some for validation of the hypothesis
function (typically around 30%).
In unsupervised learning or clustering, there is no explicit teacher, and the
system forms natural clusters (groupings) of the input patterns. Dierent
clustering algorithms may lead to dierent clusters, and the number of clusters
can be specied ahead of time if there is some prior knowledge of the classi-
cation task. Finally, a third form of learning, reinforcement learning, species
only if the tentative classication or decision is right or wrong, which improves
(reinforces) the classier.
For example, if our task were to classify musical instruments from listening
to their sound, in a supervised context we would rst train a classier by us-
ing a large database of sound recordings for which we know the origin. In an
unsupervised learning context, several clusters would be formed, hopefully rep-
resenting dierent instruments. With reinforcement learning, a new example
with a known target label is computed, and the result is used to improve the
classier.
80 CHAPTER 5. LEARNING MUSIC SIGNALS
5.1.2 Generative vs. discriminative learning
One current way of categorizing a learning approach is dierentiating whether
it is generative or discriminative [77]. In generative learning, one provides
domain-specic knowledge in terms of structure and parameter priors over the
joint space of variables. Bayesian (belief) networks [84][87], hidden Markov
models (HMM) [137], Markov random elds [177], Kalman lters [21], mixture
of experts [88], mixture of multinomials, mixture of Gaussians [145], and so
forth, provide a rich and exible language for specifying this knowledge and
subsequently rening it with data and observations. The nal result is a prob-
abilistic distribution that is a good generator of novel examples.
Conversely, discriminative algorithms adjust a possibly non-distributional model
to data, optimizing a specic task, such as classication or prediction. Popular
and successful examples include logistic regression [71], Gaussian processes [55],
regularization networks [56], support vector machines (SVM) [22], and tradi-
tional articial neural networks (ANN) [13]. This often leads to superior perfor-
mance, yet compromises the exibility of generative modeling. Jaakkola et al.
recently proposed Maximum Entropy Discrimination (MED) as a framework
to combine both discriminative estimation and generative probability density
[75]. Finally, in [77], imitative learning is presented as another variation on
generative modeling, which also learns from examples from an observed data
source. However, the distinction is that the generative model is an agent that is
interacting in a much more complex surrounding external world. The method
was demonstrated to outperform (under appropriate conditions) the usual gen-
erative approach in a classication task.
5.2 Prediction
Linear systems have two particularly desirable features: they are well char-
acterized and are straightforward to model. One of the most standard ways
of tting a linear model to a given time series consists of minimizing squared
errors between the data and the output of an ARMA (autoregressive moving
average) model. It is quite common to approximate a nonlinear time series by
locally linear models with slowly varying coecients, as through Linear Pre-
dictive Coding (LPC) for speech transmission [135]. But the method quickly
breaks down when the goal is to generalize nonlinear systems.
Music is considered a data-rich and theory-poor type of signal: unlike
strong and simple models (i.e., theory-rich and data-poor), which can be ex-
pressed in a few equations and few parameters, music holds very few assump-
tions, and modeling anything about it must require a fair amount of gener-
alization abilityas opposed to memorization. We are indeed interested in
extracting regularities from training examples, which transfer to new examples.
This is what we call prediction.
5.2. PREDICTION 81
5.2.1 Regression and classication
We use our predictive models for classication, where the task is to categorize
the data into a nite number of predened classes, and for nonlinear regression,
where the task is to nd a smooth interpolation between points, while avoiding
overtting (the problem of tting the noise in addition to the signal). The
regression corresponds to mapping a high-dimensional input data stream into
a (usually) one-dimensional nonlinear output function. This one-dimensional
signal may be the input data itself in the case of a signal forecasting task (section
5.2.5).
5.2.2 State-space forecasting
The idea of forecasting future values by using immediately preceding ones
(called a time-lag vector, or tapped delay line) was rst proposed by Yule [179].
It turns out that the underlying dynamics of nonlinear systems can also be un-
derstood, and their geometrical behavior inferred from observing delay vectors
in a time series, where no or a priori information is available about its origin
[157]. This general principle is known as state-space reconstruction [53], and
inspires our predictive approach. The method has previously been applied to
musical applications, e.g., for determining the predictability of driven nonlinear
acoustic systems [144], for musical gesture analysis and embedding synthesis
[114], and more recently for modeling musical compositions [3].
5.2.3 Principal component analysis
Principal component analysis (PCA) involves a mathematical procedure that
transforms a number of (possibly) correlated variables into a (smaller) num-
ber of uncorrelated variables called principal components. The rst principal
component accounts for the greatest possible statistical variability (or entropy)
in the data, and each succeeding component accounts for as much of the re-
maining variability as possible. Usually, PCA is used to discover or reduce the
dimensionality of the data set, or to identify new meaningful underlying vari-
ables, i.e., patterns in the data. Principal components are found by extracting
the eigenvectors and eigenvalues of the covariance matrix of the data, and are
calculated eciently via singular value decomposition (SVD). These eigenvec-
tors describe an orthonormal basis that is eectively a rotation of the original
cartesian basis (Figure 5-1).
5.2.4 Understanding musical structures
When a time-lag vector space is analyzed through PCA, the resulting rst few
dimensions characterize the statistically most signicant underlying degrees of
freedom of the dynamical system. A data set, which appears complicated when
82 CHAPTER 5. LEARNING MUSIC SIGNALS
x
y
v
1
v
2
Figure 5-1: Graphical representation of a PCA transformation in only two dimen-
sions. The variance of the data in the original cartesian space (x, y) is best captured
by the basis vectors v1 and v2 in a rotated space.
plotted as a time series, might reveal a simpler structure, represented by a
manifold when plotted as a stereo plot (Figure 5-2). This apparent simpler
structure provides the evidence of some embedded redundant patterns, which
we seek to model through learning. Our primarily goal is to build a generative
(cognitive) representation of music, rather than resynthesize it from a rigid
and nite description. Nevertheless, a dierent application consists either of
transforming the manifold (e.g., through rotation, scaling, distortion, etc.), and
unfolding it back into a new time series; or using it as an attractor for generating
new signals, governed by similar dynamics [3].
2
1.5
1
0.5
0
0.5
1
1.5
2
0
1
2
3
4
5
10
8
6
4
2
7
6
5
4
3
2
1
3
2
1
0
1
2
3
8
6
4
2
0
2
1
0
1
2
2
1.5
1
0.5
0
0.5
1
1.5
2
5
0
5
[A] [B] [C]
Figure 5-2: Manifold of the rst three dimensions of the time-lag vector in the
eigenspace. 4000 loudness points for [A] electronic piece Lebanese blonde by
Thievery Corporation; [B] funk piece Its all right now by Eddie Harris; [C] jazz
piece One last pitch by Harry Connick Jr. The manifold structure shows the
existence of rhythmic patterns.
5.2. PREDICTION 83
5.2.5 Learning and forecasting musical structures
We project a high-dimensional space onto a single dimension, i.e., a correla-
tion between the time-lag vector [x
t(d1)
, , x
xt
] and the current value
x
t
[48]. We use machine-learning algorithms to infer the mapping, which con-
sequently provides an insight into the underlying behavior of the system dy-
namics. Given an initial set of d data points, the previously taught model can
exploit the embedded geometry (long-term memory) to predict a new forth-
coming data point x
t+1
. By repeating the procedure even further through data
points x
t+2
, , x
t+(1)
, the model forecasts the time series even farther into
the future.
We model rhythm in this way, through both the iterative mixture of Gaussian
framework (CWM) provided by Schoner [145], and a support vector machine
package (SVM
light
) provided by Joachims [85] (Figure 5-3). The outcome is
an extremely compact rhythm synthesizer that learned from example, and
can generate a loudness function given an initialization data set. We measure
the robustness of the model by its ability to predict 1) previously trained data
points; 2) new test data points; and 3) the future of the time series (both
short-term and long-term accuracy). With given numbers of delays and other
related parameters, better overall performances are typically found using SVM
(Figure 5-3). However, a quantitative comparison between learning strategies
goes beyond the scope of this thesis.
5.2.6 Support Vector Machine
Support vector machines (SVM) [164] rely on preprocessing the data as a way to
represent patterns in a high dimensiontypically much higher than the orig-
inal feature space. With appropriate nonlinear mapping into the new space,
and through basis function, or kernel such as Gaussian, polynomial, sigmoid
functiondata can always be regressed (and classied) with a hyperplane (Fig-
ure 5-4). Support vector machines dier radically from comparable approaches
as SVM training always converges to a global minimum, i.e., the search corre-
sponds to a convex quadratic programming problem, typically solved by matrix
inversion. While obviously not the only machine learning solution, this is the
one we choose for the following experiments.
5.3 Downbeat prediction
Other than modeling and forecasting one-dimensional signals, such as the loud-
ness (rhythm) signal, we can also predict new musical information given a mul-
tidimensional input. A particularly interesting example is downbeat prediction
based on surface listening, and time-lag embedded learning. The model is
causal, and tempo independent: it does not require beat tracking. In fact, it
84 CHAPTER 5. LEARNING MUSIC SIGNALS
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
past future
35 sec.
0
0.5
1
5 10 15 20 25 30 35 sec.
32.5 33.75 36.25 31.25
0
0.5
1
[A]
past future
5 10 15 20 25 30 35 sec.
35 sec. 32.5 33.75 36.25 31.25
[B]
Cluster-Weighted Modeling
Support-Vector Machine
[zoom]
[zoom]
Figure 5-3: Rhythm prediction of funk song Its all right now by Eddie Harris
using: [A] CWM with 12 Gaussians, 20-point tap delay, 20 past data points; [B]
SVM with Gaussian kernel, 15-point tap delay, 15 past data points; [blue] Training
loudness function; [dotted blue] initial buer; [red] predicted loudness function;
[green] unknown future ground truth; [black] forecasted future. The dotted black
marker indicates the present.
Input space Feature space
Figure 5-4: Typical classication task using SVM. The classier must discriminate
between two types of labeled data with a nonlinear discriminative function [input
space]. The data is elevated into a higher dimensional space where the data can
be discriminated with a hyperplane [feature space]. Dotted line represent support
vectors. The SVM algorithm tries to maximize the distance between support vectors
of opposite classes.
5.3. DOWNBEAT PREDICTION 85
could appropriately be used as a phase-locking system for the beat tracker of
section 3.5, which currently runs in an open loop (i.e., without feedback control
mechanism). This is left for future work.
5.3.1 Downbeat training
The downbeat prediction is supervised. The training is a semi-automatic task
that requires little human control. If our beat tracker is accurate throughout
the whole training song, and the measure length is constant, we label only one
beat with an integer value p
b
[0, M 1], where M is the number of beats
in the measure, and where 0 is the downbeat. The system extrapolates the
beat-phase labeling to the rest of the song. In general, we can label the data
by tapping the downbeats along with the music in real time, and by recording
their location in a text le. The system nally labels segments with a phase
location: a oat value p
s
[0, M[. The resulting segment phase signal looks
like a sawtooth ranging from 0 to M. Taking the absolute value of its derivative
returns our ground-truth downbeat prediction signal, as displayed in the top
pane of Figure 5-5. Another good option consists of labeling tatums (section
3.4.3) rather than segments.
The listening stage, including auditory spectrogram, segmentation, and music
feature labeling, is entirely unsupervised (Figure 3-19). So is the construction of
the time-lag feature vector, which is built by appending an arbitrary number of
preceding multidimensional feature vectors. Best results were obtained using 6
to 12 past segments, corresponding to nearly the length of a measure. We model
short-term memory fading by linearly scaling down older segments, therefore
increasing the weight of most recent segments (Figure 5-5).
Training a support vector machine to predict the downbeat corresponds to a
regression task of several dozens of feature dimensions (e.g., 9 past segments
42 features per segments = 378 features) into one single dimension (the cor-
responding downbeat phase of the next segment). Several variations of this
principle are also possible. For instance, an additional PCA step (section 5.2.3)
allows us to reduce the space considerably while preserving most of its entropy.
We arbitrarily select the rst 20 eigen-dimensions (Figure 5-6), which generally
accounts for about 6080% of the total entropy while reducing the size of the
feature space by an order of magnitude. It was found that results are almost
equivalent, while the learning process gains in computation speed. Another
approach that we have tested consists of selecting the relative features of a run-
ning self-similarity triangular matrix rather than the original absolute features,
e.g., ((9 past segments)
2
9)/2 = 36 features. Results were found to be roughly
equivalent, and also faster to compute.
We expect that the resulting model is not only able to predict the downbeat
of our training data set, but to generalize well enough to predict the downbeat
86 CHAPTER 5. LEARNING MUSIC SIGNALS
10 20 30 40 50 60 70 80 90
50
100
150
200
250
300
350
10 20 30 40 50 60 70 80 90
0
0.5
1
Figure 5-5: [bottom] Time-lag embedded feature vectors (9 segments of 37 fea-
tures in this example) for a dozen measures of Beatless Love me do song. Note
that past features (greater index number) are faded out. [top] Corresponding target
downbeat prediction signal.
10 20 30 40 50 60 70 80 90
2
4
6
8
10
12
14
16
18
20
Figure 5-6: PCA reduction of time-lag feature vectors of Figure 5-5. The space
is properly normalized by mean and variance across dimensions. Because each
dimension represents successively axes of remaining maximum entropy, there is no
more possible interpretation of the features.
5.3. DOWNBEAT PREDICTION 87
of new input datadenominated test data in the following experiments. An
overall schematic of the training method is depicted in Figure 5-7.
downbeat
learning
time lagged
features
phase
segmentation
auditory
spectrogram
beat tracking
human
supervision
Figure 5-7: Supervised learning schematic.
Although downbeat may often be interpreted through harmonic shift [61] or a
generic template pattern [92], sometimes neither of these assumptions apply.
This is the case of the following example.
5.3.2 The James Brown case
James Browns music is often characterized by its repetitive single chord and
syncopated rhythmic pattern. The usual assumptions, as just mentioned, do
not hold. There may not even be any measurable energy in the signal at the
perceived downbeat. Figure 5-8 shows the results of a simple learning test with
a 30-second excerpt of I got the feelin. After listening and training with only
30 seconds of music, the model demonstrates good signs of learning (left pane),
and is already able to predict reasonably well some of the downbeats in the
next 30-second excerpt in the same piece (right pane).
Note that for these experiments: 1) no periodicity is assumed; 2) the system
does not require a beat tracker and is actually tempo independent; and 3) the
predictor is causal and does, in fact, predict one segment into the future, i.e.,
about 60300 ms. The prediction schematic is given in Figure 5-9.
5.3.3 Inter-song generalization
Our second experiment deals with a complex rhythm from the northeast of
Brazil called Maracatu. One of its most basic patterns is shown in standard
notation in Figure 5-10. The bass-drum sounds are circled by dotted lines.
Note that two out of three are syncopated. A listening test was given to several
88 CHAPTER 5. LEARNING MUSIC SIGNALS
20 40 60 80 100
0
0.5
P
r
e
d
i
c
t
e
d
t
r
a
i
n
i
n
g
d
a
t
a
20 40 60 80 100 seg.
0
0.5
T
a
r
g
e
t
t
r
a
i
n
i
n
g
d
a
t
a
20 40 60 80 100
0.1
P
r
e
d
i
c
t
e
d
t
e
s
t
d
a
t
a
20 40 60 80 100 seg.
0
0.5
T
a
r
g
e
t
t
e
s
t
d
a
t
a
Figure 5-8: Downbeat prediction for James Browns I got the feelin song. [left]
Training data set, including a dozen measures. [right] Test data set: the dotted
green line represents the ground truth, the blue line is our prediction. Note that
there can be a variable number of segments per measure, therefore the distance
between downbeats may vary in the plot.
time lagged
features
segmentation
auditory
spectrogram
downbeat
prediction
Figure 5-9: Causal downbeat prediction schematic.
musically trained western subjects, none of whom could nd the downbeat. Our
beat tracker also performed very badly, and tended to lock onto syncopated
accents.
1 2 3 4
Figure 5-10: Typical Maracatu rhythm score notation.
We train our model with six Maracatu pieces from the band Maracatu Estrela
Brilhante. Those pieces have dierent tempi, and include a large number of
drum sounds, singing voices, choirs, lyrics, and a variety of complex rhythmic
patterns. Best results were found using an embedded 9-segment feature vector,
and a Gaussian kernel for the SVM. We verify our Maracatu expert model
both on the training data set (8100 data points), and on a new piece from
the same album (1200 data points). The model performs outstandingly on the
5.3. DOWNBEAT PREDICTION 89
1000 2000 3000 4000 5000 6000 7000 8000
0
1
P
re
d
ic
te
d
tra
in
in
g
d
a
ta
1000 2000 3000 4000 5000 6000 7000 8000 seg.
0
0.5
T
a
rg
e
t tra
in
in
g
d
a
ta
100 200 300 400 500 600 700 800 900 1000 1100
0
1
P
re
d
ic
te
d
te
s
t d
a
ta
100 200 300 400 500 600 700 800 900 1000 1100 segments
0
0.5
T
a
rg
e
t te
s
t d
a
ta
2000 2200 2400 2600 2800 3000
0
1
P
re
d
ic
te
d
tra
in
in
g
d
a
ta
2000 2200 2400 2600 2800 3000 segments
0
0.5
T
a
rg
e
t tra
in
in
g
d
a
ta
Figure 5-11: Downbeat prediction results for the Maracatu expert model. [top] Full training
data set (6 songs) including training downbeat (dotted green) and predicted downbeat (blue).
Note that learning is consistent across all data. [middle] Zoom (red section) in the training data
to show phase prediction accuracy. [bottom] Results for the new test data (1 song). Note that
the last few measures are particularly accurate: most songs in fact tend to end in the same way.
training data, and does well on the new untrained data (Figure 5-11). Total
computation cost (including listening and modeling) was found to be somewhat
signicant at training stage (about 15 minutes on a dual-2.5 GHz Mac G5 for
the equivalent of 20 minutes of music), but minimal at prediction stage (about
15 seconds for a 4-minute song).
This experiment demonstrates the workability of extracting the downbeat in ar-
bitrarily complex musical structures, through supervised learning, and without
needing the beat information. Although we applied it to extract the downbeat
location, such framework should allow to learn and predict other music infor-
mation, such as, beat location, time signature, key, genre, artist, etc. But this
is left for future work.
5.4 Time-Axis Redundancy Cancellation
Repeating sounds and patterns are widely exploited throughout music. How-
ever, although analysis and music information retrieval applications are often
concerned with processing speed and music description, they typically discard
the benets of sound redundancy cancellation. Our method uses unsupervised
clustering, allows for reduction of the data complexity, and enables applications
such as compression [80].
90 CHAPTER 5. LEARNING MUSIC SIGNALS
5.4.1 Introduction
Typical music retrieval applications deal with large databases of audio data.
One of the major concerns of these programs is the meaningfulness of the music
description, given solely an audio signal. Another concern is the eciency of
searching through a large space of information. With these considerations, some
recent techniques for annotating audio include psychoacoustic preprocessing
models [128], and/or a collection of frame-based (i.e., 1020 ms) perceptual
audio descriptors [70] [113]. The data is highly reduced, and the description
hopefully relevant. However, although the annotation is appropriate for sound
and timbre, it remains complex and inadequate for describing music.
In this section, two types of clustering algorithms are proposed: nonhierarchical
and hierarchical. In nonhierarchical clustering, such as the k-means algorithm,
the relationship between clusters is undetermined. Hierarchical clustering, on
the other hand, repeatedly links pairs of clusters until every data object is
included in the hierarchy. The goal is to group similar segments together to
form clusters whose centroid or representative characterizes the group, revealing
musical patterns and a certain organization of sounds in time.
5.4.2 Nonhierarchical k-means clustering
K-means clustering is an algorithm used for partitioning (clustering) N data
points into K disjoint subsets so as to minimize the sum-of-squares criterion:
J =
K
j=1
nSj
|x
n
j
|
2
(5.1)
where x
n
is a vector representing the n
th
data point and
j
is the
geometric centroid of the data points in S
j
. The number of clusters K must be
selected at onset. The data points are assigned at random to initial clusters,
and a re-estimation procedure nally leads to non-optimized minima. Despite
these limitations, and because of its simplicity, k-means clustering is the most
popular clustering strategy. An improvement over k-means, called Spectral
Clustering, consists roughly of a k-means method in the eigenspace [120], but
it is not yet implemented.
We start with the segment metadata as described in section 3.7. That MDS
space being theoretically normalized and Euclidean (the geometrical distance
between two points is equivalent to their perceptual distance), it is accept-
able to use k-means for a rst prototype. Perceptually similar segments fall in
the same region of the space. An arbitrary small number of clusters is chosen
depending on the targeted accuracy and compactness. The process is compa-
rable to vector quantization: the smaller the number of clusters, the smaller
5.4. TIME-AXIS REDUNDANCY CANCELLATION 91
the lexicon and the stronger the quantization. Figure 5-12 depicts the segment
distribution for a short audio excerpt at various segment ratios (dened as the
number of retained segments, divided by the number of original segments). Re-
dundant segments get naturally clustered, and can be coded only once. The
resynthesis for that excerpt, with 30% of the original segments, is shown in
Figure 5-14.
20 40 60 80 100 120 seg.
100
90
80
70
60
50
40
30
20
10
% number segments
1
Figure 5-12: Color-coded segment distribution for the 129 segments of the Wa-
termelon Man piece by Herbie Hancock, at various segment ratios. 100% means
that all segments are represented, while 10% means that only 13 dierent segments
are retained. Note the time-independence of the segment distribution, e.g., here
is an example of the distribution for the 13 calculated most perceptually relevant
segments out of 129:
33 33 66 66 23 122 23 15 8 112 42 8 23 42 23 15 112 33 33 66 66 66 108 23 8 42 15 8 128 122 23 15 112 33 66 115 66 122 23 15
8 128 42 66 128 42 23 15 112 33 66 115 8 108 23 15 8 42 15 8 128 122 23 115 112 33 66 115 86 128 23 33 115 112 42 8 128 42
23 115 112 8 66 8 66 108 86 15 23 42 15 8 128 122 23 115 112 8 66 115 86 128 23 122 8 112 42 8 108 42 23 115 112 8 66 115 66
108 86 122 23 42 122 23 128 122 23 128 128
One of the main drawbacks of using k-means clustering is that we may not know
ahead of time how many clusters we want, or how many of them would ideally
describe the perceptual music redundancy. The algorithm does not adapt to the
type of data. It makes sense to consider a hierarchical description of segments
organized in clusters that have subclusters that have subsubclusters, and so on.
5.4.3 Agglomerative Hierarchical Clustering
Agglomerative hierarchical clustering is a bottom-up procedure that begins with
each object as a separate group. These groups are successively combined based
on similarity until there is only one group remaining, or a specied termination
condition is satised. For n objects, n 1 mergings are done. Agglomerative
hierarchical methods produce dendrograms (Figure 5-13). These show hierar-
chical relations between objects in form of a tree.
We can start from a similarity matrix as described in section 4.2.4. We order
segment pairs by forming clusters hierarchically, starting from the most similar
pairs. At each particular stage the method joins together the two clusters that
are the closest from each other (most similar). Dierences between methods
arise because of the dierent ways of dening distance (or similarity) between
92 CHAPTER 5. LEARNING MUSIC SIGNALS
52035501545306010254055 2 722321227421747573752 31833 84823381328435358 4193449 924395414294459 1 621365111264156163146
0
0.01
0.02
0.03
0.04
0.05
perceptual threshold
0 10 20 30 40 50 segments 60
1
2
3
4
5
5 2 1 3 4
[A]
[B]
c
l
u
s
t
e
r
Figure 5-13: [A] Dendrogram of a drum loop made of 5 distinct sounds repeating
12 times. The perceptual threshold, selected manually at 0.015, groups the sounds
into 5 clusters. Each cluster is composed of 12 sounds. [B] Synthesized time-series
of the musical path through the 5 clusters. The 12 loops are recovered.
clusters. The most basic agglomerative model is single linkage, also called
nearest neighbor. In single linkage, an object is linked to a cluster if at least
one object in the cluster is the closest. One defect of this distance measure is
the creation of unexpected elongated clusters, called the chaining eect. On
the other hand, in complete linkage, two clusters fuse depending on the most
distant pair of objects among them. In other words, an object joins a cluster
when its similarity to all the elements in that cluster is equal or higher to the
considered level. Other methods include average linkage clustering, average
group, and Wards linkage [42].
The main advantages of hierarchical clustering are 1) we can take advantage
of our already computed perceptual similarity matrices; 2) the method adapts
its number of clusters automatically to the redundancy of the music; and 3) we
can choose the level of resolution by dening a similarity threshold. When that
5.4. TIME-AXIS REDUNDANCY CANCELLATION 93
threshold is high (fewer clusters), the method leads to rough quantizations of
the musical description (Figure 5-13). When it is low enough (more clusters)
so that it barely represents the just-noticeable dierence between segments (a
perceptual threshold), the method allows for reduction of the complexity of the
description without altering its perception: redundant segments get clustered,
and can be coded only once. This particular evidence leads to a compression
application.
5.4.4 Compression
Compression is the process by which data is reduced into a form that minimizes
the space required to store or transmit it. While modern lossy audio coders
eciently exploit the limited perception capacities of human hearing in the
frequency domain [17], they do not take into account the perceptual redundancy
of sounds in the time domain. We believe that by canceling such redundancy,
we can reach further compression rates. In our demonstration, the segment
ratio indeed highly correlates with the compression rate that is gained over
traditional audio coders.
Perceptual clustering allows us to reduce the audio material to the most percep-
tually relevant segments, by retaining only one representative (near centroid)
segment per cluster. These segments can be stored along with a list of indexes
and locations. Resynthesis of the audio consists of juxtaposing the audio seg-
ments from the list at their corresponding locations (Figure 5-14). Note that
no cross-fading between segments or interpolations are used at resynthesis.
If the threshold is chosen too high, too few clusters may result in musical
distortions at resynthesis, i.e., the sound quality is fully maintained, but the
musical syntax may audibly shift from its original form. The ideal threshold
is theoretically a constant value across songs, which could be dened through
empirical listening test with human subjects and is currently set by hand. The
clustering algorithm relies on our matrix of segment similarities as introduced
in 4.4. Using the agglomerative clustering strategy with additional supervised
feedback, we can optimize the distance-measure parameters of the dynamic
programming algorithm (i.e., parameter h in Figure 4-4, and edit cost as in
section 4.3) to minimize the just-noticeable threshold, and equalize the eect
of the algorithm across large varieties of sounds.
5.4.5 Discussion
Reducing audio information beyond current state-of-the-art perceptual codecs
by structure analysis of its musical content is arguably a bad idea. Purists
would certainly disagree with the benet of cutting some of the original material
altogether, especially if the music is entirely performed. There are obviously
great risks for music distortion, and the method applies naturally better to
94 CHAPTER 5. LEARNING MUSIC SIGNALS
0 5 10 15 20 25
-2
-1
0
1
2
4
x 10
0 5 10 15 20 25
x 10
4
-2
-1
0
1
2
0 5 10 15 20 25
1
5
10
15
25
20
1
5
10
15
25
20
0 5 10 15 20 25
0 5 10 15 20 25
0 5 10 15 20 25
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
[A]
[B]
Figure 5-14: [A] 26-second audio excerpt of Watermelon Man by Herbie Han-
cock. From top to bottom: waveform, auditory spectrogram, and loudness curve
with segmentation markers (129 segments of around 200 ms). [B] Resynthesis of
the piece with about 30% of the segments (less than 8 seconds of audio) deter-
mined automatically through agglomerative clustering. From top to bottom: new
waveform, new auditory spectrogram, new loudness curve with segmentation mark-
ers. Note that there are few noticeable dierences, both in the time and frequency
domains.
5.4. TIME-AXIS REDUNDANCY CANCELLATION 95
certain genres, including electronic music, pop, or rock, where repetition is an
inherent part of its qualities. Formal experiments could certainly be done for
measuring the entropy of a given piece and compressibility across sub-categories.
We believe that, with a real adaptive strategy and an appropriate perceptually
grounded error estimation, the principle has great potential, primarily for de-
vices such as cell phones, and PDAs, where bit rate and memory space matter
more than sound quality. At the moment, segments are compared and con-
catenated as raw material. There is no attempt to transform the audio itself.
However, a much more rened system would estimate similarities independently
of certain perceptual criteria, such as loudness, duration, aspects of equalization
or ltering, and possibly pitch. Resynthesis would consist of transforming para-
metrically the retained segment (e.g., amplifying, equalizing, time-stretching,
pitch-shifting, etc.) in order to match its target more closely. This could greatly
improve the musical quality, increase the compression rate, while rening the
description.
Perceptual coders have already provided us with a valuable strategy for estimat-
ing the perceptually relevant audio surface (by discarding what we cannot hear).
Describing musical structures at the core of the codec is an attractive concept
that may have great signicance for many higher-level information retrieval
applications, including song similarity, genre classication, rhythm analysis,
transcription tasks, etc.
96 CHAPTER 5. LEARNING MUSIC SIGNALS
CHAPTER SIX
Composing with sounds
If it sounds good and feels good, then it IS good!
Duke Ellington
The motivation behind this analysis work is primarily music synthesis. We
are interested in composing with a database of sound segmentsof variable
sizes, usually ranging from 60 to 300 mswhich we can extract from a cata-
log of musical samples and songs (e.g., an MP3 collection). These segments
can be rearranged into sequences themselves derived from timbral, harmonic,
and rhythmic structures extracted/learned/transformed from existing songs
1
.
Our rst application, however, does not rearrange short segments, but rather
manipulates entire sections of songs by smoothly transitioning between them.
6.1 Automated DJ
A DJ (Disc Jockey) is an individual who selects, mixes and plays pre-recorded
music for the enjoyment of others. Often, DJs are expected to demonstrate
greater technical aspects of their performance by manipulating the songs they
play in order to maintain a given tempo and energy level. Even simply play-
ing records allows for the DJ to bring his own creative ideas to bear upon
the pre-recorded music. Playing songs in sequence oers the opportunity to ob-
serve relationships forming between dierent songs. Given careful attention and
control, the DJ can create these relations and encourage them to become more
expressive, beautiful and telling [172]. This is called the art of programming,
or track selection.
1
Another approach for combining segments consists of using generative algorithms.
6.1.1 Beat-matching
Certainly, there is more to the art of DJ-ing than technical abilities. In this
rst application, however, we are essentially interested in the problem of beat-
matching and cross-fading songs as smoothly as possible. This is one of
DJs most common practices, and it is relatively simple to explain but harder
to master. The goal is to select songs with similar tempos, and align their
beat over the course of a transition while cross-fading their volumes. The beat
markers, as found in section 3.5, are obviously particularly relevant features.
The length of a transition is chosen arbitrarily by the user (or the computer),
from no transition to the length of an entire song; or it could be chosen through
the detection of salient changes of structural attributes; however, this is not
currently implemented. We extend the beat-matching principle to downbeat
matching by making sure that downbeats align as well. In our application, the
location of a transition is chosen by selecting the most similar rhythmic pattern
between the two songs as in section 4.6.3. The analysis may be restricted to
nding the best match between specic sections of the songs (e.g., the last 30
seconds of song 1 and the rst 30 seconds of song 2).
To ensure perfect match over the course of long transitions, DJs typically ad-
just the playback speed of the music through specialized mechanisms, such
as a relative-speed controller on a turntable (specied as a relative posi-
tive/negative percentage of the original speed). Digitally, a similar eect is
implemented by sampling-rate conversion of the audio signal [154]. The pro-
cedure, however, distorts the perceptual quality of the music by detuning the
whole sound. For correcting this artifact, we implement a time-scaling algo-
rithm that is capable of speeding up or slowing down the music without aecting
the pitch.
6.1.2 Time-scaling
There are three main classes of audio time-scaling (or time-stretching): 1) the
time-domain approach, which involves overlapping and adding small windowed
fragments of the waveform; 2) the frequency-domain approach, which is typi-
cally accomplished through phase-vocoding [40]; and 3) the signal-modeling ap-
proach, which consists of changing the rate of a parametric signal description,
including deterministic and stochastic parameters. A review of these meth-
ods can be found in [16], and implementations for polyphonic music include
[15][94][102][43].
We have experimented with both the time-domain and frequency-domain meth-
ods, with certain original properties to them. For instance, it is suggested in
[16] to preserve transients unprocessed in order to reduce artifacts, due to the
granularity eect of windowing. While the technique has previously been used
with the phase vocoder, we apply it to our time-domain algorithm as well. The
98 CHAPTER 6. COMPOSING WITH SOUNDS
task is relatively simple since the signal is already segmented, as in section 3.4.
For each sound segment, we pre-calculate the amount of required stretch, and
we apply a xed-size overlap-add algorithm onto the decaying part of the sound,
as in Figure 6-1. In our frequency implementation, we compute a segment-long
FFT to gain maximum frequency resolution (we assume stationary frequency
content throughout the segment), and time-scale the strongest partials of the
spectrum during decay, using an improved phase-vocoding technique, i.e., by
preserving the correlation between phases of adjacent bins for a given partial
[97]. Our frequency implementation performs well with harmonic sounds, but
does not do well with noisier sounds: we adopt the simple and more consistent
time-domain approach.
time
time
loudness
loudness
1 2 3 4
1 2 3 4
attack decay
attack decay
loudness curve
loudness curve
30%
Figure 6-1: Time-scaling example of a typical sound segment. Note that we only
process the decay part of the sound. The energy is preserved by overlapping and
adding Hanning windows by 50%. In this example we stretch the whole segment
[top] by an additional 30% [bottom].
There are several options of cross-fading shape, the most common being linear,
exponential, S-type, and constant-power. We choose to implement a constant-
power cross-fader in order to preserve the perceptual energy constant through
the transition. This is implemented with a logarithmic fade curve that starts
quickly and then slowly tapers o towards the end (Figure 6-2). In order to
avoid clipping when two songs overlap, we implement a simple compression
algorithm. Finally, our system adapts continuously to tempo variations by in-
terpolating gradually one tempo into another over the course of a transition,
6.1. AUTOMATED DJ 99
and by time stretching every audio segment appropriately to preserve perfect
beat alignment (Figure 6-3). Beat-matching and automatic transitions are suc-
cessfully accomplished for arbitrary songs where tempos dier by up to 40%.
Another interesting variant of the application transitions a song with itself,
allowing for extensions of that song to innity
2
.
5 10 15 20 25 30 35 40 45 50 55 beats 60
60
80
100
120
140
BPM
0 5 10 15 20 25 30 sec.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 6-2: Example of beat-matched transition between a funk piece at about
83 BPM and a salsa piece at about 134 BPM. Note in the loudness curve [top]
that the overall loudness is transfered smoothly from one song to another. The
cross-fading zone and shape are shown in green. The beat markers (red) get closer
from each other as the music speeds-up gradually. As a result, the tempo curve
increases [bottom].
song 1
song 2
cross-fading
result
Figure 6-3: Schematic of the beat-matching procedure. Each rectangle repre-
sents a musical pattern as found in section 4.6. Rather than adjusting the speed
(i.e., time-scaling) of the second song to match the rst one, we may choose to
continuously adjust both songs during the transition period.
A sound segment represents the largest unit of continuous timbre, and the short-
est fragment of audio in our representation. We believe that every segment could
2
And beyond...
100 CHAPTER 6. COMPOSING WITH SOUNDS
eventually be described and synthesized by analytic methods, such as additive
synthesis, but this thesis work is only concerned with the issue of synthesizing
music, i.e., the structured juxtaposition of sounds over time. Therefore, we use
the sound segments found in existing pieces as primitive blocks for creating new
pieces. Several experiments have been implemented to demonstrate the advan-
tages of our segment-based synthesis approach over an indeed more generic, but
still ill-dened frame-based approach.
6.2 Early Synthesis Experiments
Our synthesis principle is extremely simple: we concatenate (or string together)
audio segments as a way to create new musical sequences that never ex-
isted before. The method is commonly used in speech synthesis where a large
database of phones is rst tagged with appropriate descriptors (pitch, duration,
position in the syllable, etc.). At runtime, the desired target utterance is cre-
ated by determining the best chain of candidate units from the database. This
method, also known as unit selection synthesis, gives the best results in speech
synthesis without requiring much signal processing when the database is large
enough.
Our concatenation module does not process the audio: there is no segment
overlap, windowing, or cross-fading involved, which is typically the case with
granular synthesis, in order to avoid discontinuities. Since segmentation is
performed psychoacoustically at the most strategic location (i.e., just before an
onset, at the locally quietest moment, and at zero-crossing), the transitions are
generally artifact-free and seamless.
6.2.1 Scrambled Music
1
5
10
15
20
25
0 0.5 1 1.5 2 2.5 3 sec.
Figure 6-4: A short 3.25 sec. excerpt of Watermelon man by Herbie Hancock.
This rst of our series of experiments assumes no structure or constraint what-
soever. Our goal is to synthesize an audio stream by randomly juxtaposing
6.2. EARLY SYNTHESIS EXPERIMENTS 101
short sound segments previously extracted from a given piece of musicabout
two to eight segments per second with the music tested. During segmentation,
a list of pointers to audio samples is created. Scrambling the music consists
simply of rearranging the sequence of pointers randomly, and of reconstructing
the corresponding waveform. While the new sequencing generates the most
unstructured music, and is arguably regarded as the worst possible case of
music resynthesis, the event-synchronous synthesis sounds robust against audio
clicks and glitches (Figure 6-5).
1
5
10
15
20
25
0 0.5 1 1.5 2 2.5 3 sec.
Figure 6-5: Scrambled version of the musical excerpt of Figure 6-4.
The underlying beat of the music, if it exists, represents a perceptual metric
on which segments lie. While beat tracking was found independently of the
segment organization, the two representations are intricately interrelated with
each other. The same scrambling procedure is applied onto the beat segments
(i.e., audio segments separated by two beat markers). As expected, the gener-
ated music is metrically structured, i.e., the beat can be extracted again, but
the underlying harmonic, and melodic structure are now scrambled.
6.2.2 Reversed Music
1
5
10
15
20
25
0 0.5 1 1.5 2 2.5 3 sec.
Figure 6-6: Reversed version of the musical excerpt of Figure 6-4.
102 CHAPTER 6. COMPOSING WITH SOUNDS
The next experiment consists of adding a simple structure to the previous
method. This time, rather than scrambling the music, the order of segments
is reversed, i.e., the last segment comes rst, and the rst segment comes last.
This is much like what could be expected when playing a score backwards,
starting with the last note rst, and ending with the rst one. This is how-
ever very dierent from reversing the audio signal, which distorts sounds,
where reversed decays come rst and attacks come last (Figure 6-6). Tested
on many kinds of music, it was found that perceptual issues with unprocessed
concatenative synthesis may occur with overlapped sustained sounds, and long
reverbcertain musical discontinuities cannot be avoided without additional
processing, but this is left for future work.
6.3 Music Restoration
A B C D
Figure 6-7: Example of Fragment-Based Image Completion as found in [41].
[A] original image; [B] region deleted manually; [C] image completed automatically;
[D] region that was made-up using fragments of the rest of the image.
A well-dened synthesis application that derives from our segment concate-
nation consists of restoring corrupted audio les, and streaming music on the
Internet or cellular phones. Some audio frames may be corrupted, due to a
defective hard-drive, or missing, due to lost packets. Our goal, inspired by [41]
in the graphical domain (Figure 6-7), is to replace the corrupt region with orig-
inal new material taken from the rest of the le. The problem diers greatly
from traditional restoration of degraded audio material such as old tapes or
vinyl recordings, where the objective is to remove clicks, pops, and background
noise [58]. These are typically xed through autoregressive signal models and
interpolation techniques. Instead, we deal with localized digital corruption of
arbitrary length, where standard signal ltering methods do not easily apply.
Error concealment methods have addressed this problem for short durations
(i.e., around 20 ms, or several packets) [166][156][176]. Our technique can deal
with much larger corrupt fragments, e.g., of several seconds. We present mul-
tiple solutions, depending on the conditions: 1) le with known metadata; 2)
streaming music with known metadata; 3) le with unknown metadata; and 4)
streaming music with unknown metadata.
6.3. MUSIC RESTORATION 103
corruption
time
file
streaming
original
Figure 6-8: Schematic of a restoration application, in the case of a le, or with
streaming music (causal). Blocks with same colors indicate audio segments that
sound similar, although perfect match and metrical organization are not required.
6.3.1 With previously known structure
The metadata describing the segments and their location (section 3.7) is ex-
tremely small compared to the audio itself (i.e., a fraction of a percent of the
original le). Even the self-similarity matrices are compact enough so that they
can easily be embedded in the header of a digital le (e.g., MP3), or sent ahead
of time, securely, in the streaming case. Through similarity analysis, we nd
segments that are most similar to the ones missing (section 4.4), and we con-
catenate a new audio stream in place of the corruption (Figure 6-9). Knowing
the music structure in advance allows us to recover the corrupt region with de-
cent quality, sometimes hardly distinguishable from the original (section 5.4.4).
Harder cases naturally include music with lyrics, where the new lyrics make
no sense. We consider the real-time case: the application is causal, and can
synthesize the new music using past segments only. This application applies to
streaming music. The quality generally improves as a function of the number
of segments available (Figure 6-8).
6.3.2 With no prior knowledge
Two more solutions include not knowing anything about the music beforehand;
in such cases, we cannot rely on the metadata le. In a non-real-time process,
we can run the full analysis on the corrupt le and try to infer the miss-
ing structure: the previous procedure applies again. Since detecting regions
of corruption is a dierent problem in and of itself, we are not considering it
here, and we delete the noise by hand, replacing it by silence. We then run
the segmentation analysis, the beat tracker, and the downbeat detector. We
assume that the tempo remains mostly steady during the silent regions and let
the beat tracker run through them. The problem becomes a constraint-solving
problem that consists of nding the smoothest musical transition between the
two boundaries. This could be achieved eciently through dynamic program-
104 CHAPTER 6. COMPOSING WITH SOUNDS
0 5 10 15 20 25 30 35 40 sec.
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35 40 sec.
0
0.2
0.4
0.6
0.8
1
25
20
15
10
5
1
25
20
15
10
5
1
B
A
G
F
E
D
C
B
A
G
F
E
D
C
A
B
3.5 sec.
Figure 6-9: Example of Segment-Based Music Completion for a 42-second
excerpt of Lebanese Blonde by Thievery Corporation. [A] is the corrupted music,
including timbral, harmonic metadata, and loudness function. We simulate the
corruption with very loud grey noise. [B] is the restored music.
6.3. MUSIC RESTORATION 105
ming, by searching for the closest match between: 1) a sequence of segments
surrounding the region of interest (reference pattern), and 2) another sequence
of the same duration to test against, throughout the rest of the song (test pat-
tern). However, this is not fully implemented. Such a procedure is proposed in
[106] at a frame level, for the synthesis of background sounds, textural sounds,
and simple music. Instead, we choose to fully unconstrain the procedure, which
leads to the next application.
6.4 Music Textures
A true autonomous music synthesizer should not only restore old music but
should invent new music. This is a more complex problem that requires the
ability to learn from the time and hierarchical dependencies of dozens of pa-
rameters (section 5.2.5). The system that probably is the closest to achieving
this task is Francois Pachets Continuator [122], based on a structural or-
ganization of Markov chains of MIDI parameters: a kind of prex tree, where
each node contains the result of a reduction function and a list of continuation
indexes.
Our problem diers greatly from the Continuators in the nature of the material
that we compose from, and its inherent high dimensionality (i.e., arbitrary poly-
phonic audio segments). It also diers in its underlying grouping mechanism.
Our approach is essentially based on a metrical representation (tatum, beat,
meter, etc.), and on grouping by similarity: a vertical description. Pachets
grouping strategy is based on temporal proximity and continuity: a horizontal
description. As a result, the Continuator is good at creating robust stylistically-
coherent musical phrases, but lacks the notion of beat, which is essential in the
making of popular music.
Figure 6-10: Screen shots of three dierent video textures as in [143]: a woman, a
waterfall, and a candle. Each movie is made innite by jumping seamlessly between
similar frames at playback (as shown for instance through arcs in the candle image),
creating smooth transitions unnoticeable for the viewer.
106 CHAPTER 6. COMPOSING WITH SOUNDS
Our method is inspired by the video textures of Schodl et al. in [143], a new
type of visual medium that consists of extending short video clips into smoothly
innite playing videos, by changing the order in which the recorded frames are
played (Figure 6-10). Given a short musical excerpt, we generate an innite
version of that music with identical tempo, that sounds similar, but that never
seems to repeat. We call this new medium: music texture. A variant of this
called audio texture, also inspired by [143], is proposed at the frame level in
[107] for textural sound eects (e.g., rain, water stream, horse neighing), i.e.,
where no particular temporal structure is found.
time
streaming
original
etc.
etc.
Figure 6-11: Schematic of the music texture procedure. Colors indicate relative
metrical-location similarities rather than segment similarities.
Our implementation is very simple, computationally very light (assuming we
have already analyzed the music), and gives convincing results. The downbeat
analysis allowed us to label every segment with its relative location in the
measure (i.e., a oat value t [0, L[, where L is the length of the pattern). We
create a music texture by relative metrical location similarity. That is, given
a relative metrical location t[i] [0, L[, we select the segment whose relative
metrical location is the closest to t[i]. We paste that segment and add its
duration d
s
, such that t[i + 1] = (t[i] +
s
) mod L, where mod is the modulo.
We reiterate the procedure indenitely (Figure 6-11). It was found that the
method may quickly fall into a repetitive loop. To cope with this limitation,
and allow for variety, we introduce a tiny bit of jitter, i.e., a few percent of
Gaussian noise to the system, which is counteracted by an appropriate time
stretching ratio c:
t[i + 1] = (t[i] +c
s
+[i]) mod L (6.1)
= (t[i] +
s
) mod L (6.2)
While preserving its perceive rhythm and metrical structure, the new music
never seems to repeat (Figure 6-12). The system is tempo independent: we
can synthesize the music at an arbitrary tempo using time-scaling on every
segment, as in section 6.1.2. If the source includes multiple harmonies, the
6.4. MUSIC TEXTURES 107
system creates patterns that combine them all. It would be useful to impose
additional constraints based on continuity, but this is not yet implemented.
0 1 2 3 4 5 6 7 8 9 10 sec.
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100 120 140 160 sec.
0
0.2
0.4
0.6
0.8
1
0 1 2 3 4 5 6 7 8 9 10 sec.
0 20 40 60 80 100 120 140 160 sec.
0 1 2 3 4 5 6 7 8 9 10 sec.
0
0.2
0.4
0.6
0.8
1
0 1 2 3 4 5 6 7 8 9 10 sec.
1
5
10
15
20
25
1
5
10
15
20
25
1
5
10
15
20
25
[A]
[B]
[C]
Figure 6-12: [A] First 11 seconds of Norah Jones Dont know why song. [B]
Music texture, extending the length of excerpt [A] by 1600%. [C] 11-second zoom
in the music texture of [B]. Note the overall structural similarity of [C] and [A]
(beat, and pattern length), although there is no similar patterns.
Instead, a variant application called intelligent hold button only requires one
pattern location parameter p
hold
from the entire song. The system rst pre-
selects (by clustering, as in 5.4.3) a number of patterns harmonically similar
to the one representing p
hold
, and then applies the described method to these
patterns (equation 6.1). The result is an innite loop with constant harmony,
which sounds similar to pattern p
hold
but which does not repeat, as if the music
was on hold.
108 CHAPTER 6. COMPOSING WITH SOUNDS
6.5 Music Cross-Synthesis
Cross-synthesis is a technique used for sound production, whereby one param-
eter of a synthesis model is applied in conjunction with a dierent parameter
of another synthesis model. Physical modeling [152], linear predictive coding
(LPC), or the vocoder, for instance, enable sound cross-synthesis. We extend
that principle to music by synthesizing a new piece out of parameters taken
from other pieces. An example application takes the music structure descrip-
tion of a target piece (i.e., the metadata sequence, or musical-DNA), and the
actual sound content from a source piece (i.e., a database of unstructured la-
beled audio segments), and creates a completely new cross-synthesized piece
that accommodates both characteristics (Figure 6-13). This idea was rst pro-
posed by Zils and Pachet in [181] under the name musaicing, in reference
to the corresponding photomosaicing process of the visual domain (Figure
6-14).
Our implementation, however, diers from this one in the type of metadata
considered, and, more importantly, the event-alignment synthesis method in-
troduced in 6.2. Indeed, our implementation strictly preserves musical edges,
and thus the rhythmic components of the target piece. The search is based
on segment similaritiesmost convincing results were found using timbral and
dynamic similarities. Given the inconsistent variability of pitches between two
distinct pieces (often not in the same key), it was found that it is usually more
meaningful to let that space of parameters be constraint-free.
Obviously, we can extend this method to larger collections of songs, increas-
ing the chances of nding more similar segments, and therefore improving the
closeness between the synthesized piece and the target piece. When the source
database is small, it is usually found useful to primarily align source and
target spaces in order to maximize the variety of segments used in the synthe-
sized piece. This is done by normalizing both means and variances of MDS
spaces before searching for the closest segments. The search procedure can be
greatly accelerated after a clustering step (section 5.4.3), which dichotomizes
the space in regions of interest. The hierarchical tree organization of a dendro-
gram is an ecient way of quickly accessing the most similar segments without
searching through the whole collection. Improvements in the synthesis might
include processing the selected segments through pitch-shifting, time-scaling,
amplitude-scaling, etc., but none of these are implemented: we are more in-
terested in the novelty of the musical artifacts generated through this process
than in the closeness of the resynthesis.
Figure 6-15 shows an example of cross-synthesizing Kickin Back by Patrice
Rushen with Watermelon Man by Herbie Hancock. The sound segments
of the former are rearranged using the musical structure of the latter. The
resulting new piece is musically meaningful in the sense that its rhythmic
structure is preserved, and its timbral structure is made as close as possible to
the target piece given the inherent constraints of the problem.
6.5. MUSIC CROSS-SYNTHESIS 109
dim. 2
dim. 1
dim. 3
dim. 2
dim. 1
dim. 3
sound segment
perceptual threshold
musical path
dim. 2
dim. 1
dim. 3
sound segment
perceptual threshold
musical path
dim. 2
dim. 1
dim. 3
[A]
[D] [C]
[B]
Figure 6-13: Our cross-synthesis application takes two independent songs, as
shown in MDS spaces [A] and [B] (section 2.5.4), and as represented together in a
common space [C]. A third song is created by merging the musical path of target
[A] with the sound space of source [B], using segment similarity, and concatenative
synthesis, as shown in [D].
Figure 6-14: Example of photomosaic by [86] made out of hundreds of portraits
of Americans who have died at war in Iraq during the last few years.
110 CHAPTER 6. COMPOSING WITH SOUNDS
-2
-1
0
1
2
x 10
4
-1
0
1
x 10
4
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
1
5
10
15
20
25
1
5
10
15
20
25
0 1 2 3 4 5 6 sec.
0 1 2 3 4 5 6 sec.
0 1 2 3 4 5 6 sec.
0 1 2 3 4 5 6 sec.
0 1 2 3 4 5 6 sec.
0 1 2 3 4 5 6 sec.
Target song
Cross-synthesized song
target
Figure 6-15: Cross-Synthesis between an excerpt of Kickin Back by Patrice
Rushen (source) and an excerpt of Watermelon Man by Herbie Hancock (target).
[top] Target waveform, its auditory spectrogram, and its loudness function. [bottom]
Cross-synthesized waveform, its auditory spectrogram, and its loudness function.
Note the close timing and spectral relationship between both pieces although they
are composed of dierent sounds.
6.5. MUSIC CROSS-SYNTHESIS 111
6.6 Putting it all together
The nal goal is to automate the creation of entirely new compositions. Our
current solution is to arbitrarily combine the previous applications. Evolution-
ary methods could help at orienting the system; at selecting a group of sound
sources, rhythmic patterns, and song structures; and at transitioning or inter-
polating between the parameters in such a way that the musical output reects
some of the various elements that constitute it. This part remains to be im-
plemented. However, the system already allows us to build entire pieces with
only little guidance from the user. For instance, he or she may select a set of
songs that represent the source database, and a set of songs that represent the
target musical structures. The system can build arbitrarily long structures, and
seamlessly transition between them. It can map groups of sounds to the struc-
tures, and merge several creations via alignment and tempo curve-matching.
The nal outcome is a new sounding piece with apparent coherence in both
the use of the sound palette, and the underlying musical structure. Yet, due to
inherent constraints of our analysis-resynthesis procedure, the created music is
new and dierent from other existing music.
112 CHAPTER 6. COMPOSING WITH SOUNDS
CHAPTER SEVEN
Conclusion
Ill play it rst and tell you what it is later.
Miles Davis
This chapter draws conclusions, presents our contributions to the elds of music
cognition and automatic music analysis-synthesis, and nally discusses future
directions.
7.1 Summary
The goal of this thesis was to computationally model the process of creating
music by listening to audio examples. We aimed to close and automate the life
cycle of listening, composing, and performing music, only feeding the system
with a song database. Our bias-free system has been given generic listening and
learning primitives, and was programmed to analyze the sounds and structures
of music uniformly from the ground up. It was designed to arbitrarily combine
the extracted musical parameters as a way to synthesize new and musically
meaningful structures. These structures could be used to drive a concatenative
synthesis module that could recycle audio material from the sound database
itself, nally creating an original song with convincing quality in sound and
form.
7.2 Discussion
Because of its highly subjective nature, it is dicult, if not impossible to
evaluate the quality of a musical production. Quality is one of those things
that cannot be quantied, and it is extremely contextual [134]. The inter-
ested readers may judge for themselves by listening to some of the prelim-
inary and intermediary examples produced in the context of this work at:
http://www.media.mit.edu/tristan/phd/. However, the work can be con-
sidered successful in regards with synthesizing music that expands the database.
The new music can be analyzed in its turn, combined with more songs, and re-
cycled again.
Although the goal of the thesis was primarily to create music, most of the
work emphasis has been on analyzing and representing music through our mu-
sic cognition framework (chapters 3, 4, and 5). It was indeed hypothesized
that synthesis could share the same knowledge as acquired from a uniform
analysis procedure based on perceptual listening and learning. This has been
demonstrated in section 5.4.4 with a novel musically-intelligent compression al-
gorithm, and in chapter 6 through a collection of synthesis applications, which
rely exclusively on this common metadata representation.
7.3 Contributions
This thesis is a practical implementation of machine intelligence for music anal-
ysis and synthesis. However, it is our desire to build a generic perceptual
model of music cognition, rather than scattered and self-contained algorithms
and techniques. The system was carefully designed to meet theoretical re-
quirements (e.g., causality, psychoacaoustics), and we genuinely avoided using
possibly more reliable, although less justiable, signal processing methods.
7.3.1 Scientic contributions
The thesis is a contribution to machine intelligence, and more precisely
music intelligence. It is shown that a computer program can close the cycle
of listening, composing, and performing music through audio signals.
The data representation resulting from the analysis can be minimal and
common to the synthesis data.
Our analysis is based on a generic four-stage music cognition framework
that combines both machine listening and machine learning, augmenting
the standard pure signal processing approach of music analysis.
We present the rst implementation of a general and unbiased downbeat
predictor. We show for instance that downbeat cannot be computed only
from signal processing, and requires training.
114 CHAPTER 7. CONCLUSION
Superior compression can be achieved through the analysis and clustering
of redundant audio segments in time.
Musical signals such as rhythm can be forecasted by learning a time-lag
embedded feature space.
This work demonstrates the advantage of a segment-based approach over
a still ill-dened frame-based approach in a music analysis task. We high-
light the importance of making a distinction between sound and music.
It is demonstrated that the chroma content of a piece of music can be
more accurately estimated, and preserves temporal resolution if the onset
and length of the Fourier analysis are synchronized to sound events. It
also gains in computation speed.
Our synthesis is based on a concatenative module that arranges arbitrar-
ily complex sounds in time without overlapping or processing the audio.
Its simplicity reveals the benets of our psychoacoustically grounded seg-
mentation over previous methods.
We show that music similarities are more adequately represented as a
combination of three classes: pitch, rhythm, and timbre, rather than
through the generic timbral class.
The music similarities are computed recursively after a perceptual seg-
mentation stage, and are represented hierarchically for each class. As a
result, our self-similarity matrices are orders of magnitude smaller, yet
musically meaningful.
7.3.2 Engineering contributions
The work and examples presented in this thesis is the result of a stand-alone
application named Skeleton, as described in appendix A. The environment
combines a collection of algorithms, player, GUI, database management, and
visualizations, which put together allow to test the applications presented in
chapter 6. In itself, Skeleton is an engineering contribution that highly facili-
tates the development of audio applications dealing with machine listening and
machine learning technologies.
Our DJ application leverages the diculty of mixing music by autonomous
tempo and beat alignment, which surpasses semi-automatic current com-
mercial DJ programs.
The restoration application goes beyond traditional approaches by xing
corrupted sections of up to several seconds. This is made possible on
the account of our musical description rather than on the audio signal
directly.
7.3. CONTRIBUTIONS 115
We introduce a new medium called music texture that stretches a piece
of music to an arbitrary length without aecting its tempo. The system
must invent new similar music.
We improve the rendering of a musical mosaicing application by se-
lecting segments with perceptual criteria, and by synchronizing these to
onsets of the target piece.
7.3.3 Artistic contributions
Throughout the development of this thesis work, we have generated a few hun-
dreds of more or less elaborated musical examples that testify to the artistic
potential of our system. Several acclaimed musicians have shown interest in us-
ing Skeleton as part of their creative process. The musical artifact can become
the source material for a larger production. Others encouraged its immediate
application as an interactive art piece: the outcome is the product of an artistic
act, which must be attributed either to the user who biases the machine through
music examples, the programmer who built the software, the machine itself that
synthesizes unpredicted new music, or perhaps a collaboration between these.
Such appreciations let us believe that this thesis contributes on the aesthetic
and artistic fronts as well.
7.4 Future directions
It goes without saying that this work represents only a few steps in the vast
eld of machine intelligence for music analysis and synthesis. There is much to
be done, if not only by improving the accuracy and robustness of our current
algorithms. Here is a short list of the most immediate work that could be done
upon our current architecture.
Although we have experimented with a variety of applications using a
few hundreds of musical sources from many genres, when scaling-up such
system, several issues of reliability and computation typically come up.
This thesis only deals with acoustic metadata. Many more valuable con-
textual results could be achieved by considering other types of metadata.
We could explore ways of interacting with the system for freely navigating
the musical space currently delimited by a database.
We have not considered sound source separation, transcription, or multi-
channel music. There are many opportunities for other research ventures
based on our representations.
Although our listening and synthesis methods are fully causal, we have
not explored any real-time options on the synthesis front.
116 CHAPTER 7. CONCLUSION
Some of this work has potential venues in more specic areas of the eld,
including information retrieval, where structure understanding is some-
times key.
The ultimate synthesizer will not only synthesize music, but will also
synthesize the source material. This would lead to much more compact
ways of storing and playing music.
7.5 Final Remarks
Whether machines can be creative or not is certainly not answered in this work.
The machine here creates with a small c, but has no intention to do so. And
it is not able to evaluate the quality of its own work although it can analytically
compare it with others. It is unaware of any social or cultural context, which
makes the music somewhat meaningless. However, the machine is faster than
humans at listening to a song database, and at generating original music. They
can produce more with less, and are not biased like humans are. Those dif-
ferences account for the usefulness and potential of computers in creating new
music.
7.5. FINAL REMARKS 117
118 CHAPTER 7. CONCLUSION
APPENDIX A
Skeleton
A score is like a skeleton.
John Zorn
Figure A-1: Skeleton software screenshot.
The author has developed this work within his own environment named Skele-
ton. Skeleton is both a stand-alone Mac OS X application with a simple GUI
(Figure A-1), and an API primarily designed to speed up, standardize, and sim-
plify the development of new applications dealing with the analysis of musical
signals. Grounded upon fundamentals of perception and learning, the environ-
ment consist of machine listening, and machine learning tools, supported by
exible data structures and fast visualizations. It is being developed as an al-
ternative to more generic and slower tools such as Matlab. It is composed of
a set of original Objective-C frameworks, and open-source C libraries encap-
sulated by Objective-C frameworks. The software architecture of Skeleton is
depicted in Figure A-2, and is described below:
A.1 Machine Listening
The machine listening software includes: pitch, loudness, brightness, noisiness,
Bark, frequency masking, time-domain masking, outer ear, auditory spectro-
gram, segmentation, tatum, beat, pattern analysis, chromagram. Most of the
listening software is also implemented for real-time use in the Max/MSP envi-
ronment, and is available at: http://www.media.mit.edu/tristan/.
A.2 Machine Learning
The machine learning software includes: dynamic programming, matrix manip-
ulations, distance measures, support vector machines, articial neural networks,
cluster-weighted modeling (mixture of Gaussians), k-means, downbeat predic-
tion, segment, beat, and pattern similarities.
A.3 Music Synthesis
The applications running in the GUI include: scrambled music, reversed music,
compression, cross-synthesis, music texture, music restoration, beat-matching
and cross-fading.
A.4 Software
The software is written for Macintosh in the native Objective-C language. It
includes a GUI built with interface builder, an audio player using Core Audio,
fast and exible displays using Open-GL, fast linear algebra with BLAST, FFT
and convolutions with Altivec, read and write audio les using sndLib, MP3
decoding with LibMAD, database management with mySQL, machine learn-
120 APPENDIX A. SKELETON
ing with packages SVM
light
, CWM, nodeLib. The hierarchical clustering, as
well as certain graphical representations (dendrogram, downbeat, state-space
reconstruction) are currently implemented in MATLAB.
A.5 Database
The software creates and connects automatically to a mySQL server, which
can store and eciently retrieve the analyzed data. Songs can be pre-analyzed
and the result of their analysis is stored in the database. The various applica-
tions typically retrieve the metadata directly from the database. The database
initially contains four tables, namely: SONGS, SEGMENTS, BEATS, PATTERNS
with the following elds:
SONGS: leName, ID, path, sampleRate, numSamples, numFrames, hopSize,
numSegments, numBeats, meanTempo, signature, pitchSignal, loudnessSig-
nal, brightnessSignal, noisinessSignal, segmentSignal, barkSpectrogram
SEGMENTS: songID, ID, startInSec, lengthInMs, sampleIndex, numSamples, loud-
ness, phase, c0 ... c11
BEATS: songID, ID, startInSec, tempo, sampleIndex, numSamples, segmentIn-
dex, numSegments, phase, c0 ... c11
PATTERNS: songID, ID, sampleIndex, numSamples, frameIndex, numFrames,
segmentIndex, numSegments, beatIndex, numBeats, c0 ... c11
Each time a new song is being analyzed, it adds one new row to the SONGS
table, and appends multiple rows to the other tables. It can also create a series
of self-similarity matrix tables. Each of these tables contains many columns (as
many as there are segments, beats, or patterns in the song) and as many rows.
A.5. DATABASE 121
L
is
te
n
in
g
F
F
T
B
a
r
k
P
itc
h
L
o
u
d
n
e
s
s
B
r
ig
h
tn
e
s
s
N
o
is
in
e
s
s
S
e
g
m
e
n
t
L
P
F
B
e
a
t
P
a
tte
r
n
S
o
u
n
d
C
o
m
b
in
h
e
rit
A
u
d
io
D
a
ta
In
te
r
fa
c
e
V
ie
w
C
o
n
tr
o
lle
r
D
a
ta
b
a
s
e
A
u
d
io
P
la
y
e
r
S
o
u
n
d
F
ile
T
a
b
le
V
ie
w
W
a
v
e
O
p
e
n
G
L
V
ie
w
S
p
e
c
tr
o
g
r
a
m
O
p
e
n
G
L
V
ie
w
S
ig
n
a
lO
p
e
n
G
L
V
ie
w
F
r
a
m
e
O
p
e
n
G
L
V
ie
w
C
o
lo
r
M
a
p
A
n
a
ly
s
is
O
p
e
n
G
L
V
ie
w
D
is
p
la
y
D
a
ta
C
W
M
S
V
M
D
P
M
a
c
h
i
n
e
L
e
a
r
n
i
n
g
M
a
c
h
i
n
e
L
i
s
t
e
n
i
n
g
V
i
s
u
a
l
i
z
a
t
i
o
n
S
o
n
g
F
r
a
m
e
S
p
e
c
tr
o
g
r
a
m
W
a
v
e
D
a
t
a
S
t
r
u
c
t
u
r
e
S
i
g
n
a
l
P
r
o
c
e
s
s
i
n
g
L
e
a
r
n
in
g
A
n
a
ly
s
is
S
k
e
l
e
t
o
n
T
r
is
t
a
n
J
e
h
a
n
A
u
d
io
O
p
e
n
G
L
V
ie
w
In
fo
r
m
a
tio
n
V
ie
w
A
u
d
io
S
a
m
p
le
R
e
s
a
m
p
lin
g
T
D
M
a
s
k
F
D
M
a
s
k
P
e
r
c
e
p
tu
a
lF
e
a
tu
r
e
D
e
lta
B
e
a
t
k
M
e
a
n
s
A
N
N
C
o
n
v
o
lu
tio
n
C
o
r
r
e
la
tio
n
P
C
A
M
a
tr
ix
O
u
te
r
E
a
r
S
im
ila
r
ity
S
e
g
m
e
n
tT
a
b
le
V
ie
w
T
e
m
p
o
O
p
e
n
G
L
V
ie
w
M
P
3
C
h
r
o
m
a
O
p
e
n
G
L
V
ie
w
M
a
tr
ix
O
p
e
n
G
L
V
ie
w
D
is
ta
n
c
e
C
h
r
o
m
a
g
r
a
m
M
y
S
Q
L
P
r
e
fe
r
e
n
c
e
s
R
e
v
e
r
s
e
C
r
o
s
s
S
y
n
th
A
p
p
l
i
c
a
t
i
o
n
s
S
y
n
th
e
s
is
S
c
r
a
m
b
le
M
u
s
ic
T
e
x
tu
r
e
C
o
m
p
le
tio
n
T
im
e
S
c
a
le
B
e
a
tM
a
tc
h
D
o
w
n
b
e
a
t
Figure A-2: Skeleton software architecture.
122 APPENDIX A. SKELETON
Im saying: to be continued, until we meet again. Meanwhile, keep
on listening and tapping your feet.
Count Basie
124 APPENDIX A. SKELETON
Bibliography
[1] S. Abdallah and M. Plumbley. Unsupervised onset detection: a proba-
bilistic approach using ICA and a hidden markov classier. In Proceedings
of Cambridge Music Processing Colloquium, Cambridge, UK, 2003.
[2] D. Adams. The Hitchhikers Guide to the Galaxy. Pocket Books, New
York, NY, 1979.
[3] V. Adan. Hierarchical music structure analysis, modeling and resynthesis:
A dynamical systems and signal processing approach. Masters thesis,
MIT Media Laboratory, 2005.
[4] M. Alonso, B. David, and G. Richard. Tempo and beat estimation of
musical signals. In Proceedings of the International Symposium on Music
Information Retrieval (ISMIR), Barcelona, Spain, October 2004. Univer-
sitat Pompeu Fabra.
[5] A. S. Association. American standard acoustical terminology. denition
12.9. timbre, 1960.
[6] J.-J. Autouturier and F. Pachet. Finding songs that sound the same. In
Proceedings of IEEE Workshop on Model based Processing and Coding of
Audio. University of Leuven, Belgium, November 2002. Invited Talk.
[7] J.-J. Autouturier and F. Pachet. Improving timbre similarity: How high
is the sky? Journal of Negative Results in Speech and Audio Sciences,
1(1), 2004.
[8] M. A. Bartsch and G. H. Wakeeld. To catch a chorus: Using chroma-
based representations for audio thumbnailing. In Proceedings of IEEE
Wokshop on Applications of Signal Processing to Audio and Acoustics
(WASPAA), pages 1518, Mohonk, NY, October 2001.
[9] S. Baumann and T. Pohle. A comparison of music similarity measures
for a p2p application. Proceedings of the 6th International Conference on
Digital Audio Eects (DAFx-03), Septembre 2003.
[10] J. P. Bello and M. Sandler. Phase-based note onset detection for music
signals. In proceedings of the IEEE International Conference on Acoustics,
Speech, and Signal Processing, 2003.
Bibliography 125
[11] A. Berenzweig, D. Ellis, B. Logan, and B. Whitman. A large scale evalua-
tion of acoustic and subjective music similarity measures. In Proceedings
of the 2003 International Symposium on Music Information Retrieval,
Baltimore, MD, October 2003.
[12] J. A. Bilmes. Timing is of essence. Masters thesis, Massachusetts Insti-
tute of Technology, 1993.
[13] C. M. Bishop, editor. Neural Networks for Pattern Recognition. Clarendon
Press, Oxford, 1995.
[14] M. Boden. The creative mind: Myths and mechanisms. Behavioural and
Brain Sciences, 17(3), 1994.
[15] J. Bonada. Automatic technique in frequency domain for near-lossless
time-scale modication of audio. In Proceedings of International Com-
puter Music Conference 2000, Berlin, Germany, 2000.
[16] J. Bonada. Audio Time-Scale Modication in the Context of Professional
Post-Production. PhD thesis, Universitat Pompeu-Fabra, Barcelona,
2002.
[17] M. Bosi and R. E. Goldberg. Introduction to Digital Audio Coding and
Standards. Kluwer Academic Publishers, Boston, December 2002.
[18] K. Brandenburg and G. Stoll. ISO MPEG-1 audio: A generic standard
for coding of high quality digital audio. Journal of the Audio Engineering
Society, 10:780792, October 1994.
[19] A. Bregman. Auditory Scene Analysis: The Perceptual Organization of
Sound. MIT Press, 1990.
[20] J. Brown and M. Cooke. Computational auditory scene analysis. Com-
puter Speech and Language, 8(2):297336, 1994.
[21] R. G. Brown and P. Y. Hwang. Introduction to Random Signals and
Applied Kalman Filtering. John Wiley & Sons, New York, 2nd edition,
1991.
[22] C. J. Burges. A tutorial on support vector machines for pattern recogni-
tion. Data Mining and Knowledge Discovery, 2(2):121167, June 1998.
[23] P. Cano, E. Batlle, T. Kalker, and J. Haitsma. A review of algorithms for
audio ngerprinting. In In International Workshop on Multimedia Signal
Processing, US Virgin Islands, December 2002.
[24] P. Cariani. Neural timing nets for auditory computation. In S. Greenberg
and M. S. (Eds.), editors, Computational Models of Auditory Function,
pages 235249, Amsterdam, 1999. IOS press.
[25] P. Cariani. Temporal codes, timing nets, and music perception. Journal
of New Music Perception, 30(2), 2001.
126 Bibliography
[26] W. Chai. Melody retrieval on the web. Masters thesis, MIT Media
Laboratory, 2001.
[27] W. Chai. Automated Analysis of Musical Structure. PhD thesis, MIT
Media Laboratory, 2005.
[28] W. Chai and B. Vercoe. Music thumbnailing via structural analysis. In
Proceedings of ACM Multimedia Conference, November 2003.
[29] H. Cohen. The further exploits of AARON, painter. Stanford Human-
ities Review, Constructions of the Mind: Articial Intelligence and the
Humanities, 4(2), 1997.
[30] M. Cooper and J. Foote. Summarizing popular music via structural sim-
ilarity analysis. In IEEE Workshop on Applications of Signal Processing
to Audio and Acoustics (WASPAA), Mohonk, NY, October 2003.
[31] D. Cope. New Directions in Music. William C. Brown, Dubuque, Iowa,
1984. 4th edition.
[32] D. Cope. Experiments in Music Intelligence. A-R Editions, Madison, WI,
1996.
[33] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to
Algorithms (Second Edition). MIT Press and McGraw-Hill, Cambridge,
MA, 2001.
[34] T. Cover and J. Thomas. Elements of Information Theory. John Wiley
& Sons, Inc., New York, 1991.
[35] T. Dartnall. Articial Intelligence and Creativity: an Interdisciplinary
Approach. Kluwer, Dordrecht, 1994.
[36] I. Deli`ege. Grouping Conditions in Listening to Music: An Approach to
Lerdhal and Jackendos grouping preferences rules. Music Perception,
4:325360, 1987.
[37] P. Desain and H. Honing. Computational models of beat induction: the
rule-based approach. Journal of New Music Research, 28(1):2942, 1999.
[38] S. Dixon. Automatic extraction of tempo and beat from expressive per-
fomances. Journal of New Music Research, 30(1), March 2001.
[39] S. E. Dixon. An empirical comparison of tempo trackers. In Proceedings
of 8th Brazilian Symposium on Computer Music, Fortaleza, Brazil, July
2001.
[40] M. Dolson. The phase vocoder: a tutorial. Computer Music Journal,
10(4):1427, 1986.
[41] I. Drori, D. Cohen-Or, and H. Yeshurun. Fragment-based image comple-
tion. In Siggraph 2003, Computer Graphics Proceedings, New York, NY,
USA, 2003. ACM Press / ACM SIGGRAPH / Addison Wesley Longman.
Bibliography 127
[42] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classication. Wiley
Interscience, New York, 2nd edition, 2000.
[43] C. Duxbury, M. Davies, and M. Sandler. Improved time-scaling of musical
audio using phase locking at transients. In Proceedings of the 112th AES
Convention, Munich, Germany, May 2002.
[44] Ear anatomy, 2004. Medical Encyclopedia website ADAM, ac-
credited by the American Accreditation HealthCare Commission.
http://www.nlm.nih.gov/medlineplus/ency/imagepages/1092.htm.
[45] D. Eck. A positive-evidence model for rhythmical beat induction. Journal
of New Music Research, 30(2):187200, 2001.
[46] D. P. Ellis and J. Arroyo. Eigenrhythms: Drum pattern basis sets for
classication and generation. In Proceedings of the International Sympo-
sium on Music Information Retrieval (ISMIR), Barcelona, Spain, October
2004.
[47] D. P. Ellis, B. Whitman, A. Berenzweig, and S. Lawrence. The quest
for ground truth in musical artist similarity. In Proceedings of the In-
ternational Symposium on Music Information Retrieval (ISMIR), Paris,
France, October 2002.
[48] J. L. Elman. Finding structure in time. Cognitive Science, 14(2):179211,
1990.
[49] H. Fletcher. Auditory patterns. Rev. Mod. Phys., 12:4755, January 1940.
[50] H. Fletcher and W. Munson. Relation between loudness and masking.
Journal of Acoustic Society of America, 9(1):110, July 1937.
[51] J. Foote and M. Cooper. Automatic audio segmentation using a measure
of audio novelty. In Proceedings of IEEE International Conference on
Multimedia and Expo, pages 452455, New York, USA, 2000.
[52] J. Foote and M. Cooper. Visualizing musical structure and rhythm via
self-similarity. In Proceedings International Computer Music Conference,
La Habana, Cuba, 2001.
[53] N. A. Gershenfeld and A. S. Weigend. The future of time series: Learning
and understanding. In A. S. Weigend and N. A. Gershenfeld, editors,
Time Series Prediction: Forecasting the Future and Understanding the
Past, pages 163, Reading, MA, 1993. AddisonWesley.
[54] A. Ghias, J. Logan, D. Chamberlin, and B. C. Smith. Query by humming:
Musical information retrieval in an audio database. In ACM Multimedia,
pages 231236, 1995.
[55] M. Gibbs. Bayesian Gaussian Processes for Regression and Classication.
PhD thesis, University of Cambridge, 1997.
128 Bibliography
[56] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural
networks architectures. Neural Computation, 7:219269, 1995.
[57] B. R. Glasberg and B. C. J. Moore. A model of loudness applicable to
time-varying sounds. J. Audio Eng. Soc., 50:331342, 2002.
[58] S. Godsill, P. Rayner, and O. Cappe. Digital audio restoration. In Ap-
plications of Digital Signal Processing to Audio and Acoustics, Norwell,
MA, 1996.
[59] M. Goto. An audio-based real-time beat tracking system for music with
or without drum sounds. Journal of New Music Research, 30:159171,
2001.
[60] M. Goto. Smartmusickiosk: music listening station with chorus-search
function. In Proceedings of the 16th annual ACM symposium on User
interface software and technology, pages 3140, November 2003.
[61] M. Goto and Y. Muraoka. Real-time beat tracking for drumless audio
signals: Chord change detection for musical decisions. Journal of Speech
Communication, 27:311335, 1999.
[62] F. Gouyon, L. Fabig, and J. Bonada. Rhythmic expressiveness transfor-
mations of audio recordings: Swing modications. In Proceedings of the
6th International Conference on Digital Audio Eects (DAFx-03), Lon-
don, UK, September 2003.
[63] F. Gouyon and P. Herrera. A beat induction method for musical audio
signals. In In Proceedings 4th WIAMIS Special Session on Audio Seg-
mentation and Digital Music, Queen Mary University, London, 2003.
[64] F. Gouyon, P. Herrera, and P. Cano. Pulse-dependent analysis of per-
cussive music. In Workshop on Digital Audio Eects, DAFx-98, pages
188191, Barcelona, Spain, 1998.
[65] F. Gouyon, P. Herrera, and P. Cano. Pulse-dependent analysis of percus-
sive music. In Proceedings of the AES 22nd International Conference on
Virtual, Synthetic and Entertainment Audio, 2002.
[66] J. Grey. Timbre discrimination in musical patterns. Journal of the Acous-
tical Society of America, 64:467472, 1978.
[67] M. Gruhne, C. Uhle, C. Dittmar, and M. Cremer. Extraction of drum pat-
terns and their description within the MPEG-7 high-level-framework. In
Proceedings of the International Symposium on Music Information Re-
trieval (ISMIR), Barcelona, Spain, October 2004. Universitat Pompeu
Fabra.
[68] S. Handel. LISTENING: An Introduction to the Perception of Auditory
Events. MIT Press, Cambridge, Massachusetts, 1989.
Bibliography 129
[69] J. Herre, E. Allamanche, and C. Ertel. How similar do songs sound?
towards modeling human perception of musical similarities. In Proceed-
ings of IEEE Wokshop on Applications of Signal Processing to Audio and
Acoustics (WASPAA), Mohonk, NY, October 2003.
[70] P. Herrera, X. Serra, and G. Peeters. Audio descriptors and descrip-
tor schemes in the context of MPEG-7. In Proceedings of International
Computer Music Conference, Beijing, China, 1999.
[71] D. Hosmer and S. Lemeshow. Applied Logistic Regression. Wiley and
Sons, New York, 1989.
[72] X. Huang, A. Acero, and H.-W. Hon. Spoken Language Processing: A
Guide to Theory, Algorithm and System Development. Prentice-Hall, En-
glewood Clis, N.J., 2001.
[73] A. Hunt and A. Black. Unit selection in a concatenative sound synthesis.
In Proceedings ICASSP, Atlanta, GA, 1996.
[74] V. S. Iyer. Microstructures of Feel, Macrostructures of Sound: Embodied
Cognition in West African and African-American Musics. PhD thesis,
University of California, Berkeley, 1998.
[75] T. Jaakkola, M. Meila, and T. Jebara. Maximum entropy discrimination.
Advances in Neural Information Processing Systems 12, 1999.
[76] A. K. Jain, J. C. Mao, and K. M. Moniuddin. Articial neural networks:
A review. IEEE Computer Special Issue on Neural Computing, March
1996.
[77] T. Jebara. Discriminative, Generative and Imitative Learning. PhD the-
sis, Massachusetts Institute of Technology, February 2002.
[78] T. Jehan. Music signal parameter estimation. Masters thesis, IFSIC,
Rennes, and CNMAT, Berkeley, September 1997.
[79] T. Jehan. Perceptual synthesis engine: an audio-driven timbre generator.
Masters thesis, MIT Media Laboratory, September 2001.
[80] T. Jehan. Perceptual segment clustering for music description and time-
axis redundancy cancellation. In Proceedings of the 5th International
Conference on Music Information Retrieval, Barcelona, Spain, October
2004.
[81] T. Jehan. Hierarchical multi-class self similarities. In Proceedings of IEEE
Wokshop on Applications of Signal Processing to Audio and Acoustics
(WASPAA), Mohonk, NY, October 2005. (submitted).
[82] T. Jehan, T. Machover, and M. Fabio. Sparkler: An audio-driven inter-
active live computer performance for symphony orchestra. In Proceedings
International Computer Music Conference, Goteborg, Sweden, 2002.
130 Bibliography
[83] T. Jehan and B. Schoner. An audio-driven, spectral analysis-based, per-
ceptual synthesis engine. Journal of the Audio Engineering Society, 2001.
[84] F. V. Jensen. An Introduction to Bayesian Networks. Springer-Verlag,
New York, 1996.
[85] T. Joachims. Learning to Classify Text Using Support Vector Machines.
Kluwer, Boston, MA, 2002.
[86] Joe. War president, 2004. Posted at http://amleft.blogspot.com on
April 1st.
[87] M. Jordan, editor. Learning in Graphical Models. MIT Press, Cambridge,
Massachusetts, 1998.
[88] M. Jordan and R. Jacobs. Hierarchical mixtures of experts and the EM
algorithm. Neural Computation, 6:181214, 1994.
[89] H. Kantz. Noise reduction by local reconstruction of the dynamics. In
A. Weigend and N. Gershenfeld, editors, Time Series Prediction: Fore-
casting the Future and Understanding the Past, pages 475490, Reading,
MA, 1993. AddisonWesley.
[90] A. Klapuri. Sound onset detection by applying psychoacoustic knowledge.
In Proceedings of IEEE International Conference on Acoustics, Speech,
and Signal Processing (ICASSP), pages 30893092, 1999.
[91] A. P. Klapuri. Musical meter estimation and music transcription. In In
Proceedings Cambridge Music Processing Colloquium, pages 4045, Cam-
bridge University, UK, 2003.
[92] A. P. Klapuri, A. J. Eronen, and J. T. Astola. Analysis of the meter of
acoustic musical signals. IEEE Transaction on Speech and Audio Pro-
cessing (in Press), 2005.
[93] T. Kuusi. Set-Class and Chord: Examining Connection between Theoret-
ical Resemblance and Perceived Closeness. PhD thesis, Sibelius Academy,
2001.
[94] J. Laroche. Time and pitch scale modication of audio signals. In
M. Kahrs and K. Brandenburg, editors, Applications of Digital Signal
Processing to Audio and Acoustics, pages 279310. Kluwer Academic Pub-
lishers, Boston, 1998.
[95] J. Laroche. Estimating, tempo, swing, and beat locations in audio record-
ings. In Proceedings of IEEE Wokshop on Applications of Signal Process-
ing to Audio and Acoustics (WASPAA), pages 135138, Mohonk, NY,
October 2001.
[96] J. Laroche. Ecient tempo and beat tracking in audio recordings. Journal
of the Audio Engineering Society, 51(4):226233, April 2003.
Bibliography 131
[97] J. Laroche and M. Dolson. Improved phase vocoder time-scale modi-
cation of audio. IEEE Transactions on Speech and Audio Processing,
7(3):323332, May 1999.
[98] A. Lazier and P. Cook. Mosievius: Feature driven interactive audio mo-
saicing. Proceedings of the 6th International Conference on Digital Audio
Eects (DAFx-03), Septembre 2003.
[99] C. S. Lee. The Perception of Metrical Structure: Experimental Evidence
and a Model, pages 59127. Academic Press, London, 1991.
[100] F. Lerdahl and R. Jackendo. A Generative Theory of Tonal Music. MIT
Press, Cambridge, Mass., 1983.
[101] F. Lerdhal. Timbral hierarchies. Contemporary Music Review, 2:135160,
1987.
[102] S. Levine. Audio Representations for Data Compression and Compressed
Domain Processing. PhD thesis, CCRMA, Stanford University, 1998.
[103] G. E. Lewis. Too many notes: Computers, complexity and culture in
Voyager. Leonardo Music Journal, 10:3339, 2000.
[104] B. Lincoln. An experimental high-delity perceptual audio coder. Tech-
nical report, CCRMA, Stanford University, 1998.
[105] L. Lu, H. Jiang, and H.-J. Zhang. Robust audio classication and segmen-
tation method. In Proceedings of the 9th ACM International Multimedia
Conference and Exhibition, pages 103211, 2001.
[106] L. Lu, L. Wenyin, and H.-J. Zhang. Audio textures: Theory and applica-
tions. IEEE Transactions on Speech and Audio Processing, 12(2), march
2004.
[107] L. Lu, L. Wenyin, H.-J. Zhang, and Y. Mao. Audio textures. In Proceed-
ings of IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP), pages 17611764, 2002.
[108] T. Machover. Hyperinstruments: A progress report 1987-1991. Technical
report, MIT Media Laboratory, 1992.
[109] M. Maher and J. Beauchamp. Fundamental frequency estimation of musi-
cal signals using a two-way mismatch procedure. Journal of the Acoustical
Society of America, 95(4):22542263, 1994.
[110] M. Marolt, A. Kavcic, and M. Privosnik. Neural networks for note onset
detection in piano music. In Proceedings International Computer Music
Conference, Goteborg, Greece, September 2002.
[111] K. D. Martin. Sound-Source Recognition: A Theory and Computational
Model. PhD thesis, MIT Media Lab, 1999.
132 Bibliography
[112] S. McAdams. Contributions of music to research on human auditory
cognition. In Thinking in Sound: the Cognitive Psychology of Human
Audition, pages 146198. Oxford University Press, 1993.
[113] M. McKinney and J. Breebaart. Features for audio and music classica-
tion. In Proceedings of the International Symposium on Music Informa-
tion Retrieval (ISMIR), Baltimore, MD, October 2003.
[114] E. Metois. Musical Sound Information: Musical Gestures and Embedding
Synthesis. PhD thesis, MIT Media Lab, 1996.
[115] T. M. Mitchell. Machine Learning. The McGraw-Hill Companies, Inc.,
Singapore, 1997.
[116] B. C. J. Moore. An Introduction to the Psychology of Hearing. Academic
Press, New York, 1997.
[117] B. C. J. Moore and B. R. Glasberg. A revision of zwickers loudness
model. Acta Acustica, 82:335345, 1995.
[118] Moving Picture Experts Group. The ocial MPEG website.
http://www.chiariglione.org/mpeg/.
[119] B. A. Myers. Peridot: Creating user interfaces by demonstration, pages
125153. MIT Press, Cambridge, MA, 1993.
[120] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and
an algorithm. In Advances in Neural Information Processing Systems,
Vancouver, Canada, December 2001. MIT Press.
[121] J. Oswald. Plunderphonics web site: audio piracy as a compositional
prerogative, 1999. http://www.plunderphonics.com.
[122] F. Pachet. The continuator: Musical interaction with style. In Proceedings
International Computer Music Conference, Goteborg, Sweden, 2002.
[123] F. Pachet. Knowledge Management and Musical Metadata. Idea Group,
2005. In Encyclopedia of Knowledge Management, edited by Diemo
Schwarz.
[124] F. Pachet, J.-J. Aucouturier, A. L. Burthe, A. Zils, and A. Beurive. The
cuidado music browser: an end-to-end electronic music distribution sys-
tem. Multimedia Tools and Applications, 2004. Special Issue on the
CBMI03 Conference.
[125] F. Pachet, G. Westerman, and D. Laigre. Musical data mining for elec-
tronic music distribution. In Proceedings of the 1st WedelMusic Confer-
ence, 2001.
[126] F. Pachet and A. Zils. Evolving automatically high-level music descriptors
from acoustic signals. Springer Verlag LNCS, 2771, 2003.
Bibliography 133
[127] T. Painter and A. Spanias. A review of algorithms for perceptual coding
of digital audio signals. In Proceedings of the International Conference of
Digital Signal Processing, pages 179205, July 1997.
[128] E. Pampalk, S. Dixon, and G. Widmer. Exploring music collections by
browsing dierent views. In Proceedings of the International Symposium
on Music Information Retrieval (ISMIR), Baltimore, MD, October 2003.
[129] M. Parry and I. Essa. Rhythmic similarity through elaboration. In Pro-
ceedings of the International Symposium on Music Information Retrieval
(ISMIR), Baltimore, MD, October 2003.
[130] J. Paulus and A. Klapuri. Measuring the similarity of rythmic patterns.
In Proceedings of the International Symposium on Music Information Re-
trieval (ISMIR), pages 175176, Paris, 2002. IRCAM.
[131] S. Pauws. Musical key extraction from audio. In Proceedings of the Inter-
national Symposium on Music Information Retrieval (ISMIR), Barcelona,
Spain, October 2004. Universitat Pompeu Fabra.
[132] G. Peeters, A. L. Burthe, and X. Rodet. Toward automatic music audio
summary generation from signal analysis. In Proceedings of the Interna-
tional Symposium on Music Information Retrieval (ISMIR), Paris, 2002.
IRCAM.
[133] G. Peeters, S. McAdams, and P. Herrera. Instrument description in the
context of mpeg-7. In Proceedings of International Computer Music Con-
ference, Berlin, Germany, 2000.
[134] R. M. Pirsig. Zen and the Art of Motorcycle Maintenance. Morrow, 10th
edition, May 1974.
[135] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery.
Numerical Recipes in C: The Art of Scientic Computing. Cambridge
University Press, New York, 2nd edition, 1992.
[136] T. F. Quatieri. Discrete-Time Speech Signal Processing, Principles and
Practice. Prentice Hall, Upper Saddle River, NJ, 2002.
[137] L. R. Rabiner. A tutorial on hidden markov models and selected appli-
cations in speech recognition. Proceedings of the IEEE, 77(2):257286,
1989.
[138] S. Rossignol, X. Rodet, J. Soumagne, J.-L. Collette, and P. Depalle. Au-
tomatic characterization of musical signals: Feature extraction and tem-
poral segmentation. Journal of New Music Research, 28(4):281295, 1999.
[139] R. Rowe. Interactive Music Systems. MIT Press, 1992.
[140] S. Russell and P. Norvig. Articial Intelligence: A Modern Approach.
Prentice Hall, Inc., Upper Saddle River, New Jersey, 1995.
134 Bibliography
[141] E. Scheirer. Tempo and beat analysis of acoustic musical signals. Journal
of the Acoustic Society of America, 103(1), January 1998.
[142] E. Scheirer. Music Listening Systems. PhD thesis, MIT Media Labora-
tory, 2000.
[143] A. Schodl, R. Szeliski, D. H. Salesin, and I. Essa. Video textures. In
K. Akeley, editor, Siggraph 2000, Computer Graphics Proceedings, pages
3342. ACM Press / ACM SIGGRAPH / Addison Wesley Longman, 2000.
[144] B. Schoner. State reconstructiion for determining predictability in driven
nonlinear acoustical systems. Masters thesis, MIT/RWTH Aachen, 1996.
[145] B. Schoner. Probabilistic Characterization and Synthesis of Complex
Driven Systems. PhD thesis, MIT Media Laboratory, 2000.
[146] M. Schroeder, B. Atal, and J. Hall. Optimizing digital speech coders by
exploiting masking properties of the human ear. Journal of the Acoustical
Society of America, 66:16471652, 1979.
[147] D. Schwarz. The caterpillar system for data-driven concatenative sound
synthesis. In Proceedings of the 6th International Conference on Digital
Audio Eects (DAFx-03), Septembre 2003.
[148] J. Seppanen. Tatum grid analysis of musical signals. In IEEE Workshop
on Applications of Signal Processing to Audio and Acoustics, Mohonk,
NY, October 2001.
[149] X. Serra. A system for Sound Analysis/Transformation/Synthesis Based
on a Deterministic Plus Stochastic Decomposition. PhD thesis, CCRMA,
Department of Music, Stanford University, 1989.
[150] X. Serra and J. O. Smith. Spectral modeling synthesis: A sound analy-
sis/synthesis system based on a deterministic plus stochastic decomposi-
tion. Computer Music Journal, 14(4):1224, 1990.
[151] P. Smaragdis. Redundancy Reduction for Computational Audition, a Uni-
fying Approach. PhD thesis, MIT Media Lab, 2001.
[152] J. O. Smith. Physical modeling using digital waveguides. Computer Music
Journal, 6(4), 1992.
[153] J. O. Smith and J. S. Abel. Bark and ERB bilinear transforms. IEEE
Transactions on Speech and Audio Processing, 7(6):697708, November
1999.
[154] J. O. Smith and P. Gosset. A exible sampling-rate conversion method.
International Conference on Acoustics, Speech, and Signal Processing,
2:19.4.119.4.2, 1984.
[155] B. Snyder. Music and Memory: an Introduction. MIT Press, Cambridge,
MA, 2000.
Bibliography 135
[156] A. Stenger, K. Younes, R. Reng, and B. Girod. A new error concealment
technique for audio transmission with packet loss. In Proceedings Eu-
ropean Signal Processing Conference (EUSIPCO 96), pages 19651968,
Trieste, Italy, September 1998.
[157] F. Takens. Detecting strange attractors in turbulence. In D. Rand and
L. Young, editors, Dynamical Systems and Turbulence, volume 898 of
Lecture Notes in Mathematics, pages 366381, New York, 1981. Springer-
Verlag.
[158] A. S. Tanguiane. Articial Perception and Music Recognition. Springer-
Verlag, New York, 1993.
[159] D. Temperley. The Cognition of Basic Musical Structures. MIT Press,
Cambridge, 2001.
[160] E. Terhardt. Calculating virtual pitch. Hearing Research, 1:155182,
1979.
[161] G. Tzanetakis. Manipulation, Analysis, and Retrieval Systems for Audio
Signals. PhD thesis, Princeton University, June 2002.
[162] G. Tzanetakis and P. Cook. Multifeature audio segmentation for browsing
and annotation. In Proceedings IEEE Workshop on applications of Signal
Processing to Audio and Acoustics, October 1999.
[163] G. Tzanetakis and P. Cook. Automatic musical genre classication of
audio signals. In Proceedings of the International Symposium on Music
Information Retrieval (ISMIR), Bloomington, USA, October 2001.
[164] V. N. Vapnik. Estimation of Dependences Based on Empirical Data [in
Russian]. Nauka, Moscow, 1979. (English translation: Springer-Verlag,
New York, 1982).
[165] B. Vercoe. Computational auditory pathways to music understanding. In
I. Deli`ege and J. Sloboda, editors, Perception and Cognition of Music,
pages 307326. Psychology Press, Hove, UK, 1997.
[166] Y. Wang. A beat-pattern based error concealment scheme for music de-
livery with burst packet loss. In Proceedings International Conference
Multimedia and Expo (ICME01), pages 7376, 2001.
[167] M. Welsh, N. Borisov, J. Hill, R. von Behren, and A. Woo. Querying large
collections of music for similarity. Technical report, Computer Science
Division, University of California, Berkeley, 1999.
[168] D. L. Wessel. Timbre space as a musical control structure. Computer Mu-
sic Journal, 3(2):4552, 1979. Republished in Foundations of Computer
Music, Curtis Roads (Ed., MIT Press).
[169] B. Whitman. Learning the Meaning of Music. PhD thesis, MIT Media
Laboratory, 2005.
136 Bibliography
[170] B. Whitman and S. Lawrence. Inferring descriptions and similarity for
music from community metadata. In Proceedings International Computer
Music Conference, pages 591598, Goteborg, Greece, September 2002.
[171] B. Whitman and R. Rifkin. Musical query-by-description as a multiclass
learning problem. In Proceedings of the IEEE Multimedia Signal Process-
ing Conference, St. Thomas, USA, December 2002.
[172] Wikipedia website. Denition of: Disc Jockey, May 2005.
http://www.wikipedia.org/wiki/DJ.
[173] T. Winkler. Composing Interactive Music: Techniques and Ideas Using
Max. MIT press, 1998.
[174] M. Wright, A. Chaudhary, A. Freed, S. Khoury, and D. Wessel. Audio
applications of the sound description interchange format standard. In Pro-
ceedings of the 107th AES Convention, New York, New York, September
1999.
[175] M. Wright and A. Freed. OpenSound control: A new protocol for commu-
nicating with sound synthesizers. In Proceedings International Computer
Music Conference, pages 101104, Thessaloniki, Greece, 1997.
[176] L. Wyse, Y. Wang, and X. Zhu. Application of a content-based percussive
sound synthesizer to packet loss recovery in music streaming. In MULTI-
MEDIA 03: Proceedings of the eleventh ACM international conference
on Multimedia, pages 335338, Berkeley, CA, USA, 2003. ACM Press.
[177] J. Yedidia, W. T. Freeman, and Y. Weiss. Generalized belief propagation.
Advances in Neural Information Processing Systems, 2000.
[178] T. Yoshioka, T. Kitahara, K. Komatani, T. Ogata, and H. G. Okuno.
Tempo and beat estimation of musical signals. In Proceedings of the Inter-
national Symposium on Music Information Retrieval (ISMIR), Barcelona,
Spain, October 2004. Universitat Pompeu Fabra.
[179] G. U. Yule. On a method of investigating periodicities in disturbed se-
ries with special reference to Wolfers sunspot numbers. Philosophical
Transactions Royal Society London Ser. A, 226:267298, 1927.
[180] D. Zicarelli. An extensible real-time signal processing environment for
Max. In Proceedings International Computer Music Conference, pages
463466, Ann Arbor, Michigan, 1998.
[181] A. Zils and F. Pachet. Musical mosaicing. In Proceedings of the COST
G-6 Conference on Digital Audio Eects (DAFx-01), Limerick, Ireland,
December 2001.
[182] A. Zils and F. Pachet. Automatic extraction of music descriptors from
acoustic signals using eds. In Proceedings of the 116th AES Convention,
May 2004.
[183] E. Zwicker and H. Fastl. Psychoacoustics: Facts and Models. Springer
Verlag, Berlin, 2nd edition, 1999.
Bibliography 137