0 Section: Multimedia

Audio Indexing
Gaël Richard
Ecole Nationale Supérieure des Télécommunications (TELECOM ParisTech), France

INTRODUCTION for fundamentals of speech recognition). Although this

application remains of great importance, the variety
The enormous amount of unstructured audio data of applications of audio indexing now clearly goes
available nowadays and the spread of its use as a beyond this initial scope. In fact, numerous promising
data source in many applications are introducing new applications exist ranging from automatic broadcast
challenges to researchers in information and signal audio streams segmentation (Richard & et al., 2007) to
processing. The continuously growing size of digital automatic music transcription (Klapuri & Davy, 2006).
audio information increases the difficulty of its access Typical applications can be classified in three major
and management, thus hampering its practical useful- categories depending on the potential users (Content
ness. As a consequence, the need for content-based providers, broadcasters or end-user consumers). Such
audio data parsing, indexing and retrieval techniques applications include:
to make the digital information more readily available
to the user is becoming ever more critical. • Intelligent browsing of music samples databases
The lack of proper indexing and retrieval systems is for composition (Gillet & Richard, 2005), video
making de facto useless significant portions of existing scenes retrieval by audio (Gillet & et al., 2007)
audio information (and obviously audiovisual informa- and automatic playlist production according to
tion in general). In fact, if generating digital content is user preferences (for content providers).
easy and cheap, managing and structuring it to produce • Automatic podcasting, automatic audio sum-
effective services is clearly not. This applies to the whole marization (Peeters & et al., 2002), automatic
range of content providers and broadcasters which can audio title identification and smart digital DJing
amount to terabytes of audio and audiovisual data. It (for broadcasters).
also applies to the audio content gathered in private • Music genre recognition (Tzanetakis & Cook,
collection of digital movies or music files stored in the 2002), music search by similarity (Berenzweig &
hard disks of conventional personal computers. et al., 2004), personal music database intelligent
In summary, the goal of an audio indexing system browsing and query by humming (Dannenberg
will then be to automatically extract high-level informa- & et al. 2007) (for consumers).
tion from the digital raw audio in order to provide new
means to navigate and search in large audio databases.
Since it is not possible to cover all applications of audio MAIN FOCUS
indexing, the basic concepts described in this chapter
will be mainly illustrated on the specific problem of Depending on the problem tackled different architec-
musical instrument recognition. tures are proposed in the community. For example,
for musical tempo estimation and tracking traditional
architectures will include a decomposition module
BACKGROUND which aims at splitting the signal into separate frequency
bands (using a filterbank) and a periodicity detection
Audio indexing was historically restricted to word module which aims at estimating the periodicity of a
spotting in spoken documents. Such an application detection function built from the time domain envelope
consists in looking for pre-defined words (such as of the signal in each band (Scheirer, 1998)(Alonso & et
name of a person, topics of the discussion etc…) in al., 2007). When tempo or beat tracking is necessary, it
spoken documents by means of Automatic Speech will be coupled with onset detection techniques (Bello
Recognition (ASR) algorithms (see (Rabiner, 1993) & et al., 2006) which aim at locating note onsets in

