Automatic Audio Analysis For Content Description & Indexing

Automatic audio analysis
for content description & indexing

Dan Ellis
International Computer Science Institute, Berkeley CA
<dpwe@icsi.berkeley.edu>
Outline
1 Auditory Scene Analysis (ASA)
2 Computational ASA (CASA)
3 Prediction-driven CASA
4 Speech recognition & sound mixtures
5 Implications for content analysis
Audio Indexing - Dan Ellis 1998feb04 - 1

1 Auditory Scene Analysis
“The organization of complex sound scenes
according to their inferred sources”
• Sounds rarely occur in isolation
- organization required for useful information
• Human audition is very effective
- unexpectedly difficult to model
• ‘Correct’ analysis defined by goal
- source shows independence, continuity
→ecological constraints enable organization
f/Hz
city22
4000
−40
2000
−50
1000
400 −60
200 −70
0 1 2 3 4 5 6 7 8 9 dB
time/s

Psychology of ASA
• Extensive experimental research
- organization of ‘simple pieces’
(sinusoids & white noise)
- streaming, pitch perception, ‘double vowels’
• “Auditory Scene Analysis” [Bregman 1990]
→ grouping ‘rules’
- common onset/offset/modulation,
harmonicity, spatial location
• Debated... (Darwin, Carlyon, Moore, Remez)
(from
Darwin 1996)

2 Computational Auditory Scene Analysis
(CASA)
• Automatic sound organization?
- convert an undifferentiated signal into a
description in terms of different sources
f/Hz
city22
4000
−40 horn horn
2000
−50 door crash
1000
yell
400 −60
200 −70
car noise
0 1 2 3 4 5 6 7 8 9 dB
time/s
• Translate psych. rules into programs?

- representations to reveal common onset,
harmonicity ...
• Motivations & Applications
- it’s a puzzle: new processing principles?
- real-world interactive systems (speech, robots)
- hearing prostheses (enhancement, description)
- advanced processing (remixing)
- multimedia indexing...
CASA survey
• Early work on co-channel speech
- listeners benefit from pitch difference
- algorithms for separating periodicities
• Utterance-sized signals need more
- cannot predict number of signals (0, 1, 2 ...)
- birth/death processes
• Ultimately, more constraints needed
- nonperiodic signals
- masked cues
- ambiguous signals

CASA1: Periodic pieces
• Weintraub 1985
- separate male & female voices
- find periodicities in each frequency channel by
auto-coincidence
- number of voices is ‘hidden state’
• Cooke & Brown (1991-3)
- divide time-frequency plane into elements
- apply grouping rules to form sources
- pull single periodic target out of noise
brn1h.aif brn1h.fi.aif
frq/Hz frq/Hz
3000 3000
2000 2000
1500 1500
1000 1000
600 600
400 400
300 300
200 200
150 150
100 100
0.2 0.4 0.6 0.8 1.0 time/s 0.2 0.4 0.6 0.8 1.0 time/s

CASA2: Hypothesis systems
• Okuno et al. (1994-)
- ‘tracers’ follow each harmonic + noise ‘agent’
- residue-driven: account for whole signal
• Klassner 1996
- search for a combination of templates
- high-level hypotheses permit front-end tuning
3760 Hz
Buzzer-Alarm
2540 Hz
2230 Hz 2350 Hz
Glass-Clink
1675 Hz
1475 Hz
950 Hz
500 Hz
Phone-Ring
460 Hz Siren-Chirp
420 Hz
1.0 2.0 3.0 4.0 sec 1.0 2.0 3.0 4.0 sec
TIME TIME
(a) (b)
• Ellis 1996
- model for events perceived in dense scenes
- prediction-driven: observations - hypotheses

CASA3: Other approaches
• Blind source separation (Bell & Sejnowski)
- find exact separation parameters by maximizing
statistic e.g. signal independence
• HMM decomposition (RK Moore)
- recover combined source states directly
• Neural models (Malsburg, Wang & Brown)
- avoid implausible AI methods (search, lists)
- oscillators substitute for iteration?

3 Prediction-driven CASA
Perception is not direct
but a search for plausible hypotheses
• Data-driven...
input signal discrete
mixture features Object objects Grouping Source
Front end
formation rules groups
vs. Prediction-driven hypotheses

Noise
components
Hypothesis Predict
management & combine
Periodic
components
prediction
errors
input signal predicted
mixture features Compare features
Front end
& reconcile
• Motivations
- detect non-tonal events (noise & clicks)
- support ‘restoration illusions’...
→ hooks for high-level knowledge
+ ‘complete explanation’, multiple hypotheses,
resynthesis
Analyzing the continuity illusion
• Interrupted tone heard as continuous
- .. if the interruption could be a masker
f/Hz
ptshort
4000
2000
1000
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

time/s
• Data-driven just sees gaps
• Prediction-driven can accommodate
- special case or general principle?

Phonemic Restoration (Warren 1970)
• Another ‘illusion’ instance
• Inference relies on high-level semantics
nsoffee.aif
frq/Hz
3500
3000
2500
2000
1500
1000
500
0
1.2 1.3 1.4 1.5 1.6 1.7 time/s
• Incorporating knowledge into models?

Subjective ground-truth in mixtures?
• Listening tests collect ‘perceived events’:
• Consistent answers:
f/Hz
City
4000
2000
1000
400
200
0 1 2 3 4 5 6 7 8 9
Horn1 (10/10)
S9−horn 2
S10−car horn
S4−horn1
S6−double horn
S2−first double horn
S7−horn
S7−horn2
S3−1st horn
S5−Honk
S8−car horns
S1−honk, honk
Crash (10/10)
S7−gunshot
S8−large object crash
S6−slam
S9−door Slam?
S2−crash
S4−crash
S10−door slamming
S5−Trash can
S3−crash (not car)
S1−slam
Horn2 (5/10)
S9−horn 5
S8−car horns
S2−horn during crash
S6−doppler horn
S7−horn3
Truck (7/10)
S8−truck engine
S2−truck accelerating
S5−Acceleration
S1−rev up/passing
S6−acceleration
S3−closeup car
S10−wheels on road

PDCASA example:
City-street ambience
f/Hz
City
4000
2000
1000
400
200
1000
400
200
100
50
0 1 2 3 4 5 6 7 8 9
f/Hz
Wefts1−4 Weft5 Wefts6,7 Weft8 Wefts9−12
4000
2000
1000
400
200
1000
400
200
100
50
Horn1 (10/10)
Horn2 (5/10)
Horn3 (5/10)
Horn4 (8/10)
Horn5 (10/10)
f/Hz
Noise2,Click1
4000
2000
1000
400
200
Crash (10/10)
f/Hz
Noise1
4000
2000
1000 −40
400
200 −50
−60
Squeal (6/10)
Truck (7/10)
−70
0 1 2 3 4 5 6 7 8 9 dB
time/s
• Problems
- error allocation - rating hypotheses
- source hierarchy - resynthesis

4 Speech recognition
& sound mixtures
• Conventional speech recognition:
Feature Phoneme HMM

extraction low-dim. classifier phoneme decoder words
signal
features probabilities
- signal assumed entirely speech

- find valid labelling by discrete labels
- class models from training data
• Some problems:
- need to ignore lexically-irrelevant variation
(microphone, voice pitch etc.)
- compact feature space → everything speech-like
• Very fragile to nonspeech, background
- scene-analysis methods very attractive...

CASA for speech recognition
• Data-driven: CASA as preprocessor
- problems with ‘holes’ (but: Cooke, Okuno)
- doesn’t exploit knowledge of speech structure
• Prediction-driven: speech as component
- same ‘reconciliation’ of speech hypotheses
- need to express ‘predictions’ in signal domain
Speech
components
Hypothesis Noise Predict

management components & combine
Periodic
components
input
mixture Compare
Front end
& reconcile

Example of speech & nonspeech
f/Bark
(a) Clap (clap8k−env.pf)
15
10
5
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
(b) Speech plus clap (223cl−env.pf)
dB
60
40
20
(c) Recognizer output
h# w n ay n tcl t uw f ay ah s ay ow h# v s eh v ah n h#
h# n ay n t uw f ay v ow h# s eh v ax n
tcl
<SIL> nine two five oh <SIL> seven
(d) Reconstruction from labels alone (223cl−renvG.pf)
(e) Slowly−varying portion of original (223cl−envg.pf)
(f) Predicted speech element ( = (d)+(e) ) (223cl−renv.pf)
(g) Click5 from nonspeech analysis (justclick.pf)
(h) Spurious elements from nonspeech analysis (nonclicks.pf)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
• Problems:
- undoing classification & normalization
- finding a starting hypothesis
- granularity of integration
5 Implications for content analysis:
Using CASA to index soundtracks
f/Hz
city22
4000
−40 horn horn
2000
−50 door crash
1000
yell
400 −60
200 −70
car noise
0 1 2 3 4 5 6 7 8 9 dB
time/s
• What are the ‘objects’ in a soundtrack?

- subjective definition → need auditory model
• Segmentation vs. classification
- low-level cues → locate events
- higher-level ‘learned’ knowledge to give
semantic label (footstep, crash)
... AI complete?
• But: hard to separate
- illusion phenomena suggest auditory
organization depends on interpretation

Using speech recognition for indexing
• Active research area:
Access to news broadcast databases
- e.g. Informedia (CMU), ThisL (BBC+...)
- use LV-CSR to transcribe,
then text-retrieval to find
- 30-40% word error rate, still works OK
• Several systems at NIST TREC workshop
• Tricks to ‘ignore’ nonspeech/poor speech

Open issues in automatic indexing
• How to do ASA?
• Explanation/description hierarchy
- PDCASA: ‘generic’ primitives
+ constraining hierarchy
- subjective & task-dependent
• Classification
- connecting subjective & objective properties
→ finding subjective invariants, prominence
- representation of sound-object ‘classes’
• Resynthesis?
- a ‘good’ description should be adequate
- provided in PDCASA, but low quality
- requires good knowledge-based constraints

6 Conclusions
• Auditory organization is required in real
environments
• We don’t know how listeners do it!
- plenty of modeling interest
• Prediction-reconciliation can account for
‘illusions’
- use ‘knowledge’ when signal is inadequate
- important in a wider range of circumstances?
• Speech recognizers are a good source of
knowledge
• Automatic indexing implies ‘synthetic listener’
- need to solve a lot of modeling issues

Automatic Audio Analysis For Content Description & Indexing

Uploaded by

Copyright:

Available Formats

Automatic Audio Analysis For Content Description & Indexing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automatic Audio Analysis For Content Description & Indexing

Uploaded by

Copyright:

Available Formats

Automatic audio analysis

for content description & indexing

1 Auditory Scene Analysis (ASA)

2 Computational ASA (CASA)

4 Speech recognition & sound mixtures

5 Implications for content analysis

Audio Indexing - Dan Ellis 1998feb04 - 1

Audio Indexing - Dan Ellis 1998feb04 - 2

Audio Indexing - Dan Ellis 1998feb04 - 3

• Translate psych. rules into programs?

Audio Indexing - Dan Ellis 1998feb04 - 5

Audio Indexing - Dan Ellis 1998feb04 - 6

Audio Indexing - Dan Ellis 1998feb04 - 7

Audio Indexing - Dan Ellis 1998feb04 - 8

vs. Prediction-driven hypotheses

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

• Data-driven just sees gaps

• Prediction-driven can accommodate

- special case or general principle?

• Incorporating knowledge into models?

Audio Indexing - Dan Ellis 1998feb04 - 11

Audio Indexing - Dan Ellis 1998feb04 - 12

Audio Indexing - Dan Ellis 1998feb04 - 13

Feature Phoneme HMM

- signal assumed entirely speech

Audio Indexing - Dan Ellis 1998feb04 - 14

Hypothesis Noise Predict

Audio Indexing - Dan Ellis 1998feb04 - 15

(e) Slowly−varying portion of original (223cl−envg.pf)

(f) Predicted speech element ( = (d)+(e) ) (223cl−renv.pf)

(g) Click5 from nonspeech analysis (justclick.pf)

(h) Spurious elements from nonspeech analysis (nonclicks.pf)

• What are the ‘objects’ in a soundtrack?

Audio Indexing - Dan Ellis 1998feb04 - 17

Audio Indexing - Dan Ellis 1998feb04 - 18

Audio Indexing - Dan Ellis 1998feb04 - 19

Audio Indexing - Dan Ellis 1998feb04 - 20

You might also like