Automatic Audio Analysis For Content Description & Indexing
Automatic Audio Analysis For Content Description & Indexing
Automatic Audio Analysis For Content Description & Indexing
Outline
3 Prediction-driven CASA
4000
−40
2000
−50
1000
400 −60
200 −70
0 1 2 3 4 5 6 7 8 9 dB
time/s
(from
Darwin 1996)
4000
−40 horn horn
2000
−50 door crash
1000
yell
400 −60
200 −70
car noise
0 1 2 3 4 5 6 7 8 9 dB
time/s
3000 3000
2000 2000
1500 1500
1000 1000
600 600
400 400
300 300
200 200
150 150
100 100
0.2 0.4 0.6 0.8 1.0 time/s 0.2 0.4 0.6 0.8 1.0 time/s
Buzzer-Alarm
2540 Hz
2230 Hz 2350 Hz
Glass-Clink
1675 Hz
1475 Hz
950 Hz
500 Hz
Phone-Ring
460 Hz Siren-Chirp
420 Hz
1.0 2.0 3.0 4.0 sec 1.0 2.0 3.0 4.0 sec
TIME TIME
(a) (b)
• Ellis 1996
- model for events perceived in dense scenes
- prediction-driven: observations - hypotheses
• Motivations
- detect non-tonal events (noise & clicks)
- support ‘restoration illusions’...
→ hooks for high-level knowledge
+ ‘complete explanation’, multiple hypotheses,
resynthesis
Audio Indexing - Dan Ellis 1998feb04 - 9
Analyzing the continuity illusion
• Interrupted tone heard as continuous
- .. if the interruption could be a masker
f/Hz
ptshort
4000
2000
1000
3000
2500
2000
1500
1000
500
0
1.2 1.3 1.4 1.5 1.6 1.7 time/s
• Consistent answers:
f/Hz
City
4000
2000
1000
400
200
0 1 2 3 4 5 6 7 8 9
Horn1 (10/10)
S9−horn 2
S10−car horn
S4−horn1
S6−double horn
S2−first double horn
S7−horn
S7−horn2
S3−1st horn
S5−Honk
S8−car horns
S1−honk, honk
Crash (10/10)
S7−gunshot
S8−large object crash
S6−slam
S9−door Slam?
S2−crash
S4−crash
S10−door slamming
S5−Trash can
S3−crash (not car)
S1−slam
Horn2 (5/10)
S9−horn 5
S8−car horns
S2−horn during crash
S6−doppler horn
S7−horn3
Truck (7/10)
S8−truck engine
S2−truck accelerating
S5−Acceleration
S1−rev up/passing
S6−acceleration
S3−closeup car
S10−wheels on road
f/Hz
Wefts1−4 Weft5 Wefts6,7 Weft8 Wefts9−12
4000
2000
1000
400
200
1000
400
200
100
50
Horn1 (10/10)
Horn2 (5/10)
Horn3 (5/10)
Horn4 (8/10)
Horn5 (10/10)
f/Hz
Noise2,Click1
4000
2000
1000
400
200
Crash (10/10)
f/Hz
Noise1
4000
2000
1000 −40
400
200 −50
−60
Squeal (6/10)
Truck (7/10)
−70
0 1 2 3 4 5 6 7 8 9 dB
time/s
• Problems
- error allocation - rating hypotheses
- source hierarchy - resynthesis
Periodic
components
input
mixture Compare
Front end
& reconcile
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
(b) Speech plus clap (223cl−env.pf)
dB
60
40
20
(c) Recognizer output
h# w n ay n tcl t uw f ay ah s ay ow h# v s eh v ah n h#
h# n ay n t uw f ay v ow h# s eh v ax n
tcl
<SIL> nine two five oh <SIL> seven
(d) Reconstruction from labels alone (223cl−renvG.pf)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5
• Problems:
- undoing classification & normalization
- finding a starting hypothesis
- granularity of integration
Audio Indexing - Dan Ellis 1998feb04 - 16
5 Implications for content analysis:
Using CASA to index soundtracks
f/Hz
city22
4000
−40 horn horn
2000
−50 door crash
1000
yell
400 −60
200 −70
car noise
0 1 2 3 4 5 6 7 8 9 dB
time/s