Linear Prediction
Linear Prediction
Linear Prediction
processing
Part 2
Chapter 3: Audio feature extraction techniques
Chapter 4 : Recognition Procedures
windows-media-player
To play music
Right-click, select
Visualization / bar and waves
Spectral envelop
Frequency
Speech recognition idea using 4 linear filters,
each bandwidth is 2.5KHz
Two sounds with two spectral envelopes SE,SE
E.g. SE=ar, SE=ei
energy
energy
Spectrum A Spectrum B
Freq. Freq.
0 0
10KHz 10KHz
filter 1 2 3 4 filter 1 2 3 4
Filter Filter
out v1 v2 v3 v4
Audio signal processing ver1g w1 w2 w3 w4 5
out
Difference between two sounds (or spectral
envelopes SE SE)
Difference between two sounds
E.g. SE=ar, SE=ei
A simple measure is
Dist =|v1-w1|+|v2-w2|+|v3-w3|+|v4-w4|
Where |x|=magnitude of x
...
freq..
500 1K 1.5K Audio 2K 2.5K
signal processing ver1g 3K ... (Hz)9
Non-uniform filter banks: Log frequency
Log. Freq... scale : close to human ear
http://instruct1.cit.cornell.edu/courses/ece576/FinalProjects/f2008/pae26_jsc59/pae26_jsc59/images/melfilt.png
Critical band scale: Mel scale
Based on perceptual Mel
studies m Scale
(m)
Log. scale when freq. is
above 1KHz
Linear scale when freq. is
below 1KHz
Signal
~
S (n ) S (n ) a~S (n 1)
0.9 a~ 1.0, tyopically a~ 0.95
For S (0), S (1), S (2),..,
~
the value S (0) does not exist and is never used.
(Consonant)
For vowels (voiced sound),
use LPC to represent the signal
The concept is to find a set of parameters ie. 1, 2, 3, 4,.. p=8
to represent the same waveform (typical values of p=8->13)
For example
Input waveform
Can reconstruct the waveform from
these LPC codes
1, 2, 3, 4,.. 8
Time frame y 30ms
1, 2, 3, 4,.. 8
Time frame y+1 30ms
Time frame y+2 1, 2, 3, 4,.. 8
30ms :
Each time frame y=512 samples Each set has 8 floating points
(S0,S1,S2,. Sn,SN-1=511)
512 floating points Audio signal processing ver1g 20
Concept: we want to find a set of a1,a2,..,a8, so when applied to all Sn in
this frame (n=0,1,..N-1), the total error E (n=0N-1)is minimum
Predicted ~ sn at n using past history Exercise 5
~
sn a1sn 1 a2 sn 2 a3sn 3 ... a p sn p Write the error
predicted error at n e s ~ s function en at N=130,
S
n n n
Time n
Audio signal processing ver1g 21
0 N-1=511
Answers
Write error function at N=130,draw en on the graph
e130 s130 ~
s130 s130 (a1s129 a2 s128 a3s127 ... a p s1308122 )
Write the error function at N=288
e288 s288 ~
s288 s288 (a1s287 a2 s286 a3s285 ... a p s2888280 )
Why e1= 0?
Answer: Because s-1, s-2,.., s-8 are outside the frame and are
considered as 0. The effect to the overall solution is very small.
Write E for n=1,..N-1, (showing n=1, 8, 130,288,511)
E
To find ai 1, 2,.. p , that generate Emin , solve 0 for all i 1,2,... p
ai
Derivations can be found at
After some manupulati ons we have http://www.cslu.ogi.edu/people/hosom/cs552/lecture07_features.ppt
r0 r1 r2 ..., rp 1 a1 r1
r r0 r1 ..., rp 2 a2 r2
1
r2 r1 r0 ..., : : : ( 2) Use Durbins equation
to solve this
: : : ..., : : :
rp 1 rp 2 rp 3 ..., r0 a p rp
n N 10 n N 1i
r0 s
n 0
n sn , ri s
n 0
n sn i auto - correlatio n functions
If we know r0 ,r1,r2 ,.., rp , we can find out a1,a2 ,.., a p by the set of equations in (2)
r2 r1 r0 ..., : : : ( 2)
: : : ..., : : :
void lpc_coeff(float *coeff) rp 1
rp 2 rp 3 ..., r0 a p rp
{int i, j; float sum,E,K,a[ORDER+1][ORDER+1];
if(coeff[0]==0.0) coeff[0]=1.0E-30;
E=coeff[0];
for (i=1;i<=ORDER;i++)
{ sum=0.0;
for (j=1;j<i;j++) sum+= a[j][i-1]*coeff[i-j];
K=(coeff[i]-sum)/E; a[i][i]=K; E*=(1-K*K);
for (j=1;j<i;j++) a[j][i]=a[j][i-1]-K*a[i-j][i-1];
}
for (i=1;i<=ORDER;i++) coeff[i]=a[i][ORDER];}
http://home.hib.no/al/engelsk/seksjon/SOFF-MASTER/ill061.gif
Audio signal processing ver1g 31
Cepstral analysis
Signal(s)=convolution(*) of
glottal excitation (e) and vocal_tract_filter (h)
s(n)=e(n)*h(n), n is time index
After Fourier transform FT: FT{s(n)}=FT{e(n)*h(n)}
Convolution(*) becomes multiplication (.)
n(time) w(frequency),
S(w) = E(w).H(w)
Find Magnitude of the spectrum
|S(w)| = |E(w)|.|H(w)|
log10 |S(w)|= log10{|E(w)|}+ log10{|H(w)|}
Ref: http://iitg.vlab.co.in/?sub=59&brch=164&sim=615&cnt=1
Audio signal processing ver1g 32
Cepstrum
C(n)=IDFT[log10 |S(w)|]=
IDFT[ log10{|E(w)|} + log10{|H(w)|} ]
Vocal track
cepstrum Glottal excitation cepstrum
http://iitg.vlab.co.in/?sub=59&brch=164&sim=615&cnt=1
Liftering
Low time liftering: Vocal tract
Magnify (or Inspect) Cepstrum Glottal excitation
Used for
the low time to find Cepstrum, useless for
Speech speech recognition,
the vocal tract filter recognition
cepstrum
High time liftering:
Magnify (or Inspect)
the high time to find
the glottal excitation
cepstrum (remove
this part for speech
recognition.
Cut-off Found
by experiment Frequency =FS/ quefrency
Audio signal processing ver1g
FS=sample frequency 36
=22050
Reasons for liftering
Cepstrum of speech
Why we need this?
Answer: remove the ripples
of the spectrum caused by
glottal excitation.
Too many ripples in the spectrum
caused by vocal
cord vibrations.
But we are more interested in
the speech envelope for
recognition and reproduction
Fourier
Transform
Cepstrum
Select high
time, C_high
Select low
time
C_low
Frequency
quefrency (sample index) This peak may be the pitch period:
This smoothed vocal track spectrum can
For more information see : be used to find pitch
http://isdl.ee.washington.edu/people/stevenschimmel/sphsc503/files/notes10.pdf
Audio signal processing ver1g 39
Exercise 6
A speech waveform S has the values
s0,s1,s2,s3,s4,s5,s6,s7,s8=
[1,3,2,1,4,1,2,4,3]. The frame size is 4.
Find the pre-emphasized wave if = is 0.98.
Find auto-correlation parameter r0, r1, r2.
If we use LPC order 2 for our feature extraction
system, find LPC coefficients a1, a2.
If the number of overlapping samples for two
frames is 2, find the LPC coefficients of the second
frame.
Code a1 a2 a3 a4 a5 a6 a7 a8
(1 byte)
D-D'<
No Yes
threshold
Audio signal processing ver1g 46
Example: VQ : 240 samples use VQ to split
to 4 classes
Steps
Stage3:
Update the 2 centroids Stage 4: split the 2 centroids again
according to the two spitted to become 4 centroids
groups
Each group find a new centroid.
Audio signal processing ver1g 48
Final result
s(n) end-point
detected
recorded
A simple End point detection algorithm
At the beginning the energy level is low.
If the energy level and zero-crossing rate of 3
successive frames is high it is a starting point.
After the starting point if the energy and zero-
crossing rate for 5 successive frames are low
it is the end point.
s(n)
5
n
4 6
2
1 3
n
m
N
l=1 window,Audio
length =N
signal processing ver1g 64
Windowing
To smooth out the discontinuities at the beginning and
end.
Hamming or Hanning windows can be used.
Hamming window
~ 2n
S (n) S (n) W (n) 0.54 0.46 cos
N 1
0 n N 1
Tutorial: write a program segment to find the result of
passing a speech frame, stored in an array int s[1000],
into the Hamming window.
W (n )
S (n ) ~
S (n )
S (n ) W (n)
2n
0.54 0.46 cos
N 1
0 n N 1
~
S (n ) ~
S (n )
S (n) *W (n)
hamming_window(i)=
abs(0.54-0.46*cos(i*(2*pi/N)));
y1(i)=hamming_window(i)*x1(i);
end
Auto-correlation of every
frame (l =1,2,..)of a
windowed signal is
calculated.
If the required output is
p-th ordered LPC
Auto-correlation for the N 1 m
~ ~
l-th frame is rl ( m) Sl Sl ( n m)
n 0
m 0,1,.., p
Audio signal processing ver1g 69
LPC to Cepstral coefficients conversion
Cepstral coefficient is more accurate in describing
the characteristic of speech signal
Normally cepstral coefficients of order 1<=m<=p
are enough to describe the speech signal.
Calculate c1, c2, c3,.. cp from a1, a2, a3,.. ap
c0 r0
m 1
k
cm am ck amk , 1 m p
k 1 m
m 1
k
cm ck amk , m p(if needed )
k m p m
Ref:http://www.clear.rice.edu/elec532/PROJECTS98/speech/cepstrum/cepstrum.html
Distortion measure - difference
between two signals
measure how different two
signals is:
Cepstral distances
d cn c
p
between a frame 2 ' 2
n
(described by cepstral
n 1
coeffs (c1,c2cp )and the
other frame (c1,c2cp) is
Weighted Cepstral
distances to give different
w(n) c
p
weighting to different ' 2
cepstral coefficients( more n c
n
accurate) n 1
D( i-1, j) D( i, j)
unknown input
Step 2:
j-axis
accumulated R 28 11 7 9
score matrix (D) O 19 5 12 18
O 11 4 12 18
Reference F 3 9 15 22
F 1 8 15 18
F O R R
Audio signal processing ver1g i-axis76
To find the optimal path in the accumulated
matrix
Starting from the top row and right most column, find
the lowest cost D (i,j)t : it is found to be the cell at
(i,j)=(3,5), D(3,5)=7 in the top row.
From the lowest cost position p(i,j)t, find the next
position (i,j)t-1 =argument_min_i,j{D(i-1,j), D(i-1,j-1),
D(i,j-1)}.
E.g. p(i,j)t-1 =argument_mini,j{9,11,4)} =(3-0,5-
1)=(3,4) that contains 4 is selected.
Repeat above until the path reaches the right most
column or the lowest row.
Note: argument_min_i,j{cell1, cell2, cell3} means the argument i,j of
the cell with the lowest value is selected.
distortion(dist ) x x '
2
YES' 2 4 6 9 3 4 5 8 1
NO' 7 6 2 4 7 6 10 4 5
Input 3 5 5 8 4 2 3 7 2
C1=(1.5,7.85)
P2=(1.8,6.9)
P3=(7.2,1.5);
C2 =(8.15,0.9)
P4=(9.1,0.3)
Audio signal processing ver1g 86
Appendix A.2: Binary split K-means method for the number of required contriods
is fixed (see binary_split2_a2.m.txt) (assume you use all available samples in
building the centroids at all stages of calculations)
P1=(1.2,8.8);P2=(1.8,6.9);P3=(7.2,1.5);P4=(9.1,0.3)
first centroid C1=((1.2+1.8+7.2+9.1)/4, 8.8+6.9+1.5+0.3)/4) =
(4.825,4.375)
Use e=0.02 find the two new centroids
Step1: CCa= C1(1+e)=(4.825x1.02,4.375x1.02)=(4.9215,4.4625)
CCb= C1(1-e)=(4.825x0.98,4.375x0.98)=(4.7285,4.2875)
CCa=(4.9215,4.4625)
CCb=(4.7285,4.2875)
The function dist(Pi,CCx )=Euclidean distance between Pi and CCx
points dist to CCa -1*dist to CCb =diff Group to
P1 5.7152 -5.7283 = -0.0131 CCa
P2 3.9605 -3.9244 = 0.036 CCb
P3 3.7374 -3.7254 = 0.012 CCb
P4 5.8980 -5.9169 = -0.019 CCa
C1=(4.825,4.375)
CCb= C1(1-e)=(4.7285,4.2875)
P3=(7.2,1.5);
P4=(9.1,0.3)
Audio signal processing ver1g 89
Direction
P1=(1.2,8.8); Step2: of the split
CCCCb=(1.5,7.85) Binary split K-means method
for the number of required
P2=(1.8,6.9) contriods is fixed, say 2, here.
CCCa,CCCb= formed
CCCa=(5.15,4.55)
CCCb=(4.50,4.20)
P3=(7.2,1.5);
CCCb =(8.15,0.9) CCCCa=(8.15,0.9)
P4=(9.1,0.3)
Audio signal processing ver1g 90
Appendix A.3. Cepstrum Vs spectrum
the spectrum is sensitive to glottal excitation
(E). But we only interested in the filter H
In frequency domain
Speech wave (X)= Excitation (E) . Filter (H)
Log (X) = Log (E) + Log (H)
Cepstrum =Fourier transform of log of the
signals power spectrum
In Cepstrum, the Log(E) term can easily be
isolated and removed.
ki j 1 ,1 i p
E ( i 1)
ai( i ) ki
a (ji ) a (ji 1) ki ai(i j1)
E ( i ) (1 ki2 ) E ( i 1)
Finally LPC_coefficie nts amp
Audio signal processing ver1g 92
Program to Convert LPC coeffs. to Cepstral coeffs.
cepstrum f (t ) FT log 10 FT f (t )
2
2
http://mi.eng.cam.ac.uk/~ajr/SA95/node33.html
http://en.wikipedia.org/wiki/Cepstrum
Audio signal processing ver1g 94