Linear Prediction

Introduction to audio signal
processing
Part 2
Chapter 3: Audio feature extraction techniques
Chapter 4 : Recognition Procedures
Audio signal processing ver1g 1

Chapter 3: Audio feature
extraction techniques
3.1 Filtering
3.2 Linear predictive coding
3.3 Vector Quantization (VQ)

Chapter 3 : Speech data analysis
techniques
Ways to find the spectral envelope
Filter banks: uniform
Spectral
spectral
energy
envelop
envelop
filter2
filter1 filter3
output output output filter4
output
freq..
Filter banks can also be non-uniform
LPC and Cepstral LPC parameters
Vector quantization method to represent data
more efficiently Audio signal processing ver1g 3
You can see the filter band output
using windows-media-player for a frame
Try to look at it
Run energy
windows-media-player
To play music
Right-click, select
Visualization / bar and waves
Spectral envelop
Frequency
Speech recognition idea using 4 linear filters,
each bandwidth is 2.5KHz
Two sounds with two spectral envelopes SE,SE
E.g. SE=ar, SE=ei
Spectral envelope SE=ar Spectral envelope SE=ei
energy
energy
Spectrum A Spectrum B
Freq. Freq.
0 0
10KHz 10KHz
filter 1 2 3 4 filter 1 2 3 4
Filter Filter
out v1 v2 v3 v4
Audio signal processing ver1g w1 w2 w3 w4 5
out
Difference between two sounds (or spectral
envelopes SE SE)
Difference between two sounds
E.g. SE=ar, SE=ei
A simple measure is
Dist =|v1-w1|+|v2-w2|+|v3-w3|+|v4-w4|
Where |x|=magnitude of x

3.1 Filtering method
For each frame (10 - 30 ms) a set of filter
outputs will be calculated. (frame overlap 5ms)
There are many different methods for setting
the filter bandwidths -- uniform or non-
uniform
Input waveform
Time frame i 30ms Filter outputs (v1,v2,)

Time frame i+1 30ms Filter outputs (v1,v2,)
Time frame i+2 Filter outputs (v1,v2,)
30ms
5ms
How to determine filter band ranges
The pervious example of using 4 linear filters
is too simple and primitive.
We will discuss
Uniform filter banks
Log frequency banks
Mel filter bands

Uniform Filter Banks
Uniform filter banks
bandwidth B= Sampling Freq... (Fs)/no. of banks
(N)
For example Fs=10Kz, N=20 then B=500Hz
Simple to implement but not too useful
V
v3
Filter v1 v2
output
1 2 3 4 5 .... Q
...
freq..
500 1K 1.5K Audio 2K 2.5K
signal processing ver1g 3K ... (Hz)9
Non-uniform filter banks: Log frequency
Log. Freq... scale : close to human ear
filter 1 filter 2 filter 3 filter 4

Center freq. 300 600 1200 2400
V bankwidth 200 400 800 1600
Filter
output v1 v2 v3
200 400 800 1600 3200

freq.. (Hz)

Inner ear and the cochlea
(human also has filter bands)
Ear and cochlea

http://universe-review.ca/I10-85-cochlea2.jpg
http://www.edu.ipa.go.jp/chiyo/HuBEd/HTML1/en/3D/ear.html
Mel filter bands (found by psychological
and instrumentation experiments)
Freq. lower than 1 Filter
KHz has narrower output
bands (and in
linear scale)
Higher frequencies
have larger bands
(and in log scale)
More filter below
1KHz
Less filters above
1KHz
http://instruct1.cit.cornell.edu/courses/ece576/FinalProjects/f2008/pae26_jsc59/pae26_jsc59/images/melfilt.png
Critical band scale: Mel scale
Based on perceptual Mel
studies m Scale
(m)
Log. scale when freq. is
above 1KHz
Linear scale when freq. is
below 1KHz
popular scales are the

Mel or Bark scales
f
m 2595 log 10 1
700 f
(f) Freq in hz
Audio signal processing ver1g Below 1KHz, fmf, linear 13
http://en.wikipedia.org/wiki/Mel_scale
Above 1KHz, f>mf, log scale
How to implement filter
bands
Linear Predictive coding LPC
methods

3.2 Feature extraction data flow
- The LPC (Liner predictive coding) method based method
Signal
preprocess -> autocorrelation-> LPC ---->cepstral

coef
(pre-emphasis) r0,r1,.., rp a1,.., ap c1,..,
cp
(windowing) (Durbin alog.)

Pre-emphasis
The high concentration of energy in the low
frequency range observed for most speech
spectra is considered a nuisance because it
makes less relevant the energy of the signal
at middle and high frequencies in many
speech analysis algorithms.
From Vergin, R. etal. ,"Compensated mel
frequency cepstrum coefficients ", IEEE,
ICASSP-96. 1996 .

Pre-emphasis -- high pass filtering
(the effect is to suppress low frequency)
To reduce noise, average transmission conditions
and to average signal spectrum.
~
S (n ) S (n ) a~S (n 1)
0.9 a~ 1.0, tyopically a~ 0.95
For S (0), S (1), S (2),..,
~
the value S (0) does not exist and is never used.

3.2 The Linear Predictive Coding
LPC method
Linear Predictive Coding LPC method
Time domain
Easy to implement
Archive data compression

First lets look at
the LPC speech production model
Speech synthesis model:
Impulse train generator governed by pitch period--
glottis
Random noise generator for consonant.
Vocal tract parameters = LPC parameters
Glottal excitation
for vowel
LPC parameters
Voice/unvoiced
Impulse train
Generator switch Time varying
Time-varying
X digitalfilter
filter output
digital
Noise
Generator Gain
(Consonant)
For vowels (voiced sound),
use LPC to represent the signal
The concept is to find a set of parameters ie. 1, 2, 3, 4,.. p=8
to represent the same waveform (typical values of p=8->13)
For example
Input waveform
Can reconstruct the waveform from
these LPC codes
1, 2, 3, 4,.. 8
Time frame y 30ms
1, 2, 3, 4,.. 8
Time frame y+1 30ms
Time frame y+2 1, 2, 3, 4,.. 8
30ms :
Each time frame y=512 samples Each set has 8 floating points
(S0,S1,S2,. Sn,SN-1=511)
512 floating points Audio signal processing ver1g 20
Concept: we want to find a set of a1,a2,..,a8, so when applied to all Sn in
this frame (n=0,1,..N-1), the total error E (n=0N-1)is minimum
Predicted ~ sn at n using past history Exercise 5
~
sn a1sn 1 a2 sn 2 a3sn 3 ... a p sn p Write the error
predicted error at n e s ~ s function en at N=130,
S
n n n
so the whole segment n 0 to N 1 draw it on the graph

Signal level n N 1 Write the error
E e
n 0
n
2
function at N=288
~ Why e0= s0?
Sn-1
s n Write E for n=1,..N-1,
en (showing n=1, 8,
Sn-2 Sn 130,288,511)
Sn-4
Sn-3
Time n
0 N-1=511
Answers
Write error function at N=130,draw en on the graph
e130 s130 ~
s130 s130 (a1s129 a2 s128 a3s127 ... a p s1308122 )
Write the error function at N=288
e288 s288 ~
s288 s288 (a1s287 a2 s286 a3s285 ... a p s2888280 )
Why e1= 0?
Answer: Because s-1, s-2,.., s-8 are outside the frame and are
considered as 0. The effect to the overall solution is very small.
Write E for n=1,..N-1, (showing n=1, 8, 130,288,511)
E s0 s0 s1 s1 .. s8 s8 .. s130 s130 .. s288 s288 ..s511 s511

~ ~ ~ ~ ~ ~
2 2 2 2 2 2

LPC idea and procedure
The idea: from all samples s0,s1,s2,sN-1=511, we want
to ap(p=1,2,..,8), so that E is a minimum. The
periodicity of the input signal provides information for
finding the result.
Procedures
For a speech signal, we first get the signal frame of size
N=512 by windowing(will discuss later).
Sampling at 25.6KHz, it is equal to a period of 20ms.
The signal frame is (S0,S1,S2,. Sn..,SN-1=511).
Ignore the effect of outside elements by setting them to zero,
I.e. S- ..=S-2 = S-1 =S512 =S513== S=0 etc.
We want to calculate LPC parameters of order p=8, ie. 1, 2,
3, 4,.. p=8.

Input waveform
For each
30ms time
frame Time frame y 30ms 1, 2, 3, 4,.. 8
The predicted value sn is denoted by ~ sn

~
sn a1sn 1 a2 sn 2 a3sn 3 ... a p sn p (1)
prediction error
en sn ~
sn , from(1)
en sn ( a1sn 1 a2 sn 2 a3sn 3 ... a p sn p )
i p
en sn ai sn i ,
i 1
so the whole frame n 0 to N-1

2
n N 1 n N 1
i p

E e n
2
sn ai sn i
n 0 n 0 i 1
E
To find ai 1, 2,.. p that generate Emin , solve 0for all i 1,2,... p
ai
Input waveform
Solve for
a1,2,,p
Time frame y 30ms 1, 2, 3, 4,.. 8
E
To find ai 1, 2,.. p , that generate Emin , solve 0 for all i 1,2,... p
ai
Derivations can be found at
After some manupulati ons we have http://www.cslu.ogi.edu/people/hosom/cs552/lecture07_features.ppt
r0 r1 r2 ..., rp 1 a1 r1
r r0 r1 ..., rp 2 a2 r2
1
r2 r1 r0 ..., : : : ( 2) Use Durbins equation
to solve this
: : : ..., : : :
rp 1 rp 2 rp 3 ..., r0 a p rp
n N 10 n N 1i
r0 s
n 0
n sn , ri s
n 0
n sn i auto - correlatio n functions
If we know r0 ,r1,r2 ,.., rp , we can find out a1,a2 ,.., a p by the set of equations in (2)

The example
For each time frame (25 ms), data is valid

only inside the window.
20.48 KHZ sampling, a window frame (25ms)
has 512 samples (N)
Require 8-order LPC, i=1,2,3,..8
calculate using r0, r1, r2,.. r8, using the above
formulas, then get LPC parameters a1, a2,.. a8
by the Durbin recursive Procedure.

Steps for each time frame to find a set of LPC
(step1) N=WINDOW=512, the speech signal is s0,s1,..,s511
(step2) Order of LPC is 8, so r0, r1,.., s8 required are:
r0 s0 s0 s1s1 s2 s2 s3s3 s4 s4 ... s511s511

r1 s0 s1 s1s2 s2 s3 s3s4 s4 s5 ... s510s511
r2 s0 s2 s1s3 s2 s4 s3s5 s4 s6 ... s509s511
r3 s0 s3 s1s4 s2 s5 s3s6 s4 s7 ... s508s511

r7 s0 s7 s1s8 s2 s9 s3s10 s4 s11 ... s504s511
r8 s0 s8 s1s9 s2 s10 s3s11 s4 s12 ... s503s511
(step3) Solve the set of linear equations (previous slide)

Program segmentation algorithm for auto-correlation
WINDOW=size of the frame; coeff = autocorrelation matrix; sig

= input, ORDER = lpc order
void autocorrelation(float *sig, float *coeff)
{int i,j;
for (i=0;i<=ORDER;i++)
{
coeff[i]=0.0;
for (j=i;j<WINDOW;j++)
coeff[i]+= sig[j]*sig[j-i];
}
}

To calculate LPC a[ ] from auto-correlation matrix *coef using
r r r ..., r a r
Durbins Method (solve equation 2)
0 1 2 p 1 1 1
r r r ..., r a r
1 0 1 p 2 2 2
r2 r1 r0 ..., : : : ( 2)

: : : ..., : : :
void lpc_coeff(float *coeff) rp 1
rp 2 rp 3 ..., r0 a p rp
{int i, j; float sum,E,K,a[ORDER+1][ORDER+1];
if(coeff[0]==0.0) coeff[0]=1.0E-30;
E=coeff[0];
{ sum=0.0;
for (j=1;j<i;j++) sum+= a[j][i-1]*coeff[i-j];
K=(coeff[i]-sum)/E; a[i][i]=K; E*=(1-K*K);
for (j=1;j<i;j++) a[j][i]=a[j][i-1]-K*a[i-j][i-1];
}
for (i=1;i<=ORDER;i++) coeff[i]=a[i][ORDER];}

Cepstrum
A new word by reversing the first
4 letters of spectrum cepstrum.
It is the spectrum of a spectrum of
a signal

Glottis and cepstrum
Speech wave (X)= Excitation (E) . Filter (H)
Output (S)
So voice has a
(H)
strong glottis (Vocal
Excitation tract filter) (E)
Frequency content
Glottal excitation
In Ceptsrum From
We can easily Vocal cords
identify and (Glottis)
remove the glottal
excitation
http://home.hib.no/al/engelsk/seksjon/SOFF-MASTER/ill061.gif
Cepstral analysis
Signal(s)=convolution(*) of
glottal excitation (e) and vocal_tract_filter (h)
s(n)=e(n)*h(n), n is time index
After Fourier transform FT: FT{s(n)}=FT{e(n)*h(n)}
Convolution(*) becomes multiplication (.)
n(time) w(frequency),
S(w) = E(w).H(w)
Find Magnitude of the spectrum
|S(w)| = |E(w)|.|H(w)|
log10 |S(w)|= log10{|E(w)|}+ log10{|H(w)|}
Ref: http://iitg.vlab.co.in/?sub=59&brch=164&sim=615&cnt=1
Cepstrum
C(n)=IDFT[log10 |S(w)|]=
IDFT[ log10{|E(w)|} + log10{|H(w)|} ]
X(n) X(w) Log|x(w)|

S(n) windowing DFT Log|x(w)| IDFT C(n)
N=time index
w=frequency
I-DFT=Inverse-discrete Fourier transform
In c(n), you can see E(n) and H(n) at two different
positions
Application: useful for (i) glottal excitation (ii) vocal
tract filter analysis

Example of cepstrum
using spCepstrumDemo.m on sor1.wav
'sor1.wav=sampling frequency 22.05KHz

Examples
Vocal track
cepstrum Glottal excitation cepstrum
http://iitg.vlab.co.in/?sub=59&brch=164&sim=615&cnt=1
Liftering
Low time liftering: Vocal tract
Magnify (or Inspect) Cepstrum Glottal excitation
Used for
the low time to find Cepstrum, useless for
Speech speech recognition,
the vocal tract filter recognition
cepstrum
High time liftering:
Magnify (or Inspect)
the high time to find
the glottal excitation
cepstrum (remove
this part for speech
recognition.
Cut-off Found
by experiment Frequency =FS/ quefrency
Audio signal processing ver1g
FS=sample frequency 36
=22050
Reasons for liftering
Cepstrum of speech
Why we need this?
Answer: remove the ripples
of the spectrum caused by
glottal excitation.
Too many ripples in the spectrum
caused by vocal
cord vibrations.
But we are more interested in
the speech envelope for
recognition and reproduction
Fourier
Transform
Speech signal x Audio signal processing ver1g 37

Spectrum of x
http://isdl.ee.washington.edu/people/stevenschimmel/sphsc503/files/notes10.pdf
Liftering method: Select the high time and
low time liftering
Signal X
Cepstrum
Select high
time, C_high
Select low
time
C_low

Recover Glottal excitation and vocal
track spectrum
Spectrum of glottal excitation
Cepstrum of glottal excitation
C_high
For
Glottal
excitation Frequency
Spectrum of vocal track filter
C_high Cepstrum of vocal track
For
Vocal track
Frequency
quefrency (sample index) This peak may be the pitch period:
This smoothed vocal track spectrum can
For more information see : be used to find pitch
http://isdl.ee.washington.edu/people/stevenschimmel/sphsc503/files/notes10.pdf
Exercise 6
A speech waveform S has the values
s0,s1,s2,s3,s4,s5,s6,s7,s8=
[1,3,2,1,4,1,2,4,3]. The frame size is 4.
Find the pre-emphasized wave if = is 0.98.
Find auto-correlation parameter r0, r1, r2.
If we use LPC order 2 for our feature extraction
system, find LPC coefficients a1, a2.
If the number of overlapping samples for two
frames is 2, find the LPC coefficients of the second
frame.

Vector quantization is a data compression
method
raw speech 10KHz/8-bit data for a 30ms frame is
300 bytes
10th order LPC =10 floating numbers=40 bytes
after VQ it can be as small as one byte.
Used in tele-communication systems.
Enhance recognition systems since less data is
involved.

Use of Vector quantization for Further
compression
LPC=10, is a data in a 10 dimensional space
after VQ it can be as small as one byte.
Example, in LPC2 (2 D space)

A simple example, 2nd order LPC, LPC2
We can classify speech sound code a1 A2
segments by Vector quantization
Make a table
The standard sound 1 e: 0.5 1.5
The standard sound is is
the centroid of all the centroid of all
samples of e: samples of I 2 i: 2 1.3
(a1,a2)=(0.5,1.5) (a1,a2)=(2,1.3)
a2
2 e: 3 u: 0.7 0.8
i: Using this table, 2 bits are
1 enough to encode each sound
u: Feature space and sounds are

classified into three different types
2 a1
e:, i: , u:
The standard sound is the centroid of all samples of
u:, (a1,a2)=(0.7,0.8)
Another example LPC8
256 different sounds encoded by the table (one segment which has 512
samples is represented by one byte)
Use many samples to find the centroid of that sound, i. e:, or i:
Each row is the centroid of that sound in LPC8.
In telecomm., the transmitter only transmits the code (1 segment using
1 byte), the receiver reconstructs the sound using that code and the
table. The table is only transmitted once.
One segment (512 samples ) compressed into 1 byte
transmitter receiver
Code a1 a2 a3 a4 a5 a6 a7 a8
(1 byte)
0=(e:) 1.2 8.4 3.3 0.2 .. .. .. ..

1=(i:) .. .. .. .. .. .. .. ..
2=(u:)
:
255 .. .. .. .. .. .. .. ..
VQ techniques, M code-book vectors
from L training vectors
K-means clustering algorithm
Arbitrarily choose M vectors
Nearest Neighbor search
Centroid update and reassignment, back to above
statement until error is minimum.
Binary split with K-means clustering algorithm,
this method is more efficient.

Binary split code-book:(assume you use all
available samples in building the centroids at all stages
of calculations)
Split function: Find m=1
Centroid
new_centroid= m<M stop
Yes No
old_centriod(1+/-e), split each

cenroid D'=0
for 0.01e 0.05 m=2*m
Classify
D'=D vectors
Find
centroids
Compute D
(distortion)
D-D'<
No Yes
threshold
Example: VQ : 240 samples use VQ to split
to 4 classes
Steps
Step1: all data find centroid C Step2:

C1=C(1+e) split the centroid into two C1,C2
C2=C(1-e) Regroup data into two classes according
to the two new centroids C1,C2
continue
Stage3:
Update the 2 centroids Stage 4: split the 2 centroids again
according to the two spitted to become 4 centroids
groups
Each group find a new centroid.
Final result
Stage 5: regroup andprocessing

Audio signal update ver1gthe 4 new centroids, done. 49
Tutorials for VQ
Given 4 speech frames, each is described by a
2-D vector (x,y) as below.
P1=(1.2,8.8);P2=(1.8,6.9);P3=(7.2,1.5);P4=(9.1,0.3)
Find the code-book of size two using K-means
method. (Answer see Appendix A.1)
Write Pseudo code (or a C program segment)
to build the code book of size 4 from 100 2D-
vectors, the input vectors (x,y) are stored in
int x[100] and int y[100].

Exercise 7
Given 4 speech frames, each is described by a 2-D
vector (x,y) as below.
P1=(1.2,8.8); P2=(1.8,6.9); P3=(7.2,1.5);
P4=(9.1,0.3).
Use K-means method to find the two centroids.
Use Binary split K-means method to find the two centroids.
Assume you use all available samples in building the
centroids at all stages of calculations
A raw speech signal is sampled at 10KHz/8-bit. Estimate
compression ratio (=raw data storage/compressed data
storage) if LPC-order is 10 and frame size is 25ms with no
overlapping samples.

Example of speech signal analysis

Speech signal
1st frame(one set of LPC -> code word)

One frame 2nd frame (one set of LPC -> code word)
=N
=512 samples 3rd frame (one set of LPC -> code word)
4th frame (one set of LPC -> code word)
5th frame
Separated
by n samples
Chapter 4 : Recognition
Procedures
Recognitionprocedure
Dynamic programming
HMM

Chapter 4 : Recognition Procedures
Preprocessing for recognition
endpoint detection
Pre-emphasis
windowing
distortion measure methods
Comparison methods
Vector quantization
Dynamic programming
Hidden Markov Model

LPC processor for a 10-word isolated
speech recognition system
End-point detection
Frame blocking and Windowing
Auto-correlation analysis
LPC analysis,
Cepstral coefficients,
Weighting
Temporal cepstral derivation

End point detection
To determine the start and end points of the
speech sound
It is not always easy since the energy of the
starting energy is always low.
Determined by energy & zero crossing rate
s(n) end-point
detected
recorded
A simple End point detection algorithm
At the beginning the energy level is low.
If the energy level and zero-crossing rate of 3
successive frames is high it is a starting point.
After the starting point if the energy and zero-
crossing rate for 5 successive frames are low
it is the end point.

Energy calculation
E(n) = s(n).s(n)
For a frame of size N,
The program to calculate the energy level:
for(n=0;n<N;n++)
{
Energy(n)=s(n) s(n);
}

Energy plot

Zero crossing calculation
A zero-crossing point is obtained when
sign[s(n)] != sign[s(n-1)]
The zero-crossing points of s(n)= 6
s(n)
5
n
4 6
2
1 3

To reduce noise, average transmission conditions

and to average signal spectrum.
~
S (n ) S (n ) a~S (n 1)
0.9 a~ 1.0, tyopically a~ 0.95
~
For S (0), S (1), S (2),.., the value S (0)
does not exis and is never used.
Tutorial: write a program segment to perform pre-

emphasis to a speech frame stored in an array int
s[1000].

Pre-emphasis program segment
input=sig1, output=sig2
void pre_emphasize(char far *sig1, float *sig2)
{
int j;
sig2[0]=(float)sig1[0];
for (j=1;j<WINDOW;j++)
sig2[j]=(float)sig1[j] - 0.95*(float)sig1[j-1];
}

Pre-emphasis

Frame blocking and Windowing
To choose the frame size (N samples )and adjacent

frames separated by m samples.
I.e.. a 16KHz sampling signal, a 10ms window has
N=160 samples, m=40 samples.
l=2 window, length = N

sn N
n
m
N
l=1 window,Audio
length =N
signal processing ver1g 64
Windowing
To smooth out the discontinuities at the beginning and
end.
Hamming or Hanning windows can be used.
Hamming window
~ 2n
S (n) S (n) W (n) 0.54 0.46 cos
N 1
0 n N 1
Tutorial: write a program segment to find the result of
passing a speech frame, stored in an array int s[1000],
into the Hamming window.

Effect of Hamming window
W (n )
S (n ) ~
S (n )
S (n ) W (n)
2n
0.54 0.46 cos
N 1
0 n N 1
~
S (n ) ~
S (n )
S (n) *W (n)

Matlab code segment
x1=wavread('violin3.wav');
for i=1:N
hamming_window(i)=
abs(0.54-0.46*cos(i*(2*pi/N)));
y1(i)=hamming_window(i)*x1(i);
end

Cepstrum Vs spectrum
the spectrum is sensitive to glottal excitation
(E). But we only interested in the filter H
In frequency domain
Log (X) = Log (E) + Log (H)
Cepstrum =Fourier transform of log of the
signals power spectrum
In Cepstrum, the Log(E) term can easily be
isolated and removed.

Auto-correlation analysis
Auto-correlation of every
frame (l =1,2,..)of a
windowed signal is
calculated.
If the required output is
p-th ordered LPC
Auto-correlation for the N 1 m
~ ~
l-th frame is rl ( m) Sl Sl ( n m)
n 0
m 0,1,.., p
LPC to Cepstral coefficients conversion
Cepstral coefficient is more accurate in describing
the characteristic of speech signal
Normally cepstral coefficients of order 1<=m<=p
are enough to describe the speech signal.
Calculate c1, c2, c3,.. cp from a1, a2, a3,.. ap
c0 r0
m 1
k
cm am ck amk , 1 m p
k 1 m
m 1
k
cm ck amk , m p(if needed )
k m p m
Ref:http://www.clear.rice.edu/elec532/PROJECTS98/speech/cepstrum/cepstrum.html
Distortion measure - difference
between two signals
measure how different two
signals is:
Cepstral distances
d cn c
p
between a frame 2 ' 2
n
(described by cepstral
n 1
coeffs (c1,c2cp )and the
other frame (c1,c2cp) is
Weighted Cepstral
distances to give different
w(n) c
p
weighting to different ' 2
cepstral coefficients( more n c
n
accurate) n 1

Matching method: Dynamic
programming DP
Correlation is a simply method for pattern
matching BUT:
The most difficult problem in speech
recognition is time alignment. No two speech
sounds are exactly the same even produced
by the same person.
Align the speech features by an elastic
matching method -- DP.

Exercise

Small Vocabulary (10 words) DP speech
recognition system
Train the system, each word has a reference
vector
For an unknown input, compare with each
reference using DP, and the one with the
minimum distance is the result.
Training is easy but recognition takes longer
time.

Dynamic programming algo.
Step 1: calculate the distortion matrix dist( )
Step 2: calculate the accumulated matrix
by using
D(i 1, j 1),

D(i, j ) dist (i, j ) min D(i 1, j ),
D (i , j 1

D( i-1, j) D( i, j)
D( i-1, j-1) D( i, j-1)

Example in R 9 6 2 2
DP(LEA , O 8 1 8 6
Trends in speech recognition.)
O 8 1 8 6
F 2 8 7 7
Step 1 : F 1 7 7 3
distortion matrix F O R R
Reference
unknown input
Step 2:
j-axis
accumulated R 28 11 7 9
score matrix (D) O 19 5 12 18
O 11 4 12 18
Reference F 3 9 15 22
F 1 8 15 18
F O R R
Audio signal processing ver1g i-axis76
To find the optimal path in the accumulated
matrix
Starting from the top row and right most column, find
the lowest cost D (i,j)t : it is found to be the cell at
(i,j)=(3,5), D(3,5)=7 in the top row.
From the lowest cost position p(i,j)t, find the next
position (i,j)t-1 =argument_min_i,j{D(i-1,j), D(i-1,j-1),
D(i,j-1)}.
E.g. p(i,j)t-1 =argument_mini,j{9,11,4)} =(3-0,5-
1)=(3,4) that contains 4 is selected.
Repeat above until the path reaches the right most
column or the lowest row.
Note: argument_min_i,j{cell1, cell2, cell3} means the argument i,j of
the cell with the lowest value is selected.

Optimal path
It should be from any element in the top row or right
most column to any element in the bottom row or left
most column.
The reason is noise may be corrupting elements at
the beginning or the end of the input sequence.
However, in fact, in actual processing the path should
be restrained near the 45 degree diagonal (from
bottom left to top right), see the attached diagram,
the path cannot passes the restricted regions. The
user can set this regions manually. That is a way to
prohibit unrecognizable matches. See next page.

Optimal path and restricted regions.

Example of an isolated 10-word
recognition system
A word (1 second) is recorded 5 times to train
the system, so there are 5x10 templates.
Sampling freq.. = 16KHz, 16-bit, so each
sample has 16,000 integers.
Each frame is 20ms, overlapping 50%, so
there are 100 frames in 1 word=1 second .
For 12-ordered LPC, each frame generates 12
LPC floating point numbers,, hence 12
cepstral coefficients C1,C2,..,C12.

So there are 5x10samples=5x10x100 frames
Each frame is described by a vector of 12-th
dimensions (12 cepstral coefficients = 12 floating
point numbers)
Put all frames to train a cook-book of size 64. So each
frame can be represented by an index ranged from 1
to 64
Use DP to compare an input with each of the
templates and obtain the result which has the
minimum distortion.

Exercise for DP
The VQ-LPC codes of the speech sounds of YESand
NO and an unknown input are shown. Is the input
= Yes or NO? (ans: is Yes)distortion

distortion(dist ) x x '
2
YES' 2 4 6 9 3 4 5 8 1
NO' 7 6 2 4 7 6 10 4 5
Input 3 5 5 8 4 2 3 7 2

YES' 2 4 6 9 3 4 5 8 1
NO' 7 6 2 4 7 6 10 4 5
Input 3 5 5 8 4 2 3 7 2
Distortion matrix for YES

1 4
8 25
5 4
4 1
3 0
9 36
6 9
4 1
2 1
3 5 5 8 4 2 3 7 2
Accumulation matrix for YES

1 81
8 77
5 52
4 48
3 47
9 47
6 11
4 2
2 1
3 5 5 8 4 2 3 7 2
Conclusion
Speech processing is important in
communication and AI systems to build more
user friendly interfaces.
Already successful in clean (not noisy)
environment.
But it is still a long way before comparable to
human performance.

Appendix A.1: K-means method to find the two
centroids
P1=(1.2,8.8);P2=(1.8,6.9);P3=(7.2,1.5);P4=(9.1,0.
3)
Arbitrarily choose P1 and P4 as the 2 centroids. So
C1=(1.2,8.8); C2=(9.1,0.3).
Nearest neighbor search; find closest centroid
P1-->C1 ; P2-->C1 ; P3-->C2 ; P4-->C2
Update centroids
C1=Mean(P1,P2)=(1.5,7.85); C2=Mean(P3,P4)=(8.15,0.9).
Nearest neighbor search again. No further changes,
so VQ vectors =(1.5,7.85) and (8.15,0.9)
Draw the diagrams to show the steps.
P1=(1.2,8.8); K-means method
C1=(1.5,7.85)
P2=(1.8,6.9)
P3=(7.2,1.5);
C2 =(8.15,0.9)
P4=(9.1,0.3)
Appendix A.2: Binary split K-means method for the number of required contriods
is fixed (see binary_split2_a2.m.txt) (assume you use all available samples in
building the centroids at all stages of calculations)
P1=(1.2,8.8);P2=(1.8,6.9);P3=(7.2,1.5);P4=(9.1,0.3)
first centroid C1=((1.2+1.8+7.2+9.1)/4, 8.8+6.9+1.5+0.3)/4) =
(4.825,4.375)
Use e=0.02 find the two new centroids
Step1: CCa= C1(1+e)=(4.825x1.02,4.375x1.02)=(4.9215,4.4625)
CCb= C1(1-e)=(4.825x0.98,4.375x0.98)=(4.7285,4.2875)
CCa=(4.9215,4.4625)
CCb=(4.7285,4.2875)
The function dist(Pi,CCx )=Euclidean distance between Pi and CCx
points dist to CCa -1*dist to CCb =diff Group to
P1 5.7152 -5.7283 = -0.0131 CCa
P2 3.9605 -3.9244 = 0.036 CCb
P3 3.7374 -3.7254 = 0.012 CCb
P4 5.8980 -5.9169 = -0.019 CCa

Nearest neighbor search to form two groups. Find the
centroid for each group using K-means method. Then split
again and find new 2 centroids. P1,P4 -> CCa group; P2,P3 -
> CCb group
Step2: CCCa=mean(P1,P4),CCCb =mean(P3,P2);
CCCa=(5.15,4.55)
CCCb=(4.50,4.20)
Run K-means again based on two centroids CCCa,CCCb for
the whole pool -- P1,P2,P3,P4.
points dist to CCCa -dist to CCCb =diff2 Group to
P1 5.8022 -5.6613 = 0.1409 CCCb
P2 4.0921 -3.8148 = 0.2737 CCCb
P3 3.6749 -3.8184 = -0.1435 CCCa
P4 5.8022 -6.0308 = -0.2286 CCCa
Regrouping we get the final result
CCCCa =(P3+P4)/2=(8.15, 0.9); CCCCb
=(P1+P2)/2=(1.5,7.85)
P1=(1.2,8.8); Step1:
Binary split K-means method
for the number of required
P2=(1.8,6.9) contriods is fixed, say 2, here.
CCa,CCb= formed
CCa= C1(1+e)=(4.9215,4.4625)
C1=(4.825,4.375)
CCb= C1(1-e)=(4.7285,4.2875)
P3=(7.2,1.5);
P4=(9.1,0.3)
Direction
P1=(1.2,8.8); Step2: of the split
CCCCb=(1.5,7.85) Binary split K-means method
for the number of required
P2=(1.8,6.9) contriods is fixed, say 2, here.
CCCa,CCCb= formed
CCCa=(5.15,4.55)
CCCb=(4.50,4.20)
P3=(7.2,1.5);
CCCb =(8.15,0.9) CCCCa=(8.15,0.9)
P4=(9.1,0.3)
Appendix A.3. Cepstrum Vs spectrum
the spectrum is sensitive to glottal excitation
(E). But we only interested in the filter H
In frequency domain
Log (X) = Log (E) + Log (H)
Cepstrum =Fourier transform of log of the
signals power spectrum
In Cepstrum, the Log(E) term can easily be
isolated and removed.

Appendix A4: LPC analysis for a frame based on the auto-correlation
values r(0),,r(p), and use the Durbins method (See P.115 [Rabiner 93])
LPC parameters a1, a2,..ap can be obtained by

setting for i=0 to i=p to the formulas
E ( 0 ) r ( 0)

r (i ) a j r i j
i
( i 1)
ki j 1 ,1 i p
E ( i 1)
ai( i ) ki
a (ji ) a (ji 1) ki ai(i j1)
E ( i ) (1 ki2 ) E ( i 1)
Finally LPC_coefficie nts amp
Program to Convert LPC coeffs. to Cepstral coeffs.
void cepstrum_coeff(float *coeff)

{int i,n,k; float sum,h[ORDER+1];
h[0]=coeff[0],h[1]=coeff[1];
for (n=2;n<=ORDER;n++){ sum=0.0;
for (k=1;k<n;k++)
sum+= (float)k/(float)n*h[k]*coeff[n-k];
h[n]=coeff[n]+sum;}
coeff[i-1]=h[i]*(1+ORDER/2*sin(PI_10*i));}

Define Cepstrum: also called the spectrum
of a spectrum
The power cepstrum (of a signal) is the squared
magnitude of the Fourier transform (FT) of the
logarithm of the squared magnitude of the Fourier
transform of a signal From Norton, Michael; Karczub,
Denis (2003). Fundamentals of Noise and Vibration
Analysis for Engineers. Cambridge University Press
Algorithm: signal FT abs() square log10
FT abs() square power cepstrum

cepstrum f (t ) FT log 10 FT f (t )
2

2
http://mi.eng.cam.ac.uk/~ajr/SA95/node33.html
http://en.wikipedia.org/wiki/Cepstrum

Linear Prediction

Uploaded by

Copyright:

Available Formats

Linear Prediction

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Prediction

Uploaded by

Copyright:

Available Formats

Introduction to audio signal

Audio signal processing ver1g 1

Audio signal processing ver1g 2

Audio signal processing ver1g 4

Spectral envelope SE=ar Spectral envelope SE=ei

Audio signal processing ver1g 6

Time frame i 30ms Filter outputs (v1,v2,)

Audio signal processing ver1g 8

filter 1 filter 2 filter 3 filter 4

200 400 800 1600 3200

Audio signal processing ver1g 10

Audio signal processing ver1g 11

popular scales are the

Audio signal processing ver1g 14

preprocess -> autocorrelation-> LPC ---->cepstral

Audio signal processing ver1g 15

Audio signal processing ver1g 16

Audio signal processing ver1g 17

Audio signal processing ver1g 18

so the whole segment n 0 to N 1 draw it on the graph

E s0 s0 s1 s1 .. s8 s8 .. s130 s130 .. s288 s288 ..s511 s511

Audio signal processing ver1g 22

Audio signal processing ver1g 23

The predicted value sn is denoted by ~ sn

so the whole frame n 0 to N-1

Audio signal processing ver1g 25

For each time frame (25 ms), data is valid

Audio signal processing ver1g 26

r0 s0 s0 s1s1 s2 s2 s3s3 s4 s4 ... s511s511

(step3) Solve the set of linear equations (previous slide)

Audio signal processing ver1g 27

WINDOW=size of the frame; coeff = autocorrelation matrix; sig

Audio signal processing ver1g 28

Audio signal processing ver1g 29

Audio signal processing ver1g 30

X(n) X(w) Log|x(w)|

Audio signal processing ver1g 33

Audio signal processing ver1g 34

Audio signal processing ver1g 35

Speech signal x Audio signal processing ver1g 37

Audio signal processing ver1g 38

Audio signal processing ver1g 40

Audio signal processing ver1g 41

Audio signal processing ver1g 42

u: Feature space and sounds are

0=(e:) 1.2 8.4 3.3 0.2 .. .. .. ..

Audio signal processing ver1g 45

old_centriod(1+/-e), split each

Step1: all data find centroid C Step2:

Stage 5: regroup andprocessing

Audio signal processing ver1g 50

Audio signal processing ver1g 51

1st frame(one set of LPC -> code word)

Audio signal processing ver1g 53

Audio signal processing ver1g 54

Audio signal processing ver1g 55

Audio signal processing ver1g 56

Audio signal processing ver1g 57

Audio signal processing ver1g 58