MFCC and Vector Quantization For Arabic Fricatives2012

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

MFCC and vector quantization for Arabic fricatives

Speech/Speaker recognition

Fatma zohra .Chelali* Amar. DJERADI


Speech communication and signal processing laboratory Speech communication and signal processing laboratory
Faculty of Electronics engineering and computer science Faculty of Electronics engineering and computer science
University of Science and Technology Houari Boumedienne University of Science and Technology Houari Boumedienne
(USTHB),Algiers.ALGERIA (USTHB),Algiers.ALGERIA
Box n°:32 El Alia, 16111, Algiers, Algeria Box n°:32 El Alia, 16111, Algiers, Algeria
Chelali_zohra@yahoo.fr adjeradi05@yahoo.com

Abstract—this article develops a speaker-dependent Arabic comparing extracted features from his/her voice input with the
phonemes recognition system using MFCC analysis and the VQ- ones from a set of known speakers [1, 2].
LBG algorithm. The system is examined with and without vector
quantization in order to analyze the effect of compression in an One of the first decisions in any pattern recognition system
acoustic parameterization phase. Our experimental results show is the choice of what features can be used and how exactly to
that vector quantization using a codebook of size 16 achieves good represent the basic signal that is to be classified, in order to
results compared to the system without quantization for a make the classification task easiest[7].
majority of the phonemes studied.
A wide range of possibilities exist for representing the
Keywords-component; MFCC; VQ; speech recognition;speaker speech signal in Automatic Speech and Speaker Recognition
identification. with spectral features as Linear Prediction Coefficients (LPC),
Linear Prediction Cepstral Coefficients (LPCC) and Mel-
I. INTRODUCTION Frequency Cepstrals Coefficients (MFCC) and others. The
Speech is the primary communication medium between most popular feature representation currently used are the Mel-
people. For decades human beings have been dreaming of an frequency Cepstral coefficients (MFCC) and perceptual linear
intelligent machine which can master the natural speech. predictive (PLP) [1, 3,7].
Automatic Speaker Recognition techniques make it possible to Among the above classes we used MFCC, because it is the
use the speaker’s voice to verify their identity and control best known and most popular. MFCC’s are based on the known
access to services such as voice dialing, banking by telephone, variation of the human ear’s critical bandwidths with
database access services, information services, voice mail, frequency; filters spaced linearly at low frequencies and
security control for confidential information areas, and remote logarithmically at high frequencies have been used to capture
access to computers [1,2,11]. the phonetically important characteristics of speech. This is
These techniques can be classified into identification and expressed in the mel-frequency scale, which is linear frequency
verification. Speaker identification is the process of spacing below 1000 Hz and a logarithmic spacing above 1000
determining which registered speaker provides a given Hz [1, 2, 6].
utterance. Speaker verification is the process of accepting or Psychophysical studies have shown that human perception
rejecting the identity claim of a speaker [2]. of the frequency contents of sounds for speech signals does not
Speaker Recognition methods can be divided into text- follow a linear scale. MFCC features are based on the known
independent and text dependent. In a text-independent method, variation of the human ear’s critical bandwidths with
speaker models capture characteristics of speaker’s speech frequency; filters spaced linearly at low frequencies and
irrespective of what one is saying. In a text-dependent method logarithmically at high frequencies have been used to capture
the recognition of the speaker’s identity is based on his/her the phonetically important characteristics of speech [1, 2, 6].
speaking specific phrases, like passwords, card numbers, PIN There are a lot of techniques for speaker recognition such
codes, etc [1, 2, 11]. as Hidden Markov model (HMM), Artificial Neural network
Speaker Recognition systems contain two main processes: (ANN) for speaker recognition. We have used VQ because of
feature extraction and feature matching. Feature extraction its less computational complexity [1]. There are two main
extracts a small amount of data from the voice signal that can modules feature extraction and feature matching in any speaker
be used later to represent each speaker. Feature matching recognition system [1, 2]. The speaker specific features are
involves the procedure to identify the unknown speaker by extracted using Mel-Frequency Cepstrum Coefficient (MFCC)
processor. A set of Mel-frequency cepstrum coefficients was

978-1-4673-1520-3/12/$31.00 ©2012 IEEE


found, which are called acoustic vectors [3]. These are the overlaps it by N - M samples. Similarly, the third frame begins
extracted features of the speakers. These acoustic vectors are 2M samples after the first frame (or M samples after the second
used in feature matching using vector quantization technique. It frame) and overlaps it by N - 2M samples. This process
is the typical feature matching technique in which VQ continues until all the speech is accounted for within one or
codebook is generated using trained data. Finally tested data more frames. Typical values for N and M are N = 256 and M =
are provided for searching the nearest neighbor to match that 100 [2, 6].
data with the trained data [6].
Windowing
Our article presents an automatic Speaker recognition Next processing step is windowing. By means of
system using MFCC feature vectors and Vector Quantization windowing the signal discontinuities, at the beginning and end
depending on Arabic fricatives. Correlation and k nearest of each frame, is minimized. If we define the window as w(n),
neighbor with Euclidean distance are used to decide on the 0 ≤ n≤ N - 1 , where N is the number of samples, then the result
classification task. of windowing is the signal[2, 6].
y (n) = x (n)w(n), 0 ≤ n ≤ N −1
II. DESCRIPTION OF THE MFCC-VQ PROCESS
Among all the above, we used Hamming Window method
A. The extraction process of cepstral features most to serve our purpose for ease of mathematical
Speech feature extraction is one of the important blocks of computations, which is described as:
speaker recognition problem. The process of computing MFCC
 N
is described as follows. After framing the speech signal, the α + (1 − α ). cos(2πn / N ) , n p
next step in the processing is to window each individual frame w(n) =  2 (1)
so as to minimize the signal discontinuities at the beginning 0 ailleurs
and end of each frame [5].
if α= 0.54, we define then the Hamming window.
A diagram of the structure of an MFCC processor is given
in Figure 1 .the main purpose of the MFCC processor is to 2πn
mimic the behavior of the human ears. In addition, rather than w(n) = Wham min g = 0.54 − 0.46 * cos( )
the speech waveforms themselves, MFCC’s are shown to be N (2)
less susceptible to mentioned variations [2, 5, 6]. pour 0 ≤ n ≤ N − 1

Continuous speech

Frame Mel-
blocking frequency
frame wraping

Windowing Mel

cepstrum

FFT
transform
Set of Mel
cepstrum Figure 2. The generalized Hamming window

Spectrum coefficients
Fast Fourier Transform (FFT)
Fast Fourier Transform, converts a signal from the time
Figure 1. Block diagram of the MFCC processor domain into the frequency domain. The FFT is a fast algorithm
to implement the Discrete Fourier Transform (DFT) which is
defined on the set of N samples {Xn}, as follow:
Frame Blocking
In this step the continuous speech signal is blocked into N −1 jπkn
−2
frames of N samples, with adjacent frames being separated by
M (M < N). The first frame consists of the first N samples. The
X n = ∑ x k .e N
, n = 0,1,...., N − 1 (3)
k =0
second frame begins M samples after the first frame, and
In general Xn’s are complex numbers. The resulting
sequence {Xn} is interpreted as follow: the zero frequency
corresponds to n = 0, positive frequencies 0<f <Fs/2 ~ Ml ~  1 π 
corresponds to values 1 ≤ n≤ N/2 - 1, while negative c( p ) = ∑ log(a m ) cos  p (m − ) 
frequencies -Fs/2< f< 0 correspond to N/2 + 1 ≤ n≤ N - 1. Here, m =1  2 Ml 
(5)
Fs denote the sampling frequency [2, 6,11].
where n = 1, 2 ...K, where p = 1, 2 ...Ml
Mel-Frequency wrapping
Psychophysical studies have shown that human perception K denotes the number of mel cepstrum coefficients,is
of the frequency contents of sounds for speech signals does not typically chosen as 20(number of filters). This set of
follow a linear scale. Thus for each tone with an actual coefficients is called an acoustic vector [2, 6] .Note that we
frequency, f, measured in Hz, a subjective pitch is measured on c~ ,
exclude the first component, 0 from the DCT since it
a scale called the ‘mel’ scale. The mel-frequency scale is linear
represents the mean value of the input signal, which carried
frequency spacing below 1000 Hz and a logarithmic spacing
little speaker specific information.
above 1000 Hz. Therefore we can use the following
approximate formula to compute the mels for a given
frequency f in Hz [2, 6, 11]: B. Vector Quantization Method of Feature Matching
VQ is a process of mapping vectors from a large vector
f HZ space to a finite number of regions in that space. Each region is
FMEL = 2595. log10 (1 + ) (4) called a cluster and can be represented by its center called a
700 codeword. The collection of all codeword is called a codebook.
One approach for simulating the subjective spectrum is to Figure 4 shows a diagram to illustrate this process [2].
use a filter bank, spaced uniformly on the mel scale. That filter
bank has a triangular band pass frequency response, and the The objective of vector quantization for a given training
spacing as well as the bandwidth is determined by a constant data set T is to design (discover) the optimal codebook,
mel frequency interval. The modified spectrum of S (ω) thus containing a predetermined number of reference code vectors,
consists of the output power of these filters when S (ω) is the which guarantee minimization of the chosen distortion metric
input. The number of mel spectrum coefficients, K, is typically for all encoded patterns from the data set. Each code vector in
chosen as 20 [3, 2, 6]. the codebook has an associated integer index used for
referencing [2, 4, 11].

1.8
Speaker 1 Speaker 2
1.6

1.4

1.2

0.8

0.6
Speaker 1
0.4 centroid
sample VQdistortion
0.2

0
0 2000 4000 6000 8000 10000 12000
Frequency [Hz]
Speaker 2
centroid
sample
Figure 3 Mel spaced filter bank with 20 filters

Cepstrum
In the final step, the log mel spectrum has to be converted
back to time. The result is called the mel frequency cepstrum
Figure 4. Conceptual diagram illustrating vector quantization codebook
coefficients (MFCCs). The cepstral representation of the formation [2]
speech spectrum provides a good representation of the local
spectral properties of the signal for the given frame analysis.
Because the mel spectrum coefficients are real numbers (and so
are their logarithms), they may be converted to the time The circles refer to the acoustic vectors from speaker 1
domain using the Discrete Cosine Transform (DCT). The while the triangles are from speaker 2. In the training phase, a
MFCCs may be calculated using this equation: [2, 6].
speaker-specific VQ codebook is generated for each known d ( x p , q ( x p ))
speaker by clustering his/her training acoustic vectors [2,4,11]. Np NC
1
The result codewords (centroids) are shown by black circles
MQE = D(Y , S ) =
Np
∑=
p =1
1 ∑D
i =1
i

and black triangles for speaker 1 and 2, respectively. The Np


distance from any acoustic vector to the closest codeword of a (8)
codebook is called a VQ-distortion. In the testing phase, an Where Di indicates the total distorsion of ith cell. If
input utterance of an unknown voice is “vector-quantized”
using each trained codebook and the total VQ distortion is ( Dm −1 − Dm ) / Dm ≤ ε
then the optimization ends and Ym
computed. The speaker corresponding to the VQ codebook is the final returned codebook [8].
with smallest total VQ-distortion is identified [1, 2, 8, 11].
d- New codebook calculation. Given the partition P(Ym), the
In 1980, Linde, Buzo, and Gray (LBG) proposed a VQ new codebook is calculated to the centroid condition. In
design algorithm based on a training sequence. A VQ that is symbols:
designed using this algorithm are referred to in the literature as
an LBG-VQ. Ym +1 = X ( P(Ym )) (9)
After, the counter m is increased by one and the procedure
LBG (Linde, Buzo and Gray) Algorithm
follows from step b [8].
The LBG algorithm is a finite sequence of steps in which,
at every step, a new quantizer, with a total distortion less or When the distortion is minimized, redistribution does not
equal to the previous one, is produced.We can distinguish two result in any movement of vectors among the clusters. This
phases, the initialization of the codebook and its optimization could be used as an indicator to terminate the algorithm. The
[8]. total distortion can also be used as an indicator of convergence
of the algorithm. Upon convergence, the total distortion does
The codebook optimization starts from an initial codebook not change as a result of redistribution. It is to be noted that in
and, after some iterations, generates a final codebook with a each iteration, the K-means algorithm estimates the means of
distortion corresponding to a local minimum. The following all the M clusters [8, 1,2].
are the steps for LBG algorithm [8].
a- Initialization:The following values are fixed: III. SIMULATION RESULTS
NC : number of codewords; We analyze in this section the feature extraction process
describing the acoustic features - MFCC (Mel-scale Frequency
ε ≥ 0 : Precision of the optimization process; Cepstral Coefficient) and the acoustic features VQ-LBG. Then
the recognition rate obtained for each parameterization.
Y0 : initial codebook ;
Phoneme data base
X = {x j ; j = 1,........., N p } Our experiments were performed using 14 fricatives Arabic
(6)
phonemes. we collected a large number of speech signal of
Additionally, the following assignments are made: four speakers male and female at different moments
pronouncing 14 Arabic syllabus (with short vowels)), we
M=0; where m is the iteration number.
choose in our experiments the voiced and the unvoiced
b- Partition calculation. Given the codebook Ym, the fricatives phonemes:
partition P (Ym) is calculated according to the nearest
‫ غ ف ح خ ز ظ ض ص ج ذ  ث‬,‫ ش ع‬,‫س‬
neighbour condition, given by [8]
The database includes 700 speech signals from four (4)
 x ∈ X : d ( x, y i ) ≥ d ( x, y j ), different subjects. The speech input is typically recorded at a
Si =   sampling rate 22 KHz.
 j = 1,2,3,....., N C , j ≠ i  (7) Feature extraction (MFCC coefficients)
C- Termination condition chek. The quantizer distorsion This operation is done for every individual and for all the
Dm = D(Ym , P(Ym )) is calculated according to the phonemes used (700 speech signals). For good
following equation: speaker/phoneme recognition accuracy, 20 MFCC coefficients
per frame were used.
Figure 6. MFCC coefficients using the third and the fourth dimension for the
four speakers (phoneme cha).

Therefore, dimensionality reduction or speech


parameterization is a very important step which will greatly
improve the performance of the speaker recognition system. In
addition, discrimination is very clear with the 5th and the 6th
dimension. We retain then these vectors for speech
parameterization for the next step.
The input matrix (training parameterization) has a
dimension of 880 real values corresponding to 20 coefficients
calculated for the 44 frames of each signal.
VQ-LBG algorithm has been applied using a codebook size
of 16. Hence 880 (20*44) vectors have been transformed to
Figure 5. Spectrum representation of phoneme cha 20*16(320) vectors by using VQ-LBG method.
By applying the procedure described above for each speech In the first phase, we created a "codebook" of the speakers
frame, an acoustic vector of 20 mel-frequency cepstrum to characterize their vocal characteristics using training
coefficients is computed. These are result of a cosine transform monosyllabic sentences. Then, we compared a sample of a
of the logarithm of the short-term power spectrum expressed speaker's voice to the codebook to determine the identity of the
on a mel-frequency scale. Some examples to illustrate the speaker.
MFCC extraction process can be found in [10, 11].
The MFCC feature extraction process was performed 2D plot of accoustic vectors
8
frame by frame using a Hamming window to segment each Speaker 1
speech utterance into frames=100 and N=256. The frame 7 Codebook 1
length is set to 11 ms with an overlap of 5ms. We have used 20 Speaker 2
6
coefficients of MFCC by applying the procedure in section 3. Codebook 2
This set of coefficients is called an acoustic vector. Each input 5
Speaker 3
Codebook 3
utterance of size 7000 is transformed into a sequence of
4 Speaker 4
acoustic vectors of size 880 (20MFCC coefficients for a total
6th D im ens ion

Codebook 4
of 44 frames for each signal). These acoustic vectors can be 3
used to represent and recognize the voice characteristic of the
speaker. The following figure shows the distribution of MFCC 2
feature vectors in the (3th and 4th) dimension (fig 6 ) 1

0
MFCC coefficients
12
Signal cha1 -1
Signal cha2
10 -2
Signal cha3
-4 -2 0 2 4 6 8
Signal cha4
5th Dimension
8

Figure 7. Code book description using phoneme cha


4th Dimension

In general, VQ achieve better recognition rate in case for


4
unvoiced signals, except for phoneme (cha) where low
2
accuracy was obtained. We also noticed that for the voiced
signals, recognition rate obtained for the classifier based
0
MFCC-VQ were better than the classifier based MFCC, except
for phoneme (za).
-2
-8 -6 -4 -2 0 2 4 6 8 10 12
Our system achieves between 65% to 100% in identifying
3th Dimension the correct speaker when a codebook of size 16 is used. Fig8
and 9 show the simulation results obtained for voiced and
unvoiced fricatives Arabic phonemes.
without vector quantization in order to analyze the effect of
compression in an acoustic parameterization phase.
Recognition rate for unvoiced signals
Based on the results of experiments, vector quantization
using a codebook of size 16 achieves good results compared to
120
the system without quantization, except for two or three
100 phonemes where the degradation introduced was significant
80
(for phoneme za , recognition rate with quantization was about
54%, whereas with MFCC parameterization was about 100%).
RR%

MFCC
60
MFCC-VQ For future work, other techniques for speech
40 parameterization can be investigated; we can improve our work
20 in order to develop an audiovisual speaker/speech recognition
system using the acoustic and the visual modality using
0
artificial neural network or Support vector machine classifier.
‫ث‬ ‫ح‬ ‫خ‬ ‫س‬ ‫ش‬ ‫ف‬
phonemes REFERENCES

[1] José Ramón Calvo de Lara, “A Method of Automatic Speaker


Recognition Using Cepstral Features and Vectorial Quantization”, M.
Lazo and A. Sanfeliu (Eds.): CIARP 2005, LNCS 3773, pp. 146 – 153,
Figure 8. Recognition rate obtained for unvoiced fricatives 2005. Springer- Verlag Berlin Heidelberg 2005.
[2] Minh, Do No, “An Automatic Speaker Recognition System”, White
paper, Digital Signal Processing Mini-Project, Audio Visual
Communications Laboratory, Swiss Federal Institute of Technology,
Lausanne, , Switzerland, 1996 pp.1-14.
Recognition rate for voiced signals [3] Tomi Kinnunen, “Spectral Features for Automatic Text-Independent
Speaker Recognition”, University of Joensuu, Department of Computer
Science, Joensuu, Finland, December 21,2003.
120
[4] F.K. Song, A.E. Rosenberg and B.H. Juang, “A vector quantisation
100 approach to speaker recognition”, AT&T Technical Journal, Vol. 66-2,
80 pp. 14-26, March 1987.
RR%

MFCC
60 [5] G. Senthil Raja · S. Dandapat, «Speaker recognition under stressed
MFCC-VQ condition”, Int J Speech Technol (2010) 13: 141–161, DOI 10.
40
1007/s10772-010-9075-z.
20
[6] P. Chakraborty1, F. Ahmed , Md. Monirul Kabir , Md. Shahjahan1, and
0 Kazuyuki Murase “An Automatic Speaker Recognition System”, M.
‫ذ‬ ‫ز‬ ‫ع‬ ‫غ‬ Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 517–526,
2008. Springer-Verlag Berlin Heidelberg 2008.
phonemes
[7] Maati, H. Marvi and M. Lankarany, “vowels recognition using mellin
transform and plp-based feature extraction”, acoustics-08 Paris.
[8] Sheeraz Memon and Margaret Lech,” Speaker Verification Based on
Information Theoretic Vector Quantization”, D.M.A. Hussain et al.
(Eds.): IMTIC 2008, CCIS 20, pp. 391–399, 2008. Springer-Verlag
Figure 9. Recognition rate obtained for voiced fricatives Berlin Heidelberg 2008.
[9] Ashish Jain& John Harris, Speaker Identification using MFCC and
HMM based techniques, Univ. of Florida, April 25, 2004.
IV. CONCLUSION
[10] “Matlab VOICEBOX”.
A speaker-dependent Arabic phonemes recognition system [11] report, http://www.scholarpedia.org/article/Speaker_recognition
using MFCC analysis and the VQ-LBG algorithm has been
examined in this work. The system is examined with and

You might also like