MFCC and Vector Quantization For Arabic Fricatives2012
MFCC and Vector Quantization For Arabic Fricatives2012
MFCC and Vector Quantization For Arabic Fricatives2012
Speech/Speaker recognition
Abstract—this article develops a speaker-dependent Arabic comparing extracted features from his/her voice input with the
phonemes recognition system using MFCC analysis and the VQ- ones from a set of known speakers [1, 2].
LBG algorithm. The system is examined with and without vector
quantization in order to analyze the effect of compression in an One of the first decisions in any pattern recognition system
acoustic parameterization phase. Our experimental results show is the choice of what features can be used and how exactly to
that vector quantization using a codebook of size 16 achieves good represent the basic signal that is to be classified, in order to
results compared to the system without quantization for a make the classification task easiest[7].
majority of the phonemes studied.
A wide range of possibilities exist for representing the
Keywords-component; MFCC; VQ; speech recognition;speaker speech signal in Automatic Speech and Speaker Recognition
identification. with spectral features as Linear Prediction Coefficients (LPC),
Linear Prediction Cepstral Coefficients (LPCC) and Mel-
I. INTRODUCTION Frequency Cepstrals Coefficients (MFCC) and others. The
Speech is the primary communication medium between most popular feature representation currently used are the Mel-
people. For decades human beings have been dreaming of an frequency Cepstral coefficients (MFCC) and perceptual linear
intelligent machine which can master the natural speech. predictive (PLP) [1, 3,7].
Automatic Speaker Recognition techniques make it possible to Among the above classes we used MFCC, because it is the
use the speaker’s voice to verify their identity and control best known and most popular. MFCC’s are based on the known
access to services such as voice dialing, banking by telephone, variation of the human ear’s critical bandwidths with
database access services, information services, voice mail, frequency; filters spaced linearly at low frequencies and
security control for confidential information areas, and remote logarithmically at high frequencies have been used to capture
access to computers [1,2,11]. the phonetically important characteristics of speech. This is
These techniques can be classified into identification and expressed in the mel-frequency scale, which is linear frequency
verification. Speaker identification is the process of spacing below 1000 Hz and a logarithmic spacing above 1000
determining which registered speaker provides a given Hz [1, 2, 6].
utterance. Speaker verification is the process of accepting or Psychophysical studies have shown that human perception
rejecting the identity claim of a speaker [2]. of the frequency contents of sounds for speech signals does not
Speaker Recognition methods can be divided into text- follow a linear scale. MFCC features are based on the known
independent and text dependent. In a text-independent method, variation of the human ear’s critical bandwidths with
speaker models capture characteristics of speaker’s speech frequency; filters spaced linearly at low frequencies and
irrespective of what one is saying. In a text-dependent method logarithmically at high frequencies have been used to capture
the recognition of the speaker’s identity is based on his/her the phonetically important characteristics of speech [1, 2, 6].
speaking specific phrases, like passwords, card numbers, PIN There are a lot of techniques for speaker recognition such
codes, etc [1, 2, 11]. as Hidden Markov model (HMM), Artificial Neural network
Speaker Recognition systems contain two main processes: (ANN) for speaker recognition. We have used VQ because of
feature extraction and feature matching. Feature extraction its less computational complexity [1]. There are two main
extracts a small amount of data from the voice signal that can modules feature extraction and feature matching in any speaker
be used later to represent each speaker. Feature matching recognition system [1, 2]. The speaker specific features are
involves the procedure to identify the unknown speaker by extracted using Mel-Frequency Cepstrum Coefficient (MFCC)
processor. A set of Mel-frequency cepstrum coefficients was
Continuous speech
Frame Mel-
blocking frequency
frame wraping
Windowing Mel
cepstrum
FFT
transform
Set of Mel
cepstrum Figure 2. The generalized Hamming window
Spectrum coefficients
Fast Fourier Transform (FFT)
Fast Fourier Transform, converts a signal from the time
Figure 1. Block diagram of the MFCC processor domain into the frequency domain. The FFT is a fast algorithm
to implement the Discrete Fourier Transform (DFT) which is
defined on the set of N samples {Xn}, as follow:
Frame Blocking
In this step the continuous speech signal is blocked into N −1 jπkn
−2
frames of N samples, with adjacent frames being separated by
M (M < N). The first frame consists of the first N samples. The
X n = ∑ x k .e N
, n = 0,1,...., N − 1 (3)
k =0
second frame begins M samples after the first frame, and
In general Xn’s are complex numbers. The resulting
sequence {Xn} is interpreted as follow: the zero frequency
corresponds to n = 0, positive frequencies 0<f <Fs/2 ~ Ml ~ 1 π
corresponds to values 1 ≤ n≤ N/2 - 1, while negative c( p ) = ∑ log(a m ) cos p (m − )
frequencies -Fs/2< f< 0 correspond to N/2 + 1 ≤ n≤ N - 1. Here, m =1 2 Ml
(5)
Fs denote the sampling frequency [2, 6,11].
where n = 1, 2 ...K, where p = 1, 2 ...Ml
Mel-Frequency wrapping
Psychophysical studies have shown that human perception K denotes the number of mel cepstrum coefficients,is
of the frequency contents of sounds for speech signals does not typically chosen as 20(number of filters). This set of
follow a linear scale. Thus for each tone with an actual coefficients is called an acoustic vector [2, 6] .Note that we
frequency, f, measured in Hz, a subjective pitch is measured on c~ ,
exclude the first component, 0 from the DCT since it
a scale called the ‘mel’ scale. The mel-frequency scale is linear
represents the mean value of the input signal, which carried
frequency spacing below 1000 Hz and a logarithmic spacing
little speaker specific information.
above 1000 Hz. Therefore we can use the following
approximate formula to compute the mels for a given
frequency f in Hz [2, 6, 11]: B. Vector Quantization Method of Feature Matching
VQ is a process of mapping vectors from a large vector
f HZ space to a finite number of regions in that space. Each region is
FMEL = 2595. log10 (1 + ) (4) called a cluster and can be represented by its center called a
700 codeword. The collection of all codeword is called a codebook.
One approach for simulating the subjective spectrum is to Figure 4 shows a diagram to illustrate this process [2].
use a filter bank, spaced uniformly on the mel scale. That filter
bank has a triangular band pass frequency response, and the The objective of vector quantization for a given training
spacing as well as the bandwidth is determined by a constant data set T is to design (discover) the optimal codebook,
mel frequency interval. The modified spectrum of S (ω) thus containing a predetermined number of reference code vectors,
consists of the output power of these filters when S (ω) is the which guarantee minimization of the chosen distortion metric
input. The number of mel spectrum coefficients, K, is typically for all encoded patterns from the data set. Each code vector in
chosen as 20 [3, 2, 6]. the codebook has an associated integer index used for
referencing [2, 4, 11].
1.8
Speaker 1 Speaker 2
1.6
1.4
1.2
0.8
0.6
Speaker 1
0.4 centroid
sample VQdistortion
0.2
0
0 2000 4000 6000 8000 10000 12000
Frequency [Hz]
Speaker 2
centroid
sample
Figure 3 Mel spaced filter bank with 20 filters
Cepstrum
In the final step, the log mel spectrum has to be converted
back to time. The result is called the mel frequency cepstrum
Figure 4. Conceptual diagram illustrating vector quantization codebook
coefficients (MFCCs). The cepstral representation of the formation [2]
speech spectrum provides a good representation of the local
spectral properties of the signal for the given frame analysis.
Because the mel spectrum coefficients are real numbers (and so
are their logarithms), they may be converted to the time The circles refer to the acoustic vectors from speaker 1
domain using the Discrete Cosine Transform (DCT). The while the triangles are from speaker 2. In the training phase, a
MFCCs may be calculated using this equation: [2, 6].
speaker-specific VQ codebook is generated for each known d ( x p , q ( x p ))
speaker by clustering his/her training acoustic vectors [2,4,11]. Np NC
1
The result codewords (centroids) are shown by black circles
MQE = D(Y , S ) =
Np
∑=
p =1
1 ∑D
i =1
i
Codebook 4
of 44 frames for each signal). These acoustic vectors can be 3
used to represent and recognize the voice characteristic of the
speaker. The following figure shows the distribution of MFCC 2
feature vectors in the (3th and 4th) dimension (fig 6 ) 1
0
MFCC coefficients
12
Signal cha1 -1
Signal cha2
10 -2
Signal cha3
-4 -2 0 2 4 6 8
Signal cha4
5th Dimension
8
MFCC
60
MFCC-VQ For future work, other techniques for speech
40 parameterization can be investigated; we can improve our work
20 in order to develop an audiovisual speaker/speech recognition
system using the acoustic and the visual modality using
0
artificial neural network or Support vector machine classifier.
ث ح خ س ش ف
phonemes REFERENCES
MFCC
60 [5] G. Senthil Raja · S. Dandapat, «Speaker recognition under stressed
MFCC-VQ condition”, Int J Speech Technol (2010) 13: 141–161, DOI 10.
40
1007/s10772-010-9075-z.
20
[6] P. Chakraborty1, F. Ahmed , Md. Monirul Kabir , Md. Shahjahan1, and
0 Kazuyuki Murase “An Automatic Speaker Recognition System”, M.
ذ ز ع غ Ishikawa et al. (Eds.): ICONIP 2007, Part I, LNCS 4984, pp. 517–526,
2008. Springer-Verlag Berlin Heidelberg 2008.
phonemes
[7] Maati, H. Marvi and M. Lankarany, “vowels recognition using mellin
transform and plp-based feature extraction”, acoustics-08 Paris.
[8] Sheeraz Memon and Margaret Lech,” Speaker Verification Based on
Information Theoretic Vector Quantization”, D.M.A. Hussain et al.
(Eds.): IMTIC 2008, CCIS 20, pp. 391–399, 2008. Springer-Verlag
Figure 9. Recognition rate obtained for voiced fricatives Berlin Heidelberg 2008.
[9] Ashish Jain& John Harris, Speaker Identification using MFCC and
HMM based techniques, Univ. of Florida, April 25, 2004.
IV. CONCLUSION
[10] “Matlab VOICEBOX”.
A speaker-dependent Arabic phonemes recognition system [11] report, http://www.scholarpedia.org/article/Speaker_recognition
using MFCC analysis and the VQ-LBG algorithm has been
examined in this work. The system is examined with and