Digital Signal Processing "Speech Recognition": Paper Presentation On
Digital Signal Processing "Speech Recognition": Paper Presentation On
Digital Signal Processing "Speech Recognition": Paper Presentation On
On
By
Authors:
N.Sreeja P.Udayasree
Electronics and Communication Engg Electronics and Communication Engg
Nagirimadugu.sreeja@gmail.com pidugu.udayasree@gmail.com
Ph no: 9395152353 Ph no: 0877-2234712
III B. Tech
Department of Electronics and Communication Engineering
Note that we use j here to denote the imaginary unit, i.e. . In general Xn's are complex
numbers. The resulting sequence {Xn} is interpreted as follows: the zero frequency corresponds
Mel-frequency wrapping:
As mentioned above, psychophysical studies have shown that human perception of the frequency
contents of sounds for speech signals does not follow a linear scale. Thus for each tone with an
actual frequency, f, measured in Hz, a subjective pitch is measured on a scale called the 'mel'
scale. The mel-frequency scale is linear frequency spacing below 1000 Hz and a logarithmic
spacing above 1000 Hz. As a reference point, the pitch of a 1 kHz tone, 40 dB above the
perceptual hearing threshold, is defined as 1000 mels.
One approach to simulating the subjective spectrum is to use a filter bank, spaced uniformly on
the Mel scale. That filter bank has a triangular band pass frequency response, and the spacing as
well as the bandwidth is determined by a constant Mel frequency interval. The modified
spectrum of S(ω ) thus consists of the output power of these filters when S(ω ) is the input. The
number of Mel spectrum coefficients, K, is typically chosen as 20.
Cepstrum
In this final step, the log Mel spectrum is converted back to time. The result is called the Mel
frequency cepstrum coefficients (MFCC). The cepstral representation of the speech spectrum
provides a good representation of the local spectral properties of the signal for the given frame
analysis. Because the Mel spectrum coefficients (and so their logarithm) are real numbers, we
can convert them to the time domain using the Discrete Cosine Transform (DCT). Therefore if
we denote those Mel power spectrum coefficients that are the result of the last step are
Note that the first component is excluded, from the DCT since it represents the mean value
of the input signal which carried little speaker specific information.
By applying the procedure described above, for each speech frame of around 30msec with
overlap, a set of Mel-frequency cepstrum coefficients is computed. These are result of a cosine
transform of the logarithm of the short-term power spectrum expressed on a Mel-frequency
scale. This set of coefficients is called an acoustic vector. Therefore each input utterance is
transformed into a sequence of acoustic vectors. In the next section we will see how those
acoustic vectors can be used to represent and recognize the voice characteristic of the speaker.
Feature Matching
Introduction
The problem of speaker recognition belongs to pattern recognition. The objective of pattern
recognition is to classify objects of interest into one of a number of categories or classes. The
objects of interest are generically called patterns and in our case are sequences of acoustic
vectors that are extracted from an input speech using the techniques described in the previous
section. The classes here refer to individual speakers. Since the classification procedure in our
case is applied on extracted features, it can also be referred to as feature matching.
The state-of-the-art in feature matching techniques used in speaker recognition includes
Dynamic Time Warping (DTW), Hidden Markov Modeling (HMM), and Vector Quantization
(VQ). In this paper the VQ approach will be used, due to ease of implementation and high
accuracy. VQ is a process of mapping vectors from a large vector space to a finite number of
regions in that space. Each region is called a cluster and can be represented by its center called a
codeword. The collection of all code words is called a codebook.
Figure 4 shows a conceptual diagram to illustrate this recognition process. In the figure, only two
speakers and two dimensions of the acoustic space are shown. The circles refer to the acoustic
vectors from the speaker 1 while the triangles are from the speaker 2. In the training phase, a
speaker-specific VQ codebook is generated for each known speaker by clustering his/her training
acoustic vectors. The result code words (centroids) are shown in Figure 4 by black circles and
black triangles for speaker 1 and 2, respectively. The distance from a vector to the closest
codeword of a codebook is called a VQ-distortion. In the recognition phase, an input utterance of
an unknown voice is "vector-quantized" using each trained codebook and the total VQ distortion
is computed. The speaker corresponding to the VQ codebook with smallest total distortion is
identified.
Figure.4 Conceptual diagram illustrating vector quantization codebook formation.
One speaker can be discriminated from another based of the location of centroids.
choose =0.01).
3. Nearest-Neighbor Search: for each training vector, find the codeword in the current codebook
that is closest (in terms of similarity measurement), and assign that vector to the corresponding
cell (associated with the closest codeword).
4. Centroid Update: update the codeword in each cell using the centroid of the training vectors
assigned to that cell.
5. Iteration 1: repeat steps 3 and 4 until the average distance falls below a preset threshold
6. Iteration 2: repeat steps 2, 3 and 4 until a codebook size of M is designed.
Intuitively, the LBG algorithm designs an M-vector codebook in stages. It starts first by
designing a 1-vector codebook, then uses a splitting technique on the code words to initialize the
search for a 2- vector codebook, and continues the splitting process until the desired M-vector
codebook is obtained.
Figure 5 shows, in a flow diagram, the detailed steps of the LBG algorithm. "Cluster vectors" is
the nearest-neighbor search procedure which assigns each training vector to a cluster associated
with the closest codeword. "Find centroids" is the centroid update procedure. "Compute D
(distortion)" sums the distances of all training vectors in the nearest-neighbor search so as to
determine whether the procedure has converged.
Figure.5 Flow diagram of the LBG algorithm
Conclusion:
Even though much care is taken it is difficult to obtain an efficient speaker recognition
system since this task has been challenged by the highly variant input speech signals. The
principle source of this variance is the speaker himself. Speech signals in training and testing
sessions can be greatly different due to many facts such as people voice change with time, health
conditions (e.g. the speaker has a cold), speaking rates, etc. There are also other factors, beyond
speaker variability, that present a challenge to speaker recognition technology. Because of all
these difficulties this technology is still an active area of research.
References
[1] L.R. Rabiner and B.H. Juang, Fundamentals of Speech Recognition, Prentice-Hall,
Englewood Cliffs, N.J., 1993.
[2] L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals,
Prentice-Hall. Englewood Cliffs, N.J, 1978.