Build Automatic Speech Recognition System: Bachelor of Technology

BUILD AUTOMATIC SPEECH
RECOGNITION SYSTEM
Internship Report Submitted in Partial Fulfillment of the Requirements for the Degree of
Bachelor of Technology
in
Computer Science and Engineering
Submitted by
GUDIPATI SAI (roll no. 18CSE1009)
Department of Computer Science and

Engineering
National Institute of Technology Goa
May, 2022
Contents
1 ABSTRACT 1
2 INTRODUCTION 2
3 LITERATURE REVIEW 3
4 AUTOMATIC SPEECH RECOGNITION 6

4.0.1 CLASSIFICATION OF SPEECH RECOGNITION . . 8
5 METHODOLOGY 9
5.0.1 A Basic Primer on How Automatic Speech Recognition
Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.0.2 Speech Preprocessing . . . . . . . . . . . . . . . . . . . 10
6 CONCLUSION 20
References 21
i
Chapter 1
ABSTRACT
Speech is the most efficient mode of communication between peoples. This,

being the best way of communication, could also be a useful interface to
communicate with machines. Therefore the popularity of automatic speech
recognition system has been greatly increased. There are different approaches
to speech recognition like Hidden Markov Model (HMM), Dynamic Time
Warping (DTW), Vector Quantization (VQ), etc. This session provides a
comprehensive study of use of Artificial Neural Networks (ANN) in speech
recognition. The session focuses on the different neural network related meth-
ods that can be used for speech recognition and compares their advantages
and disadvantages. The conclusion is given on the most suitable method.
1
Chapter 2
INTRODUCTION
Speech is probably the most efficient and natural way to communicate with
each other. Humans learn all the relevant skills during early childhood,
without any instruction, and they continue to rely on speech communication
throughout their life. Humans also want to have a similar natural, easy and
efficient mode of communication with machines. Speech is greatly affected
by accents, articulation, pronunciation, roughness, emotional state, gender,
pitch, speed, volume, background noise and echoes. Speech Recognition or
Automatic Speech Recognition (ASR) plays an important role in human
computer interaction. Speech recognition uses the process and relevant tech-
nology to convert speech signals into the sequence of words by means of an
algorithm implemented as a computer program Theoretically, there should be
the possibility of recognition of speech directly from the digitized waveform.
At present, speech recognition systems are capable of understanding of thou-
sands of words under functional environment The purpose of this survey is to
obtain an understanding of the state-of-the-art in usage of neural networks
for speech recognition. The goal is to find out about different neural network
related methods that can be used for speech recognition and compare pros
and cons of each technique if possible.
2
Chapter 3
LITERATURE REVIEW
In 1952, Audrey system designed at Bell Laboratories was the first speech
recognition system which recognized only digits spoken by a single person.
Ten years later, IBM produced in 1962 which recognized 16 English words.
In collaboration, Soviet Union, United States, England and Japan developed
a hardware which recognized 4 vowels and 9 consonants. Carnegie Mellon’s
”Harpy” speech-understanding system recognized 1011 words between 1971
and 1976. Threshold Technology and Bell Laboratories are the first commer-
cial speech recognition companies that interpret multiple person voice. A
new statistical method called Hidden Markov Model (HMM) was introduced
in 1980 which expanded to recognize hundred words to several thousand
words and to recognize an unlimited number of words. Children could train
to respond their voice in the form of Worlds of Wonder’s Julie doll in 1987.
In 1985, Kurzweil text-to-speech recognizes 5000 word vocabulary which is
established by IBM. Dragon launched the first consumer speech recogni-
tion product called Dragon Dictate which recognizes 100 words per minute
and also the system took 45 minutes to train the program. In 1996, Voice
Activated Link (VAL) from Bell South launched a dial-in interactive voice
recognition system which gave the information based on what the speaker
said through phone. In 2001, speech recognition system attained 80 % ac-
curacy [9]. Ten years later, Google’s English Voice Search system integrated
3
230 billion words from actual user. Thiang, et al. (2011) presented speech
recognition using Linear Predictive Coding (LPC) and Artificial Neural Net-
work (ANN) for controlling movement of mobile robot. Input signals were
sampled directly from the microphone and then the extraction was done by
LPC and ANN [1]. Ms.Vimala.C and Dr.V.Radha (2012) proposed speaker
independent isolated speech recognition system for Tamil language. Feature
extraction, acoustic model, pronunciation dictionary and language model
were implemented using HMM which produced 88% of accuracy in 2500
words [7]. Cini Kurian and Kannan Balakrishnan (2012) found development
and evaluation of different acoustic models for Malayalam continuous speech
recognition. In this paper HMM is used to compare and evaluate the Context
Dependent (CD), Context Independent (CI) models and Context Dependent
tied (CD tied) models from this CI model 21%. The database consists of 21
speakers including 10 males and 11 females [2]. Suma Swamy et al. (2013)
introduced an efficient speech recognition system which was experimented
with Mel Frequency Cepstrum Coefficients (MFCC), Vector Quantization
(VQ), HMM which recognize the speech by 98 % accuracy. The database
consists of five words spoken by 4 speakers at ten times [10]. Annu Choud-
hary et al. (2013) proposed an automatic speech recognition system for
isolated and connected words of Hindi language by using Hidden Markov
Model Toolkit (HTK). Hindi words are used for dataset extracted by MFCC
and the recognition system achieved 95 % accuracy in isolated words and
90% in connected words [3]. Preeti Saini et al. (2013) proposed Hindi auto-
matic speech recognition using HTK. Isolated words are used to recognize the
speech with 10 states in HMM topology which produced 96.61% [12]. Md.
Akkas Ali et al. (2013) presented automatic speech recognition technique
for Bangla words. Feature extraction was done by, Linear Predictive Coding
(LPC) and Gaussian Mixture Model (GMM). Totally 100 words recorded in
1000 times which gave 84% accuracy [11]. Maya Moneykumar, et al. (2014)
developed Malayalam word identification for speech recognition system. The
proposed work was done with syllable based segmentation using HMM on
4
MFCC for feature extraction [5]. Jitendra Singh Pokhariya and Dr. Sanjay
Mathur (2014) introduced Sanskrit speech recognition using HTK. MFCC
and two state of HMM were used for extraction which produces 95.2% to
97.2% accuracy respectively [4]. In 2014, Geeta Nijhawan et al. developed
real time speaker recognition system for Hindi words. Feature extraction
done with MFCC using Quantization Linde, Buzo and Gray (VQLBG) al-
gorithm. Voice Activity Detector (VAC) was proposed to remove the silence
[10]. In 2015, Google’s speech recognition experimented with Connectionist
Temporal Classification (CTC) trained Long Short-Term Memory (LSTM)
approaches which is implemented in Google Voice [8]. Various techniques
suggested by many researchers for developing different applications in speech
recognition are elaborated in this paper.
5
Chapter 4
AUTOMATIC SPEECH
RECOGNITION
Automatic speech recognition is a high-tech that makes machine turn the

speech signal to the corresponding text or command after recognizing and
understanding. Automatic speech recognition (ASR) includes the extraction
and determination of the acoustic feature, the acoustic model, and the lan-
guage model. The extraction and determination of the acoustic feature is a
significant part of speech recognition. The extraction and determination of
the acoustic feature is a procedure of information compression, as well as a
procedure of signal deconvolution.
The acoustic model is the calculation from speech to syllable probability.

The acoustic model in speech recognition usually uses hidden markov model
(HMM) to modeling for element recognition, and a speech element it is a
three- to five-status HMM. A word is an HMM formed by a string of speech
elements that form this word. While all models of continuous speech recog-
nition is an HMM combined with word and mute.
Language model is the calculation from speech to word probability, mainly

divided into a statistical model and a rule model. The statistical language
6
model uses a probability statistic to show the inner statistic regulars, N-gram
is simple and efficient and is widely used. Regular model refers to some rules
or grammatical structure.
Due to the diversity and complexity of the speech signal, the present
speech recognition system can only be satisfied based on performance under
certain conditions or used in some specific occasions. The speech recognition
system roughly depends on four factors: recognize size of vocabulary and
acoustic complexity, the quality of speech signal, single speaker or speakers,
and hardware platform. Speech is the most natural communicating media in
the present communicating system. With the development of computer and
voice processing technology, the voice–voice translation between different lan-
guages will become the hot spot of speech research. The research hot spots
of speech recognition include design data base of natural language; speech
features extraction; use of corpus to processing acoustic model; algorithm of
speech recognition research; language translation; and speech synthesis and
dialog processing research.
Deng Li, coworking with Xin Dun, found that the deep network can
improve the precision of speech recognition. The fruit was further deepened
by Microsoft Research Asia. They built a huge NN, which includes 6.6 million
neural connections. This is the biggest such model in the research history
of speech recognition. This model’s recognition error rate was reduced by a
third from the lowest error rate in the Switchboard standard data sets. In
the field of speech recognition, the lowest error rate in these data sets has
already not been updated for many years.The smaller the distance produced,
the more similar between the two sound patterns. Both sound patterns are
similar, thus the two voices are said to be the same.
7
4.0.1 CLASSIFICATION OF SPEECH RECOGNITION
Speech recognition utterance can be classified into four
1. Isolated Word
2. Connected Word
3. Continuous Word
4. Spontaneous Word.
Isolated Word:
Isolated word accepts and recognizes single word or single utterance at a
time, and also, pronunciation needs pause between each utterance which is
known as listen or non-listen state. This system has annoyances when select-
ing a different limit that alters the entire results. Example for Isolated word
is ‘start’, ‘stop’, ’read’, etc. Table I shows isolated words speech recognition
system for different languages.
Connected Word:
Connected word recognition system is identical to the isolated word. How-
ever, it allows speakers to pronounce the word together in minimum interval
of time. The utterance can be a single word, or a collection of a few words,
a single sentence, or even multiple sentences. Table II shows the list of lan-
guage based connected words recognition system using different techniques.
Continuous Speech:
Continuous speech allows user to speak typically, while the computer takes
up the task of determining the content. Recognizers with continuous speech
capabilities are most difficult to determine the utterance of speech boundary.
8
Chapter 5
METHODOLOGY
5.0.1 A Basic Primer on How Automatic Speech Recog-

nition Works
The basic sequence of events that makes any Automatic Speech Recognition
software, regardless of its sophistication, pick up and break down your words
for analysis and response goes as follows:
1. You speak to the software via an audio feed
2. The device you’re speaking to creates a wave file of your words
3. The wave file is cleaned by removing background noise and normalizing
volume
4. The resulting filtered wave form is then broken down into what are called
phonemes. (Phonemes are the basic building block sounds of language and
words. English has 44 of them, consisting of sound blocks such as “wh”,
“th”, “ka” and “t”.
5. Each phoneme is like a chain link and by analyzing them in sequence,
starting from the first phoneme, the ASR software uses statistical probabil-
ity analysis to deduce whole words and then from there, complete sentences
6. Your ASR, now having “understood” your words, can respond to you in
a meaningful way.
9
5.0.2 Speech Preprocessing
Speech recognition is initiated with speaker producing an utterance which
consists of audio waves. The audio waves are captured by a microphone and
converted into electric signal which is again converted into digital signal in or-
der to be understood by the speech system. The relevant information about
the given utterance is extracted for accurate recognition. Finally, speech
recognition system finds the best match by tuning the utterance.
Figure 5.1: Automatic speech recognition system architecture
1) Analysis Techniques:
Analysis is the initial stage of speech recognition system which involves sam-
pling, windowing, framing and noise cancellation. Analysis deals with fram-
ing size for segmenting speech signal and contains various types of informa-
tion about speaker due to vocal tract, behavior feature and excitation source
which is explained in the analysis types such as segmentation analysis, sub-
segmental analysis and super segmental analysis.
10
Segmentation Analysis:
Speech is analyzed using frame size and shift in the range of 10-30 ms to
extract vocal tract information of speaker.
Sub-segmental Analysis:
The size of the frame and shift range around 3-5 ms extract the characteristic
of the speaker from excitation state.
Supra-segmental Analysis:
Speech is analyzed, using frame size and behavior characteristics of the
speaker. Different techniques used in analysis phase
2) Feature Extraction:
According to speech recognition theory, it should be possible to recognize
speech directly from the digitized waveform. Since speech signals are unstable
in nature, statistical representations should be generated for compressing the
speech signal variability which is achieved by performing feature extraction.
In order to transform the time domain signal into an effective parametric
representation feature extraction is used. Most widely used extraction tech-
nique is MFCC.The block diagram of MFCC techniques is shown in Fig. 5.2.
The MFCC processor involves the following steps: Pre-emphasis: In the
11
Figure 5.2: Block Diagram for MFCC Feature Extraction
pre processing, the speech signal increases the amplitude of high frequency
bands and decrease the amplitudes of lower bands which is implemented by
FIR filter. Framing and windowing: The speech signal is split into number of
frames. The frame size considered as 25 ms, hamming windowing is applied
in order to minimize the signal discontinuities at the starting each edge of
the frames.
Fast Fourier Transformer (FFT):
Each frame of N samples is converted in to time domain into frequency do-
main.
Mel Filter Bank:
The scale of frequency is converted from linear to mel scale which is called
mel filter bank.
Logarithm:
Logarithm is taken for the mel filter bank which is known as log mel spec-
trum.
Discrete cosine Transform (DCT):
12
The log mel scale is again converted in to frequency domain to time domain
which produces the feature of MFCC.
2.1 Recognition:
Recognition is broadly classified into three, namely acoustic-phonetic ap-
proach, pattern-recognition approach and artificial intelligence approach. In
the training phase of recognition system, parameters of the classification
model are estimated using a large number of training classes. During the
testing phase the features of a test speech are matched with the trained
speech model of each and every class.
2.2 Modeling Technique:
The aim of modeling technique is to create speaker models using speaker
specific feature vector. The speaker modeling techniques are separated into
two categorization namely, speaker identification and speaker recognition.
The speaker identification technique automatically identifies who is speaking
on the basis of individual information integrated in speech signal. Speaker
recognition technique is the identification of a person from the characteristics
of specific speaker voice.
3) SPEECH CLASSIFICATION
The most common techniques used for speech classification are discussed in
short. These system involve complex mathematical functions and they take
out hidden information from the input processed signal. The followings are
the different types of classification model
VQ
Vector Quantization (VQ) is a technique in which the mapping of vector
is performed from a large vector space to a finite number of region in that
space. This technique is based on block coding principle. The density match-
ing property of vector quantization is powerful, especially for identifying the
density of large and high-dimensional data. Since data points are repre-
sented by the index of their closest centroid, commonly occurring data have
low error, and rare data high error. This is why VQ is suitable for lossy
13
data compression. It can also be used for lossy data correction and density
estimation. Each region is called as cluster and can be represented by its
centre known as a code-word. Code book is the collection of all code-words.
DTW
Dynamic Time Warping (DTW) technique compares words with reference
words. It is an algorithm to measure the similarity between two sequences
that can vary in time or speed. DTW is a method to measure the similarity
of a pattern with different time zones. The smaller the distance produced,
the more similar between the two sound patterns. Both sound patterns are
similar, thus the two voices are said to be the same. The initial data on the
speech recognition process is transformed into frequency waves. Pronounce
volume, pronunciation time, and noise from the sound around the record-
ing takes place affecting the distance generated. The smaller the effect, the
smaller the distance that will be generated. In this technique, the time di-
mensions of the unknown words are changed until they match with that of
the reference word.
HMM
HMM is the most successfully used pattern recognition technique for speech
recognition. It is a mathematical model signalized on the Markov Model and
a set of output distribution Speech is split into the smallest audible entities
(not only vowels and consonants but also conjugated sound like ou, ea, eu,...).
All these entities are represented as states in the Markov Model. As a word
enters the Hidden Markov Model it is compared to the best suited model
(entity) Markov Models seems to perform quite well in noisy environments
because every sound entity is treated separately If a sound entity is lost in the
noise, the model might be able to guess that entity based on the probability
of going from one sound entity to another.
Neural Networks (NN)
Neural networks have many similarities with Markov models. Both are sta-
tistical models which are represented as graphs Where Markov models use
probabilities for state transitions, neural networks use connection strengths
14
and functions A key difference is that neural networks are fundamentally
parallel while Markov chains are serial. Frequencies in speech, occur in par-
allel, while syllable series and words are essentially serial. This means that
both techniques are very powerful in a different context. As in the neural
network, the challenge is to set the appropriate weights of the connection,
the Markov model challenge is finding the appropriate transition and obser-
vation probabilities Speech can be represented in different ways. Depending
on the situation and the kind of speech information that needs to be present,
one representation domain might be more appropriate than the other one.
In many speech recognition systems, both techniques are implemented to-
gether and work in a symbiotic relationship Neural networks perform very
well at learning phoneme probability from highly parallel audio input, while
Markov models can use the phoneme observation probabilities that neural
networks provide to produce the likeliest phoneme sequence or word. This is
at the core of a hybrid approach to natural language understanding.
a) Waveform: This is the most general way to represent a signal.

Variations of amplitude in time are presented. The biggest disadvantage
of this method is that it cannot represent speech related information. A
time-domain signal as such contains too much irrelevant data to use it di-
rectly for classification.
15
Figure 5.3: Time domain representation of the words ‘left’ and ’one’
16
Figure 5.4: Spectrogram of the words ‘left’ and ’one’
17
b) Spectrogram: There is a better representation domain, namely
the spectrogram. This representation domain shows the change in amplitude
spectra over time. It has three dimensions: X-axis: Time (ms) Y-axis: Fre-
quency Z -axis: Color intensity represents magnitude The complete sample is
split into different time-frames (with a 50 % overlap). For every time- frame,
the short-term frequency spectrum is calculated. Although the spectrogram
provides a good visual representation of speech it still varies significantly be-
tween samples. Samples never start at exactly the same moment, words may
be pronounced faster or slower and they might have different intensities at
different times.
A. Multilayer Feedforward Network The first type of neural nets used
for speech classification is a Multilayer Feedforward Network using Back
Propagation alghorithm for training. This type of NN is the most popu-
lar NN and is used worldwide in many different types of applications. Our
network consists of an input layer, one hidden layer and an output layer.
The hidden layer consists of non-linear sigmoidal activation function neu-
rons. First the Oja rule of thumb is applied to make a first guess on how
many hidden layer neurons are required.
T
H=
5 × (N + M )
where H is number of hidden layer neurons,
N is the size of the input layer,
M is the size of the output layer and
T is the training set size
B. Radial Basis Function Network Another approach to classify the
speech samples is to make use of Radial Basis Function Network. This net-
work also consists of three layers: an input layer, a hidden layer and an
output layer. The main difference of this type of network is that the hidden
layer has (Gaussian) mapping functions. Mostly they are used for function
approximation, but they can also solve classification problems. Radial means
that they are symmetric around their centre, basis functions means that a
18
linear combination of their functions can generate (approximate) an arbitrary
function.
19
Chapter 6
CONCLUSION
This paper is showing that neural networks can be very powerful

speech signal classifiers. A small set of words could be recognized
with some very simplified models. The pre-processing quality is
giving the biggest impact on the neural networks performance. In
some cases where the spectrogram combined with entropy based
endpoint detection is used we observed poor classification perfor-
mance results, making this combination as a poor strategy for the
pre-processing stage. On the other hand we observed that Mel
Frequency Ceptstrum Coefficients are a very reliable tool for the
pre-processing stage, with the good results they provide. Both
the Multilayer Feed forward Network with back propagation algo-
rithm and the Radial Basis Functions Neural Network are achieving
satisfying results when Mel Frequency Ceptstrum Coefficients are
used. The combination of an artificial neural network and a Hid-
den Markov Model in speech recognition has been found to work
about as well as the more traditional Gaussian Mixture Model ap-
proach. There are pros such as fast classification after training and
possibility of solving more complex problems with specialized net-
works, but also problems like long training time and possibility of
instability outside the range of training data.
20
References
[1] Thiang and Suryo Wijoyo, “Speech Recognition Using Linear Predictive
Coding and Artificial Neural Network for Controlling Movement of Mo-
bile Robot”, in Proceedings of International Conference on Information
and Electronics Engineering (IPCSIT), Singapore, IACSIT Press, Vol.6,
2011, pp.179-183.
[2] Cini Kuriana, Kannan Balakrishnan, “Development and evaluation of

different acoustic models for Malayalam continuous speech recognition”,
in Proceedings of International Conference on Communication Technol-
ogy and System Design 2011 Published by Elsevier Ltd, December 2011,
pp.1081-1088.
[3] Suma Swamy,K.V Ramakrishnan, “An Efficient Speech Recogni-

tion System”, Computer Science and Engineering: An Interna-
tionalJournal(CSEIJ),Vol.3,No.4,DOI:10.512 1/cseij.2013.3403 August
2013, pp.21-27.
[4] Md. Akkas Ali, Manwar Hossain, Mohammad Nuruzzaman Bhuiyan,

“Automatic Speech Recognition Technique for Bangla Words”, Interna-
tional Journal of Advanced Science and Technology, Vol. 50, January,
2013, pp.51-60.
[5] Mousmita Sarma, Krishna Dutta and Kandarpa Kumar Sarma, “As-
samese Numeral Corpus for Speech Recognition using Cooperative ANN
Architect”, International Journal of Electrical and Electronics Engineer-
ing,Vol.3,Issue8,Nov 2009,pp.456-465.
21
[6] Purnima Pandit, Shardav Bhatt “Automatic Speech Recognition of Gu-
jarati digits using Dynamic Time Warping”, International Journal of
Engineering and Innovative Technology (IJEIT), Vol.3, Issue 12, ISSN:
2277-3754, June 2014, pp.69-73.
[7] G. Saha, Sandipan Chakroborty, Suman Senapati, “A New Silence. Re-

moval and Endpoint Deletion Algorithm for Speech and Speaker Recog-
nition Applications.
[8] Rabiner L., BingH wangJ., ”F undamentalsof SpeechRecognition”, P renticeH all, 1993.K.P a
333, 1982.
[9] Haşim Sak, Andrew Senior, Kanishka Rao, Françoise Beaufays and Johan
Schalkwyk (September 2015): Google voice search: faster and more accurate.
[10] I. Mohamed Kalith, David Ashirvatham, Samantha Thelijjagoda, “Isolated

to Connected Tamil Digit Speech Recognition System Based on Hidden
Markov Model”, International Journal of New Technologies in Science and
Engineering, Vol.3, Issue 4, ISSN:2231-5381, April 2016, pp.51-60. .
[11] K. Waheed, K. Weaver and F.M. Salam, ”A robust algorithm for detecting
speech segments using an entropic contrast”, Circuits and Systems MWS-
CAS,vol 3. p.p 328-331, Michigan State University, 2002
[12] Lawrence R. Rabiner. A Tutorial on Hidden Markov Model and Selected

Application in Speech Recognition. IEEE, 1989
22

THIS CERTIFICATE IS PROUDLY PRESENTED TO
Gudipati Sai
participated in "Artificial Intelligence-Personifwy"
from 11th Nov, 2021 to 11th Jan, 2022
and successfully completed the program.
12-Jan-2022
DATE PAUL MATHEW. I

T-IITB-2201000543 OVERALL COORDINATOR

Build Automatic Speech Recognition System: Bachelor of Technology

Uploaded by

Copyright:

Available Formats

Build Automatic Speech Recognition System: Bachelor of Technology

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Build Automatic Speech Recognition System: Bachelor of Technology

Uploaded by

Copyright:

Available Formats

BUILD AUTOMATIC SPEECH

Department of Computer Science and

4 AUTOMATIC SPEECH RECOGNITION 6

Speech is the most efficient mode of communication between peoples. This,

Automatic speech recognition is a high-tech that makes machine turn the

The acoustic model is the calculation from speech to syllable probability.

Language model is the calculation from speech to word probability, mainly

5.0.1 A Basic Primer on How Automatic Speech Recog-

Figure 5.1: Automatic speech recognition system architecture

a) Waveform: This is the most general way to represent a signal.

This paper is showing that neural networks can be very powerful

[2] Cini Kuriana, Kannan Balakrishnan, “Development and evaluation of

[3] Suma Swamy,K.V Ramakrishnan, “An Efficient Speech Recogni-

[4] Md. Akkas Ali, Manwar Hossain, Mohammad Nuruzzaman Bhuiyan,

[7] G. Saha, Sandipan Chakroborty, Suman Senapati, “A New Silence. Re-

[10] I. Mohamed Kalith, David Ashirvatham, Samantha Thelijjagoda, “Isolated

[12] Lawrence R. Rabiner. A Tutorial on Hidden Markov Model and Selected

THIS CERTIFICATE IS PROUDLY PRESENTED TO

DATE PAUL MATHEW. I

You might also like