Documentation (AA20)
Documentation (AA20)
Documentation (AA20)
Bachelor of Technology
in
Department of Computer Science and Engineering
(Artificial Intelligence and Machine Learning)
by
Mohammad Arif 21241A6645
Jadala Sriram 21241A6628
Kandimalla Dhanush Kumar 21241A6634
Associate Professor
i
GOKARAJU RANGARAJU INSTITUTE OF ENGINEERING AND
TECHNOLOGY
(Autonomous) Hyderabad-500090
CERTIFICATE
This is to certify that the mini project entitled “Deciphering Speech: A Deep
External Examiner
ii
ACKNOWLEDGEMENT
There are many people who helped us directly and indirectly to complete our project successfully.
We would like to take this opportunity to thank one and all. First, we would like to express our
deep gratitude towards our internal guide Dr. Sanjeeva Polepaka, Associate Professor,
Department of Computer Science and Engineering (Artificial Intelligence and Machine Learning)
for his support in the completion of our dissertation. We are thankful to mini project coordinator
Mr. B. Rajasekhar, Assistant Professor, for his valuable suggestions and comments during this
project period.
We wish to express our sincere thanks to Dr. G. Karuna, Head of the Department, and
to our principal Dr. J. PRAVEEN, for providing the facilities to complete the dissertation. We
would like to thank all our faculty and friends for their help and constructive criticism during the
project period. Finally, we are very much indebted to our parents for their moral support and
encouragement to achieve goals.
iii
DECLARATION
We hereby declare that the mini project titled “Deciphering Speech: A Deep Learning
Approach to Lip Reading” Is the work done during the period from 6th February 2024
to 29th June 2024 and is submitted in the partial fulfillment of the requirements for the award
of degree of Bachelor of Technology in Computer Science and Engineering (Artificial
Intelligence and Machine Learning) from Gokaraju Rangaraju Institute of Engineering and
Technology (Autonomous under Jawaharlal Nehru Technology University, Hyderabad). The
results embodied in this project have not been submitted to any other University or Institution
for the award of any degree or diploma.
iv
ABSTRACT
This project aims to develop a lip-reading model using deep learning techniques, specifically
convolutional neural networks (CNNs) and recurrent neural networks (RNNs). The model is
trained on a dataset comprising 450 videos of a single speaker, with each video capturing the
speaker articulating various phrases. The objective is to accurately recognize spoken words
based solely on visual input of the speaker's lip movements. The deep learning architecture is
designed to extract and analyze spatiotemporal features from the video frames, leading to a
robust recognition system. The model shows significant potential in applications such as silent
communication and assistive technologies for the hearing impaired.
v
LIST OF FIGURES
vi
LIST OF TABLES
Table Table Name Page
No. No.
vii
LIST OF ACRONYMS
Acronym Full Form
viii
TABLE OF CONTENTS
Acknowledgement iii
Declaration iv
Abstract v
List of Figures vi
1 Introduction 1
2 Literature Survey 10
2.1 Summary of existing approaches 10
2.2 Drawbacks of existing approaches 21
3 Proposed Method 24
3.1 Problem Statement and Objectives 24
3.2 Detailed Explanation of Architecture Diagram 27
3.3 Modules Connectivity Diagram 30
3.4 Software and Hardware Requirements 31
3.6 Testing 36
4 Results and discussions 38
ix
4.2 Detailed Explanation about the Experimental Results 40
6 Appendices 45
1
CHAPTER 1
INTRODUCTION
1.1 Deciphering Speech: A Deep Learning Approach to Lip Reading
Following accurate lip detection and tracking, feature extraction is the next critical step.
This process identifies and captures relevant information from lip movements indicative of
speech-related gestures. Effective feature extraction methods must distinguish subtle variations
in lip shapes and motions associated with different phonemes and words. Advanced techniques
like convolutional neural networks (CNNs) are highly effective in capturing intricate spatial
features from images.
Deep learning architectures, particularly CNNs and recurrent neural networks (RNNs),
play a pivotal role in interpreting and classifying lip movements. CNNs process spatial
information, making them ideal for analysing static features of the lip region, while RNNs handle
temporal sequences, understanding dynamic aspects of lip movements over time. Combining
these neural networks allows the system to comprehensively understand both spatial and temporal
dimensions of speech.
2
To ensure the system's effectiveness and reliability, rigorous evaluation is essential. This
involves testing on benchmark datasets using standard performance metrics such as word error
rate (WER) and accuracy. Benchmarking objectively assesses the system's capabilities and
compares its performance with existing solutions. Additionally, extensive real-world testing
validates the system's robustness across different conditions and environments, including varied
lighting and noise levels and diverse populations. This helps identify limitations and areas for
improvement, ensuring the system can generalize well beyond controlled settings
1.2.2 Robustness
• Ensure the model remains effective under various conditions, such as different lighting
environments, camera angles, and distances.
• Maintain high performance despite background noise and visual obstructions.
3
1.2.4 Real-Time Processing
• Design the model for real-time processing to enable immediate feedback and interaction.
• Optimize algorithms for low latency and efficient computation to support live applications,
such as video conferencing and assistive technologies.
1.2.5 Scalability
• Develop a model that can handle large and diverse datasets, ensuring it performs well with
various speakers, accents, and languages.
• Implement efficient training and inference mechanisms to scale across different devices
and platforms
1.2.6 Generalization
• Ensure the model generalizes well to new, unseen speakers and diverse linguistic contexts.
• Avoid overfitting to specific datasets by using regularization techniques and diverse
training data.
• Focus on reducing errors in recognizing challenging words and phrases, especially those
that are visually similar.
• Conduct thorough error analysis to identify and address common sources of mistakes,
enhancing the model's reliability
1.2.8 User-Friendly
• Create an intuitive and accessible interface for users, making the technology easy to deploy
and use in real-world applications.
• Ensure the model can be integrated seamlessly into various applications, such as assistive
devices for the hearing impaired, security systems, and silent communication tools.
4
1.3. Methodology
• Video and Audio Sources: Collect large datasets of synchronized video and audio recordings
of people speaking. Use publicly available datasets like LRS2 and LRS3.
• Annotations: Ensure the datasets are well-annotated with transcriptions of spoken words.
• Face and Lip Detection: Use face detection algorithms to identify and crop the lip region from
each frame.
• Normalization: Normalize the lip region images to a consistent size and scale.
• Data Augmentation: Apply techniques such as cropping, rotation, time masking and noise
addition to create a more robust training set.
• Pre-trained CNN Models: Utilize pre-trained CNN architectures like VGG19 and ResNet50
to extract spatial features from the keyframes.
• Temporal Feature Extraction: Capture temporal characteristics using models like Bi-
directional Gated Recurrent Units (BGRUs) and Dilated Convolutional Temporal
Convolutional Networks (DC-TCNs).
• CNN and RNN Integration: Combine the spatial features from CNNs with temporal models
such as LSTM or attention-based LSTM networks to learn the relationship between visual and
auditory information.
• Attention Mechanism: Incorporate an attention mechanism to focus on key frames and
enhance robustness against image translation, rotation, and distortion.
• Ensemble Learning: Implement ensemble learning by combining predictions from multiple
models to improve performance.
• Self-Distillation: Use self-distillation techniques to improve the model by using the model's
own predictions as additional training data.
5
• Word Boundary Indicators: Incorporate word boundary indicators to help the model recognize
where words start and end, improving accuracy.
• Performance Metrics: Evaluate the model using metrics such as accuracy, precision, recall,
and F1-score.
• Error Analysis: Perform thorough error analysis to identify and address common mistakes,
focusing on challenging words and visually similar lip movements.
• Real-Time Processing: Optimize the model for real-time processing to enable immediate
feedback and interaction.
• User Interface: Develop a user-friendly interface for practical deployment in applications such
as assistive devices, security systems, and silent communication tools.
• Feedback Loop: Implement a feedback loop where the model learns from new data and user
interactions to continuously improve its performance.
• Scalability: Ensure the model can handle increasing amounts of data and adapt to new speakers
and languages efficiently.
6
1.4. Architecture diagram
The architecture of a lip-reading model would be made up of several sub modules usually each
with its’ own specialized function in the lip reading. The architecture diagram of a lip-reading
model can be broken down into the following stages.
7
1.4.1 Input Stage
Video Frames: Raw video frames can be defined as the frames that can be captured from a video
input source and/or obtained from a video input output. These frames are characteristic of how the
lip shapes transition with the changes in that position.
1.4.2 Preprocessing Stage
Face Detection: The facial landmarks such as the eyes, eyebrows, lips, and many more are
identified from the frame as a way of aligning the MTCMM (Multi-task Cascaded Convolutional
Networks) algorithm which crops the frame for face recognition using ROI (region of interest).
Lip Localization: In the detected face, special lip area is localized further, sometimes it can be
done using landmarks The reason for this is that in the detected face, a special lip area is localized
further, sometimes it can be done using landmarks.
1.4.3 Frame Normalization
The area of lips discerned in videos is always defined and shifted to a specific size and
mathematical average.
1.4.4 Feature Extraction Stage
Convolutional Neural Networks (CNNs): Pre-processed frames are then received by the CNN
which produces spatial features that address each frame. Here they can use VGG, ResNet
architectures or it could be a new architecture when only lip features are the focus.
1.4.5 Training the model
Recurrent Neural Networks (RNNs): This is because the lip movements have a temporal nature
and it is possible to have specific lip movements during a certain phase of a song; therefore, the
feature vectors are processed using an RNN like the LSTM or the GRU.
Temporal Convolutional Networks (TCNs): In other instance they may be used for capturing long
span relations over the frames over the sequence.
1.4.6 Sequence to Sequence Modeling
Encoder-Decoder Architecture: Supervised learning can be incorporated into the encoder-decoder
framework; the encoder has the purpose of extracting features necessary to encode the input
sequence and the decoder serves for the generation of the output sequence, words or phonemes,
for instance.
1.4.7 Output Stage
Fully Connected Layers: The temporal output of the modeling stage is connected to a layer of fully
connected layers that transforms the features emanating from it to the graceful output space
recommended (e. g., the character probability).
8
SoftMax Activation: To this end, a SoftMax layer is used at the tail end of the network to output
directed probabilities for each character or phoneme class.
This report consists of the overview of what all topics discussed in this entire report in a brief and
concise manner with the sequence of topics presented.
Chapter 1: Introduction
In this section we discussed about the project and the use case of the project and how it is useful to
our users we discussed about the basic working of the overall project.
In this section we discussed about the existing approaches to solve this problem and their
drawbacks and the advantages. This section provides the required knowledge and a momentum to
carry out the project.
In this section we discussed about the logical sequence in which we are solving the problem and
the methods that we adopted to solve the problem.
In this section we provided a scope for both self-evaluation and AI evaluation. The AI evaluation
gave a Spearman’s correlation of 0.904.
The Interview Automation System provides a realistic interview experience with AI technology and
audio-based question delivery and collection, empowering users to feel confident and prepared for
real-life interviews. Scaling the project will provide users with a comprehensive interview
preparation platform tailored to their individual needs and skillset.
9
CHAPTER 2
LITERATURE SURVEY
2.1 Summary of Existing Approaches
Pingchuan Ma, Yujiang Wang [1] Lip movements are encoded speech in deep learning-
based lip-reading where neural networks are utilized. The techniques about the recurrent neural
networks (RNNs) which are often combined with the Long Short-Term Memory (LSTM) unit
to model temporal dependencies of lip movement sequences, the convolutional neural networks
(CNNs) which acquire spatial features from the frames of the video sequence are essential.
These techniques are enhanced by huge sets of annotations and incorporated architectures as
VGG-Net and ResNet. In addition, drastically incremental advancements of accuracy and
robustness have been realized with full forms or mixed approaches of CNN and RNN so as to
enhance the VSR.
Mutallip Mamut, Nurbiya Yadikar [3] Data augmentation, which enriches datasets by
adding variables like scaling, rotation, and flipping to boost model generalization, is one
training strategy for better lip-reading using deep learning. Another significant approach is
transferring learning, which includes optimizing for particular lip-reading tasks using pre-
trained models on sizable datasets. This cuts down on training time and improves performance.
Moreover, methods including adjusting hyperparameters, applying more complex loss
functions, and putting ensemble approaches into practice have been used to increase the
precision and resilience of lip-reading systems.
10
Atharva Karekar, Aakansha Gharate [4] AUTO-AVSR or Audio-Video Speech
Recognition with Automatic Labels improves accuracy of speech recognition through
integration of audio and video streams. Traditional solutions in this area include manual
tagging, which is a very lengthy process and carries an inherent risk of high error rate. AUTO-
AVSR as a methodology based on deep learning algorithms providing an effective way to
remove the time-consuming human interactive labeling method. It can help in perfecting
identification of speech because it can learn to associate lip movements with the corresponding
audio signals using audio-visual datasets. To be able to take into account the temporal and
spatial characteristics of the videos and audio, this method also incorporates more advanced
neural networks, including LSTM units and CNNs.
Priyanshu Aggarwal [5] Deep learning has been thoroughly investigated in lip-reading
technology research to understand speech using visual clues from lip movements.
Convolutional neural networks (CNNs) are a key tool for extracting spatial features, while
recurrent neural networks (RNNs), especially Long Short-Term Memory (LSTM) networks, are
a key tool for collecting temporal sequences. CNN and RNN hybrid models have demonstrated
remarkable success. Performance has been further improved by training tactics such data
enhancement, learning by transfer, and automated labeling (as in AUTO-AVSR). Large datasets
with annotations and complex architectures such as VGG Net and ResNet have made
tremendous progress in improving the accuracy and resilience of lip-reading systems.
Nikita Deshmukh [6] There are also lip-reading modalities derived from two-stream
approaches that working with lip images consideration and CNN using lips with temporal
changes and improving the recognition of speeches. Lip mapping and lip contours: Although
there’s an application of lip maps and lip contours, by quantizing the motions and transitions of
lip images in a way that may help track dynamics involved in speaking. In turn, CNNs ‘learn’
such features and also know about spatial and temporal hierarchy that, apparently, exists in most
of the recreational deep learning data sets. This it accomplishes by focusing on temporal rather
than spatial disparity in lip movement; this way the model is in a position to distinguish between
lip movements executed while saying two motions but which resemble those of another yet are
completely different from the sound phrase being articulated. Consequently, an opportunity for
attaining the combined system of CNNs with the dynamic feature extraction to is much higher
11
compared to the system that involves the engagement of lip-reading approach based on the
convention.
Gaoyan Zhang [7] Using appearance-based visual features and deep learning techniques,
lip-reading via deep neural networks analyzes lip movements to interpret speech through
evaluating visual cues. This method captures the fine-grained visual characteristics of the lips
by extracting appearance-based elements from video frames. Typically, these data are processed
using Convolutional Neural Networks (CNNs), which learn the spatial patterns corresponding
to various speech sounds. This approach is capable of identifying speech from visual data only,
focusing on the look of the lips instead of dynamic changes. By using this method, lip-reading
systems perform better and become more dependable.
Karan Shrestha [8] Lip-reading via deep neural networks using appearance-based visual
features involves utilizing deep learning techniques to interpret speech by analyzing visual cues
from lip movements. This approach extracts appearance-based features from video frames,
capturing the detailed visual information of the lips. Convolutional Neural Networks (CNNs)
are typically employed to process these features, learning the spatial patterns associated with
different speech sounds. By focusing on the appearance of the lips rather than dynamic changes,
this method can effectively recognize speech from visual input alone. This technique enhances
the performance of lip-reading systems, making them more accurate and reliable.
Yiting Li, Yuki Takashima [9] Deep learning lip reading uses advanced neural network
designs to analyze lip movements visually and interpret speech. This method extracts spatial
data from video frames using Convolutional Neural Networks (CNNs) to capture the texture and
contour of the lips. To simulate the temporal dynamics of lip movements sequences, recurrent
neural networks (RNNs) are also used, frequently equipped with LSTM (Long Short-Term
Memory) units. These systems learn to recognize spoken words just from visual signals with
great accuracy by using massive data sets with labelled lip motions. The accuracy and efficacy
of lip-reading systems have been greatly enhanced by the inclusion of deep learning techniques,
expanding the range of possible real-world applications for these systems.
12
speech recognition. This approach aims to achieve high performance using minimal
computational resources, which is crucial for real-time applications on devices with limited
processing power. Efficient-Ghost Net integrates optimizations from both Efficient Net and
Ghost Net methodologies, focusing on reducing model complexity and parameter count while
maintaining accuracy. By leveraging these techniques, the algorithm extracts and interprets
meaningful features from lip images to accurately recognize spoken words. This research strives
to improve the efficiency and applicability of lip-reading systems in various technological
contexts.
Nergis Pervan Akman [11] Using advanced machine learning techniques, lip reading using
neural networks and deep learning involves separating spoken words from visual information
provided by lip movements. Typically, this method uses Recurrent Neural Networks (RNNs)
with LSTM units to capture temporal dynamics in lip movement sequences and Convolutional
Neural Networks (CNNs) for retrieving spatial features. Depending on the dataset and model
complexity, these models are trained on datasets with annotations to reach high accuracy,
usually between 70% and 90%. Deep learning has been integrated into lip-reading systems,
greatly enhancing their performance and reliability and enabling them to be used in a variety of
speech styles and settings.
Tayyip Özcan, Alper Basturk [12] In the context of lip-reading multiclass classification
with a Turkish dataset, the approach utilizes Dilated Convolutional Neural Networks (CNNs) to
interpret spoken words based on visual cues from lip movements. Dilated CNNs are chosen for
their ability to capture both local and global dependencies in the lip image sequences effectively.
The Turkish dataset provides annotated examples of lip movements corresponding to different
spoken words or phonemes, enabling supervised learning. This method aims to achieve accurate
classification by leveraging deep learning techniques to extract and analyze spatial features from
lip images. The research focuses on enhancing the precision and reliability of lip-reading
systems specifically tailored to Turkish speech patterns and contexts.
Souheil Fenghour [13] Lip reading using Convolutional Neural Networks (CNNs), both
with and without pre-trained models, explores the effectiveness of leveraging deep learning for
interpreting speech from lip movements. The approach involves training CNN architectures on
13
visual sequences of lip movements to extract spatial features. In one scenario, models are trained
from scratch without pre-existing weights, allowing them to learn directly from the lip-reading
task data. Alternatively, pre-trained CNN models, which have learned generic visual features
from large datasets like ImageNet, are fine-tuned on lip-reading datasets to enhance
performance. This comparative study aims to assess the benefits of transfer learning in
improving accuracy and efficiency in lip-reading tasks, thereby advancing the capabilities of
automated speech recognition systems based on visual cues.
The project on lip reading sentences using deep learning exclusively with visual cues
focuses on interpreting spoken language solely from the visual information of lip movements.
This approach employs Convolutional Neural Networks (CNNs) to extract spatial features from
video frames depicting lip motion sequences. By training on annotated datasets containing
examples of lip movements corresponding to spoken sentences, the model learns to recognize
and transcribe words without relying on audio information. This method aims to enhance
accessibility for individuals with hearing impairments and improve the accuracy of automated
speech recognition systems in noisy environments where audio signals may be compromised.
14
Table 2.1.1: Literature Survey
15
[3] Mutalli Training The study 2022
p strategies combines 83.4%.
Mamut, for impro cropping, time
Nurbiy ved masking for
a lip- data
Yadikar reading augmentation,
BGRUs and
DC-TCNs for
temporal
model and
employs self-
distillation and
word boundary
indicators in
training
16
[5] Priyans A survey Outline the 2020
hu of classification 87.55%.
Aggarw research methods
al on lip employed,
reading such as
technolo Template
gy Matching,
DTW, HMM,
SVM, and
TDNN.
17
[8] Karan Lip Deep learning
Shresth reading models, such 2022 74.9%
a using as CNNs and
deep RNNs, are
learning trained on the
preprocesses
dataset to learn
the
relationship
between visual
and auditory
information.
[9] Yiting Research It proposes an
Li, on a lip- optimization 2019 76.3%
Yuki reading approach
Takashi algorith based on Ghost
ma m based Net, a
on lightweight
efficient network
- ghost architecture.
net The
optimization
involves
enhancing
Ghost Net to
create an even
more efficient
model named
Efficient-
Ghost Net
18
[10] Fateme Lip The
h reading methodology 2017 77.14%
Vakhsh using involves using
iteh neural a Haar
network Feature-Based
and deep Cascade
learning classifier to
detect the face
and mouth
region from
each input
video. he pre-
processed data
is then used to
train the lip-
reading model
using a 3D
CNN
architecture
19
[12] Tayyip Lip Lip reading
Özcan, reading from video is 2019 64.40%
Alper using performed by
Basturk convoluti using the CNN
onal technique. The
neural standard and
network Av letters
with and datasets used
without for training
pre- and testing the
trained CNN.
models
[13] Souheil Lip2auds The CNN and
Fengho pec: LSTM models 2018 79%
ur Speech are used to
Reconstr training the
uction system and
from reconstruct the
Silent speech from
Lip lip reading.
Moveme
nts
Video
20
2.2 Drawbacks of Existing Approaches
• Existing vision-based lip-reading systems using deep learning face challenges due to
computational demands of complex models like VGG19 and ResNet50, limiting real-time
application. Dependency on pre-trained models may hinder accuracy without fine-tuning for lip-
reading tasks. Ensemble learning, while improving performance, adds complexity in model
integration and increases computational overhead. Despite achieving 85% accuracy, these
systems may lack robustness against variations in lighting, facial expressions, and diverse speech
patterns encountered in real-world settings.
• Training strategies for improved lip-reading face challenges in complexity due to methods like
cropping, time masking, BGRUs, and DC-TCNs, requiring substantial computational resources.
Dependency on specific techniques such as self-distillation and word boundary indicators may
limit generalization beyond training conditions. While achieving 93.4% accuracy, scalability in
real-world scenarios with diverse speech styles and environments remains a concern,
necessitating robust validation and accessibility to large, varied datasets for effective
implementation.
• Lip reading research utilizes diverse methods (Template Matching, DTW, HMM, SVM, TDNN)
with varying accuracy and computational efficiency. Challenges include robust feature
extraction from lip movements, impacting performance. Integrating new technologies improves
accuracy but increases complexity and computational requirements. Achieving 87.55% accuracy
highlights potential, but reliance on large-scale databases for training poses challenges in data
management and accessibility for broader deployment.
• Lip reading using dynamic features and CNNs faces challenges despite achieving 71.76%
accuracy. Issues include complex image processing requirements, sensitivity to image quality
affecting alignment and clarity, and limitations in generalizing to diverse real-world conditions
beyond specific variations like translation and rotation. Acquiring and annotating large, varied
datasets remains crucial for improving robustness and overcoming training data constraints in
practical applications.
• Lip-reading systems using Deep Neural Networks (DNNs) and appearance-based visual features
face challenges despite achieving 45.63% accuracy. Issues include limited accuracy in
21
interpreting lip movements and recognizing words, highlighting the need for enhanced feature
extraction and model refinement. Dependency on high-quality lip images for effective
performance also remains a significant concern for real-world application and robustness.
• Lip reading with deep learning, leveraging CNNs and RNNs on preprocessed datasets, has
notably enhanced accuracy compared to traditional methods. Challenges include the high
computational demands for training and the critical dependency on the quality and diversity of
training data. Real-time application feasibility remains a concern due to these computational
requirements.
• Efficient-Ghost Net for lip reading achieves 76.3% accuracy but faces challenges. These include
potential difficulties in generalizing to diverse lighting, facial expressions, and speech styles not
well-represented in training. Implementing and optimizing the architecture require specialized
expertise, and effective performance hinges on access to large, diverse datasets for robust
training and validation.
• Lip reading with neural networks and deep learning achieves 77.14% accuracy using Haar
Feature-Based Cascade for face and mouth detection, followed by training with a 3D CNN
architecture. Despite its accuracy, challenges include potential limitations in handling diverse
facial orientations and expressions not adequately represented in training data, and the need for
robustness in real-world environments with varying lighting conditions and speech styles.
• Using Dilated Convolutional Neural Networks (DCNN) for lip reading achieves 58.90%
accuracy with a Turkish dataset but faces challenges. These include limited accuracy in
multiclass classification, reliance on dataset expansion for improved performance, and
complexities in preprocessing strategies that may affect scalability and real-time implementation
in diverse environments.
• Lip reading with CNNs achieves 64.40% accuracy using both standard and Av letters datasets,
with and without pre-trained models. Challenges include potential limitations in accurately
capturing nuanced lip movements and variability in different speaking styles and environments.
Improving robustness and generalization remains crucial for practical deployment in varied real-
world scenarios.
22
• Lip2audspec synthesizes speech from silent lip movements with 79% accuracy using CNN and
LSTM models. Challenges include potential difficulties in accurately capturing speech nuances
and variations in real-world noisy environments, affecting robustness and reliability in practical
applications.
• Lip reading sentences using only visual cues achieves 64.04% accuracy by classifying Visemes
and employing perplexity analysis for word conversion. Challenges include limitations in
accurately transcribing varied speech patterns and accounting for diverse environmental
conditions not sufficiently covered in training data, affecting overall robustness and applicability
in real-world scenarios.
23
CHAPTER 3
PROPOSED METHOD
3.1. Problem Statement and Objectives
The problem statement and the objectives of the project are discussed in this section
3.1.1 Problem Statement
Disabling hearing impairment is known to occur at an estimated 466 million people this
means that there exists an impediment to interaction. This technique may be suggested
particularly to hearing impaired people most especially in situations that are filled with a lot of
interference because lip reading considers itself to be better than normal Hearing aids. However,
the current lip-reading models have for instance; low accuracy and adaptive ability that actually
restrict them.
Thus, within the scope of this project, the concept of a new lip-reading model is proposed based
on improving the existing deep learning algorithms with the goal of increasing the accuracy of
the model as well as its robustness with respect to certain conditions. Therefore, as for this
aspect, the objective was to at least get rid of some of the demerits that are still found in the
existing systems and also try to enhance the quality of communication for as well as to the
community of the hearing-impaired individuals.
It will therefore only focus on designing an optimal lip-reading system that will be able to
accurately interpret facial movements, for instance the lip, mouth and so on to make
communication possible irrespective of the level of noise present in the environment. The
potential that which can be exploited out of the deep learning techniques that has proved itself
useful in various visual and auditory related tasks: the ability to enhance and train lip movement
recognition abilities as a function of speaker and context.
In the long run this project is meant to help in the creation of ways and means through which
any normal person could effectively be able to communicate with the hearing impaired and the
disabled people of society and secondly, to provide an avenue of displaying oral speech in such
a form that the hearing impaired can comprehend and rely on. This would also help them attain
better quality of life as they seek employment and, in doing so, help in minimization of the living
struggling of disabled persons in social class and working context.
24
3.1.2 Objectives
The creation of a novel lip-reading model for the hearing-impaired entails the following
objectives that also share a common aim of enhancing communicational efficiency where
impaired hearing persists. Here are the refined objectives:
Achieve High Accuracy: It is crucial to establish a model that can grant a high accuracy level
in the interpretation of lip movements as a sign of the actual spoken language settings of various
speakers.
Ensure Environmental Robustness: Addition to this, design the model to be able to work well
in different environments such as those which comprise of artificial or poor light and noisy
settings to mirror the real-world environments.
Adapt to Diverse Speakers: Design a highly adaptive lip movement synthesis system and it
should be able to ensure that it accepts most of the lips movements, face movements, and even
speaking abilities or patterns across the various groups of people.
Support Scalability: They should also create a growth model that can be integrated into various
platforms such as telecommunication applications as well as mobile devices, also they should
also include software programs for assisting the disabled in their day-to-day activities, and
further versatility by incorporating it into a broad range of applications.
Provide a User-friendly Experience: Create one that makes the technology easy to use; since
the primary goal is lip reading, this should not be a complicated technology that requires the user
to undergo a long learning process in order to make good use of it.
Integrate with Assistive Technologies: Ensure integration with the current assistive devices
like hearing aids, speech-to-text software, and make the solution a complete communication tool.
25
Facilitate Continuous Learning: Implement frameworks for the model’s continuous
development, ensuring incoming data and user feedback are incorporated to enhance and update
the model.
Maintain Cultural and Linguistic Sensitivity: The model should take culture and language
differences into consideration so that it is universally acceptable for all users of different ethnic
origin and language abilities.
Prioritize Privacy and Security: There are measures that are put in place to ensure that data
privacy and security are observed in adherence to the highest levels of ethical standards.
26
3.2. Detailed Explanation of Architecture Diagram
Lip reading involves several critical steps, each contributing to the accurate interpretation of
spoken language through visual cues from lip movements. Below is a detailed breakdown of
each step in the lip-reading process:
27
3.2.1 Input Video
The process begins with an input video that captures the speaker’s face, specifically focusing on
the lip region. This video serves as the raw data from which visual speech cues will be extracted.
The quality and resolution of the video are important factors, as they affect the clarity of the lip
movements and, consequently, the model's performance.
In this step, the input video is divided into a sequence of individual frames. Each frame represents
a single moment in time, capturing the position and shape of the lips as the speaker articulates
different sounds. This conversion is crucial as it transforms the continuous video stream into
discrete units that can be processed by the model.
3.2.3 Preprocessing
Preprocessing involves several sub-steps to prepare the frames for feature extraction:
Lip Detection and Cropping: The lip region is detected in each frame, and the area surrounding
the lips is cropped to focus on the relevant part of the face. This ensures that the model analyzes
only the necessary visual information.
Normalization: The cropped frames are normalized to a standard size and scale. Normalization
adjusts the pixel values to a common range, improving consistency across frames and enhancing
the model's ability to learn from the data.
Data Augmentation: Techniques such as rotation, scaling, and flipping may be applied to the
frames to create a more diverse training dataset, helping the model generalize better to different
speakers and conditions.
Feature extraction is performed using Convolutional Neural Networks (CNNs), which are
effective in identifying and capturing spatial patterns in images:
Convolutional Layers: These layers apply filters to the input frames to detect features such as
edges, textures, and shapes. The convolutional process results in feature maps that highlight
important visual details of the lip movements.
Pooling Layers: Pooling operations reduce the dimensionality of the feature maps, retaining the
most significant information while making the data more manageable for the subsequent layers.
28
3.2.5 Training the Model
The training phase involves teaching the model to recognize and interpret lip movements:
Loss Function and Optimization: The model is trained using a loss function that measures the
difference between the predicted and actual outputs. Optimization algorithms, such as Adam or
SGD (Stochastic Gradient Descent), are employed to minimize this loss, adjusting the model’s
parameters to improve accuracy.
Once trained, the model is evaluated on a separate test dataset that it has not seen during training.
This step assesses the model's ability to generalize to new, unseen data:
Performance Metrics: Metrics such as accuracy, precision, recall, and F1 score are used to
quantify the model’s performance. These metrics help determine how well the model can
interpret lip movements and convert them into text.
The final step is the generation of the text output, which corresponds to the spoken words
represented by the lip movements in the input video:
Classification: The processed features are passed through a SoftMax layer, which outputs a
probability distribution over possible speech class (e.g., phonemes or words).
Text Generation: The highest probability class is selected for each frame sequence, and these are
combined to form the final predicted text. This output provides a readable transcription of the
visual speech input, effectively translating lip movements into written language.
29
3.3 Modules Connectivity Diagram
It can be noted that the module connectivity diagram of a lip-reading model is a depiction of a
clear flow of information and processing steps which are central to comprehending spoken
language using vision. Fundamentally, the occurring process starts with the Video Input module
that receives consecutive frames depicting lip movements. These frames are usually subjected
to some form of preprocessing wherein normalization and cropping may be done on these in
order to be clearer and in focus. Then, the frames are passed to the Preprocessing module which
in return passes a processed frame to the Feature Extraction module where techniques like CNN
are used to extract simple spatial features from each fram
30
It then proceeds to the Temporal CNN module where temporal relation across frames is
extracted through temporal convolution. From here, the processed sequence of features is
inputted into a Bidirectional LSTM/GRU module, for the improvement of context and
successive movement of lips in the model. After as the output of LSTM/GRU, there is a step
called the Attention Mechanism that aims to ‘pay attention’ to the important frames or features
that are significant for correct interpretation.
After attentional processing, the extracted features are sent into the Classification module in
which the model output phonemes or words mapped to lip movements. Last but not least, the
results obtained from Classification are transferred to the Lip-Reading module where the
predicted linguistic information is fused to give the final output.
This structural connectivity chart defines the concrete links between the visual information from
lip movements and the transformation of this information into linguistic perception, which
illustrates the coordinated cooperation of separate computer processors that are critical for lip
reading.
3.4.1 Software
• Language: Python
3.4.2 Hardware
• Operating System: Windows 11 12th Gen
• Processor: Intel® i5
• RAM: 8 GB
• System type: 32-bit operating system
• Graphics Processing Unit (GPU)
31
3.5 Analysis and Design through UML
3.5.1 Class Diagram
Preprocessor
The `Preprocessor` class is the one to pre-process raw video data by face detection, lips
localization, and frames normalization. It sets up video frames for characteristic retrieval by
eradicating unnecessary information and maintaining standardization.
Postprocessor
The present `Postprocessor` class interprets model outputs and translates them into readable text.
It employs methods such as the CTC decoding to decode the probabilities it predicts into actual
sentences or phonemes.
32
Feature Extractor
The `Feature Extractor` class extracts spatial features from the frames in the videos after pre-
processing them with CNNs. It gets real video frames that can in turn be used to feed the lip-
reading model with meaningful information.
Lipreading Model
The `Lipreading Model` class is the central class liable for the construction of the whole neural
network architecture. It encompasses layers for temporal modeling such as RNNs or TCNs
because it predicts text from extracted features.
Trainer
The `Trainer` class prepares the `Lipreading Model` by supplying it with training content,
evaluating the loss and tuning the parameters. It covers techniques of training loops, losses,
gradients instructions, and optimization with the help of optimization algorithms.
Evaluator
The `Evaluator` class evaluates the learned `Lipreading Model` on the test set. It measures the
word error rate where terms such as the actual word and predicted word are compared to
determine its efficiency and effectiveness.
33
Description about Data Flow Diagram
The DFD for the lip-reading model shows the flow of data and various activities that take place
to acquire processed text data from raw video data. In a nutshell, the diagram enables the
identification of flow of information from the input stage to the output stage.
Data Loading: It starts with the loading of the video data and the alignment files to the system
by `Data Loader` component. This step involves acquiring video files that contain lips and
captions in the video that indicate the occurrences of a spoken message.
Face Detection and Lip Localization: Facial detection and Lips localization within the video
frames are the functions that the `Preprocessor` module performs. This step helps to exclude
from the next stages any data that is not related to the perception of the given visual images.
Frame Normalization: Specific frames are re-sized or normalized so as to optimize and arrange
their input quality to be the same for all the videos in the database.
Spatial Feature Extraction: The `Feature Extractor` component involves Convolution Neural
Network (CNN) as it extracts spatial features on the preprocessed video frames. These describe
some distinctive aspects that are required in order to understand the lip movements, being
concerned with depicting essential visual patterns and details.
Lip Reading Model: The main `Lipreading Model` module analyzes the extracted spatial
features using temporal modeling that includes RNNs or TCNs. This stage deals with capturing
the motion of the lips over time to infer the spoken text.
Text Prediction: After passing through this section, the model outputs the expected text that
when uttered gives the visual lip movements or the similar phonemes to be more precise. This
step constitutes the last phase of lip-reading process, aimed at the testing efficiency of the system
in the identification of the actual visual inputs and their translation into useful text messages.
34
3.5.3 Use case diagram
Users or actors perform several significant activities in the diagram. They start by inputting video
data into the system: in addition to video files, the input includes. all files containing textual
annotations of alignments. It is also important to note that the system takes these inputs through
different processes. Second, the system uses a feature extraction module often utilizing CNNs to
35
extract spatial features from the raw video frames that have been preprocessed. They include
features of relevant visual data that are vital in lip reading processes. Further on, the lip-reading
model is trained using the extracted features and the aligned text data processing through training
loops and optimization of the model parameters.
After training, the system also measures the model accuracy and performances the test set using
WER and CER if the system is a speech recognition system. The model can be used by users to
predict spoken text in real time hence showing practical use of the model. Lastly, the system sets
the output into comprehensible understandable worded phrases or phoneme for the final
interpreted text, for use or for further analysis
3.6Testing
An evaluation of a lip-reading model is a complex process that encapsulates the tasks of assessing
its verity, solidity, and adaptability. This kind of evaluation generally starts with the development
of a database, which consists of a single speaker and accent as well as speaking condition. The
data set is divided into training, validation and test set so that the model is trained over more
different data but at the same time tested on data which it was not trained and so evaluate the
performance of the model. The test set is especially valuable since it gives an independent
estimate of the model’s performance and its ability to function with relatively high accuracy on
new examples.
When testing the model, several output measures are taken into consideration, these include word
error rate, sentence error rate, and character error rate. These measures give information about
diverse aspects of the model’s performance. For instance, WER quantifies the word error rate
which tends to extent the difference between the right and wrong word predictions while CER
gives the character error rate which calculates the number of character level mistakes.
Furthermore, every model is confronted with test of Lighting variation, Background noise
variation, and Speaker lip movement dynamics included in the evaluation criterion to check
robustness. It is typically accomplished by applying various distortions to the test dataset and
assessing the model’s performance under these conditions.
Furthermore, cross validation is used as a way of confirming the validity of the developed model.
Take for instance, k-fold cross validation, here a dataset is divided into k sets a model is trained
36
k times but a different set is used for testing while the rest are for training. This is useful in finding
out if there are instances of over training and to ensure that the model is not too specialist in
making predictions on the new samples of the dataset. Furthermore, there are references to
previously best known algorithms or models for the same problem, to compare the gain in
performance that was obtained.
37
CHAPTER 4
The dataset is designed to facilitate the training of a lip-reading model and consists of two main
files: The target and the target 2 may also be referred to as alignments and s1.
4.1.1 Alignments File
Purpose: Deposit the alignment information of the videos stored in the s1 file.
Contents
Alignments: This file describes what is to be said in each video and the relation between the frames
of the video and its phonemes or words.
Format: Every line in the alignments file is associated with of one video in the s1 file and it can
include times or frame numbers with the text.
Silence Representation: Where there is no speech in the video, the alignment file labeling uses the
label “sil”.
38
4.1.2 S1 File
Purpose: Stores the video files, that were used in the process of building the model.
Contents
Videos: There are 450 videos in the s1 file and all of them contain one speaker for a each of it
Duration: That is why each video is 2 to 3 seconds long on average.
Single Speaker: No two videos have the same speaker in order to maintain an aspect of
continuity of lip movement as well as speech motor patterning.
File Format: Messages are in the form of videos and these are saved in a format commonly
used for videos (e. g., MPG).
39
4.2 Detailed Explanation about the Experimental Results
A total of 420 videos are trained on the model and we evaluated the output of the model using
various testing videos which results in the accurate predictions of the trained model by giving
the text as output.
RESULT:
40
4.3 Significance of the Proposed Method with its Advantages
This lip-reading model proposed by him has the potential to take human computer interaction or
in other words, enabling technologies for the disabled a notch higher. Thus, correctly interpreting
spoken language with the help of visual signals only, this model provides new opportunities for
hearing-impaired people and bringing them closer to effective communication in conditions where
the use of audial information generally does not suffice. In addition, the versatile functions of the
model include security alarms, surveillance systems as well as the way of silent information
exchange in noisy environments.
Among its key benefits, the proposed lip-reading model is very effective in cases where source
audio data are either absent or of poor quality. This distinguishes it especially for use where there
is much interference, in which interfering speech recognition systems are apt to fail. Through
relying on the features extracted from the video, the proposed model avoids the effects of the low
sound quality and other problems that arise due to it.
For Hearing Impaired Individuals: This model offers a way of translating spoken language into a
form that is accessible to persons with hearing impairment by translating lip movements. This can
greatly improve communication in live discussions and online communication; where there are no
subtitles or sign language interpreters.
Silent Communication: In all the essential instances where it is important not to talk including
libraries or meetings, the model provides a way of communication through lip movement that the
computer can interpret without sounds.
Speech Recovery in Noisy Environments: Thus, the model can be useful in noisy environment
during occasions or security to recover the spoken content that are hard to be understood due to
the surrounding noise.
41
Single Speaker Focus: The model is trained for singles speakers' videos that enables accurate
interpretation of the lip movements with no errors that might be introduced by multiple speakers.
42
CHAPTER 5
CONCLUSION AND FUTURE ENHANCEMENTS
5.1 Conclusion
In our lip-reading project, we developed a robust system for accurately converting visual speech
into text. Using advanced machine learning algorithms, we achieved significant accuracy in single-
speaker scenarios. Our work included the development of preprocessing and prediction pipelines,
setting a strong foundation for future enhancements.
While we made substantial progress, future enhancements will focus on integrating live webcam
video input, improving accuracy with multiple speakers, creating a user-friendly interface, and
adding multilingual translation capabilities. Our project establishes a solid groundwork, ready to
be built upon for more advanced and versatile applications.
43
5.2.4 Multilingual Translation Capabilities
Implement translation capabilities that can convert both the visual lip movements and
corresponding text into multiple languages. This enhancement facilitates communication across
linguistic barriers, making the technology accessible to a global audience.
44
CHAPTER 6
APPENDICES
6.1 Importing the required libraries
!pip list
!pip install opencv-python matplotlib imageio gdown tensorflow
import os
import cv2
import tensorflow as tf
import numpy as np
from typing import List
from matplotlib import pyplot as plt
import imageio
tf.config.list_physical_devices('GPU')
physical_devices = tf.config.list_physical_devices('GPU')
try:
tf.config.experimental.set_memory_growth(physical_devices[0], True)
except:
pass
import gdown
url = 'https://drive.google.com/uc?id=1YlvpDLix3S-U8fd-gqRwPcWXAXm8JwjL'
output = 'data.zip'
gdown.download(url, output, quiet=False)
gdown.extractall('data.zip')
def load_video(path:str) -> List[float]:
cap = cv2.VideoCapture(path)
frames = []
for _ in range(int(cap.get(cv2.CAP_PROP_FRAME_COUNT))):
ret, frame = cap.read()
frame = tf.image.rgb_to_grayscale(frame)
frames.append(frame[190:236,80:220,:])
cap.release()
mean = tf.math.reduce_mean(frames)
std = tf.math.reduce_std(tf.cast(frames, tf.float32))
return tf.cast((frames - mean), tf.float32) / std
45
vocab = [x for x in "abcdefghijklmnopqrstuvwxyz'?!123456789 "]
print(
f"The vocabulary is: {char_to_num.get_vocabulary()} "
f"(size ={char_to_num.vocabulary_size()})"
)
char_to_num.get_vocabulary()
char_to_num(['n','i','c','k'])
num_to_char([14, 9, 3, 11])
def load_alignments(path:str) -> List[str]:
with open(path, 'r') as f:
lines = f.readlines()
tokens = []
for line in lines:
line = line.split()
if line[2] != 'sil':
tokens = [*tokens,' ',line[2]]
return char_to_num(tf.reshape(tf.strings.unicode_split(tokens, input_encoding='UTF-8'), (-
1)))[1:]
tf.convert_to_tensor(test_path).numpy().decode('utf-8').split('\\')[-1].split('.')
46
alignments
data = tf.data.Dataset.list_files('./data/s1/*.mpg')
data = data.shuffle(500, reshuffle_each_iteration=False)
data = data.map(mappable_function)
data = data.padded_batch(2, padded_shapes=([75,None,None,None],[40]))
data = data.prefetch(tf.data.AUTOTUNE)
# Added for split
train = data.take(420)
test = data.skip(420)
len(test)
len(frames)
sample = data.as_numpy_iterator()
plt.imshow(val[0][0][35])
47
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint, LearningRateScheduler
data.as_numpy_iterator().next()[0][0].shape
model = Sequential()
model.add(Conv3D(128, 3, input_shape=(75,46,140,1), padding='same'))
model.add(Activation('relu'))
model.add(MaxPool3D((1,2,2)))
model.add(Conv3D(256, 3, padding='same'))
model.add(Activation('relu'))
model.add(MaxPool3D((1,2,2)))
model.add(Conv3D(75, 3, padding='same'))
model.add(Activation('relu'))
model.add(MaxPool3D((1,2,2)))
model.add(TimeDistributed(Flatten()))
model.add(Dense(char_to_num.vocabulary_size()+1, kernel_initializer='he_normal',
activation='softmax'))
model.summary()
yhat = model.predict(val[0])
48
6.5. Setup Training Options and Train
class ProduceExample(tf.keras.callbacks.Callback):
def __init__(self, dataset) -> None:
self.dataset = dataset.as_numpy_iterator()
model.compile(optimizer=Adam(learning_rate=0.0001), loss=CTCLoss)
schedule_callback = LearningRateScheduler(scheduler)
example_callback = ProduceExample(test)
url = 'https://drive.google.com/uc?id=1vWscXs4Vt0a_1IH1-ct2TCgXAZT-N3_Y'
output = 'checkpoints.zip'
gdown.download(url, output, quiet=False)
gdown.extractall('checkpoints.zip', 'models')
model.load_weights('models/checkpoint')
test_data = test.as_numpy_iterator()
sample = test_data.next()
yhat = model.predict(sample[0])
print('~'*100, 'PREDICTIONS')
[tf.strings.reduce_join([num_to_char(word) for word in sentence]) for sentence in decoded]
sample = load_data(tf.convert_to_tensor('.\\data\\s1\\bras9a.mpg'))
print('~'*100, 'PREDICTIONS')
[tf.strings.reduce_join([num_to_char(word) for word in sentence]) for sentence in decoded]
50
REFERENCES
[1] . Pingchuan Ma, Yujiang Wang, Stavros Petridis, Jie Shen, Maja Pantic “Training
Srategies For Improved Lip-reading” 2022 IEEE International Conference
On Acoustics, Speech And Signal Processing (ICASSP), Pp. 8472-8476, 2022.
[2]. Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-lopez, Honglie Chen, Stavros
Petridis, Maja Pantic “Auto-avsr: Audio-visual Speech Recognition with Automatic Labels” -
Arxiv:2303.14307v3 [Cs.CV] 28 Jun 2023
[3]. Mutallip Mamut, Nurbiya Yadikar , Mingfeng Hao, Alimjan Aysa, Kurban Ubul “A Survey
of Research on Lip Reading Technology” DOI: 10.1109/Access.2020.3036865
[4]. Atharva Karekar, Aakansha Gharate, Ravish Shaikh “Lip Reading Using Deep Learning”
Volume:05/Issue:04/April-2023
[5]. Kartik Datar, Meet N. Gandhi , Priyanshu Aggarwal, Mayank Sohani “A Review on Deep
Learning Based Lip-Reading” DOI : doi.org/10.32628/CSEIT206140
[6]. Nikita Deshmukh , Anamika Ahire , Smriti H Bhandari “Vision based Lip Reading System
using Deep Learning ” 021 International Conference on Computing, Communication and Green
Engineering DOI: 10.1109/CCGE50943.2021.9776430
[7]. Gaoyan Zhang and Yuanyao Lu “Research on a Lip-Reading Algorithm Based on Efficient-
GhostNet ” DOI - https://doi.org/10.3390/electronics12051151
[8]. Karan Shrestha “Lip Reading using Neural Network and Deep learning”.
[9]. Yiting Li, Yuki Takashima, Tetsuya Takiguchi, Yasuo Ariki; Lip Reading Using a Dynamic
Feature of Lip Images and Convolutional Neural Networks 978-1-5090-0806-3/16/$31.00
copyright 2016 IEEE
[10]. Fatemeh Vakhshiteh, Farshad Almasganj; lip-reading via Deep Neural Network Using
Appearance-based Visual Features 2017 24th national and 2nd International Iranian Conference
51
on Biomedical Engineering (ICBME), Amirkabir University of Technology, Tehran, Iran, 30
November - 1 December 2017 978-1-5386-3609-1/17/$31.00 ©2017 IEEE
[11]. Nergis Pervan Akman, Talya Tumer Sivrij, ali Berkol, Hamit Erdem; Lip Reading
Multiclass Classification by UsingDilated CNN with Turkish Dataset 2022 International
Conference on Electrical, Computer and Energy Technologies (ICECET) | 978- 1-6654-7087-
2/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICECET55527.2022.9873011
[12]. Tayyip Özcan, Alper Basturk; Lip Reading Using Convolutional Neural Networks with
and without Pre-Trained Models Article in Balkan Journal of Electrical and Computer
Engineering · April 2019 DOI: 10.17694/bajece.479891 .
[13]. Souheil Fenghour (Associate Member IEEE), Daqing Chen (Member IEEE); Lip Reading
Sentences Using Deep Learning with Only Visual Cues Digital Object Identifier
10.1109/ACCESS.2020.3040906 VOLUME 8, 2020
[14]. Hassan Akbari, Himani Arora, Liangliang Cao, Nima Mesgarani; Lip2audspec: Speech
52
53