Documentation (AA20)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 62

A Mini Project with Seminar On

Deciphering Speech: A Deep Learning Approach to Lip Reading

Submitted in partial fulfilment of the requirements for the award of the

Bachelor of Technology
in
Department of Computer Science and Engineering
(Artificial Intelligence and Machine Learning)

by
Mohammad Arif 21241A6645
Jadala Sriram 21241A6628
Kandimalla Dhanush Kumar 21241A6634

Under the Esteemed guidance of

Dr. Sanjeeva Polepaka

Associate Professor

Department of Computer Science and Engineering


(Artificial Intelligence and Machine Learning)
GOKARAJU RANGARAJU INSTITUTE OF ENGINEERING AND TECHNOLOGY

(Approved by AICTE, Autonomous under JNTUH, Hyderabad)


Bachupally, Kukatpally, Hyderabad-500090

i
GOKARAJU RANGARAJU INSTITUTE OF ENGINEERING AND
TECHNOLOGY
(Autonomous) Hyderabad-500090

CERTIFICATE

This is to certify that the mini project entitled “Deciphering Speech: A Deep

Learning Approach to Lip Reading” Is submitted by Md. Arif


(21241A6645), J. Sriram (21241A6628) and K. Dhanush Kumar
(21241A6634). in partial fulfillment of the award of degree in BACHELOR OF
TECHNOLOGY in Computer Science and Engineering (Artificial Intelligence and
Machine Learning) during Academic year 2023- 2024.

Internal Guide Head of the Department


Dr. Sanjeeva Polepaka Dr. G. Karuna

External Examiner

ii
ACKNOWLEDGEMENT

There are many people who helped us directly and indirectly to complete our project successfully.
We would like to take this opportunity to thank one and all. First, we would like to express our
deep gratitude towards our internal guide Dr. Sanjeeva Polepaka, Associate Professor,
Department of Computer Science and Engineering (Artificial Intelligence and Machine Learning)
for his support in the completion of our dissertation. We are thankful to mini project coordinator
Mr. B. Rajasekhar, Assistant Professor, for his valuable suggestions and comments during this
project period.
We wish to express our sincere thanks to Dr. G. Karuna, Head of the Department, and
to our principal Dr. J. PRAVEEN, for providing the facilities to complete the dissertation. We
would like to thank all our faculty and friends for their help and constructive criticism during the
project period. Finally, we are very much indebted to our parents for their moral support and
encouragement to achieve goals.

Md. Arif (21241A6645)


J. Sriram (21241A6628)
K. Dhanush Kumar (21241A6634)

iii
DECLARATION

We hereby declare that the mini project titled “Deciphering Speech: A Deep Learning

Approach to Lip Reading” Is the work done during the period from 6th February 2024
to 29th June 2024 and is submitted in the partial fulfillment of the requirements for the award
of degree of Bachelor of Technology in Computer Science and Engineering (Artificial
Intelligence and Machine Learning) from Gokaraju Rangaraju Institute of Engineering and
Technology (Autonomous under Jawaharlal Nehru Technology University, Hyderabad). The
results embodied in this project have not been submitted to any other University or Institution
for the award of any degree or diploma.

Md. Arif (21241A6645)


J. Sriram (21241A6628)
K. Dhanush Kumar (21241A6634)

iv
ABSTRACT

This project aims to develop a lip-reading model using deep learning techniques, specifically
convolutional neural networks (CNNs) and recurrent neural networks (RNNs). The model is
trained on a dataset comprising 450 videos of a single speaker, with each video capturing the
speaker articulating various phrases. The objective is to accurately recognize spoken words
based solely on visual input of the speaker's lip movements. The deep learning architecture is
designed to extract and analyze spatiotemporal features from the video frames, leading to a
robust recognition system. The model shows significant potential in applications such as silent
communication and assistive technologies for the hearing impaired.

v
LIST OF FIGURES

Figure Figure Name Page


No. No.

1.4 Architecture diagram 7

3.3 Module connectivity diagram 30


3.5.1 Class diagram 32
3.5.2 Data flow diagram 33
3.5.3 Use Case diagram 35

vi
LIST OF TABLES
Table Table Name Page
No. No.

2.1 Summary of the Existing Approaches 15

vii
LIST OF ACRONYMS
Acronym Full Form

CNN Convolutional Neural Network


RNN Recurrent Neural Network
LSTM Long Short-Term Memory
Bi – LSTM Bidirectional- Long Short-Term Memory
GRU Gated Recurrent Units

viii
TABLE OF CONTENTS

Chapter Chapter Name Page no


No.
Certificate ii

Acknowledgement iii

Declaration iv

Abstract v

List of Figures vi

List of Tables vii

List of Acronyms viii

1 Introduction 1

1.1 Introduction to project work 2


1.2 Objective of the Project 3
1.3 Methodology 5
1.4 Architecture diagram 7
1.5 Organization of the report 9

2 Literature Survey 10
2.1 Summary of existing approaches 10
2.2 Drawbacks of existing approaches 21

3 Proposed Method 24
3.1 Problem Statement and Objectives 24
3.2 Detailed Explanation of Architecture Diagram 27
3.3 Modules Connectivity Diagram 30
3.4 Software and Hardware Requirements 31

3.5 Analysis and Design through UML 32

3.6 Testing 36
4 Results and discussions 38

4.1 Description about dataset 38

ix
4.2 Detailed Explanation about the Experimental Results 40

4.3 Significance of the Proposed Method with its 41


Advantages

5 Conclusion and future enhancements 43

6 Appendices 45

1
CHAPTER 1
INTRODUCTION
1.1 Deciphering Speech: A Deep Learning Approach to Lip Reading

Lip reading, or visual speech recognition, involves interpreting speech by analyzing


movements of the lips, face, and tongue, particularly when audio signals are unavailable or
compromised. This technology is invaluable in various fields, especially in noisy environments
like crowded public spaces or industrial sites where audio clarity is impaired. It also serves as a
crucial assistive tool for individuals with hearing impairments, enabling them to comprehend
spoken language through visual cues. Moreover, in security and surveillance, lip reading can help
decipher conversations when audio capture is impractical, enhancing monitoring capabilities.

Developing an effective lipreading system involves overcoming complex challenges,


primarily in accurately detecting and tracking lip movements in video sequences. A robust lip
detection and tracking algorithm must precisely locate the lips in each frame and maintain
consistent tracking across successive frames despite variations in lighting, head pose, and facial
expressions. This requires sophisticated image processing techniques and adaptive machine
learning models. The accuracy of lip tracking is fundamental, as any errors can propagate through
the system.

Following accurate lip detection and tracking, feature extraction is the next critical step.
This process identifies and captures relevant information from lip movements indicative of
speech-related gestures. Effective feature extraction methods must distinguish subtle variations
in lip shapes and motions associated with different phonemes and words. Advanced techniques
like convolutional neural networks (CNNs) are highly effective in capturing intricate spatial
features from images.

Deep learning architectures, particularly CNNs and recurrent neural networks (RNNs),
play a pivotal role in interpreting and classifying lip movements. CNNs process spatial
information, making them ideal for analysing static features of the lip region, while RNNs handle
temporal sequences, understanding dynamic aspects of lip movements over time. Combining
these neural networks allows the system to comprehensively understand both spatial and temporal
dimensions of speech.

2
To ensure the system's effectiveness and reliability, rigorous evaluation is essential. This
involves testing on benchmark datasets using standard performance metrics such as word error
rate (WER) and accuracy. Benchmarking objectively assesses the system's capabilities and
compares its performance with existing solutions. Additionally, extensive real-world testing
validates the system's robustness across different conditions and environments, including varied
lighting and noise levels and diverse populations. This helps identify limitations and areas for
improvement, ensuring the system can generalize well beyond controlled settings

In conclusion, a robust lip-reading system can significantly enhance human-computer


interaction and accessibility. By accurately detecting and tracking lip movements, extracting
meaningful features, and employing advanced deep learning models, the system can interpret
speech with high accuracy in challenging environments. Rigorous evaluation and real-world
testing will ensure its effectiveness, reliability, and adaptability to various conditions. As machine
learning and computer vision technologies advance, the prospects for lip reading technology will
expand, offering new opportunities for communication and interaction.

1.2. Objectives of the Project

1.2.1 Accurate Speech Recognition


• Develop a model that can precisely interpret and transcribe spoken words by analysing
visual cues from lip movements.
• Aim to match or exceed the accuracy of traditional audio-based speech recognition systems

1.2.2 Robustness
• Ensure the model remains effective under various conditions, such as different lighting
environments, camera angles, and distances.
• Maintain high performance despite background noise and visual obstructions.

1.2.3 Integration with Audio


• Combine visual lip movement data with audio signals to improve overall speech
understanding.
• Use multimodal inputs to enhance recognition accuracy, especially in noisy environments
where audio quality is compromised.

3
1.2.4 Real-Time Processing

• Design the model for real-time processing to enable immediate feedback and interaction.
• Optimize algorithms for low latency and efficient computation to support live applications,
such as video conferencing and assistive technologies.

1.2.5 Scalability

• Develop a model that can handle large and diverse datasets, ensuring it performs well with
various speakers, accents, and languages.
• Implement efficient training and inference mechanisms to scale across different devices
and platforms

1.2.6 Generalization

• Ensure the model generalizes well to new, unseen speakers and diverse linguistic contexts.
• Avoid overfitting to specific datasets by using regularization techniques and diverse
training data.

1.2.7 Error Reduction

• Focus on reducing errors in recognizing challenging words and phrases, especially those
that are visually similar.
• Conduct thorough error analysis to identify and address common sources of mistakes,
enhancing the model's reliability

1.2.8 User-Friendly

• Create an intuitive and accessible interface for users, making the technology easy to deploy
and use in real-world applications.
• Ensure the model can be integrated seamlessly into various applications, such as assistive
devices for the hearing impaired, security systems, and silent communication tools.

4
1.3. Methodology

1.3.1 Data Collection

• Video and Audio Sources: Collect large datasets of synchronized video and audio recordings
of people speaking. Use publicly available datasets like LRS2 and LRS3.
• Annotations: Ensure the datasets are well-annotated with transcriptions of spoken words.

1.3.2 Data Preprocessing

• Face and Lip Detection: Use face detection algorithms to identify and crop the lip region from
each frame.
• Normalization: Normalize the lip region images to a consistent size and scale.
• Data Augmentation: Apply techniques such as cropping, rotation, time masking and noise
addition to create a more robust training set.

1.3.3 Feature Extraction

• Pre-trained CNN Models: Utilize pre-trained CNN architectures like VGG19 and ResNet50
to extract spatial features from the keyframes.
• Temporal Feature Extraction: Capture temporal characteristics using models like Bi-
directional Gated Recurrent Units (BGRUs) and Dilated Convolutional Temporal
Convolutional Networks (DC-TCNs).

1.3.4 Model Training

• CNN and RNN Integration: Combine the spatial features from CNNs with temporal models
such as LSTM or attention-based LSTM networks to learn the relationship between visual and
auditory information.
• Attention Mechanism: Incorporate an attention mechanism to focus on key frames and
enhance robustness against image translation, rotation, and distortion.
• Ensemble Learning: Implement ensemble learning by combining predictions from multiple
models to improve performance.

1.3.5 Training Strategies

• Self-Distillation: Use self-distillation techniques to improve the model by using the model's
own predictions as additional training data.

5
• Word Boundary Indicators: Incorporate word boundary indicators to help the model recognize
where words start and end, improving accuracy.

1.3.6 Model Evaluation

• Performance Metrics: Evaluate the model using metrics such as accuracy, precision, recall,
and F1-score.
• Error Analysis: Perform thorough error analysis to identify and address common mistakes,
focusing on challenging words and visually similar lip movements.

1.3.7 Integration and Deployment

• Real-Time Processing: Optimize the model for real-time processing to enable immediate
feedback and interaction.
• User Interface: Develop a user-friendly interface for practical deployment in applications such
as assistive devices, security systems, and silent communication tools.

1.3.8 Continuous Improvement

• Feedback Loop: Implement a feedback loop where the model learns from new data and user
interactions to continuously improve its performance.
• Scalability: Ensure the model can handle increasing amounts of data and adapt to new speakers
and languages efficiently.

6
1.4. Architecture diagram

Figure 1.4 Architecture Diagram for Lip Reading Model

The architecture of a lip-reading model would be made up of several sub modules usually each
with its’ own specialized function in the lip reading. The architecture diagram of a lip-reading
model can be broken down into the following stages.

7
1.4.1 Input Stage
Video Frames: Raw video frames can be defined as the frames that can be captured from a video
input source and/or obtained from a video input output. These frames are characteristic of how the
lip shapes transition with the changes in that position.
1.4.2 Preprocessing Stage
Face Detection: The facial landmarks such as the eyes, eyebrows, lips, and many more are
identified from the frame as a way of aligning the MTCMM (Multi-task Cascaded Convolutional
Networks) algorithm which crops the frame for face recognition using ROI (region of interest).
Lip Localization: In the detected face, special lip area is localized further, sometimes it can be
done using landmarks The reason for this is that in the detected face, a special lip area is localized
further, sometimes it can be done using landmarks.
1.4.3 Frame Normalization
The area of lips discerned in videos is always defined and shifted to a specific size and
mathematical average.
1.4.4 Feature Extraction Stage
Convolutional Neural Networks (CNNs): Pre-processed frames are then received by the CNN
which produces spatial features that address each frame. Here they can use VGG, ResNet
architectures or it could be a new architecture when only lip features are the focus.
1.4.5 Training the model
Recurrent Neural Networks (RNNs): This is because the lip movements have a temporal nature
and it is possible to have specific lip movements during a certain phase of a song; therefore, the
feature vectors are processed using an RNN like the LSTM or the GRU.
Temporal Convolutional Networks (TCNs): In other instance they may be used for capturing long
span relations over the frames over the sequence.
1.4.6 Sequence to Sequence Modeling
Encoder-Decoder Architecture: Supervised learning can be incorporated into the encoder-decoder
framework; the encoder has the purpose of extracting features necessary to encode the input
sequence and the decoder serves for the generation of the output sequence, words or phonemes,
for instance.
1.4.7 Output Stage
Fully Connected Layers: The temporal output of the modeling stage is connected to a layer of fully
connected layers that transforms the features emanating from it to the graceful output space
recommended (e. g., the character probability).

8
SoftMax Activation: To this end, a SoftMax layer is used at the tail end of the network to output
directed probabilities for each character or phoneme class.

1.5 Organization of the Report

This report consists of the overview of what all topics discussed in this entire report in a brief and
concise manner with the sequence of topics presented.

Chapter 1: Introduction

In this section we discussed about the project and the use case of the project and how it is useful to
our users we discussed about the basic working of the overall project.

Chapter 2: Literature Survey

In this section we discussed about the existing approaches to solve this problem and their
drawbacks and the advantages. This section provides the required knowledge and a momentum to
carry out the project.

Chapter 3: Proposed Methods

In this section we discussed about the logical sequence in which we are solving the problem and
the methods that we adopted to solve the problem.

Chapter 4: Results and Discussions

In this section we provided a scope for both self-evaluation and AI evaluation. The AI evaluation
gave a Spearman’s correlation of 0.904.

Chapter 5: Conclusion and Future Enhancements

The Interview Automation System provides a realistic interview experience with AI technology and
audio-based question delivery and collection, empowering users to feel confident and prepared for
real-life interviews. Scaling the project will provide users with a comprehensive interview
preparation platform tailored to their individual needs and skillset.

9
CHAPTER 2
LITERATURE SURVEY
2.1 Summary of Existing Approaches
Pingchuan Ma, Yujiang Wang [1] Lip movements are encoded speech in deep learning-
based lip-reading where neural networks are utilized. The techniques about the recurrent neural
networks (RNNs) which are often combined with the Long Short-Term Memory (LSTM) unit
to model temporal dependencies of lip movement sequences, the convolutional neural networks
(CNNs) which acquire spatial features from the frames of the video sequence are essential.
These techniques are enhanced by huge sets of annotations and incorporated architectures as
VGG-Net and ResNet. In addition, drastically incremental advancements of accuracy and
robustness have been realized with full forms or mixed approaches of CNN and RNN so as to
enhance the VSR.

Alexandros Haliassos, Adriana Fernandez-lopez [2] Advanced designs of neural networks


are used by vision-based lip-reading systems that use deep learning to decipher speech by
interpreting visual information from lip movements. Key approaches involve using Recurrent
Neural Networks (RNNs), frequently incorporating Long Short-Term Memory (LSTM) units,
to capture the temporal dynamics of these motions and Convolutional Neural Networks (CNNs)
for collecting spatial features from video frames. CNN and RNN hybrid models have proved
highly effective. These systems can recognize visual speech with excellent accuracy and
robustness since they have been trained on big annotated datasets. Prominent designs like VGG-
Net and ResNet have been used a lot to improve performance, showing a lot of progress in the
area.

Mutallip Mamut, Nurbiya Yadikar [3] Data augmentation, which enriches datasets by
adding variables like scaling, rotation, and flipping to boost model generalization, is one
training strategy for better lip-reading using deep learning. Another significant approach is
transferring learning, which includes optimizing for particular lip-reading tasks using pre-
trained models on sizable datasets. This cuts down on training time and improves performance.
Moreover, methods including adjusting hyperparameters, applying more complex loss
functions, and putting ensemble approaches into practice have been used to increase the
precision and resilience of lip-reading systems.

10
Atharva Karekar, Aakansha Gharate [4] AUTO-AVSR or Audio-Video Speech
Recognition with Automatic Labels improves accuracy of speech recognition through
integration of audio and video streams. Traditional solutions in this area include manual
tagging, which is a very lengthy process and carries an inherent risk of high error rate. AUTO-
AVSR as a methodology based on deep learning algorithms providing an effective way to
remove the time-consuming human interactive labeling method. It can help in perfecting
identification of speech because it can learn to associate lip movements with the corresponding
audio signals using audio-visual datasets. To be able to take into account the temporal and
spatial characteristics of the videos and audio, this method also incorporates more advanced
neural networks, including LSTM units and CNNs.

Priyanshu Aggarwal [5] Deep learning has been thoroughly investigated in lip-reading
technology research to understand speech using visual clues from lip movements.
Convolutional neural networks (CNNs) are a key tool for extracting spatial features, while
recurrent neural networks (RNNs), especially Long Short-Term Memory (LSTM) networks, are
a key tool for collecting temporal sequences. CNN and RNN hybrid models have demonstrated
remarkable success. Performance has been further improved by training tactics such data
enhancement, learning by transfer, and automated labeling (as in AUTO-AVSR). Large datasets
with annotations and complex architectures such as VGG Net and ResNet have made
tremendous progress in improving the accuracy and resilience of lip-reading systems.

Nikita Deshmukh [6] There are also lip-reading modalities derived from two-stream
approaches that working with lip images consideration and CNN using lips with temporal
changes and improving the recognition of speeches. Lip mapping and lip contours: Although
there’s an application of lip maps and lip contours, by quantizing the motions and transitions of
lip images in a way that may help track dynamics involved in speaking. In turn, CNNs ‘learn’
such features and also know about spatial and temporal hierarchy that, apparently, exists in most
of the recreational deep learning data sets. This it accomplishes by focusing on temporal rather
than spatial disparity in lip movement; this way the model is in a position to distinguish between
lip movements executed while saying two motions but which resemble those of another yet are
completely different from the sound phrase being articulated. Consequently, an opportunity for
attaining the combined system of CNNs with the dynamic feature extraction to is much higher

11
compared to the system that involves the engagement of lip-reading approach based on the
convention.

Gaoyan Zhang [7] Using appearance-based visual features and deep learning techniques,
lip-reading via deep neural networks analyzes lip movements to interpret speech through
evaluating visual cues. This method captures the fine-grained visual characteristics of the lips
by extracting appearance-based elements from video frames. Typically, these data are processed
using Convolutional Neural Networks (CNNs), which learn the spatial patterns corresponding
to various speech sounds. This approach is capable of identifying speech from visual data only,
focusing on the look of the lips instead of dynamic changes. By using this method, lip-reading
systems perform better and become more dependable.

Karan Shrestha [8] Lip-reading via deep neural networks using appearance-based visual
features involves utilizing deep learning techniques to interpret speech by analyzing visual cues
from lip movements. This approach extracts appearance-based features from video frames,
capturing the detailed visual information of the lips. Convolutional Neural Networks (CNNs)
are typically employed to process these features, learning the spatial patterns associated with
different speech sounds. By focusing on the appearance of the lips rather than dynamic changes,
this method can effectively recognize speech from visual input alone. This technique enhances
the performance of lip-reading systems, making them more accurate and reliable.

Yiting Li, Yuki Takashima [9] Deep learning lip reading uses advanced neural network
designs to analyze lip movements visually and interpret speech. This method extracts spatial
data from video frames using Convolutional Neural Networks (CNNs) to capture the texture and
contour of the lips. To simulate the temporal dynamics of lip movements sequences, recurrent
neural networks (RNNs) are also used, frequently equipped with LSTM (Long Short-Term
Memory) units. These systems learn to recognize spoken words just from visual signals with
great accuracy by using massive data sets with labelled lip motions. The accuracy and efficacy
of lip-reading systems have been greatly enhanced by the inclusion of deep learning techniques,
expanding the range of possible real-world applications for these systems.

Fatemeh Vakhshiteh [10] Research on a lip-reading algorithm based on Efficient-Ghost


Net explores the implementation of a lightweight neural network architecture for efficient visual

12
speech recognition. This approach aims to achieve high performance using minimal
computational resources, which is crucial for real-time applications on devices with limited
processing power. Efficient-Ghost Net integrates optimizations from both Efficient Net and
Ghost Net methodologies, focusing on reducing model complexity and parameter count while
maintaining accuracy. By leveraging these techniques, the algorithm extracts and interprets
meaningful features from lip images to accurately recognize spoken words. This research strives
to improve the efficiency and applicability of lip-reading systems in various technological
contexts.

Nergis Pervan Akman [11] Using advanced machine learning techniques, lip reading using
neural networks and deep learning involves separating spoken words from visual information
provided by lip movements. Typically, this method uses Recurrent Neural Networks (RNNs)
with LSTM units to capture temporal dynamics in lip movement sequences and Convolutional
Neural Networks (CNNs) for retrieving spatial features. Depending on the dataset and model
complexity, these models are trained on datasets with annotations to reach high accuracy,
usually between 70% and 90%. Deep learning has been integrated into lip-reading systems,
greatly enhancing their performance and reliability and enabling them to be used in a variety of
speech styles and settings.

Tayyip Özcan, Alper Basturk [12] In the context of lip-reading multiclass classification
with a Turkish dataset, the approach utilizes Dilated Convolutional Neural Networks (CNNs) to
interpret spoken words based on visual cues from lip movements. Dilated CNNs are chosen for
their ability to capture both local and global dependencies in the lip image sequences effectively.
The Turkish dataset provides annotated examples of lip movements corresponding to different
spoken words or phonemes, enabling supervised learning. This method aims to achieve accurate
classification by leveraging deep learning techniques to extract and analyze spatial features from
lip images. The research focuses on enhancing the precision and reliability of lip-reading
systems specifically tailored to Turkish speech patterns and contexts.

Souheil Fenghour [13] Lip reading using Convolutional Neural Networks (CNNs), both
with and without pre-trained models, explores the effectiveness of leveraging deep learning for
interpreting speech from lip movements. The approach involves training CNN architectures on

13
visual sequences of lip movements to extract spatial features. In one scenario, models are trained
from scratch without pre-existing weights, allowing them to learn directly from the lip-reading
task data. Alternatively, pre-trained CNN models, which have learned generic visual features
from large datasets like ImageNet, are fine-tuned on lip-reading datasets to enhance
performance. This comparative study aims to assess the benefits of transfer learning in
improving accuracy and efficiency in lip-reading tasks, thereby advancing the capabilities of
automated speech recognition systems based on visual cues.

Hassan Akbari [14] Lip2audspec is a novel approach focused on reconstructing speech


from silent lip movements captured in video footage. This innovative method uses deep learning
techniques to interpret visual cues from lip movements and convert them into corresponding
auditory spectrograms, which represent the sound characteristics of spoken words. By training
on large datasets containing synchronized video and audio samples, the model learns to associate
specific lip movements with corresponding speech sounds. This technology aims to facilitate
speech reconstruction for individuals with speech impairments or in scenarios where audio
information is unavailable. Lip2audspec demonstrates promising advancements in bridging the
gap between visual and auditory speech perception through neural network-based reconstruction
methods.

The project on lip reading sentences using deep learning exclusively with visual cues
focuses on interpreting spoken language solely from the visual information of lip movements.
This approach employs Convolutional Neural Networks (CNNs) to extract spatial features from
video frames depicting lip motion sequences. By training on annotated datasets containing
examples of lip movements corresponding to spoken sentences, the model learns to recognize
and transcribe words without relying on audio information. This method aims to enhance
accessibility for individuals with hearing impairments and improve the accuracy of automated
speech recognition systems in noisy environments where audio signals may be compromised.

14
Table 2.1.1: Literature Survey

r no Author title methodology year of accuracy


publica
tion
[1] Pingch A Train the CNN 2022 This
uan Review model on model
Ma, on Deep the dataset to provides
Yujiang Learning identify words. a
Wang Based Use an accuracy
Lip- Attention- of
Reading Based LSTM 88.2%.
to focus on key
frames,
improving
robustness
against image
translation,
rotation, and
distortion.
[2] Alexan Vision VGG19 and 2023 The
dros based ResNet50 system
Haliass Lip extract spatial provides
os, Reading features from 85%
Adriana System keyframes, accuracy
Fernan using which are then using
dez- Deep- processed by ResNet5
lopez, Learning an attention- 0 and
based LSTM ensemble
to capture -
temporal learning
characteristics
and ensure
robust word
identification.

15
[3] Mutalli Training The study 2022
p strategies combines 83.4%.
Mamut, for impro cropping, time
Nurbiy ved masking for
a lip- data
Yadikar reading augmentation,
BGRUs and
DC-TCNs for
temporal
model and
employs self-
distillation and
word boundary
indicators in
training

[4] Atharva auto- Utilizing 2023


Karekar avsr: publicly- 80 %.
, audio- available pre-
Aakans visual trained ASR
ha speech models, we
Gharate recogniti automatically
on with transcribe
automati unlabelled
c labels datasets and
augment the
training set
with these
transcriptions,
alongside
LRS2 and
LRS3 data.

16
[5] Priyans A survey Outline the 2020
hu of classification 87.55%.
Aggarw research methods
al on lip employed,
reading such as
technolo Template
gy Matching,
DTW, HMM,
SVM, and
TDNN.

[6] Nikita Lip Use of 2016


Deshm Reading Convolutional 71.76 %.
ukh Using a Neural
Dynamic Networks
Feature (CNNs) for
of Lip processing the
Images. dynamic
feature,
reducing
negative
influences and
also face
alignment
blurring.

[7] Gaoyan Lip- The Deep


Zhang reading Belief 2020 45.63 %.
via Deep Network
Neural (DBN) is used
Network for the
Using recognition
Appeara part of the lip
nce- reading.
based
Visual
Features

17
[8] Karan Lip Deep learning
Shresth reading models, such 2022 74.9%
a using as CNNs and
deep RNNs, are
learning trained on the
preprocesses
dataset to learn
the
relationship
between visual
and auditory
information.
[9] Yiting Research It proposes an
Li, on a lip- optimization 2019 76.3%
Yuki reading approach
Takashi algorith based on Ghost
ma m based Net, a
on lightweight
efficient network
- ghost architecture.
net The
optimization
involves
enhancing
Ghost Net to
create an even
more efficient
model named
Efficient-
Ghost Net

18
[10] Fateme Lip The
h reading methodology 2017 77.14%
Vakhsh using involves using
iteh neural a Haar
network Feature-Based
and deep Cascade
learning classifier to
detect the face
and mouth
region from
each input
video. he pre-
processed data
is then used to
train the lip-
reading model
using a 3D
CNN
architecture

[11] Nergis Lip It is evaluated


Pervan reading by using 2022 58.90%
Akman multiclas Dilated
s Convolutional
classifica Neural
tion by Network
using (DCNN), a
dilated different
cnn with variation of
turkish CNN.
dataset

19
[12] Tayyip Lip Lip reading
Özcan, reading from video is 2019 64.40%
Alper using performed by
Basturk convoluti using the CNN
onal technique. The
neural standard and
network Av letters
with and datasets used
without for training
pre- and testing the
trained CNN.
models
[13] Souheil Lip2auds The CNN and
Fengho pec: LSTM models 2018 79%
ur Speech are used to
Reconstr training the
uction system and
from reconstruct the
Silent speech from
Lip lip reading.
Moveme
nts
Video

[14] Hassan Lip The


Akbari Reading classification 2020 64.04%
Sentence of Visemes
s Using which is used
Deep to convert
Learning visemes to
with words using
Only perplexity
Visual analysis.
Cues

20
2.2 Drawbacks of Existing Approaches

• Existing vision-based lip-reading systems using deep learning face challenges due to
computational demands of complex models like VGG19 and ResNet50, limiting real-time
application. Dependency on pre-trained models may hinder accuracy without fine-tuning for lip-
reading tasks. Ensemble learning, while improving performance, adds complexity in model
integration and increases computational overhead. Despite achieving 85% accuracy, these
systems may lack robustness against variations in lighting, facial expressions, and diverse speech
patterns encountered in real-world settings.

• Training strategies for improved lip-reading face challenges in complexity due to methods like
cropping, time masking, BGRUs, and DC-TCNs, requiring substantial computational resources.
Dependency on specific techniques such as self-distillation and word boundary indicators may
limit generalization beyond training conditions. While achieving 93.4% accuracy, scalability in
real-world scenarios with diverse speech styles and environments remains a concern,
necessitating robust validation and accessibility to large, varied datasets for effective
implementation.

• Lip reading research utilizes diverse methods (Template Matching, DTW, HMM, SVM, TDNN)
with varying accuracy and computational efficiency. Challenges include robust feature
extraction from lip movements, impacting performance. Integrating new technologies improves
accuracy but increases complexity and computational requirements. Achieving 87.55% accuracy
highlights potential, but reliance on large-scale databases for training poses challenges in data
management and accessibility for broader deployment.

• Lip reading using dynamic features and CNNs faces challenges despite achieving 71.76%
accuracy. Issues include complex image processing requirements, sensitivity to image quality
affecting alignment and clarity, and limitations in generalizing to diverse real-world conditions
beyond specific variations like translation and rotation. Acquiring and annotating large, varied
datasets remains crucial for improving robustness and overcoming training data constraints in
practical applications.

• Lip-reading systems using Deep Neural Networks (DNNs) and appearance-based visual features
face challenges despite achieving 45.63% accuracy. Issues include limited accuracy in

21
interpreting lip movements and recognizing words, highlighting the need for enhanced feature
extraction and model refinement. Dependency on high-quality lip images for effective
performance also remains a significant concern for real-world application and robustness.

• Lip reading with deep learning, leveraging CNNs and RNNs on preprocessed datasets, has
notably enhanced accuracy compared to traditional methods. Challenges include the high
computational demands for training and the critical dependency on the quality and diversity of
training data. Real-time application feasibility remains a concern due to these computational
requirements.

• Efficient-Ghost Net for lip reading achieves 76.3% accuracy but faces challenges. These include
potential difficulties in generalizing to diverse lighting, facial expressions, and speech styles not
well-represented in training. Implementing and optimizing the architecture require specialized
expertise, and effective performance hinges on access to large, diverse datasets for robust
training and validation.

• Lip reading with neural networks and deep learning achieves 77.14% accuracy using Haar
Feature-Based Cascade for face and mouth detection, followed by training with a 3D CNN
architecture. Despite its accuracy, challenges include potential limitations in handling diverse
facial orientations and expressions not adequately represented in training data, and the need for
robustness in real-world environments with varying lighting conditions and speech styles.

• Using Dilated Convolutional Neural Networks (DCNN) for lip reading achieves 58.90%
accuracy with a Turkish dataset but faces challenges. These include limited accuracy in
multiclass classification, reliance on dataset expansion for improved performance, and
complexities in preprocessing strategies that may affect scalability and real-time implementation
in diverse environments.

• Lip reading with CNNs achieves 64.40% accuracy using both standard and Av letters datasets,
with and without pre-trained models. Challenges include potential limitations in accurately
capturing nuanced lip movements and variability in different speaking styles and environments.
Improving robustness and generalization remains crucial for practical deployment in varied real-
world scenarios.

22
• Lip2audspec synthesizes speech from silent lip movements with 79% accuracy using CNN and
LSTM models. Challenges include potential difficulties in accurately capturing speech nuances
and variations in real-world noisy environments, affecting robustness and reliability in practical
applications.

• Lip reading sentences using only visual cues achieves 64.04% accuracy by classifying Visemes
and employing perplexity analysis for word conversion. Challenges include limitations in
accurately transcribing varied speech patterns and accounting for diverse environmental
conditions not sufficiently covered in training data, affecting overall robustness and applicability
in real-world scenarios.

23
CHAPTER 3
PROPOSED METHOD
3.1. Problem Statement and Objectives
The problem statement and the objectives of the project are discussed in this section
3.1.1 Problem Statement
Disabling hearing impairment is known to occur at an estimated 466 million people this
means that there exists an impediment to interaction. This technique may be suggested
particularly to hearing impaired people most especially in situations that are filled with a lot of
interference because lip reading considers itself to be better than normal Hearing aids. However,
the current lip-reading models have for instance; low accuracy and adaptive ability that actually
restrict them.

Thus, within the scope of this project, the concept of a new lip-reading model is proposed based
on improving the existing deep learning algorithms with the goal of increasing the accuracy of
the model as well as its robustness with respect to certain conditions. Therefore, as for this
aspect, the objective was to at least get rid of some of the demerits that are still found in the
existing systems and also try to enhance the quality of communication for as well as to the
community of the hearing-impaired individuals.

It will therefore only focus on designing an optimal lip-reading system that will be able to
accurately interpret facial movements, for instance the lip, mouth and so on to make
communication possible irrespective of the level of noise present in the environment. The
potential that which can be exploited out of the deep learning techniques that has proved itself
useful in various visual and auditory related tasks: the ability to enhance and train lip movement
recognition abilities as a function of speaker and context.

In the long run this project is meant to help in the creation of ways and means through which
any normal person could effectively be able to communicate with the hearing impaired and the
disabled people of society and secondly, to provide an avenue of displaying oral speech in such
a form that the hearing impaired can comprehend and rely on. This would also help them attain
better quality of life as they seek employment and, in doing so, help in minimization of the living
struggling of disabled persons in social class and working context.

24
3.1.2 Objectives

The creation of a novel lip-reading model for the hearing-impaired entails the following
objectives that also share a common aim of enhancing communicational efficiency where
impaired hearing persists. Here are the refined objectives:

Achieve High Accuracy: It is crucial to establish a model that can grant a high accuracy level
in the interpretation of lip movements as a sign of the actual spoken language settings of various
speakers.

Ensure Environmental Robustness: Addition to this, design the model to be able to work well
in different environments such as those which comprise of artificial or poor light and noisy
settings to mirror the real-world environments.

Adapt to Diverse Speakers: Design a highly adaptive lip movement synthesis system and it
should be able to ensure that it accepts most of the lips movements, face movements, and even
speaking abilities or patterns across the various groups of people.

Enable Real-time Processing: Integrate real-time lip-reading systems to make it possible to


have actual conversations in real-time without necessarily relying on the system to notice the
need to interact.

Support Scalability: They should also create a growth model that can be integrated into various
platforms such as telecommunication applications as well as mobile devices, also they should
also include software programs for assisting the disabled in their day-to-day activities, and
further versatility by incorporating it into a broad range of applications.

Provide a User-friendly Experience: Create one that makes the technology easy to use; since
the primary goal is lip reading, this should not be a complicated technology that requires the user
to undergo a long learning process in order to make good use of it.

Integrate with Assistive Technologies: Ensure integration with the current assistive devices
like hearing aids, speech-to-text software, and make the solution a complete communication tool.

25
Facilitate Continuous Learning: Implement frameworks for the model’s continuous
development, ensuring incoming data and user feedback are incorporated to enhance and update
the model.

Maintain Cultural and Linguistic Sensitivity: The model should take culture and language
differences into consideration so that it is universally acceptable for all users of different ethnic
origin and language abilities.

Prioritize Privacy and Security: There are measures that are put in place to ensure that data
privacy and security are observed in adherence to the highest levels of ethical standards.

26
3.2. Detailed Explanation of Architecture Diagram

Figure 3.2.1 Architecture Diagram for Lip Reading Model

Lip reading involves several critical steps, each contributing to the accurate interpretation of
spoken language through visual cues from lip movements. Below is a detailed breakdown of
each step in the lip-reading process:

27
3.2.1 Input Video

The process begins with an input video that captures the speaker’s face, specifically focusing on
the lip region. This video serves as the raw data from which visual speech cues will be extracted.
The quality and resolution of the video are important factors, as they affect the clarity of the lip
movements and, consequently, the model's performance.

3.2.2 Frame Conversion

In this step, the input video is divided into a sequence of individual frames. Each frame represents
a single moment in time, capturing the position and shape of the lips as the speaker articulates
different sounds. This conversion is crucial as it transforms the continuous video stream into
discrete units that can be processed by the model.

3.2.3 Preprocessing

Preprocessing involves several sub-steps to prepare the frames for feature extraction:

Lip Detection and Cropping: The lip region is detected in each frame, and the area surrounding
the lips is cropped to focus on the relevant part of the face. This ensures that the model analyzes
only the necessary visual information.

Normalization: The cropped frames are normalized to a standard size and scale. Normalization
adjusts the pixel values to a common range, improving consistency across frames and enhancing
the model's ability to learn from the data.

Data Augmentation: Techniques such as rotation, scaling, and flipping may be applied to the
frames to create a more diverse training dataset, helping the model generalize better to different
speakers and conditions.

3.2.4 Feature Extraction

Feature extraction is performed using Convolutional Neural Networks (CNNs), which are
effective in identifying and capturing spatial patterns in images:

Convolutional Layers: These layers apply filters to the input frames to detect features such as
edges, textures, and shapes. The convolutional process results in feature maps that highlight
important visual details of the lip movements.

Pooling Layers: Pooling operations reduce the dimensionality of the feature maps, retaining the
most significant information while making the data more manageable for the subsequent layers.

28
3.2.5 Training the Model

The training phase involves teaching the model to recognize and interpret lip movements:

Temporal Modeling: Recurrent Neural Networks (RNNs), particularly Long Short-Term


Memory (LSTM) or Gated Recurrent Units (GRUs), are used to capture the temporal
dependencies between frames. These networks process the sequence of feature maps and learn
the patterns associated with different phonemes and words.

Loss Function and Optimization: The model is trained using a loss function that measures the
difference between the predicted and actual outputs. Optimization algorithms, such as Adam or
SGD (Stochastic Gradient Descent), are employed to minimize this loss, adjusting the model’s
parameters to improve accuracy.

3.2.6 Testing the Model

Once trained, the model is evaluated on a separate test dataset that it has not seen during training.
This step assesses the model's ability to generalize to new, unseen data:

Performance Metrics: Metrics such as accuracy, precision, recall, and F1 score are used to
quantify the model’s performance. These metrics help determine how well the model can
interpret lip movements and convert them into text.

3.2.7 Text Output

The final step is the generation of the text output, which corresponds to the spoken words
represented by the lip movements in the input video:

Classification: The processed features are passed through a SoftMax layer, which outputs a
probability distribution over possible speech class (e.g., phonemes or words).

Text Generation: The highest probability class is selected for each frame sequence, and these are
combined to form the final predicted text. This output provides a readable transcription of the
visual speech input, effectively translating lip movements into written language.

29
3.3 Modules Connectivity Diagram

Figure 3.3 Modules Connectivity Diagram

Description About Modules Connectivity Diagram

It can be noted that the module connectivity diagram of a lip-reading model is a depiction of a
clear flow of information and processing steps which are central to comprehending spoken
language using vision. Fundamentally, the occurring process starts with the Video Input module
that receives consecutive frames depicting lip movements. These frames are usually subjected
to some form of preprocessing wherein normalization and cropping may be done on these in
order to be clearer and in focus. Then, the frames are passed to the Preprocessing module which
in return passes a processed frame to the Feature Extraction module where techniques like CNN
are used to extract simple spatial features from each fram

30
It then proceeds to the Temporal CNN module where temporal relation across frames is
extracted through temporal convolution. From here, the processed sequence of features is
inputted into a Bidirectional LSTM/GRU module, for the improvement of context and
successive movement of lips in the model. After as the output of LSTM/GRU, there is a step
called the Attention Mechanism that aims to ‘pay attention’ to the important frames or features
that are significant for correct interpretation.

After attentional processing, the extracted features are sent into the Classification module in
which the model output phonemes or words mapped to lip movements. Last but not least, the
results obtained from Classification are transferred to the Lip-Reading module where the
predicted linguistic information is fused to give the final output.

This structural connectivity chart defines the concrete links between the visual information from
lip movements and the transformation of this information into linguistic perception, which
illustrates the coordinated cooperation of separate computer processors that are critical for lip
reading.

3.4 Software and Hardware Requirements

3.4.1 Software

• Language: Python

• Libraries: OpenCV, TensorFlow

• Text Editor or IDE like VS Code, Google colab.

3.4.2 Hardware
• Operating System: Windows 11 12th Gen
• Processor: Intel® i5
• RAM: 8 GB
• System type: 32-bit operating system
• Graphics Processing Unit (GPU)

31
3.5 Analysis and Design through UML
3.5.1 Class Diagram

Figure 3.5.1 Class Diagram

Description about Class Diagram


Data Loader
This paper introduces a new class called `Data Loader` that deals with loading the data and
fulfilling tasks such as reading the video files and their alignment. It divides the data into a
training set and a test set and gives tools for obtaining batches of data.

Preprocessor
The `Preprocessor` class is the one to pre-process raw video data by face detection, lips
localization, and frames normalization. It sets up video frames for characteristic retrieval by
eradicating unnecessary information and maintaining standardization.

Postprocessor
The present `Postprocessor` class interprets model outputs and translates them into readable text.
It employs methods such as the CTC decoding to decode the probabilities it predicts into actual
sentences or phonemes.

32
Feature Extractor
The `Feature Extractor` class extracts spatial features from the frames in the videos after pre-
processing them with CNNs. It gets real video frames that can in turn be used to feed the lip-
reading model with meaningful information.

Lipreading Model
The `Lipreading Model` class is the central class liable for the construction of the whole neural
network architecture. It encompasses layers for temporal modeling such as RNNs or TCNs
because it predicts text from extracted features.

Trainer
The `Trainer` class prepares the `Lipreading Model` by supplying it with training content,
evaluating the loss and tuning the parameters. It covers techniques of training loops, losses,
gradients instructions, and optimization with the help of optimization algorithms.

Evaluator
The `Evaluator` class evaluates the learned `Lipreading Model` on the test set. It measures the
word error rate where terms such as the actual word and predicted word are compared to
determine its efficiency and effectiveness.

3.5.2 Dataflow Diagram

Figure 3.5.2. Data Flow Diagram

33
Description about Data Flow Diagram

The DFD for the lip-reading model shows the flow of data and various activities that take place
to acquire processed text data from raw video data. In a nutshell, the diagram enables the
identification of flow of information from the input stage to the output stage.

3.5.2.1 Input Stage

Data Loading: It starts with the loading of the video data and the alignment files to the system
by `Data Loader` component. This step involves acquiring video files that contain lips and
captions in the video that indicate the occurrences of a spoken message.

3.5.2.2 Preprocessing Stage

Face Detection and Lip Localization: Facial detection and Lips localization within the video
frames are the functions that the `Preprocessor` module performs. This step helps to exclude
from the next stages any data that is not related to the perception of the given visual images.

Frame Normalization: Specific frames are re-sized or normalized so as to optimize and arrange
their input quality to be the same for all the videos in the database.

3.5.2.3 Feature Extraction Stage

Spatial Feature Extraction: The `Feature Extractor` component involves Convolution Neural
Network (CNN) as it extracts spatial features on the preprocessed video frames. These describe
some distinctive aspects that are required in order to understand the lip movements, being
concerned with depicting essential visual patterns and details.

3.5.2.4 Temporal Modeling Stage

Lip Reading Model: The main `Lipreading Model` module analyzes the extracted spatial
features using temporal modeling that includes RNNs or TCNs. This stage deals with capturing
the motion of the lips over time to infer the spoken text.

3.5.2.5 Output Stage

Text Prediction: After passing through this section, the model outputs the expected text that
when uttered gives the visual lip movements or the similar phonemes to be more precise. This
step constitutes the last phase of lip-reading process, aimed at the testing efficiency of the system
in the identification of the actual visual inputs and their translation into useful text messages.

34
3.5.3 Use case diagram

Figure 3.5.3 Use Case Diagram

Description About Use Case Diagram


The following is the use case diagram for the lip-reading model; Font: Arial the use case diagram
demonstrates the main roles and features regarding the interaction between users and the system
components. Fundamentally, the diagram shows how users engage with the system to fulfil
purposes concerning the processing and transcription of lip movements into written text.

Users or actors perform several significant activities in the diagram. They start by inputting video
data into the system: in addition to video files, the input includes. all files containing textual
annotations of alignments. It is also important to note that the system takes these inputs through
different processes. Second, the system uses a feature extraction module often utilizing CNNs to

35
extract spatial features from the raw video frames that have been preprocessed. They include
features of relevant visual data that are vital in lip reading processes. Further on, the lip-reading
model is trained using the extracted features and the aligned text data processing through training
loops and optimization of the model parameters.

After training, the system also measures the model accuracy and performances the test set using
WER and CER if the system is a speech recognition system. The model can be used by users to
predict spoken text in real time hence showing practical use of the model. Lastly, the system sets
the output into comprehensible understandable worded phrases or phoneme for the final
interpreted text, for use or for further analysis

3.6Testing
An evaluation of a lip-reading model is a complex process that encapsulates the tasks of assessing
its verity, solidity, and adaptability. This kind of evaluation generally starts with the development
of a database, which consists of a single speaker and accent as well as speaking condition. The
data set is divided into training, validation and test set so that the model is trained over more
different data but at the same time tested on data which it was not trained and so evaluate the
performance of the model. The test set is especially valuable since it gives an independent
estimate of the model’s performance and its ability to function with relatively high accuracy on
new examples.

When testing the model, several output measures are taken into consideration, these include word
error rate, sentence error rate, and character error rate. These measures give information about
diverse aspects of the model’s performance. For instance, WER quantifies the word error rate
which tends to extent the difference between the right and wrong word predictions while CER
gives the character error rate which calculates the number of character level mistakes.
Furthermore, every model is confronted with test of Lighting variation, Background noise
variation, and Speaker lip movement dynamics included in the evaluation criterion to check
robustness. It is typically accomplished by applying various distortions to the test dataset and
assessing the model’s performance under these conditions.

Furthermore, cross validation is used as a way of confirming the validity of the developed model.
Take for instance, k-fold cross validation, here a dataset is divided into k sets a model is trained

36
k times but a different set is used for testing while the rest are for training. This is useful in finding
out if there are instances of over training and to ensure that the model is not too specialist in
making predictions on the new samples of the dataset. Furthermore, there are references to
previously best known algorithms or models for the same problem, to compare the gain in
performance that was obtained.

37
CHAPTER 4

RESULTS AND DISCUSSIONS

4.1. Description about Dataset

The dataset is designed to facilitate the training of a lip-reading model and consists of two main
files: The target and the target 2 may also be referred to as alignments and s1.
4.1.1 Alignments File
Purpose: Deposit the alignment information of the videos stored in the s1 file.
Contents
Alignments: This file describes what is to be said in each video and the relation between the frames
of the video and its phonemes or words.
Format: Every line in the alignments file is associated with of one video in the s1 file and it can
include times or frame numbers with the text.
Silence Representation: Where there is no speech in the video, the alignment file labeling uses the
label “sil”.

Figure 4.1 Alignment folder in data.zip file

38
4.1.2 S1 File
Purpose: Stores the video files, that were used in the process of building the model.
Contents
Videos: There are 450 videos in the s1 file and all of them contain one speaker for a each of it
Duration: That is why each video is 2 to 3 seconds long on average.
Single Speaker: No two videos have the same speaker in order to maintain an aspect of
continuity of lip movement as well as speech motor patterning.
File Format: Messages are in the form of videos and these are saved in a format commonly
used for videos (e. g., MPG).

Figure 4.2 s1 folder in data.zip file

39
4.2 Detailed Explanation about the Experimental Results

A total of 420 videos are trained on the model and we evaluated the output of the model using
various testing videos which results in the accurate predictions of the trained model by giving
the text as output.

RESULT:

Figure 4.1.1 Text generation from video

40
4.3 Significance of the Proposed Method with its Advantages
This lip-reading model proposed by him has the potential to take human computer interaction or
in other words, enabling technologies for the disabled a notch higher. Thus, correctly interpreting
spoken language with the help of visual signals only, this model provides new opportunities for
hearing-impaired people and bringing them closer to effective communication in conditions where
the use of audial information generally does not suffice. In addition, the versatile functions of the
model include security alarms, surveillance systems as well as the way of silent information
exchange in noisy environments.

Among its key benefits, the proposed lip-reading model is very effective in cases where source
audio data are either absent or of poor quality. This distinguishes it especially for use where there
is much interference, in which interfering speech recognition systems are apt to fail. Through
relying on the features extracted from the video, the proposed model avoids the effects of the low
sound quality and other problems that arise due to it.

4.3.1 Enhanced Accessibility

For Hearing Impaired Individuals: This model offers a way of translating spoken language into a
form that is accessible to persons with hearing impairment by translating lip movements. This can
greatly improve communication in live discussions and online communication; where there are no
subtitles or sign language interpreters.

Silent Communication: In all the essential instances where it is important not to talk including
libraries or meetings, the model provides a way of communication through lip movement that the
computer can interpret without sounds.

4.3.2 Improved Security and Surveillance

Speech Recovery in Noisy Environments: Thus, the model can be useful in noisy environment
during occasions or security to recover the spoken content that are hard to be understood due to
the surrounding noise.

4.3.3 Technical Robustness and Versatility

Noise-Independent Performance: As opposed to the conventional speech recognition model that is


predicated primarily on the audio feed, the lip-reading model is extremely reliable in different
acoustic conditions since noise has no impact on it.

41
Single Speaker Focus: The model is trained for singles speakers' videos that enables accurate
interpretation of the lip movements with no errors that might be introduced by multiple speakers.

42
CHAPTER 5
CONCLUSION AND FUTURE ENHANCEMENTS

5.1 Conclusion
In our lip-reading project, we developed a robust system for accurately converting visual speech
into text. Using advanced machine learning algorithms, we achieved significant accuracy in single-
speaker scenarios. Our work included the development of preprocessing and prediction pipelines,
setting a strong foundation for future enhancements.

While we made substantial progress, future enhancements will focus on integrating live webcam
video input, improving accuracy with multiple speakers, creating a user-friendly interface, and
adding multilingual translation capabilities. Our project establishes a solid groundwork, ready to
be built upon for more advanced and versatile applications.

5.2 Future Enhancement


To further enhance the lip-reading model and expand its capabilities, several key improvements
and features can be implemented:

5.2.1 Real-time Lip Reading with Webcam Integration


Integrate the lip-reading model with a webcam to enable real-time testing of lip-reading accuracy.
This enhancement allows users to receive immediate feedback on spoken language interpretation,
making the technology more interactive and practical for everyday use.

5.2.2 Improved Accuracy in Multi-speaker Environments


Enhance the model’s accuracy in scenarios where multiple individuals are speaking simultaneously
in the video source. This can be achieved through advanced audio-visual fusion techniques and
sophisticated deep learning architectures that can differentiate and interpret overlapping lip
movements more effectively.

5.2.3 User Interface (UI) Development for Enhanced Usability


Create a user-friendly interface that simplifies interaction with the lip-reading technology. The UI
should provide intuitive controls for starting and stopping video input, displaying real-time
translations, adjusting settings, and viewing historical data or logs. Design considerations should
focus on accessibility and ease of use for users with varying levels of technical expertise.

43
5.2.4 Multilingual Translation Capabilities
Implement translation capabilities that can convert both the visual lip movements and
corresponding text into multiple languages. This enhancement facilitates communication across
linguistic barriers, making the technology accessible to a global audience.

5.2.5 Implementation Considerations:


Data Synchronization: Ensure synchronization between visual input from the webcam and audio
input for accurate lip reading in real-time.
Audio-Visual Fusion: Develop algorithms that combine visual lip movements with audio cues to
improve accuracy, especially in noisy environments or when multiple speakers are present.
Machine Translation Integration: Integrate machine translation models to convert text output into
different languages, leveraging advances in natural language processing (NLP) for accurate and
fluent translations.
User-Centric Design: Improves user experience with an intuitive interface that enhances usability
and accessibility.
By integrating these enhancements, the lip-reading model can evolve into a versatile and
indispensable tool for improving communication across diverse settings and user needs. These
advancements not only enhance the accuracy and usability of the technology but also contribute to
its broader adoption and impact in various fields.

44
CHAPTER 6
APPENDICES
6.1 Importing the required libraries
!pip list
!pip install opencv-python matplotlib imageio gdown tensorflow

import os
import cv2
import tensorflow as tf
import numpy as np
from typing import List
from matplotlib import pyplot as plt
import imageio
tf.config.list_physical_devices('GPU')
physical_devices = tf.config.list_physical_devices('GPU')
try:
tf.config.experimental.set_memory_growth(physical_devices[0], True)
except:
pass

6.2. Build Data Loading Functions

import gdown
url = 'https://drive.google.com/uc?id=1YlvpDLix3S-U8fd-gqRwPcWXAXm8JwjL'
output = 'data.zip'
gdown.download(url, output, quiet=False)
gdown.extractall('data.zip')
def load_video(path:str) -> List[float]:

cap = cv2.VideoCapture(path)
frames = []
for _ in range(int(cap.get(cv2.CAP_PROP_FRAME_COUNT))):
ret, frame = cap.read()
frame = tf.image.rgb_to_grayscale(frame)
frames.append(frame[190:236,80:220,:])
cap.release()

mean = tf.math.reduce_mean(frames)
std = tf.math.reduce_std(tf.cast(frames, tf.float32))
return tf.cast((frames - mean), tf.float32) / std

45
vocab = [x for x in "abcdefghijklmnopqrstuvwxyz'?!123456789 "]

char_to_num = tf.keras.layers.StringLookup(vocabulary=vocab, oov_token="")


num_to_char = tf.keras.layers.StringLookup(
vocabulary=char_to_num.get_vocabulary(), oov_token="", invert=True
)

print(
f"The vocabulary is: {char_to_num.get_vocabulary()} "
f"(size ={char_to_num.vocabulary_size()})"
)

char_to_num.get_vocabulary()
char_to_num(['n','i','c','k'])

num_to_char([14, 9, 3, 11])
def load_alignments(path:str) -> List[str]:
with open(path, 'r') as f:
lines = f.readlines()
tokens = []
for line in lines:
line = line.split()
if line[2] != 'sil':
tokens = [*tokens,' ',line[2]]
return char_to_num(tf.reshape(tf.strings.unicode_split(tokens, input_encoding='UTF-8'), (-
1)))[1:]

def load_data(path: str):


path = bytes.decode(path.numpy())
#file_name = path.split('/')[-1].split('.')[0]
# File name splitting for windows
file_name = path.split('\\')[-1].split('.')[0]
video_path = os.path.join('data','s1',f'{file_name}.mpg')
alignment_path = os.path.join('data','alignments','s1',f'{file_name}.align')
frames = load_video(video_path)
alignments = load_alignments(alignment_path)

return frames, alignments


Itest_path = '.\\data\\s1\\bbal6n.mpg'

tf.convert_to_tensor(test_path).numpy().decode('utf-8').split('\\')[-1].split('.')

frames, alignments = load_data(tf.convert_to_tensor(test_path))


plt.imshow(frames[40])

46
alignments

tf.strings.reduce_join([bytes.decode(x) for x in num_to_char(alignments.numpy()).numpy()])

def mappable_function(path:str) ->List[str]:


result = tf.py_function(load_data, [path], (tf.float32, tf.int64))
return result

6.3. Create Data Pipeline

from matplotlib import pyplot as plt

data = tf.data.Dataset.list_files('./data/s1/*.mpg')
data = data.shuffle(500, reshuffle_each_iteration=False)
data = data.map(mappable_function)
data = data.padded_batch(2, padded_shapes=([75,None,None,None],[40]))
data = data.prefetch(tf.data.AUTOTUNE)
# Added for split
train = data.take(420)
test = data.skip(420)

len(test)

frames, alignments = data.as_numpy_iterator().next()

len(frames)
sample = data.as_numpy_iterator()

val = sample.next(); val[0]

imageio.mimsave('./animation.gif', val[0][0], fps=10)

plt.imshow(val[0][0][35])

tf.strings.reduce_join([num_to_char(word) for word in val[1][0]])

6.4. Design the Deep Neural Network

from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import Conv3D, LSTM, Dense, Dropout, Bidirectional,
MaxPool3D, Activation, Reshape, SpatialDropout3D, BatchNormalization, TimeDistributed,
Flatten

47
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint, LearningRateScheduler

data.as_numpy_iterator().next()[0][0].shape

(75, 46, 140, 1)

model = Sequential()
model.add(Conv3D(128, 3, input_shape=(75,46,140,1), padding='same'))
model.add(Activation('relu'))
model.add(MaxPool3D((1,2,2)))

model.add(Conv3D(256, 3, padding='same'))
model.add(Activation('relu'))
model.add(MaxPool3D((1,2,2)))

model.add(Conv3D(75, 3, padding='same'))
model.add(Activation('relu'))
model.add(MaxPool3D((1,2,2)))

model.add(TimeDistributed(Flatten()))

model.add(Bidirectional(LSTM(128, kernel_initializer='Orthogonal', return_sequences=True)))


model.add(Dropout(.5))

model.add(Bidirectional(LSTM(128, kernel_initializer='Orthogonal', return_sequences=True)))


model.add(Dropout(.5))

model.add(Dense(char_to_num.vocabulary_size()+1, kernel_initializer='he_normal',
activation='softmax'))
model.summary()

yhat = model.predict(val[0])

tf.strings.reduce_join([num_to_char(x) for x in tf.argmax(yhat[0],axis=1)])

tf.strings.reduce_join([num_to_char(tf.argmax(x)) for x in yhat[0]])


model.input_shape
model.output_shape

48
6.5. Setup Training Options and Train

def scheduler(epoch, lr):


if epoch < 30:
return lr
else:
return lr * tf.math.exp(-0.1)

def CTCLoss(y_true, y_pred):


batch_len = tf.cast(tf.shape(y_true)[0], dtype="int64")
input_length = tf.cast(tf.shape(y_pred)[1], dtype="int64")
label_length = tf.cast(tf.shape(y_true)[1], dtype="int64")

input_length = input_length * tf.ones(shape=(batch_len, 1), dtype="int64")


label_length = label_length * tf.ones(shape=(batch_len, 1), dtype="int64")

loss = tf.keras.backend.ctc_batch_cost(y_true, y_pred, input_length, label_length)


return loss

class ProduceExample(tf.keras.callbacks.Callback):
def __init__(self, dataset) -> None:
self.dataset = dataset.as_numpy_iterator()

def on_epoch_end(self, epoch, logs=None) -> None:


data = self.dataset.next()
yhat = self.model.predict(data[0])
decoded = tf.keras.backend.ctc_decode(yhat, [75,75], greedy=False)[0][0].numpy()
for x in range(len(yhat)):
print('Original:', tf.strings.reduce_join(num_to_char(data[1][x])).numpy().decode('utf-8'))
print('Prediction:', tf.strings.reduce_join(num_to_char(decoded[x])).numpy().decode('utf-
8'))
print('~'*100)

model.compile(optimizer=Adam(learning_rate=0.0001), loss=CTCLoss)

checkpoint_callback = ModelCheckpoint(os.path.join('models','checkpoint'), monitor='loss',


save_weights_only=True)

schedule_callback = LearningRateScheduler(scheduler)
example_callback = ProduceExample(test)

model.fit(train, validation_data=test, epochs=100, callbacks=[checkpoint_callback,


schedule_callback, example_callback])
49
6.6. Make a Prediction

url = 'https://drive.google.com/uc?id=1vWscXs4Vt0a_1IH1-ct2TCgXAZT-N3_Y'
output = 'checkpoints.zip'
gdown.download(url, output, quiet=False)
gdown.extractall('checkpoints.zip', 'models')

model.load_weights('models/checkpoint')

test_data = test.as_numpy_iterator()

sample = test_data.next()

yhat = model.predict(sample[0])

print('~'*100, 'REAL TEXT')


[tf.strings.reduce_join([num_to_char(word) for word in sentence]) for sentence in sample[1]]

decoded = tf.keras.backend.ctc_decode(yhat, input_length=[75,75], greedy=True)[0][0].numpy()

print('~'*100, 'PREDICTIONS')
[tf.strings.reduce_join([num_to_char(word) for word in sentence]) for sentence in decoded]

6.7. Test on a Video

sample = load_data(tf.convert_to_tensor('.\\data\\s1\\bras9a.mpg'))

print('~'*100, 'REAL TEXT')


[tf.strings.reduce_join([num_to_char(word) for word in sentence]) for sentence in [sample[1]]]

yhat = model.predict(tf.expand_dims(sample[0], axis=0))

decoded = tf.keras.backend.ctc_decode(yhat, input_length=[75], greedy=True)[0][0].numpy()

print('~'*100, 'PREDICTIONS')
[tf.strings.reduce_join([num_to_char(word) for word in sentence]) for sentence in decoded]

50
REFERENCES
[1] . Pingchuan Ma, Yujiang Wang, Stavros Petridis, Jie Shen, Maja Pantic “Training
Srategies For Improved Lip-reading” 2022 IEEE International Conference
On Acoustics, Speech And Signal Processing (ICASSP), Pp. 8472-8476, 2022.

[2]. Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-lopez, Honglie Chen, Stavros
Petridis, Maja Pantic “Auto-avsr: Audio-visual Speech Recognition with Automatic Labels” -
Arxiv:2303.14307v3 [Cs.CV] 28 Jun 2023

[3]. Mutallip Mamut, Nurbiya Yadikar , Mingfeng Hao, Alimjan Aysa, Kurban Ubul “A Survey
of Research on Lip Reading Technology” DOI: 10.1109/Access.2020.3036865

[4]. Atharva Karekar, Aakansha Gharate, Ravish Shaikh “Lip Reading Using Deep Learning”
Volume:05/Issue:04/April-2023

[5]. Kartik Datar, Meet N. Gandhi , Priyanshu Aggarwal, Mayank Sohani “A Review on Deep
Learning Based Lip-Reading” DOI : doi.org/10.32628/CSEIT206140

[6]. Nikita Deshmukh , Anamika Ahire , Smriti H Bhandari “Vision based Lip Reading System
using Deep Learning ” 021 International Conference on Computing, Communication and Green
Engineering DOI: 10.1109/CCGE50943.2021.9776430

[7]. Gaoyan Zhang and Yuanyao Lu “Research on a Lip-Reading Algorithm Based on Efficient-
GhostNet ” DOI - https://doi.org/10.3390/electronics12051151

[8]. Karan Shrestha “Lip Reading using Neural Network and Deep learning”.

[9]. Yiting Li, Yuki Takashima, Tetsuya Takiguchi, Yasuo Ariki; Lip Reading Using a Dynamic
Feature of Lip Images and Convolutional Neural Networks 978-1-5090-0806-3/16/$31.00
copyright 2016 IEEE

[10]. Fatemeh Vakhshiteh, Farshad Almasganj; lip-reading via Deep Neural Network Using
Appearance-based Visual Features 2017 24th national and 2nd International Iranian Conference

51
on Biomedical Engineering (ICBME), Amirkabir University of Technology, Tehran, Iran, 30
November - 1 December 2017 978-1-5386-3609-1/17/$31.00 ©2017 IEEE

[11]. Nergis Pervan Akman, Talya Tumer Sivrij, ali Berkol, Hamit Erdem; Lip Reading
Multiclass Classification by UsingDilated CNN with Turkish Dataset 2022 International
Conference on Electrical, Computer and Energy Technologies (ICECET) | 978- 1-6654-7087-
2/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICECET55527.2022.9873011

[12]. Tayyip Özcan, Alper Basturk; Lip Reading Using Convolutional Neural Networks with
and without Pre-Trained Models Article in Balkan Journal of Electrical and Computer
Engineering · April 2019 DOI: 10.17694/bajece.479891 .

[13]. Souheil Fenghour (Associate Member IEEE), Daqing Chen (Member IEEE); Lip Reading
Sentences Using Deep Learning with Only Visual Cues Digital Object Identifier
10.1109/ACCESS.2020.3040906 VOLUME 8, 2020

[14]. Hassan Akbari, Himani Arora, Liangliang Cao, Nima Mesgarani; Lip2audspec: Speech

Reconstruction from Silent Lip Movements Video 978-1-5386-4658-8/18/$31.00 ©2018 IEE

52
53

You might also like