0 ratings 0% found this document useful (0 votes) 12 views 24 pages DIP Project Code
The document outlines a project focused on developing a real-time sign language recognition system that detects five American Sign Language gestures using a standard laptop camera. The system employs digital image processing techniques and deep learning, specifically utilizing the YOLOv5 model for gesture classification, to enhance communication for individuals with hearing or speech impairments. The report details the methodology, challenges faced, and results, emphasizing the system's accessibility and potential for fostering inclusive communication.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here .
Available Formats
Download as PDF or read online on Scribd
Carousel Previous Carousel Next
Save DIP Project Code For Later Sign Language Recognition
Digital Image Processing
Submitted by:
‘Mudassir Alam 20224503
‘Manish Goutam 20224093
B.Tech. (Vith Sem)
Department of Electronics & Communication Engineering
Motilal Nehru National Institute of Technology, Prayagraj, U.PTable Of Content
No. Topic Name Page
1 Abstract 03-04
2. 04-06
Introduction
3. 07-09
Literature
4 10-11
Problem Definition
5. 12-15
Methodology
6. 16-20
Code Structure
7. 21-22
Result
8. 23-24
Conclusion1.Abstract
is project presents the design, development, and evaluation of a real-time sign
Ianguage recognition system capable of detecting five commonly used American Sign
Language (ASL) gestures: Thank You, Yes, No, Please, and I Love You. The system is
engineered to operate using only a standard laptop camera, eliminating the need for any
specialized or expensive hardware and thereby enhancing its accessibility and ease of
deployment. Developed as part of a Digital Image Processing course, this project
leverages widely used open-source computer vision tools, specifically OpenCV, in
conjunction with a pre-trained deep learning model to accurately classify static hand
gestures from live video input.
The primary motivation behind this work stems from the significant communication
barriers faced by individuals who are hearing or speech-impaired. In many everyday
situations, the inability to communicate effectively can lead to frustration, social
isolation, and reduced opportunities for participation in educational, professional, or
social contexts. By providing a technological solution that can interpret specific sign
language gestures in real time and translate them into text, this system aims to foster
more inclusive and accessible communication channels between sign language users
and the broader community,
The methodology adopted in this project encompasses several key stages, including data
collection, image preprocessing, pre-trained model, and system integration. Hand
gesture data was collected using a webcam, and frames were processed to enhance
image quality and isolate relevant features. The deep learning model, trained on a
custom dataset of the five selected signs, utilizes convolutional neural networks (CNNs)
to extract spatial features from the hand region and accurately classify the gesture. The
integration of OpenCV facilitates efficient real-time video capture and processing,
ensuring that the system responds promptly to user input.
In terms of system design, particular attention was given to user-friendliness and
robustness. The interface provides immediate visual feedback by displaying the
recognized sign as text on the screen, allowing users to verify system output in real
time. The modular architecture of the codebase also allows for future expansion to
include additional signs or adapt to different sign languages.
‘This report provides a comprehensive overview of the project, detailing the underlying
motivation, technical methodology, system architecture, implementation challenges, and
evaluation results. Through rigorous testing, the system demonstrated high accuracy in
recognizing the selected signs under various lighting and background conditions.
Ultimately, this work highlights the potential of combining digital image processing and
deep learning to create practical assistive technologies that can bridge communication
gaps and promote greater inclusivity for the hearing and speech-impaired community.2.Introduction
2.1 Background of the Project
Communication is a core component of daily human life, allowing individuals to
interact, share ideas, express emotions, and convey information. For individuals with
hearing or speech impairments, traditional modes of communication such as spoken
language often pose significant challenges. To bridge this gap, sign language has been
developed as a rich and expressive alternative communication medium. Using hand
gestures, facial expressions, and body language, sign language enables users to
communicate complex ideas without speaking.
Despite its effectiveness, a key barrier remains — most of the general population is not
trained in sign language. As a result, sign language users often experience difficulties in
everyday communication, whether in educational settings, professional environments,
r social interactions. This communication gap has motivated researchers and
developers to seek technological solutions that can serve as real-time interpreters,
automatically recognizing and translating sign language into spoken or written
language.
The growth of digital image processing, combined with advances in machine learning
and computer vision, has made real-time sign language recognition more achievable
than ever. By using standard webcams, open-source libraries like OpenCV, and deep
learning frameworks, it is possible to create lightweight, cost-effective systems capable
of identifying specific hand gestures and displaying their meanings in real time.
‘Sign language is a vital communication medium for individuals with speech and hearing
impairments. However, the language barrier between sign language users and
non-users often leads to communication difficulties. The integration of computer vision
and digital image processing has opened avenues for real-time sign recognition
systems. This project aims to contribute a simple yet functional model that recognizes a
subset of American Sign Language (ASL) gestures and displays the corresponding words
on-screen.
The focus was on ease of use, minimal hardware requirements (a laptop camera), and
implementation of five signs that are among the most frequently used in polite and
emotional conversation2.2 Objective of the Project
This project was undertaken with the primary goal of developing a basic yet functional
real-time sign language recognition system using digital image processing techniques.
Specifically, the system is designed to detect and recognize five frequently used
‘American Sign Language (ASL) signs:
© Thank You
# Yes
* No
© Please
* Love You
These particular signs were chosen due to their importance in polite conversation and
‘emotional expression, making them highly relevant for real-world communication
scenarios. The objective was to use a standard laptop camera for video input and train
a deep learning model capable of accurately classifying each gesture in live video
frames. By limiting the scope to five signs, the focus remained on building a reliable
and accessible proof-of-concept system that could be later scaled or improved.
‘The broader goal of the project is to contribute to the field of assistive technology and
digital accessibility, offering a solution that could potentially enhance communication
between deaf individuals and those unfamiliar with sign language.
2.3 Scope of the Report
This lab report documents the entire development process of the Sign Language
Recognition system. It covers the technical and conceptual aspects of the project in a
structured manner. The report is divided into several key sections, each focusing on a
critical aspect of the work:
© Literature Review: An overview of existing sign language recognition systems
and relevant research,
‘* Problem Definition: A clear outline of the challenges and goals addressed by the
project.
‘© Methodology: Detailed explanation of the tools, datasets, and algorithms used.‘® Design and Implementation: Description of how the system was built, from data
collection to real-time prediction,
‘® Results: Performance evaluation of the system based on accuracy and usability.
© Conclusion: Summary of findings and suggestions for future improvements.
By the end of this report, the reader should have a comprehensive understanding of
how digital image processing was applied to recognize sign language gestures in real
time, the challenges encountered during development, and the significance of the
results obtained.3.Literature
3.1 Background and Research in Sign Language Recognition
Before starting my project, | looked into how sign language recognition has been
handled in the past and what approaches are commonly used. | found that earl
systems mostly used gloves with sensors or hardware-based tools to track hand
motion. Although those systems were accurate, they required users to wear special
equipment, which made them inconvenient for everyday use.
With the growth of computer vision and deep learning, | noticed that more recent
research focused on vision-based systems using just a camera, These systems used
machine learning models to recognize hand gestures from video or images without
needing any extra devices. Since my goal was to keep things simple and easy to use,
especially with a standard laptop camera, | decided to follow the same vision-based
approach.
| also came across various models like CNNs, ResNet, and MobileNet being used for
gesture recognition. But for real-time detection and speed, | felt that | needed
something that could both detect and classify in one shot. That’s when I came across
YOLO — a powerful family of object detection models.
3.2 Why I Chose YOLOvS for Gesture Recognition
‘Among all the object detection models | researched, YOLOvS stood out because of its
speed, simplicity, and real-time performance. | used YOLOvS in my project mainly
because | needed a model that could work with live webcam input and give fast and
accurate predictions. YOLO (You Only Look Once) models are known for doing
detection and classification in one go, making them ideal for tasks like this.
YOLOVS, in particular, was easy to set up and work with because it is written in Python
using PyTorch, which | was already somewhat familiar with through tutorials. Another
reason | chose YOLOVS was because of the available pre-trained weights and the ability
to fine-tune the model on my own dataset of hand gestures. Since | only had a few
hours to work on this project, training a model from scratch wasn’t possible —
YOLOvS's transfer learning capabilities helped me a lot.
In my setup, | used YOLOVS to detect hand gestures directly from the video feed. |trained the model to recognize five specific signs — “Thank you,” “Yes,” “No,” “Please,”
and “I love you.” YOLOvS handled the task well and was able to detect the gestures in
real time while | performed them in front of the camera.
3.3 Role of Python and OpenCV in My Project
Python was the main programming language | used for the entire project. One of the
biggest reasons | went with Python is because of the huge number of libraries and
community support available for computer vision and deep learning. It made the
development process smoother and faster.
used OpenCV for capturing the video from the laptop camera and for displaying the
results. OpenCV also helped with image preprocessing tasks like resizing the frames,
drawing bounding boxes, and adding the label of the detected gesture on the screen. It
was very straightforward to integrate OpenCV with the YOLOvS model.
Other Python libraries | used include:
© Numpy, for handling arrays and image data.
© PyTorch, which was used to load and run the YOLOvS model.
‘© Matplotlib, which helped me visualize training performance and test results.
Thanks to Python's simplicity, | was able to build the project quickly, even though |
didn’t have much time or deep experience with machine learning frameworks,
3.4 Using Pre-trained Models to Save Time and Effort
Given the time constraint for this project, | knew | wouldn’t be able to collect a huge
dataset or train a model from scratch. That’s why I decided to use pretrained weights
provided with YOLOvS and fine-tune them using a small dataset of hand gestures that |
either collected myself or sourced from open datasets online.
‘The idea behind using a pre-trained model is that it already knows how to detect
general shapes and features (like edges, curves, and patterns) from large-scale datasetslike COCO. | only had to retrain it slightly on my custom classes (the five hand gestures),
which took much less time and gave decent accuracy.
Fine-tuning a pre-trained YOLOv5 model allowed me to:
© Get results fast, even with a small number of images.
'® Avoid the need for powerful GPUs or long training times.
‘® Focus more on the application side — integrating the model with a webcam and
making the predictions user-friendly.
Overall, using a pretrained model made the entire project more manageable within the
short deadline | had. It also showed me how powerful transfer learning can be in
real-world tasks like gesture recognition.10
4.Problem Defin:
4.1 Communication Barriers for the Hearing and Speech Impaired
One of the core problems | wanted to address with this project is the communication
gap between sign language users and non-signers. People who are hearing or speech
impaired often depend on sign language to communicate, but unfortunately, a large
part of the population does not understand it. This creates a barrier in daily life
situations — whether it’s talking to a shopkeeper, asking for help in public, or
participating in a classroom or workplace.
While professional human interpreters can help, they are not always available or
affordable. | realized that if there could be a system that can recognize sign language
using just a webcam, it would help make communication more inclusive and
accessible. | wanted to explore how far | could go in building a basic version of that
system using just a laptop and open-source tools.
4.2 Need for Real-Time, Easy-to-Use Recognition Systems
Another problem | faced early in the planning stage was deciding how to make the
system real-time and easy to use. Many gesture recognition systems exist in academic
papers or research labs, but they often require expensive hardware like depth cameras,
special gloves, or high-end GPUs. | didn’t have access to those things — I only had my
laptop and its built-in camera.
So the challenge for me became: How do I create a gesture recognition system that
works in real time, on a basic laptop, using just a webcam? | needed something that:
© Can detect and recognize hand gestures instantly.
‘© Doesn't require training a model from scratch.
© Can be built and tested within a few hours or a day.
This narrowed my search to real-time object detection models, and that’s when |
decided to use YOLOVS, which was light, fast, and could work even with a smallernu
dataset. My goal was to strike a balance between simplicity and functionality — just
‘enough to show that the system works and can recognize at least a few common signs.
4.3 Defining the Scope of the Problem for This Project
Since | was short on time and resources, | decided to limit the scope of the project to
only five gestures from American Sign Language:
© Thank You
© Yes
© No
© Please
© Love You
These signs were chosen because they are among the most frequently used in daily
‘communication and are also relatively easier to distinguish in terms of hand shape and
position. By focusing on just five signs, | was able to:
‘* Collect and label enough data quickly.
© Train the model without requiring a large dataset.
‘® Keep the recognition task more manageable.
The problem, therefore, became a multi-class image classification task — where each
image (or frame from a video) had to be classified into one of five categories. | also
needed to make sure that the system works fast enough to give the user immediate
feedback, which is critical for real-world usability.2
5. Methodology
5.1 Overview of the Approach
To develop a real-time sign language recognition system, my primary goal was to
ensure that the approach was simple, practical, and achievable within the available
time and hardware constraints. | opted for a computer vision-based method that relies
solely on a webcam feed. The project pipeline was designed around the following key
stages:
1. Data Collection and Annotation
2. Model Selection and Pretrained Weights
3. Training using YOLOvS,
4, Integration with Live Webcam Feed
5. Real-Time Detection and Display
Each stage played a crucial role in achieving the final outcome. Below, | explain the
methodology in detail, breaking down the technical process and the reasons behind
each decision.
5.2 Data Collection and Preprocessing
Since | only planned to recognize five specific gestures, | decided to either create a
small custom dataset myself or source freely available hand gesture images from the
web. The goal was to gather images representing the five gestures;
© Thank you
© Yes
© No
© Please2B
* [love you
| captured these images using my laptop webcam under different lighting conditions
and angles. For each gesture, | tried to capture around 100-150 images. Although this is
a small dataset by deep learning standards, it was sufficient for a proof-of-concept
when paired with transfer learning,
used Labellmg, an open-source tool, to annotate the images. Each hand gesture was
labeled with the appropriate class name, and the annotations were saved in YOLO
format. This tool allowed me to draw bounding boxes around the hand performing the
gesture and assign the corresponding class label.
Preprocessing Steps:
© Resized images to 640x640 pixels (input size required by YOLOvS)
‘© Ensured class balancing to prevent overfitting
‘© Split the dataset into training (80%) and v.
lation (20%)
5.3 Model Selection and YOLOvS Overview
After evaluating different object detection models, | selected YOLOVS for its lightweight
architecture, speed, and ease of deployment. YOLO (You Only Look Once) is a real-time
object detection system that can predict bounding boxes and class labels in a single
forward pass.
YOLOVS is implemented in Python using the PyTorch framework, which made it ideal
for quick setup and integration. | used the YOLOvSs (small) variant to ensure faster
inference on my laptop.
Key Features of YOLOVS:
© High inference speed
* Easy to train on custom data
‘© retrained on the COCO dataset
‘© Export support for ONNX, TorchScript, and CoreML
By using the pretrained weights from YOLOvS, | was able to take advantage of theu
model's ability to recognize low-level image features, then fine-tune it on my small
dataset of hand gestures.
5.4 Transfer Learning and Training Process
Transfer learning was the backbone of this project. Since | didn't have the
computational power or time to train a large-scale model from scratch, | fine-tuned
YOLOVS's pretrained model on my gesture dataset.
Setup:
‘© Model: YOLOvSs (small variant)
‘© Framework: PyTorch
‘® Epochs: 50
© Batch size: 16
© Image size: 640x640
© Optimizer: SGD
© Loss functions: Classification loss, Objectness loss, Localization loss
used the command line training interface provided by YOLOvS. The model was trained
using a single GPU on Google Colab, which helped reduce local resource consumption
During training, | monitored:
‘© mAP (mean Average Precision): to measure detection accuracy
‘© Loss curves: to track overfitting or underfitting
© Validation images: to visually check prediction results
5.5 Integration with Webcam and Real-Time Inference
After training the model, the next step was to integrate it with a live webcam feed so
that it could detect and classify gestures in real time. | used OpenCV for this purpose.