0% found this document useful (0 votes)
1 views

Sign Language Recognition System Using CNN (2)

The document presents a Sign Language Recognition System utilizing Convolutional Neural Networks (CNN) to translate American Sign Language (ASL) into text, addressing the communication gap faced by speech-impaired individuals. It outlines the methodology for capturing hand gestures through a webcam, processing the images, and classifying the gestures using deep learning techniques. The proposed system aims to facilitate easier communication between deaf and hearing individuals without requiring prior knowledge of sign language.

Uploaded by

aishuaishwa1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Sign Language Recognition System Using CNN (2)

The document presents a Sign Language Recognition System utilizing Convolutional Neural Networks (CNN) to translate American Sign Language (ASL) into text, addressing the communication gap faced by speech-impaired individuals. It outlines the methodology for capturing hand gestures through a webcam, processing the images, and classifying the gestures using deep learning techniques. The proposed system aims to facilitate easier communication between deaf and hearing individuals without requiring prior knowledge of sign language.

Uploaded by

aishuaishwa1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2023 International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE) | 978-1-6654-9260-7/23/$31.

00 ©2023 IEEE | DOI: 10.1109/IITCEE57236.2023.10091051

Sign Language Recognition System using CNN


Aditi Deshpande1, Ansh Shriwas2, Vaishnavi Deshmukh3 ,Shubhangi Kale4
123
School of Electrical Engineering, 4Assistant Professor at School of Computer Engineering and Technology
1
avdeshpande@mitaoe.ac.in, 2 aashriwas@mitaoe.ac.in, 3vsdeshmukh@mitaoe.ac.in, 4spkale@mitaoe.ac.in
MIT Academy of Engineering, Alandi, Pune, Maharashtra 412105.

Abstract— The most common way of communication for better and confident communication with rest of the world.
a speech-impaired person is Sign Language. Generally,
people do not learn Sign Language for communication
with deaf and dumb people which causes their isolation.
Sign language is an ancient language for communication
that comes naturally, but since people are unaware of
systematic sign language and the person cannot have a
translator every time with him, there is a need of a
mediator system which can translate sign language. For
that purpose, a real-time method is presented in this paper
where deep learning is used for ASL translation. The main
aim of this project is to design a system that could identify
the alphabets of American Sign Language that are being
signed. The camera captures the frames of the hand being
signed and then passed through a filter and later through a
classifier for prediction of the class of the hand gestures.
The proposed system is an initial step for creating a
translator for sign language for making the communication
easier. The result is an HCI system that enables people to Fig(a) American Sign Language.
communicate with D&M people without sign language
being known.
II. LITERATURE SURVEY
Keywords— American Sign Language, Human
In order to extract signs from video sequence colour
Computer Interface
segmentation is done and viola Jones algorithm is used to
cutoff facial region. When it comes to object detection
I. INTRODUCTION
faster RCNN increases detection speed where input is
In today’s world, no one as a normal person takes special
warped to same size which further given to convolution
efforts to learn sign language to communicate with speech
block. To decode signs at sentence level centroids of facial
and hearing-impaired person. They strongly need a
as well as spatial aspects are extracted which is function of
mediator. Our communication happens mainly through
fuzzy membership class. When input is in the form of
visual aids such as body language, gestures and reading
sequence of gestures then LSTM model is used to identify
but speech is what mostly used. Communication in the
contiguous series of indications. Which is further broken
form of signs includes hand movements, hand shape
into subcomponents and fed to neural network. Global body
orientation to convey speaker’s message. There is no
configuration having coarse-grained characteristic and hand
universal sign language present which makes it a
region having fine grained features two different 3-
cumbersome task so that other people can understand it.
dimensional CNN availed for learning purpose. Various
Hence fingerspelling is just not sufficient for
pre-processing technique such as ABC (Artificial bee
communication. The intermediate system can solve this
colony) optimization algorithm and FBPNN (Flexible back
problem by identifying signs. The researchers are
propagation neural network) includes skin color
working on the sign language recognition from last few
segmentation and morphological filtering. SIFT is a
decades as it not only requires understanding of signs but
descriptor. Alphabets and numerals both are recognized
also the understanding of facial expressions, body
by this process.
language, different body postures. Even it may also
The proposed algorithm is able to extract the signs and
happen that same sign is used for different appearances
features from continuous video sequence by applying color
by different signers. Nowadays, on television it is seen
segmentation to identify hand image, but this dynamic
that a mediator is there for the people with inability to
background needs to be minimally cluttered in order to get
hear on news channels and sports channels. This shows
satisfactory result because it becomes difficult to extract the
that there is need of a system for helping these people to
gesture in such case. By using Support Vector machine, it
communicate as well as understand. The main aim was to
can find difference between static and dynamic signs and
design the system which can translate American sign
find their feature vectors as well. It uses Viola Jones
language to text. The proposed system is the little step to
algorithm to detect and remove facial part in video sequence
overcome communication problems and to help them for
as the considered lexicon only includes hand signs. Zernike
978-1-6654-9260-7/23/$31.00 2023
c IEEE 906

Authorized licensed use limited to: Zhejiang University. Downloaded on January 12,2025 at 16:08:31 UTC from IEEE Xplore. Restrictions apply.
moments is used for static signs which is also known as test the proposed system with 942 signed sentences (ISL).
shape descriptor. Curve features are extracted for 35 distinct sign terms are used to recognize these sign
dynamic gestures. The given system also converts speech sentences. On signed sentences accuracy was 72.3% and
to text for which it uses the sphinx module. Post-
that on isolated sign words, the accuracy was found to be
processing includes finding center of gravity and
detection of fingertips.[1] (Anup Kumar, 2016) 72.3% and 89.5%.[8] (Mittal, Kumar, Roy,
The gestures are made using hand postures and the Balasubramanian, & Chaudhuri, 2019)
addition of supporting features such as face expressions The retrieved local features in this system were combined
and body position. The model was trained with large and globalized using MLP and autoencoders, and the
datasets. In the system, convex hull was used for feature classification was performed using the SoftMax
extraction, while KNN was used for classification with algorithm.[9] (Muneer-al-Hammadi, Muhammad, Abdul,
accuracy rate of 65%.[2] (Amrutha K, 2021) & Alsulaiman, 2020).
Comparison with popular methods: 10 dynamic gestures of ISL, for this 1080 videos were
trained. NVIDIA Tesla K80 GPUs were used to train all the
Classifier HMM Particle HMM Multiclass Multiclass
filtering SVM SVM models.[10] (Bhagat, Y, & N, 2019).
Background Data Cluttered Uniform Cluttered Minimally VGG11 Model was used which is a 2D-CNN model. It is
Glove cluttered utilized in the spatial representation module to produce
Accuracy 80 86 85 96.2 >90 multicore features of the full-frame, hands, face, and
Table 1 posture.[11] (Zhou, Wengang, Yun, & Li, 2022). CNN was
Faster R-CNN models Detection speed increases because employed in this model's creation for both classification and
it designs RPN module. According to the location of recognition. Pre-processing is the first phase, which gets rid
gestures faster R-CNN can detect accurately compared to of the facial pixels while keeping the pixels for the
YOLO. The proposed system uses 3D CNN Network hands.[12] (Lijiya, 2019).
consisting of 4 convolution blocks. The input to each of This system uses 3-SU (ASL) subunit sign modelling
the convolution blocks is sampled to the same size. Relu framework to extract large-vocabulary multimodal signals
is used as an activation function.[3] (He, 2019) from continuous video sessions. Parallel Bayesian HMM is
Webcam is used to capture the images and to preprocess used for the spatial and temporal properties of subunits
the input sequence images VS code IDE and OpenCV (BPaHMM). Their concept created the sign lexicon for two
library are used. Preprocessing involves background temporal components and two spatial subunits (hand form)
noise removal using a slope distance algorithm.[4] (velocity and position).[13] (R. & Selvamani, 2019).
(Ashish, Ambekar, & G., 2016) The detected region was then transformed to the frame
In this system, frames are extracted from a real-time video which is binary. Once the binary image is achieved the
captured through a webcam. The image frames are Euclidean distance transformation is performed.[14]
converted to YCbCr domain from RGB. For feature (Roade & Jadav, 2017). Artificial bee colony and Flexible
extraction a feed forward type ANN Architecture is used. back propagation neural network used as classifier in this
There are 5 types of gestures and database of 25 images system.[15] (Kaur & Krishna, 2019).
was created.[5]. (Pankajakshan, C, & Thilagavathi, 2015)
For detecting and tracking the hand Face detection, Skin III. METHODOLOGY
Colour Segmentation and Object stabilization techniques
Data set generation:
are used. Classification of these hand postures is done
using KNN.[6]. (Shenoy, Dastan, Rao, & Vyavaharkar, Generation of data set was one of the most important as
well as time consuming task. Nearly 600 images of each
2018). alphabet for the testing purpose and 1000 images for
A similar technique can be used in OCR(Optical training purpose were generated. Images captured by the
Character recognition).[18] (Chikmurge & Shriram, webcam then region of interest is defined then extracting
2021). The challenge of understanding signs made at the the ROI -Region of interest, which is RGB, then it is
sentence level by speech impaired people has been converted into grey scale further gaussian blur filter is
applied to the image. This process was followed for all the
superscribed. Through the use of fuzzy membership ASL alphabets.
functions, the face and spatial aspects of a signer's hands
are extracted from the frames of a provided video of a
sign.[7] (H, B, & R, 2016).
The system detects a series of interconnected movements
using an LSTM model for continuous sequences of
gestures or continuous Sign Language Recognition. It is
founded on decomposing continuous indications into
smaller components and modelling such sub-units using Fig. (b) Dataset Generation
neural networks. Indian Sign Language has been used to

International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE) 907

Authorized licensed use limited to: Zhejiang University. Downloaded on January 12,2025 at 16:08:31 UTC from IEEE Xplore. Restrictions apply.
Feature Extraction:
Through images frames were extracted and those
frames were fed to CNN model for feature extraction
and hence the details were extracted from the frames.
The Architecture used here is Alexnet. Activation
function used is Relu that is rectified linear unit in each
of the convolution layer as well as the fully connected
layer. It adds non- linearity to the images.
Fig(d). Feature Extraction
Training and testing:
The training and testing of the model was done based on If the image in represented in the form of 3D matrix the
the generated datasets. For training the model 1000 height width and depth are its dimensional properties. The
images were used and 600 for testing purpose. The final value of depth of each pixel is 1 in case of grayscale image
classifies the image by calculating its likelihood. This was and 3 in case of RGB image. These values become
possible with the help of SoftMax activation function. important while extracting features of image using CNN.

Special Feature of system: 4) Gesture Classification: Following steps are followed for
The system also suggests the words similar to the signed identification of signs-
word shown by the user. This makes it user-friendly and x By applying gaussian blur and threshold on input
easier to predict the complex words. image we get the processed image.
x Now the processed image is derived through CNN
IV. BLOCK DIAGRAM model and if image resembles to more than 50 images
then the identified letter is printed corresponding to the
Steps to be followed for hand gesture recognition are: provided sign.
x Blank symbol or no symbol is considered for space
between the words.

V. ALGORITHM USED

Convolution Neural Network


Neurons have height, width and depth as 3 dimensions
unlike regular neural network. The neurons are connected
Fig. (c) Block Diagram
to window size part of the layer and each neuron is
connected to such parts of the input feature layer. In the
1) Data acquisition: Approaches to gather data are as
fully connected region, each neuron is connected to each
follows-
other. The output layer produces the single vector of class
Sensory Devises – This is expensive method as it uses
scores as the full image is reduced to its dimensions or
electromechanical devices to cut out hand configuration
number of classes. For object classification CNN model is
also various glove-based methods are used to draw out
considered as to be adroit. Even after using for millions of
information.
images overfitting problem does not occur at critical state,
Vision Approach – In this method camera is used as an
but it difficult to apply CNN model in case of high-
input device. The major challenges faced in this method
resolution images which turns out to be its drawback.
are due to different skin tones possible for human hand,
movement of hand, difference in viewpoint, camera
i) Convolution Layer – In convolution layer a small part of
capturing speed.
input matrix is taken which is also known as window size
(generally 5*5). The layer also contains learnable filters
2) Data pre-processing: The background needs to be
which is of the same size as of window. Dot product of filter
subtracted as it may contain the facial part which is of
values and selected region of layer is done which further
same colour as of hand. Here adaboost face detector is
produces activation matrix. Window is slided by the stride
used to differentiate between face and hand region so that
value (typically one). Thus, the network learns the filter so
facial part could be removed. Then the gaussian blur filter
that whenever a particular orientation of edge/object is seen
is used on extracted image which is to be further trained.
it could be activated.
To get the better accuracy background of hand should be
kept as single one colour as prediction and accuracy is
ii) Pooling layer – The function of pooling layer is to reduce
highly dependent on lightning conditions.
the activation matrix size which ultimately decreases
learning parameters. Max pooling and average pooling are
3) Feature-Extraction: the types of pooling-layers. Convolution layer produces
feature map and the function of pooling layer is to take out
abstract of features. Feature maps generated by convolution
al layer is location specific means it try to associate the
particular feature with specific portion in the input image.It

908 International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE)

Authorized licensed use limited to: Zhejiang University. Downloaded on January 12,2025 at 16:08:31 UTC from IEEE Xplore. Restrictions apply.
further reduces the performance as it focused on gritty set the input unit values to 0 preventing model
details .Focusing on features of higher level can solve the from overfitting.
problem for that using stride with higher value could be
one solution . Another approach is pooling which is
focused on higher level details .
Architecture used:

Fig. (e) Pooling Fig. (h) Alexnet Architecture [16]

x Max Pooling- In max pooling maximum value Alexnet architecture is considered as milestop for image
from the window size is selected. If window size is classification which has eight layers consisting of five
of 2*2 size, then the maximum value is chosen from convolutional layers, three max-pooling layers and three
the corresponding 4 values. Thus the new activation fully connected layers. [17] (A. Gadre, 2021).
matrix we got is of half size than the original one. Layers Layer Name Number Kernel Strides
of Size (pixels)
kernels
Layer First 96 11*11 4
1 convolutional
layer
Layer First max- 96 3*3 2
2 pooling layer
Layer Second 256 5*5 1
Fig. (f) Max Pooling 3 Convolutional
x Average Pooling- It takes average of all values Layer
Layer Second max- 256 3*3 2
present in window.
4 pooling layer
Layer Third 384 3*3 1
5 Convolutional
Layer
Layer Forth 384 3*3 1
6 Convolutional
Layer
Layer Fifth 256 3*3 2
7 Convolutional
Fig. (g) Average Pooling Layer
Layer Third max- 56 3*3 2
iii) Fully Connected layer – In convolution layer the 8 pooling layer
neurons are connected only to the other neurons from the
same frame but in the fully connected layer all the inputs
First and second fully connected layer has 4096 neurons.
are connected to the neuron. In order to influence every
output vector with input vector it applies linear Output layer has 1000 neurons.
transformation to input vector. The output from preceding
pooling layer is flattened and then fed to fully Relu activation function non-linearity:
connected layer. Tanh function has saturating nonlinearity as it
compresses the output within the range of –1 to 1 which
iv) Final Output layer – This is the final layer of neurons shows limiting behavior at boundaries. Relu has non
which is having the count equal to number of possible saturating non linearity f(x) = max(0,x) which makes it
classes. Final output layer predicts the probability for the easier to learn complex features. Also, it is faster to train
input image to be present in a particular class. the neural network than saturating nonlinearity.

v) Dropout layer:
This layer abolish the effect of some neuron. It refers to

International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE) 909

Authorized licensed use limited to: Zhejiang University. Downloaded on January 12,2025 at 16:08:31 UTC from IEEE Xplore. Restrictions apply.
excellent outcomes. After evaluation it was found that
AlexNet performed better than LeNet 5, with an accuracy
of 98.28 % compared to LeNet 5’s 96.3 %. LeNet 5 only
achieved 56.59 % true positive results compared to 71.78
% true positive results from AlexNet (roc auc metric),
which may suggest that this model’s accuracy is subpar.
LeNet 5 has a greater mean squared error (5.15 %),
compared to AlexNet’s 3.81 %.

Fig. (i) Relu Activation Function [16]

Reducing Overfitting:
Dropout – This regularization technique refers to dropout
hidden and visible units in the network. It is the
probability to train the particular number of neurons
means 0.5 probability indicates to dropout 50 percent of
neurons hence they will be ignored.

Fig. (j) Accuracy Comparison


VI. FLOW OF SYSTEM
x The model will capture image of signed
gesture using webcam.
x The images are then filtered through gaussian-
blur filter.
x This processed frame is passed through
Alexnet (CNN Model) for the prediction of
signed alphabet.
x If the alphabet or a number is recognized for
more than 50 images/frames, then it is
displayed on the screen and considered for
word formation.
Fig. (k) Loss Comparison
x When nothing is signed, it considers it as blank

symbol (space).
VIII. OBSERVATIONS
x Then the images are classified between the sets
which show similar results using particular For the proposed model, when the value epoch lies in
classifiers made for those sets. between the range 0-23 average accuracy achieved is 20%,
23-45 accuracy reaches 50%, 45-67 it is 70% of the
x The model only displays the alphabet, whenever maximum and above 90 epochs we get 95% accuracy
the number of alphabets detected exceeds the which linearly increases as the number of epochs goes on
threshold value and no other alphabet’s value is increasing. Hence the final accuracy obtained is 98.6%.
near to it, and the displayed alphabet is added to
the string.
x Then it gives some correct possible synonyms
for every incorrect word, and a set of words
similar to the input word is displayed on the
screen.
x The user can select a word from the suggestions
and add it to the sentence formed and hence sign
language can be converted to text letter by letter,
word by word.

VII. MODEL COMPARISON


CNN is found to be more difficult to work with because
finding the right values, filters, and units (neurons) for Fig. (l) Results
the final tune-up in post to improve the performance
obtained is challenging, according.We have discovered
LeNet 5 and AlexNet parameters that have produced

910 International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE)

Authorized licensed use limited to: Zhejiang University. Downloaded on January 12,2025 at 16:08:31 UTC from IEEE Xplore. Restrictions apply.
[8] Anshul Mittal; Pradeep Kumar; Partha Pratim Roy;
Accuracy Raman Balasubramanian; Bidyut B. Chaudhuri “A
1.2
Modified-LSTM Model for Continuous Sign
Language Recognition using Leap motion” DOI
1 10.1109/JSEN.2019.2909837, IEEE Sensors Journal
0.8 1 2019
0.6 [9] Muneer-al-Hammadi; Ghulam Muhammad; Wadood
0.4 Abdul; Mansour Alsulaiman; Mohammed A.
Bencherif Tareq S. Alrayes; Hassan Mathkour;
0.2
Mohamed Amine mekhtiche; “Deep Learning-Based
0 Approach for Sign Language Gesture Recognition
1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 With Efficient Hand Gesture Representation” IEEE
Fig. (m) Plot of Model Accuracy October 19, 2020.
[10] Neel Kamal Bhagat; Vishnu Sai Y; Rathna G N
“Indian Sign Language Gesture Recognition using
Image Processing and Deep Learning” [IEEE 2019
IX. CONCLUSION
Digital Image Computing: Techniques and
The creation of a practical real-time vision-based Applications (DICTA) - Perth, Australia (2019.12.2-
American sign language recognition system for D&M 2019. 12.4)] 2019.
users employing asl alphabets is described in this article. [11] Hao Zhou; Wengang Zhou; Yun Zhou; and
The final accuracy of 98.0 percent is achieved on the Houqiang Li; Fellow “Spatial-Temporal Multi-Cue
dataset. The accuracy of the prediction is increased by Network for Sign Language Recognition and
using symbols that are more similar to one another after Translation” DOI 10.1109/TMM.2021.3059098,
creating two layers of algorithms. If they are displayed IEEE 2022
appropriately, there is no background noise, and the [12] Sruthi C. J and Lijiya A” Signet: A Deep Learning
illumination is suitable, all symbols are recognized based Indian Sign Language Recognition System”
correctly.
International Conference on Communication and
Signal Processing, April 4-6, 2019, India.
X. REFERENCES [13] R., Elakkiya; Selvamani, K. (2019). “Subunit sign
modelling framework for continuous sign language
[1] Anup Kumar; Karun Thankachan; Mevin M. “Sign recognition.” Computers and Electrical Engineering.
Language Recognition” 3rd InCI Conf. on Recent 74. 379-390.10.1016/j.compeleceng.2019.02.012.
Advances in Information Technology I RAIT-2016 [14] Yogeshwar I. Roade; Prashant M. Jadav; “Indian
[2] Amrutha K; Prabhu P.; “ML Based Sign Language Sign Language Recognition System” DOI:
Recognition System” 2021 International 10.21817/ijet/2017/v9i3 /170903S030 Vol 9 No 3S
Conference on Innovative Trends in Information July 2017.
Technology (ICITIIT) ©2021 IEEE. [15] Jasmine Kaur; C. Rama Krishna “An Efficient
[3] Siming He “Research of a Sign Language Indian Sign Language Recognition System using Sift
Translation System Based on Deep Learning” Descriptor” International Journal of Engineering and
International Conference on Artificial Intelligence Advanced Technology (IJEAT) ISSN: 2249 – 8958,
and Advanced Manufacturing (AIAM) -2019. Volume-8 Issue-6, August 2019.
[16] Alexnet Architecture explained (n.d.). Siddhesh
[4] Nikam, Ashish; S Ambekar; Aarti G “Sign Bangar. Retrieved from
Language Recognition Using Image Based Hand https://medium.com/@siddheshb008/alexnet-
Gesture Recognition Techniques” Online architecture-explained-b6240c528bd5.
International Conference on Green Engineering [17] A. Gadre, P. Pund, G. Ajmire and S. Kale, "Signature
and Technologies (IC-GET) 2016. Recognition Models: Performance Comparison,"
[5] Pankajakshan; Priyanka C; Thilagavathi ;,“Sign 2021 International Conference on Advancements in
Language Recognition System” IEEE Sponsored Electrical, Electronics, Communication, Computing
2nd International Conference on Innovations in and Automation (ICAECA), 2021, pp. 1-6, doi:
Information Embedded and Communications 10.1109/ICAECA52838.2021.9675598.
Systems ICIIECS’15, 2015 [18] Chikmurge, Diptee & Shriram, R.. (2021). Marathi
[6] Kartik Shenoy; Tejas Dastan; Varun Rao; Handwritten Character Recognition Using SVM and
Devendra Vyavaharkar; “Real-time Indian Sign KNN Classifier. 10.1007/978-3-030-49336-3_32
Language (ISL) Recognition”
10.1109/ICCCNT.2018.8493808 IEEE July 2018
[7] Nagendraswamy H; Chethana Kumara B; Lekha
Chinmayi R; “Indian Sign Language Recognition:
An Approach Based on Fuzzy-Symbolic Data”
2016 Intl. Conference on Advances in Computing,
Communications and Informatics (ICACCI), Sept.
21-24, 2016.

International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE) 911

Authorized licensed use limited to: Zhejiang University. Downloaded on January 12,2025 at 16:08:31 UTC from IEEE Xplore. Restrictions apply.

You might also like