Sign Language Recognition System Using CNN (2)
Sign Language Recognition System Using CNN (2)
Abstract— The most common way of communication for better and confident communication with rest of the world.
a speech-impaired person is Sign Language. Generally,
people do not learn Sign Language for communication
with deaf and dumb people which causes their isolation.
Sign language is an ancient language for communication
that comes naturally, but since people are unaware of
systematic sign language and the person cannot have a
translator every time with him, there is a need of a
mediator system which can translate sign language. For
that purpose, a real-time method is presented in this paper
where deep learning is used for ASL translation. The main
aim of this project is to design a system that could identify
the alphabets of American Sign Language that are being
signed. The camera captures the frames of the hand being
signed and then passed through a filter and later through a
classifier for prediction of the class of the hand gestures.
The proposed system is an initial step for creating a
translator for sign language for making the communication
easier. The result is an HCI system that enables people to Fig(a) American Sign Language.
communicate with D&M people without sign language
being known.
II. LITERATURE SURVEY
Keywords— American Sign Language, Human
In order to extract signs from video sequence colour
Computer Interface
segmentation is done and viola Jones algorithm is used to
cutoff facial region. When it comes to object detection
I. INTRODUCTION
faster RCNN increases detection speed where input is
In today’s world, no one as a normal person takes special
warped to same size which further given to convolution
efforts to learn sign language to communicate with speech
block. To decode signs at sentence level centroids of facial
and hearing-impaired person. They strongly need a
as well as spatial aspects are extracted which is function of
mediator. Our communication happens mainly through
fuzzy membership class. When input is in the form of
visual aids such as body language, gestures and reading
sequence of gestures then LSTM model is used to identify
but speech is what mostly used. Communication in the
contiguous series of indications. Which is further broken
form of signs includes hand movements, hand shape
into subcomponents and fed to neural network. Global body
orientation to convey speaker’s message. There is no
configuration having coarse-grained characteristic and hand
universal sign language present which makes it a
region having fine grained features two different 3-
cumbersome task so that other people can understand it.
dimensional CNN availed for learning purpose. Various
Hence fingerspelling is just not sufficient for
pre-processing technique such as ABC (Artificial bee
communication. The intermediate system can solve this
colony) optimization algorithm and FBPNN (Flexible back
problem by identifying signs. The researchers are
propagation neural network) includes skin color
working on the sign language recognition from last few
segmentation and morphological filtering. SIFT is a
decades as it not only requires understanding of signs but
descriptor. Alphabets and numerals both are recognized
also the understanding of facial expressions, body
by this process.
language, different body postures. Even it may also
The proposed algorithm is able to extract the signs and
happen that same sign is used for different appearances
features from continuous video sequence by applying color
by different signers. Nowadays, on television it is seen
segmentation to identify hand image, but this dynamic
that a mediator is there for the people with inability to
background needs to be minimally cluttered in order to get
hear on news channels and sports channels. This shows
satisfactory result because it becomes difficult to extract the
that there is need of a system for helping these people to
gesture in such case. By using Support Vector machine, it
communicate as well as understand. The main aim was to
can find difference between static and dynamic signs and
design the system which can translate American sign
find their feature vectors as well. It uses Viola Jones
language to text. The proposed system is the little step to
algorithm to detect and remove facial part in video sequence
overcome communication problems and to help them for
as the considered lexicon only includes hand signs. Zernike
978-1-6654-9260-7/23/$31.00 2023
c IEEE 906
Authorized licensed use limited to: Zhejiang University. Downloaded on January 12,2025 at 16:08:31 UTC from IEEE Xplore. Restrictions apply.
moments is used for static signs which is also known as test the proposed system with 942 signed sentences (ISL).
shape descriptor. Curve features are extracted for 35 distinct sign terms are used to recognize these sign
dynamic gestures. The given system also converts speech sentences. On signed sentences accuracy was 72.3% and
to text for which it uses the sphinx module. Post-
that on isolated sign words, the accuracy was found to be
processing includes finding center of gravity and
detection of fingertips.[1] (Anup Kumar, 2016) 72.3% and 89.5%.[8] (Mittal, Kumar, Roy,
The gestures are made using hand postures and the Balasubramanian, & Chaudhuri, 2019)
addition of supporting features such as face expressions The retrieved local features in this system were combined
and body position. The model was trained with large and globalized using MLP and autoencoders, and the
datasets. In the system, convex hull was used for feature classification was performed using the SoftMax
extraction, while KNN was used for classification with algorithm.[9] (Muneer-al-Hammadi, Muhammad, Abdul,
accuracy rate of 65%.[2] (Amrutha K, 2021) & Alsulaiman, 2020).
Comparison with popular methods: 10 dynamic gestures of ISL, for this 1080 videos were
trained. NVIDIA Tesla K80 GPUs were used to train all the
Classifier HMM Particle HMM Multiclass Multiclass
filtering SVM SVM models.[10] (Bhagat, Y, & N, 2019).
Background Data Cluttered Uniform Cluttered Minimally VGG11 Model was used which is a 2D-CNN model. It is
Glove cluttered utilized in the spatial representation module to produce
Accuracy 80 86 85 96.2 >90 multicore features of the full-frame, hands, face, and
Table 1 posture.[11] (Zhou, Wengang, Yun, & Li, 2022). CNN was
Faster R-CNN models Detection speed increases because employed in this model's creation for both classification and
it designs RPN module. According to the location of recognition. Pre-processing is the first phase, which gets rid
gestures faster R-CNN can detect accurately compared to of the facial pixels while keeping the pixels for the
YOLO. The proposed system uses 3D CNN Network hands.[12] (Lijiya, 2019).
consisting of 4 convolution blocks. The input to each of This system uses 3-SU (ASL) subunit sign modelling
the convolution blocks is sampled to the same size. Relu framework to extract large-vocabulary multimodal signals
is used as an activation function.[3] (He, 2019) from continuous video sessions. Parallel Bayesian HMM is
Webcam is used to capture the images and to preprocess used for the spatial and temporal properties of subunits
the input sequence images VS code IDE and OpenCV (BPaHMM). Their concept created the sign lexicon for two
library are used. Preprocessing involves background temporal components and two spatial subunits (hand form)
noise removal using a slope distance algorithm.[4] (velocity and position).[13] (R. & Selvamani, 2019).
(Ashish, Ambekar, & G., 2016) The detected region was then transformed to the frame
In this system, frames are extracted from a real-time video which is binary. Once the binary image is achieved the
captured through a webcam. The image frames are Euclidean distance transformation is performed.[14]
converted to YCbCr domain from RGB. For feature (Roade & Jadav, 2017). Artificial bee colony and Flexible
extraction a feed forward type ANN Architecture is used. back propagation neural network used as classifier in this
There are 5 types of gestures and database of 25 images system.[15] (Kaur & Krishna, 2019).
was created.[5]. (Pankajakshan, C, & Thilagavathi, 2015)
For detecting and tracking the hand Face detection, Skin III. METHODOLOGY
Colour Segmentation and Object stabilization techniques
Data set generation:
are used. Classification of these hand postures is done
using KNN.[6]. (Shenoy, Dastan, Rao, & Vyavaharkar, Generation of data set was one of the most important as
well as time consuming task. Nearly 600 images of each
2018). alphabet for the testing purpose and 1000 images for
A similar technique can be used in OCR(Optical training purpose were generated. Images captured by the
Character recognition).[18] (Chikmurge & Shriram, webcam then region of interest is defined then extracting
2021). The challenge of understanding signs made at the the ROI -Region of interest, which is RGB, then it is
sentence level by speech impaired people has been converted into grey scale further gaussian blur filter is
applied to the image. This process was followed for all the
superscribed. Through the use of fuzzy membership ASL alphabets.
functions, the face and spatial aspects of a signer's hands
are extracted from the frames of a provided video of a
sign.[7] (H, B, & R, 2016).
The system detects a series of interconnected movements
using an LSTM model for continuous sequences of
gestures or continuous Sign Language Recognition. It is
founded on decomposing continuous indications into
smaller components and modelling such sub-units using Fig. (b) Dataset Generation
neural networks. Indian Sign Language has been used to
International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE) 907
Authorized licensed use limited to: Zhejiang University. Downloaded on January 12,2025 at 16:08:31 UTC from IEEE Xplore. Restrictions apply.
Feature Extraction:
Through images frames were extracted and those
frames were fed to CNN model for feature extraction
and hence the details were extracted from the frames.
The Architecture used here is Alexnet. Activation
function used is Relu that is rectified linear unit in each
of the convolution layer as well as the fully connected
layer. It adds non- linearity to the images.
Fig(d). Feature Extraction
Training and testing:
The training and testing of the model was done based on If the image in represented in the form of 3D matrix the
the generated datasets. For training the model 1000 height width and depth are its dimensional properties. The
images were used and 600 for testing purpose. The final value of depth of each pixel is 1 in case of grayscale image
classifies the image by calculating its likelihood. This was and 3 in case of RGB image. These values become
possible with the help of SoftMax activation function. important while extracting features of image using CNN.
Special Feature of system: 4) Gesture Classification: Following steps are followed for
The system also suggests the words similar to the signed identification of signs-
word shown by the user. This makes it user-friendly and x By applying gaussian blur and threshold on input
easier to predict the complex words. image we get the processed image.
x Now the processed image is derived through CNN
IV. BLOCK DIAGRAM model and if image resembles to more than 50 images
then the identified letter is printed corresponding to the
Steps to be followed for hand gesture recognition are: provided sign.
x Blank symbol or no symbol is considered for space
between the words.
V. ALGORITHM USED
908 International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE)
Authorized licensed use limited to: Zhejiang University. Downloaded on January 12,2025 at 16:08:31 UTC from IEEE Xplore. Restrictions apply.
further reduces the performance as it focused on gritty set the input unit values to 0 preventing model
details .Focusing on features of higher level can solve the from overfitting.
problem for that using stride with higher value could be
one solution . Another approach is pooling which is
focused on higher level details .
Architecture used:
x Max Pooling- In max pooling maximum value Alexnet architecture is considered as milestop for image
from the window size is selected. If window size is classification which has eight layers consisting of five
of 2*2 size, then the maximum value is chosen from convolutional layers, three max-pooling layers and three
the corresponding 4 values. Thus the new activation fully connected layers. [17] (A. Gadre, 2021).
matrix we got is of half size than the original one. Layers Layer Name Number Kernel Strides
of Size (pixels)
kernels
Layer First 96 11*11 4
1 convolutional
layer
Layer First max- 96 3*3 2
2 pooling layer
Layer Second 256 5*5 1
Fig. (f) Max Pooling 3 Convolutional
x Average Pooling- It takes average of all values Layer
Layer Second max- 256 3*3 2
present in window.
4 pooling layer
Layer Third 384 3*3 1
5 Convolutional
Layer
Layer Forth 384 3*3 1
6 Convolutional
Layer
Layer Fifth 256 3*3 2
7 Convolutional
Fig. (g) Average Pooling Layer
Layer Third max- 56 3*3 2
iii) Fully Connected layer – In convolution layer the 8 pooling layer
neurons are connected only to the other neurons from the
same frame but in the fully connected layer all the inputs
First and second fully connected layer has 4096 neurons.
are connected to the neuron. In order to influence every
output vector with input vector it applies linear Output layer has 1000 neurons.
transformation to input vector. The output from preceding
pooling layer is flattened and then fed to fully Relu activation function non-linearity:
connected layer. Tanh function has saturating nonlinearity as it
compresses the output within the range of –1 to 1 which
iv) Final Output layer – This is the final layer of neurons shows limiting behavior at boundaries. Relu has non
which is having the count equal to number of possible saturating non linearity f(x) = max(0,x) which makes it
classes. Final output layer predicts the probability for the easier to learn complex features. Also, it is faster to train
input image to be present in a particular class. the neural network than saturating nonlinearity.
v) Dropout layer:
This layer abolish the effect of some neuron. It refers to
International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE) 909
Authorized licensed use limited to: Zhejiang University. Downloaded on January 12,2025 at 16:08:31 UTC from IEEE Xplore. Restrictions apply.
excellent outcomes. After evaluation it was found that
AlexNet performed better than LeNet 5, with an accuracy
of 98.28 % compared to LeNet 5’s 96.3 %. LeNet 5 only
achieved 56.59 % true positive results compared to 71.78
% true positive results from AlexNet (roc auc metric),
which may suggest that this model’s accuracy is subpar.
LeNet 5 has a greater mean squared error (5.15 %),
compared to AlexNet’s 3.81 %.
Reducing Overfitting:
Dropout – This regularization technique refers to dropout
hidden and visible units in the network. It is the
probability to train the particular number of neurons
means 0.5 probability indicates to dropout 50 percent of
neurons hence they will be ignored.
910 International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE)
Authorized licensed use limited to: Zhejiang University. Downloaded on January 12,2025 at 16:08:31 UTC from IEEE Xplore. Restrictions apply.
[8] Anshul Mittal; Pradeep Kumar; Partha Pratim Roy;
Accuracy Raman Balasubramanian; Bidyut B. Chaudhuri “A
1.2
Modified-LSTM Model for Continuous Sign
Language Recognition using Leap motion” DOI
1 10.1109/JSEN.2019.2909837, IEEE Sensors Journal
0.8 1 2019
0.6 [9] Muneer-al-Hammadi; Ghulam Muhammad; Wadood
0.4 Abdul; Mansour Alsulaiman; Mohammed A.
Bencherif Tareq S. Alrayes; Hassan Mathkour;
0.2
Mohamed Amine mekhtiche; “Deep Learning-Based
0 Approach for Sign Language Gesture Recognition
1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 With Efficient Hand Gesture Representation” IEEE
Fig. (m) Plot of Model Accuracy October 19, 2020.
[10] Neel Kamal Bhagat; Vishnu Sai Y; Rathna G N
“Indian Sign Language Gesture Recognition using
Image Processing and Deep Learning” [IEEE 2019
IX. CONCLUSION
Digital Image Computing: Techniques and
The creation of a practical real-time vision-based Applications (DICTA) - Perth, Australia (2019.12.2-
American sign language recognition system for D&M 2019. 12.4)] 2019.
users employing asl alphabets is described in this article. [11] Hao Zhou; Wengang Zhou; Yun Zhou; and
The final accuracy of 98.0 percent is achieved on the Houqiang Li; Fellow “Spatial-Temporal Multi-Cue
dataset. The accuracy of the prediction is increased by Network for Sign Language Recognition and
using symbols that are more similar to one another after Translation” DOI 10.1109/TMM.2021.3059098,
creating two layers of algorithms. If they are displayed IEEE 2022
appropriately, there is no background noise, and the [12] Sruthi C. J and Lijiya A” Signet: A Deep Learning
illumination is suitable, all symbols are recognized based Indian Sign Language Recognition System”
correctly.
International Conference on Communication and
Signal Processing, April 4-6, 2019, India.
X. REFERENCES [13] R., Elakkiya; Selvamani, K. (2019). “Subunit sign
modelling framework for continuous sign language
[1] Anup Kumar; Karun Thankachan; Mevin M. “Sign recognition.” Computers and Electrical Engineering.
Language Recognition” 3rd InCI Conf. on Recent 74. 379-390.10.1016/j.compeleceng.2019.02.012.
Advances in Information Technology I RAIT-2016 [14] Yogeshwar I. Roade; Prashant M. Jadav; “Indian
[2] Amrutha K; Prabhu P.; “ML Based Sign Language Sign Language Recognition System” DOI:
Recognition System” 2021 International 10.21817/ijet/2017/v9i3 /170903S030 Vol 9 No 3S
Conference on Innovative Trends in Information July 2017.
Technology (ICITIIT) ©2021 IEEE. [15] Jasmine Kaur; C. Rama Krishna “An Efficient
[3] Siming He “Research of a Sign Language Indian Sign Language Recognition System using Sift
Translation System Based on Deep Learning” Descriptor” International Journal of Engineering and
International Conference on Artificial Intelligence Advanced Technology (IJEAT) ISSN: 2249 – 8958,
and Advanced Manufacturing (AIAM) -2019. Volume-8 Issue-6, August 2019.
[16] Alexnet Architecture explained (n.d.). Siddhesh
[4] Nikam, Ashish; S Ambekar; Aarti G “Sign Bangar. Retrieved from
Language Recognition Using Image Based Hand https://medium.com/@siddheshb008/alexnet-
Gesture Recognition Techniques” Online architecture-explained-b6240c528bd5.
International Conference on Green Engineering [17] A. Gadre, P. Pund, G. Ajmire and S. Kale, "Signature
and Technologies (IC-GET) 2016. Recognition Models: Performance Comparison,"
[5] Pankajakshan; Priyanka C; Thilagavathi ;,“Sign 2021 International Conference on Advancements in
Language Recognition System” IEEE Sponsored Electrical, Electronics, Communication, Computing
2nd International Conference on Innovations in and Automation (ICAECA), 2021, pp. 1-6, doi:
Information Embedded and Communications 10.1109/ICAECA52838.2021.9675598.
Systems ICIIECS’15, 2015 [18] Chikmurge, Diptee & Shriram, R.. (2021). Marathi
[6] Kartik Shenoy; Tejas Dastan; Varun Rao; Handwritten Character Recognition Using SVM and
Devendra Vyavaharkar; “Real-time Indian Sign KNN Classifier. 10.1007/978-3-030-49336-3_32
Language (ISL) Recognition”
10.1109/ICCCNT.2018.8493808 IEEE July 2018
[7] Nagendraswamy H; Chethana Kumara B; Lekha
Chinmayi R; “Indian Sign Language Recognition:
An Approach Based on Fuzzy-Symbolic Data”
2016 Intl. Conference on Advances in Computing,
Communications and Informatics (ICACCI), Sept.
21-24, 2016.
International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE) 911
Authorized licensed use limited to: Zhejiang University. Downloaded on January 12,2025 at 16:08:31 UTC from IEEE Xplore. Restrictions apply.