Research Paper

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

Interface for Sign Language to Text Conversion

using Convolutional Neural Networks


1st Prof. Kopal Gangrade
Asst. Professor
Dept. of Computer Engineering
Pune Institute of Computer Technology

4th Vishwatej Harer


Student
Dept. of Computer Engineering
Pune Institute of Computer Technology
2nd Abhishek Kulkarni
Student
Dept. of Computer Engineering
Pune Institute of Computer Technology
3rd Deep Thombare
Student
Dept. of Computer Engineering
Pune Institute of Computer Technology

Abstract—Sign languages utilize a visual-manual modality to convey meaning and are commonly used by individuals who are deaf
or hard of hearing. These communication systems involve Image Processing hand gestures that are specific to each language and can
be difficult for those unfamiliar with them to interpret. In order to address this issue, our goal is to develop an interface for Sign
Language Recognition that can translate sign language into text and audio, making it more accessible to a wider audience. However,
current techniques for sign language translation are not without their limitations, such as inaccuracies, difficulties with detecting skin
tones, excessive motion gestures, clutter, and variability. Despite these challenges, we aim to create an interface using Convolution
Neural Networks that can accurately use techniques such as Edge Detection and Hand Gesture Recognition for converting sign
language to text and audio while also mitigating these drawbacks to the best of our abilities.

Index Terms—Sign Language Recognition, Convolutional Neu- ral Network, Image Processing, Edge Detection, Hand Gesture
Recognition.
gestures. Gestures are a form of nonverbal communication that can be complex and difficult to code into machine language.
In our project, we are focusing on image processing and use of Convolutional Neural Network to improve the accuracy of our
output.
In our project we basically implement a model which can recognise finger-spelling based hand gestures in order to form a
complete word by combining each gesture. The gestures we aim to train are as given in the image below.

I. INTRODUCTION
Sign language recognition refers to the process of converting a user’s signs and gestures into text, allowing for commu-
nication between individuals who are unable to speak and the general public. Image processing algorithms and neural
networks are utilized to map gestures to the appropriate text found within the training data, thereby converting raw
images and videos into text that can be easily read and understood. Communication can be challenging for individuals who are
deaf or hard of hearing, as they often rely on visual communication. Sign language serves as the primary means of
communication within the deaf and hard of hearing community, utilizing visual modality to exchange information. However,
many people are unaware of the grammar associ- ated with sign language, limiting communication opportunities for individuals
who rely on it. This has led to a growing demand for a computer-based system that can accurately recognize and translate sign
language. Researchers have been working to address this problem by developing technologies that can recognize speech,
facial expressions, and human
II. PROPOSED SYSTEM
Our system employs convolution neural networks to recog- nize hand gestures in sign language. It does so by capturing video
and converting it into frames, which are then used to segment hand pixels. The resulting image is then compared to a
trained model to obtain accurate text labels of letters. By using this approach, our system achieves robustness in accurately
recognizing a variety of hand gestures.
III. METHODOLOGY

The system is a vision based approach. All the signs are represented with bare hands and so it eliminates the problem of
using any artificial devices for interaction.

A. DATASET GENERATION
In our project, we encountered a challenge in finding pre- existing datasets that met our requirements in the form of raw
images. Instead, we decided to create our own dataset using the OpenCV library. We captured approximately 800 images of each
symbol in ASL for training purposes, and around 200 images per symbol for testing purposes. To create our dataset, we used the
webcam on our machine to capture each frame. Within each frame, we defined a region of interest (ROI) using a blue bounded
square, as shown in the image below.
We then extracted the ROI, which was in RGB format, and converted it into a grayscale image, as shown below. Finally, we
applied a Gaussian blur filter to our image, which helped us extract various features of our image. The resulting image after
applying the Gaussian blur looked like the image below.
B. GESTURE CLASSIFICATION
Our approach uses two layers of algorithm to predict the final symbol of the user.
• Algorithm Layer 1:

1) Apply Gaussian Blur filter and threshold to the frame taken with openCV to get the processed image after feature
extraction.
2) This processed image is passed to the CNN model for prediction and if a letter is detected for more than 50 frames
then the letter is printed and taken into consideration for forming the word.
3) Space between the words is considered using the blank symbol.

• Algorithm Layer 2:
1) We detect various sets of symbols which show similar results on getting detected.
2) We then classify between those sets using classifiers made for those sets only.

Layer 1:

• CNN Model
1) The first step in our convolutional neural network in- volves processing an input picture with a resolution of 128x128
pixels through the first convolutional layer, which uses 32 filter weights (3x3 pixels each). This
produces a 126x126 pixel image, one for each filter weight.
2) Next, we downsample the image using max pooling of 2x2, keeping only the highest value in each 2x2 square of the array.
This results in a picture that is downsampled to 63x63 pixels.
3) We then process this 63x63 image through the second convolutional layer, which uses 32 filter weights (3x3 pixels each)
and produces a 60x60 pixel image.
4) The resulting images are downsampled again using max pooling of 2x2, reducing the resolution to 30x30.
5) Next, we use the resulting images as input to a fully connected layer with 128 neurons. The output from the second
convolutional layer is reshaped into an array of 30x30x32=28,800 values.
6) The output of the first densely connected layer is then fed to a second fully connected layer with 96 neurons. We also
incorporate a dropout layer with a value of 0.5 to avoid overfitting.
7) Finally, the output of the second densely connected layer serves as input to the final layer, which has a number of
neurons equal to the number of classes we are classifying (i.e., alphabets + blank symbol).

• Activation Function:
In our convolutional neural network, we have incorporated the Rectified Linear Unit (ReLU) function in each of the
layers, including the convolutional and fully connected neurons. The ReLU function calculates max(x,0) for each input
pixel, which adds nonlinearity to the formula and helps to learn more complicated features. This function helps to eliminate
the vanishing gradient problem and speed up the training by reducing the computation time.

• Pooling Layer:
We apply Max pooling to the input image with a pool size of (2, 2) with relu activation function.This reduces the amount
of parameters thus lessening the computation cost and reduces overfitting.
• Dropout Layers:
The dropout layer is added to our model to prevent overfitting, which occurs when the weights of the network are overly
adjusted to the training examples and the network does not generalize well to new examples. The dropout layer
randomly sets a subset of the activations in the previous layer to zero during training, which helps prevent the model
from relying too much on any specific activation. This allows the model to learn more robust and generalizable features
and reduces the chances of overfitting to the training data. During testing, the dropout layer is disabled and all
activations are used for making predictions.

• Optimizer:
To update our model in response to the output of the loss function, we have used the Adam optimizer. Adam
combines the advantages of two stochastic gradient descent algorithms: adaptive gradient algorithm (AdaGrad) and root
mean square propagation (RMSProp). AdaGrad adapts the learning rate of each parameter based on its historical gradient
information, allowing the learning rate to decrease for parameters that are frequently updated and increase for
infrequently updated ones. On the other hand, RMSProp divides the learning rate by a running average of the
magnitudes of recent gradients, which helps to avoid exploding gradients and converge faster. Adam optimizer combines
these two methods and adapts the learning rate of each parameter based on the first and second moments of the gradients.
This results in faster convergence and better performance on a variety of deep learning tasks.

Layer 2:

To improve the accuracy of our symbol detection and prediction, we have implemented two layers of algorithms that can
differentiate between symbols that are similar to each other. However, during our testing phase, we encountered some issues
where certain symbols were not being detected
accurately, and were being confused with other symbols. Specifically, we found that the following symbols were not being
detected properly:
1) For D : R and U
2) For U : D and R
3) For I : T, D, K and I
4) For S : M and N
So to handle above cases we made three different classifiers for classifying these sets:
1) D,R,U
2) T,K,D,I
3) S,M,N

C. FINGER SPELLING SENTENCE FORMATION IMPLE- MENTATION


1) Whenever the count of a letter detected exceeds a specific value and no other letter is close to it by a threshold, we print
the letter and add it to the current string (In our code we kept the value as 50 and difference threshold as 20)
2) Otherwise, we clear the current dictionary which has the count of detections of present symbol to avoid the probability
of a wrong letter getting predicted.
3) . Whenever the count of a blank (plain background) detected exceeds a specific value and if the current buffer is empty no
spaces are detected.
4) . In other case it predicts the end of word by printing a space and the current gets appended to the sentence below.

D. AUTOCORRECT FEATURE
The system incorporates the Hunspell suggest python library to provide users with suggested alternative words for any
incorrectly spelled input word. These suggested words are displayed to the user in a set, allowing them to select the most
appropriate option to append to the current sentence. This feature not only reduces the number of spelling errors but also
aids in the prediction of complex words.

E. TRAINING AND TESTING


We preprocess our input images by converting them from RGB to grayscale and applying a Gaussian blur to remove noise.
We then use adaptive thresholding to extract the hand from the background and resize the images to 128 x 128 pixels. These
preprocessed images are fed into our model for training and testing. During prediction, the output of the model is normalized
using the SoftMax function so that the sum of values in each class adds up to 1. The model’s output initially may be far from the
actual value, so we use labeled data to train the network. To measure the performance of the model, we use the cross-entropy
function, which measures the difference between the predicted and labeled values. We optimize the cross-entropy by
adjusting the weights of the
neural network using Gradient Descent, specifically the Adam Optimizer, which is considered the best optimizer for this task.

IV. RESULT
Our model achieved an accuracy of 95.8% with only layer 1 of our algorithm. With the combination of layer 1 and layer 2,
we achieved an accuracy of 98.0%, surpassing the accuracy of many current research papers on American Sign Language,
which tend to focus on using devices such as Kinect for hand detection. For example, in [7], researchers built a recognition
system for Flemish Sign Language using convolutional neural networks and Kinect, achieving an error rate of 2.5%. In [8], a
recognition model was developed using a hidden Markov model classifier and a vocabulary of 30 words, resulting in an error
rate of 10.90%.
In [9], they achieved an average accuracy of 86% for 41 static gestures in Japanese Sign Language. Using depth sensors,
[10] achieved an accuracy of 99.99% for observed signers and 83.58% and 85.49% for new signers, also using CNN for their
recognition system. It is worth noting that our model does not use any background subtraction algorithm, unlike some of the
models mentioned above, so the accuracy may vary once we implement background subtraction in our project.
Additionally, most of the projects mentioned above use Kinect devices, but our goal was to create a project that could be used
with readily available resources. Since a sensor like Kinect is not only not readily available, but also expensive for most people
to purchase, our model uses a normal webcam, making it a great plus point. The confusion matrices for our results are shown
below.

V. CHANLLENGES FACED
The project encountered numerous challenges, starting with the dataset, which needed to consist of square raw images to
work conveniently with the CNN in Keras. Since no pre- existing dataset met this requirement, the team created their own.
The second obstacle involved selecting a filter that could extract the essential features of the images for input to the CNN
model. They experimented with several filters, including binary threshold, canny edge detection, and Gaussian blur,
before settling on the Gaussian Blur Filter. Additional issues arose concerning the accuracy of the previously trained model,
which the team resolved by enlarging the input image size and enhancing the dataset.

VI. CONCLUSION
We have developed a functional, real-time American Sign Language (ASL) recognition system for the DM community,
specifically for ASL alphabets. Our data-set achieved a final accuracy of 98.0%. To improve our prediction accuracy, we
implemented two layers of algorithms that verify and predict symbols that are more similar to each other. As a result, our system
is now able to detect almost all symbols accurately, assuming they are shown properly, there is no background noise, and the
lighting conditions are adequate.

VII. FUTURE SCOPE


Our team is working on increasing the accuracy of our gesture recognition system even in the presence of complex
backgrounds. We plan to achieve this by exploring various background subtraction algorithms. Additionally, we aim to improve
the pre-processing stage to enable accurate gesture recognition in low light conditions.
To make our project more accessible to users, we are considering developing it as a web or mobile application. Currently, our
system is designed to recognize American Sign Language (ASL) gestures. However, with sufficient data and training, we could
extend it to recognize other native sign languages.
While our system currently focuses on finger-spelling trans- lation, sign languages also include contextual gestures that
can represent objects or verbs. Identifying these contextual gestures would require a higher level of processing and natural
language processing (NLP).

REFERENCES
[1] T. Yang, Y. Xu, and “A, Hidden Markov Model for Gesture Recognition”, CMU-RI-TR-94 10, Robotics Institute, Carnegie Mellon
Univ.,Pittsburgh,PA, May 1994.

[2] Pujan Ziaie, Thomas Muller , Mary Ellen Foster , and Alois Knoll“A Naive Bayes Munich,Dept. of Informatics VI, Robotics and Embedded
Systems,Boltzmannstr. 3, DE-85748 Garching, Germany.
[3] aeshpande3.github.io/A-Beginnerutional-Neural-Networks-Part-2/

[4] Mohammed Waleed Kalous, Machine recognition of Auslan signs using PowerGloves: Towards large-lexicon recognition of sign language.

[5] Pigou L., Dieleman S., Kindermans PJ., Schrauwen B. (2015) Sign Language Recognition Using Convolutional Neural Networks. In: Agapito L.,
Bronstein M., Rother C. (eds) Computer Vision - ECCV 2014 Workshops. ECCV 2014. Lecture Notes in Computer Science, vol 8925. Springer,
Cham

[6] Zaki, M.M., Shaheen, S.I.: Sign language recognition using a combination of new vision based features. Pattern Recognition Letters 32(4), 572–577
(2011).

[7] N. Mukai, N. Harada and Y. Chang, ”Japanese Fingerspelling Recognition Based on Classification Tree and Machine Learning,” 2017 Nicograph
International (NicoInt) , Kyoto, Japan, 2017, pp. 19-24. doi:10.1109/NICOInt.2017.9

[8] https://opencv.org/

[9] https://en.wikipedia.org/wiki/TensorFlow

[10] https://en.wikipedia.org/wiki/Convolutional neural network

You might also like