Abstract
In this paper, we propose an efficient cascaded model for sign language recognition taking benefit from spatio-temporal hand-based information using deep learning approaches, especially Single Shot Detector (SSD), Convolutional Neural Network (CNN), and Long Short Term Memory (LSTM), from videos. Our simple yet efficient and accurate model includes two main parts: hand detection and sign recognition. Three types of spatial features, including hand features, Extra Spatial Hand Relation (ESHR) features, and Hand Pose (HP) features, have been fused in the model to feed to LSTM for temporal features extraction. We train SSD model for hand detection using some videos collected from five online sign dictionaries. Our model is evaluated on our proposed dataset (Rastgoo et al., Expert Syst Appl 150: 113336, 2020), including 10’000 sign videos for 100 Persian sign using 10 contributors in 10 different backgrounds, and isoGD dataset. Using the 5-fold cross-validation method, our model outperforms state-of-the-art alternatives in sign language recognition













Similar content being viewed by others
References
Acton B, Koum J (2009) Yahoo.www.whatsapp.com
Chai X, Guang L, Lin Y, Xu Z h, Tang Y, Chen X, Zhou M (2013) Sign language recognition and translation with kinect. In: IEEE International conference on automatic face and gesture recognition (FG2013). April 22–26. Shanghai
Chen Ch, Zhang B, Zhenjie H, Jiang J, Liu M, Yang Y (2017) Action recognition from depth sequences using weighted fusion of 2D and 3D auto-correlation of gradients features. Multimedia Tools and Applications
Cooper H, Ong W-J, Pugeault N, Bowden R (2012) Sign language recognition using sub-units. J Mach Learn Res 13:2205–2231
Duan J, Zhou Sh, Wan J, Guo X, Li SZ (2016) Multi-modality fusion based on consensus-voting and 3D convolution for isolated gesture recognition, arXiv:https://arxiv.org/abs/1611.06689v2
El Khattabi Z, Tabii Y, Benkaddour A (2015) Video summarization: techniques and applications. Int J Comput Inform Eng 4:9
Forster, et al. (2012) WTH-PHOENIX v1 - German sign language RWTH-PHOENIX v2
Ge L, Liang H, Yuan J, Thalmann D (2018) Robust 3D hand pose estimation in single depth images: from single-view CNN to multi-view CNNs. IEEE Transactions on Image Processing
Goodwyn S, Acredolo L, Brown C (2000) Impact of symbolic gesturing on early language development. Nonverbal Behavior, 81–103. https://www.babysignlanguage.com/dictionary/?v=04c19fa1e772
He K, Zhang X, Ren Sh, Sun J (2016) Deep residual learning for image recognition. CVPR
Jameson L, et al. (2004) American Sign Language
Kang B, Tripathi S, Nguyen TQ (2015) Real-time sign language fingerspelling recognition using convolutional neural networks from depth map. In: 3rd IAPR Asian conference on pattern recognition (ACPR)
Kapuscinski T, Oszust M, Wysocki M, Warchol D (2015) Recognition of hand gestures observed by depth cameras. International Journal of Advanced Robotic Systems
Kim S, Ban Y, Lee S (2017) Tracking and classification of in-air hand gesture based on thermal guided joint filter. Sensors
Koller O, Forster J, Hermann N (2015) Continuous sign language recognition: towards large vocabulary statistical recognition systems handling multiple signers. Comput Vis Image Underst, 108–125
Le TH, Jaw DW, Lin ICh, Liu HB, Huang ShCh (2018) An efficient hand detection method based on convolutional neural network. In: The 7th IEEE international symposium on next-generation electronics
Liu W, Anguelov D, Erhan D, Szegedy Ch, Reed S, Fu ChY, Berg AC (2016) SSD: single shot MultiBox detector. ECCV, 21–37
Miao Q, Li Y, Ouyang W, Ma Z, Xu X, Shi W, Cao X, Liu Z, Chai X, Liu Z et al (2017) Multimodal gesture recognition based on the resc3d network. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Miller J, Winn B, Winn J (2019) Signing savvy. Online dictionary
Narayana P, Beveridge JR, Bruce AD (2018) Gesture recognition: focus on the hands. CVPR, 5235–5244
Neverova N, Wolf C h, Taylor GW, Nebout F (2014) Hand segmentation with structured convolutional learning. In: Asian conference on computer vision (ACCV) 2014: computer vision, pp 687–702
Ong WJ, Cooper H, Pugeault N, Bowden R (2012) Sign language recognition using sequential pattern trees. CVPR
Oszust M, Wysocki M (2013) Polish sign language words recognition with Kinect. In: 6th International conference on human system interactions (HSI)
Pagebites Inc. (2019) United States. www.imo.com
Pugeault N, Bowden R (2011) Spelling it out: real-time ASL fingerspelling recognition. In: Proceedings of the 1st IEEE workshop on consumer depth cameras for computer vision, jointly with ICCV’2011
Rastgoo R, Kiani K, Escalera S (2018) Multi-modal deep hand sign language recognition in still images using restricted Boltzmann machine. Entropy 20:809
Rastgoo R, Kiani K, Escalera S (2020) Hand sign language recognition using multi-view hand skeleton. Expert Syst Appl 150:113336. https://doi.org/10.1016/j.eswa.2020.113336
Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. NIPS
Ronchetti F, Quiroga F, Estrebou C, Lanzarini L (2016) Handshape recognition for Argentinian sign language using ProbSom. JCS-T
Ronchetti F, Quiroga F, Estrebou C, Lanzarini LC, Rosete A (2016) LSA64: an Argentinian sign language dataset. Congreso Argentino de Ciencias de la Computación (CACIC 2016)
Scogin J (2008) Texas math sign language dictionary. http://www.tsdvideo.org/about.php
Simon T, Joo H, Matthews I, Sheikh Y (2017) Hand keypoint detection in single images using multi-view bootstrapping. arXiv:https://arxiv.org/abs/1704.07809
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. arXiv:https://arxiv.org/abs/1409.1556v6
Sun A, Wei Y, Liang S, Tang X, Sun J (2015) Cascaded hand pose regression. CVPR, 824–832
Thangali A, Nash J, Sclaroff S, Neidle C (2011) Exploiting phonological constraints for handshape inference in ASL video. CVPR
Wang H, Wang P, Song Z, Li W (2017) Large-scale multimodal gesture recognition using heterogeneous networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition
William V (2013) American sign language. William Vicars Publisher, http://www.lifeprint.com/index.htm
Yan S h, Xia Y, Smith JS, Lu W, Zhang B (2017) Multi-scale convolutional neural networks for hand detection. Applied Computational Intelligence and Soft Computing
Zhang L, Zhu G, Shen P, Song J, Shah SA, Bennamoun M (2017) Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Zhou Y, Lu J, Lin X, Sun Y, Ma X (2018) HBE: hand branch ensemble network for real-time 3D Hand Pose Estimation. ECCV
Zimmermann Ch, Brox T (2017) Learning to estimate 3D hand pose from single RGB images. ICCV
Acknowledgements
This work is partially supported by the Spanish project TIN2016-74946-P (MINECO/FEDER,UE), CERCA Programme /Generalitat de Catalunya, ICREA under the ICREA Academia programme, and High Intelligent Solution (HIS) company of Iran. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan XP GPU used for this research. Also, we would like to thank two deaf centers of Iran (Semnan and Tehran) and the Computer Vision Center (CVC) of Spain for their collaborations.
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interests
The authors certify that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Rastgoo, R., Kiani, K. & Escalera, S. Video-based isolated hand sign language recognition using a deep cascaded model. Multimed Tools Appl 79, 22965–22987 (2020). https://doi.org/10.1007/s11042-020-09048-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-09048-5