Abstract
Visual Scene interpretation is one of the major areas of research in the recent past. Recognition of human object interaction is a fundamental step towards understanding visual scenes. Videos can be described via a variety of human-object interaction scenarios such as when both human and object are static (static-static), one is static while other is dynamic (static-dynamic) and both are dynamic (dynamic-dynamic). This paper presents a unified framework for the explanation of these interactions between humans and a variety of objects using deep learning as a pivot methodology. Human-object interaction is extracted through native machine learning techniques, while spatial relations are captured by training a model through convolution neural network. We also address the recognition of human posture in detail to provide egocentric visual description. After extracting visual features, sequential minimal optimization is employed for training our model. Extracted inter-action, spatial relations and posture information are fed into natural language generation module along with interacting object label to generate scene understanding. Evaluation of the proposed framework is done for two state of the art datasets i.e., MSCOCO and MSR3D Daily activity dataset; where achieved results are 78 and 91.16% accurate, respectively.












Similar content being viewed by others
References
Aydemir A et al (2011) Search in the real world: Active visual object search based on spatial relations. Robotics and Automation (ICRA)
Ellis C, Masood S, Tappen M, Laviola J, Sukthankar R (2013) Exploring the trade-off between accuracy and observational latency in action recognition. Int J Comput Vis 101(3):420436
Gupta A, Kembhavi A, Davis LS (2009) Observing human-object interactions: using spatial and functional compatibility for recognition. IEEE Trans Pattern Anal Mach Intell 31(10):1775–1789
Hamza R et al. (2017) Hash based encryption for keyframes of diagnostic hysteroscopy. IEEE Access
Hamza R et al (2017) Secure video summarization framework for personalized wireless capsule endoscopy. Pervasive and Mobile Computing 41:436–450
Huang D et al. (2014) Sequential max-margin event detectors.” European conference on computer vision. Springer, Cham
Jain P et al. (2015) Knowledge acquisition for language description from scene understanding." Computer, Communication and Control (IC4), 2015 International Conference on. IEEE
Karpathy A, Li F-F (2015) Deep visual-semantic alignments for generating image descriptions. Proc IEEE Conf Comput Vis Patt Recog
H Kuehne, H Jhuang, E Garrote, T Poggio, T Serre, HMDB (2011) A Large Video Database for Human Motion Recognition. ICCV
W Li, Z Zhang, Z Liu (2010) Action recognition based on a bag of 3D points, in: IEEE CVPR Workshop on Human Communicative Behavior, Analysis
Lin T-Y et al (2014) Microsoft coco: Common objects in context. European conference on computer vision. Springer, Cham
Muhammad K et al. (2018) Secure Surveillance Framework for IoT systems using Probabilistic Image Encryption. IEEE Trans Indust Info
Redmon J et al (2016) You only look once: Unified, real-time object detection. Proc IEEE Conf Comput Vis Patt Recog
Sajjad M, et al. (2018) CNN-based anti-spoofing two-tier multi-factor authentication system. Pattern Recognition Letters
Sj K, Aydemir A, Jensfelt P (2012) Topological spatial relations for active visual search. Robot Auton Syst 60(9):1093–1107
J Sung, C Ponce, B Selman, A Saxena (2012) Unstructured human activity detection from RGBD images, in: Proc. International Conference on Robotics and Automation 842849
Wang J et al. (2012) Mining actionlet ensemble for action recognition with depth cameras.” Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE
Welke K et al. (2011) Grounded spatial symbols for task planning based on experience. Humanoid Robots (Humanoids), 2013 13th IEEE-RAS International Conference on. IEEE, 2013. IEEE International Conference on. IEEE
Xia L, C-C Chen, JK Aggarwal (2012) View invariant human action recognition using histograms of 3d joints.” Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on. IEEE
Yang X, Tian YL (2014) Effective 3d action recognition using eigenjoints. J Vis Commun Image Represent 25.1:2–11
Zanfir M, M Leordeanu, and C Sminchisescu (2013) The moving pose: An efficient 3d kinematics descriptor for low-latency action recognition and detection. Proceedings of the IEEE international conference on computer vision
Acknowledgements
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea Government (MSIP) (No. 2016R1A2B4011712) & by IGNITE, National Technology Fund, Pakistan for the project entitle “Automatic Surveillance System for Video Sequences”.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Khan, G., Ghani, M.U., Siddiqi, A. et al. Egocentric visual scene description based on human-object interaction and deep spatial relations among objects. Multimed Tools Appl 79, 15859–15880 (2020). https://doi.org/10.1007/s11042-018-6286-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6286-9