Abstract
Human action recognition is a challenging computer vision task and many efforts have been made to improve the performance. Most previous work has concentrated on the hand-crafted features or spatial-temporal features learned from multiple contiguous frames. In this paper, we present a dual-channel model to decouple the spatial and temporal feature extraction. More specifically, we propose to capture the complementary static form information from single frame and dynamic motion information from multi-frame differences in two separate channels. In both channels we use two stacked classical subspace networks to learn hierarchical representations, which are subsequently fused for action recognition. Our model is trained and evaluated on three typical benchmarks: KTH, UCF and Hollywood2 datasets. The experimental results illustrate that our approach achieves comparable performances to the state-of-the-art methods. In addition, both feature analysis and control experiments are also carried out to demonstrate the effectiveness of the proposed approach for feature extraction and thereby action recognition.
Access this article
We’re sorry, something doesn't seem to be working properly.
Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.






Similar content being viewed by others
References
Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(3):257–267
Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27
Chen J, Song X, Nie L, Wang X, Zhang H, Chua TS (2016) Micro tells macro: predicting the popularity of micro-videos via a transductive model. In: Proceedings of the 2016 ACM on multimedia conference, ACM, pp 898–907
Cormen TH, Leiserson CE, Rivest RL, Stein C (2001) Introduction to algorithms, vol 6. MIT Press, Cambridge
Fu Y, Zhang T, Wang W (2017) Sparse coding-based space-time video representation for action recognition. Multimed Tool Appl 76(10):12645–12658
Goodale MA, Milner AD (1992) Separate visual pathways for perception and action. Trends Neurosci 15(1):20–25
Hubel DH, Wiesel TN (1959) Receptive fields of single neurones in the cat’s striate cortex. J Physiol 148(3):574–591
Hyvärinen A, Hoyer P (2000) Emergence of phase-and shift-invariant features by decomposition of natural images into independent feature subspaces. Neural Comput 12(7):1705–1720
Hyvärinen A, Hurri J, Hoyer PO (2009) Natural image statistics: a probabilistic approach to early computational vision, vol 39. Springer Science & Business Media, Berlin
Jhuang H, Serre T, Wolf L, Poggio T (2007) A biologically inspired system for action recognition. In: IEEE international conference on computer vision. IEEE, pp 1–8
Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Klaser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3d-gradients. In: British machine vision conference, 2008. British Machine Vision Association, pp 275–1
Laptev I, Lindeberg T (2003) Space-time interest points. In: IEEE International conference on computer vision, 2003. IEEE, pp 432–439
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE conference on computer vision and pattern recognition, 2008. IEEE, pp 1–8
Le QV, Karpenko A, Ngiam J, Ng AY (2011) Ica with reconstruction cost for efficient overcomplete feature learning. In: Advances in neural information processing systems, pp 1017–1025
Le QV, Zou WY, Yeung SY, Ng AY (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: IEEE conference on computer vision and pattern recognition, 2011. IEEE, pp 3361–3368
Li L, Dai S (2017) Action recognition with spatio-temporal augmented descriptor and fusion method. Multimed Tool Appl 76(12):13953–13969
Liu AA, Su YT, Nie WZ, Kankanhalli M (2017) Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans Pattern Anal Mach Intell 39(1):102–114
Liu A-A, Xu N, Nie W-Z, Su Y-T, Wong Y, Kankanhalli M (2017) Benchmarking a multimodal and multiview and interactive dataset for human action recognition. IEEE Trans Cybern 47(7):1781–1794
Liu C, Xu W, Wu Q, Yang G (2016) Learning motion and content-dependent features with convolutions for action recognition. Multimed Tool Appl 75(21):13023–13039
Marszalek M, Laptev I, Schmid C (2009) Actions in context. In: IEEE conference on computer vision and pattern recognition, 2009. IEEE, pp 2929–2936
Ngiam J, Coates A, Lahiri A, Prochnow B, Le QV, Ng AY (2011) On optimization methods for deep learning. In: Proceedings of the 28th international conference on machine learning, pp 265–272
Olshausen BA, Field DJ (1997) Sparse coding with an overcomplete basis set: a strategy employed by v1? Vis Res 37(23):3311–3325
Rodriguez MD, Ahmed J, Shah M (2008) Action Mach: a spatio-temporal maximum average correlation height filter for action recognition. In: IEEE conference on computer vision and pattern recognition. IEEE
Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: International conference on pattern recognition, 2004, vol 3. IEEE, pp 32–36
Shen J, Tao D, Li X (2008) Modality mixture projections for semantic video event detection. IEEE Trans Circuits Syst Video Technol 18(11):1587–1596
Shen J, Pang H, Tao D, Li X (2010) Dual phase learning for large scale video gait recognition. In: MMM. Springer, pp 500–510
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Taylor GW, Fergus R, LeCun Y, Bregler C (2010) Convolutional learning of spatio-temporal features. In: European conference on computer vision. Springer, pp 140–153
Tom M, Babu RV (2013) Rapid human action recognition in h. 264/avc compressed domain for video surveillance. In: Visual communications and image processing. IEEE, pp 1–6
Wang H, Ullah MM, Klaser A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: British Machine Vision Conference, 2009. BMVA Press, pp 124–1
Xiao Y, Xia L (2015) Human action recognition using modified slow feature analysis and multiple kernel learning. Multimed Tool Appl:1–16
Xue W, Zhao H, Zhang L (2016) Encoding multi-resolution two-stream cnns for action recognition. In: International conference on neural information processing. Springer, pp 564–571
Yan C, Zhang Y, Dai F, Wang X, Li L, Dai Q (2014) Parallel deblocking filter for hevc on many-core processor. Electron Lett 50(5):367–368
Yan C, Zhang Y, Dai F, Zhang J, Li L, Dai Q (2014) Efficient parallel hevc intra-prediction on many-core processor. Electron Lett 50(11):805–806
Yan C, Zhang Y, Xu J, Dai F, Li L, Dai Q, Wu F (2014) A highly parallel framework for hevc coding unit partitioning tree decision on many-core processors. IEEE Signal Process Lett 21(5):573–576
Yan C, Zhang Y, Xu J, Dai F, Zhang J, Dai Q, Wu F (2014) Efficient parallel framework for hevc motion estimation on many-core processors. IEEE Trans Circuits Syst Video Technol 24(12):2077– 2089
Yu S, Cheng Y, Su S, Cai G, Li S (2017) Stratified pooling based deep convolutional neural networks for human action recognition. Multimed Tool Appl 76(11):13367–13382
Zhang J, Nie L, Wang X, He X, Huang X, Chua TS (2016) Shorter-is-better: venue category estimation from micro-video. In: Proceedings of the 2016 ACM on multimedia conference. ACM, pp 1415–1424
Zhang Z, Tao D (2012) Slow feature analysis for human action recognition. IEEE Trans Pattern Anal Mach Intell 34(3):436–450
Zhu L, Shen J, Xie L, Cheng Z (2017) Unsupervised visual hashing with semantic assistant for content-based image retrieval. IEEE Trans Knowl Data Eng 29(2):472–486
Zou W, Zhu S, Yu K, Ng AY (2012) Deep learning of invariant features via simulated fixations in video. In: Advances in neural information processing systems, pp 3212–3220
Acknowledgements
The work was supported by the National Natural Science Foundation of China (Grant No. 91420302), the National Basic Research Program of China (Grant No. 2015CB856004) and the Key Basic Research Program of Shanghai (Grant No. 15JC1400103).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhang, K., Zhang, L. Extracting hierarchical spatial and temporal features for human action recognition. Multimed Tools Appl 77, 16053–16068 (2018). https://doi.org/10.1007/s11042-017-5179-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-5179-7