Abstract
Temporal modelling is still challenging for action recognition in video. To alleviate this problem, this paper proposes a new video architecture, called Adaptive Region Aware (ARP) network. The network focuses on combining short-range temporal information with long-range temporal information to perform effective action recognition. The core of our ARP is the Movement and Spatial-Temporal module (MST), which is made up of two modules, movement information and Spatial-Temporal information. MST uses the idea of difference for short-range temporal information extraction and adaptive region sensing and temporal convolution for long-range temporal information extraction. To extract temporal information for the entire video, our MST module contains two branches. Specifically, for local temporal information, we use the difference in motion between successive frames to extract a fine-grained representation of the motion, thus obtaining short-range temporal information. For the global temporal information, we use adaptive region awareness for feature extraction of the whole video to enhance the representation of the model in the spatio-temporal domain, and use temporal convolution for modelling to obtain our long-range temporal information. We insert the MST module into ResNet-50 to build our ARP network and experiment on the Something V1, Something V2 and Kinetics-400 datasets with optimal performance.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availibility
The data that support the findings of this study are not openly available due to site restrictions and are available from the corresponding author upon reasonable request “https://20bn.com/datasets/something-something/”.
References
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60:84–90. https://doi.org/10.1145/3065386
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition 2016-December, pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Wang F, Cheng J, Liu W, Liu H (2018) Additive margin softmax for face verification. IEEE Signal Process Lett 25(7):926–930
Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
He K, Gkioxari G, Dollar P, Girshick R (2017). Mask R-CNN. https://doi.org/10.1109/ICCV.2017.322
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition 2016-December, pp 1933–1941. https://doi.org/10.1109/CVPR.2016.213
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings—30th IEEE conference on computer vision and pattern recognition, CVPR 2017 2017-January, pp 4724–4733. https://doi.org/10.1109/CVPR.2017.502
Chen Y, Kalantidis Y, Li J, Yan S, Feng J (2018) A2-nets: double attention networks. Adv Neural Inf Process Syst 352
Wang H, Klaser A, Schmid C, Liu C-L (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79
Yang B, Le QV, Bender G, Ngiam J (2019) Condconv: conditionally parameterized convolutions for efficient inference. Adv Neural Inf Process Syst 32
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9912 LNCS, pp 20–36. https://doi.org/10.1007/978-3-319-46484-8_2
Wang Y, Long M, Wang J, Yu PS (2017) Spatiotemporal pyramid network for video action recognition. https://doi.org/10.1109/CVPR.2017.226
Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: Proceedings—30th IEEE conference on computer vision and pattern recognition, CVPR 2017 2017-January, pp 7445–7454. https://doi.org/10.1109/CVPR.2017.787
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision 2015 international conference on computer vision ICCV 2015, pp 4489–4497. https://doi.org/10.1109/ICCV.2015.510
Stroud JC, Ross DA, Sun C, Deng J, Sukthankar R (2020) D3d: Distilled 3d networks for video action recognition. In: Proceedings—2020 IEEE winter conference on applications of computer vision, WACV 2020, pp 614–623. https://doi.org/10.1109/WACV45572.2020.9093274
Bobick A, Davis J (1996) An appearance-based representation of action, vol 1, pp 307–312. https://doi.org/10.1109/ICPR.1996.546039
Weinland D, Ronfard R, Boyer E (2006) Free viewpoint action recognition using motion history volumes. Comput Vis Image Underst 104(2–3 SPEC. ISS.):249–257
Nguyen TP, Manzanera A (2013) Action recognition using bag of features extracted from a beam of trajectories, pp 4354–4357. https://doi.org/10.1109/ICIP.2013.6738897
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: International conference on computer vision, pp 3551–3558
Shi C, Wang Y, Jia F, He K, Wang C, Xiao B (2017) Fisher vector for scene character recognition: a comprehensive evaluation. Pattern Recognit 72:1–14
Danafar S, Gheissari N (2007) Action recognition for surveillance applications using optic flow and svm. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (PART 2), pp 457–466
Oreifej O, Liu Z (2013) Hon4d: histogram of oriented 4d normals for activity recognition from depth sequences. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 716–723
Wang Y, Zhou L, Qiao Y (2018) Temporal hallucinating for action recognition with few still images. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 5314–5322
Wang L, Koniusz P, Huynh D (2019) Hallucinating idt descriptors and i3d optical flow features for action recognition with cnns. In: Proceedings of the IEEE international conference on computer vision, pp 8697–8707
Tang Y, Ma L, Zhou L (2019) Hallucinating optical flow features for video classification. In: IJCAI international joint conference on artificial intelligence, pp 926–932
Wang L, Koniusz P (2021) Self-supervising action recognition by statistical moment and subspace descriptors. In: MM 2021—proceedings of the 29th ACM international conference on multimedia, pp 4324–4333
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst 1:568–576
Diba A, Sharma V, VanGool L (2017) Deep temporal linear encoding networks. In: Proceedings—30th IEEE conference on computer vision and pattern recognition, CVPR 2017 2017-January, pp 1541–1550. https://doi.org/10.1109/CVPR.2017.168
Wang L, Ge L, Li R, Fang Y (2017) Three-stream CNNs for action recognition. Pattern Recognit Lett 92:33–40. https://doi.org/10.1016/j.patrec.2017.04.004
Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Tran D, Ray J, Shou Z, Chang S-F, Paluri M (2017) Convnet architecture search for spatiotemporal feature learning. Comput Res Repos 8(16):1–12
Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 6546–6555. https://doi.org/10.1109/CVPR.2018.00685
Diba A, Fayyaz M, Sharma V, Karami AH, Arzani MM, Yousefzadeh R, Van Gool L (2017) Temporal 3D ConvNets: new architecture and transfer learning for video classification. https://arxiv.org/abs/1711.08200
Zhou B, Andonian A, Oliva A, Torralba A (2018) Temporal relational reasoning in videos. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 11205 LNCS, pp 831–846. https://doi.org/10.1007/978-3-030-01246-5_49
Zolfaghari M, Singh K, Brox T (2018) Eco: efficient convolutional network for online video understanding. Lecture Notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 11206 LNCS, pp 713–730. https://doi.org/10.1007/978-3-030-01216-8_43
Lee M, Lee S, Son S, Park G, Kwak N (2018) Motion feature network: fixed motion filter for action recognition. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 11214 LNCS, pp 392–408. https://doi.org/10.1007/978-3-030-01249-6_24
Lin J, Gan C, Han S (2019) Tsm: temporal shift module for efficient video understanding. In: Proceedings of the IEEE international conference on computer vision 2019-October, pp 7082–7092. https://doi.org/10.1109/ICCV.2019.00718
Shao H, Qian S, Liu Y (2020) Temporal interlacing network. In: Paper presented at the AAAI 2020—34th AAAI conference on artificial intelligence
Wang L, Koniusz P (2022) Uncertainty-DTW for time series and sequences
Liu Z, Luo D, Wang Y, Wang L, Tai Y, Wang C, Li J, Huang F, Lu T (2020) TEINet: towards an efficient architecture for video recognition. In: Paper presented at the AAAI 2020—34th AAAI conference on artificial intelligence
Liu Z, Wang L, Wu W, Qian C, Lu T (2021) Tam: temporal adaptive module for video recognition. In: Proceedings of the IEEE international conference on computer vision, pp 13688–13698. https://doi.org/10.1109/ICCV48922.2021.01345
Li Y, Ji B, Shi X, Zhang J, Kang B, Wang L (2020) Tea: temporal excitation and aggregation for action recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 906–915. https://doi.org/10.1109/CVPR42600.2020.00099
Jiang B, Wang M, Gan W, Wu W, Yan J (2019) Stm: spatiotemporal and motion encoding for action recognition. In: Proceedings of the IEEE international conference on computer vision 2019-October, pp 2000–2009. https://doi.org/10.1109/ICCV.2019.00209
Chen Y, Dai X, Liu M, Chen D, Yuan L, Liu Z (2020) Dynamic convolution: attention over convolution kernels. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 11027–11036. https://doi.org/10.1109/CVPR42600.2020.01104
Zhang Y, Zhang J, Wang Q, Zhong Z (2020) Dynet: dynamic convolution for accelerating convolutional neural networks
Hu J, Shen L, Albanie S, Sun G, Wu E (2020) Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 42(8):2011–2023
Goyal R, Kahou SE, Michalski V, Materzynska J, Westphal S, Kim H, Haenel V, Fruend I, Yianilos P, Mueller-Freitag M, Hoppe F, Thurau C, Bax I, Memisevic R (2017) The ’something something’ video database for learning and evaluating visual common sense. In: Proceedings of the IEEE international conference on computer vision 2017-October, pp 5843–5851. https://doi.org/10.1109/ICCV.2017.622
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, Suleyman M, Zisserman A (2017) The kinetics human action video dataset
Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE international conference on computer vision 2019-October, pp 6201–6210. https://doi.org/10.1109/ICCV.2019.00630
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 7794–7803. https://doi.org/10.1109/CVPR.2018.00813
Deng J, Dong W, Socher R, Li L-J, Li K, Li F-F (2009) Imagenet: a large-scale hierarchical image database, pp 248–255
Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 11219 LNCS, pp 318–335. https://doi.org/10.1007/978-3-030-01267-0_19
Yang Q, Lu T, Zhou H (2022) A spatio-temporal motion network for action recognition based on spatial attention. Entropy 24(3):368
Liu X, Lee J-Y, Jin H (2019) Learning video representations from correspondence proposals. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 4268–4276. https://doi.org/10.1109/CVPR.2019.00440
Fan Q, Chen C-F, Kuehne H, Pistoia M, Cox D (2019) More is less: learning efficient video representations by big-little network and depthwise temporal aggregation. Adv Neural Inf Process Syst 32
Li X, Wang Y, Zhou Z, Qiao Y (2020) Smallbignet: integrating core and contextual views for video classification. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 1089–1098. https://doi.org/10.1109/CVPR42600.2020.00117
Wang L, Li W, Van Gool L (2018) Appearance-and-relation networks for video classification. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 1430–1439. https://doi.org/10.1109/CVPR.2018.00155
Feichtenhofer C (2020) X3d: expanding architectures for efficient video recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 200–210. https://doi.org/10.1109/CVPR42600.2020.00028
Zhang Y, Li X, Liu C, Shuai B, Zhu Y, Brattoli B, Chen H, Marsic I, Tighe J (2021) Vidtr: video transformer without convolutions. In: Proceedings of the IEEE international conference on computer vision, pp 13557–13567. https://doi.org/10.1109/ICCV48922.2021.01332
Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding?
Arnab A, Dehghani M, Heigold G, Sun C, Lui M, Schmid C (2021) Vivit: a video vision transformer. In: Proceedings of the IEEE international conference on computer vision, pp 6816–6826. https://doi.org/10.1109/ICCV48922.2021.00676
Fan H, Xiong B, Mangalam K, Li Y, Yan Z, Malik J, Feichtenhofer C (2021) Multiscale vision transformers. In: Proceedings of the IEEE international conference on computer vision, pp 6804–6815. https://doi.org/10.1109/ICCV48922.2021.00675
Patrick M, Campbell D, Asano Y, Misra I, Metze F, Feichtenhofer C, Vedaldi A, Henriques JF (2021) Keeping your eye on the ball: trajectory attention in video transformers. Adv Neural Inf Process Syst 15:12493–12506
Acknowledgements
This work is supported by the Hubei Technology Innovation Project (2019AAA045), the National Natural Science Foundation of China (62171328), the National Natural Science Foundation of China (62171327) and the Graduate Innovative Fund of Wuhan Institute of Technology (CX2021276).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lu, T., Yang, Q., Min, F. et al. Action recognition based on adaptive region perception. Neural Comput & Applic 36, 943–959 (2024). https://doi.org/10.1007/s00521-023-09069-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-09069-9