Multi-scale spatiotemporal topology unveiled: enhancing skeleton-based action recognition

Chen, Hongwei; Wang, Jianpeng; Chen, Zexi

doi:10.1007/s11227-024-06531-w

Multi-scale spatiotemporal topology unveiled: enhancing skeleton-based action recognition

Published: 16 October 2024

Volume 81, article number 10, (2025)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

218 Accesses
Explore all metrics

Abstract

In recent years, skeleton-based action recognition has received considerable attention due to the robustness of human skeletons in complex environments. However, many existing methods face challenges in effectively learning global temporal information due to inadequate extraction of spatiotemporal features and the neglect of long-term dependencies. Furthermore, subtle joint movements play a critical role in skeleton-based behavior recognition, as such movements are essential for distinguishing between similar actions. To address the aforementioned challenges, this paper proposes a Multi-Scale Spatiotemporal Topology-Aware Network (MSTC3D), which integrates data from various sampled frames into a dual-channel network and employs lateral connections to merge features from different temporal scales. This facilitates the dynamic learning of global temporal channel variations, enhancing the modeling of long-term temporal dependencies. The proposed Multi-Scale 3D Convolutional Block (M3D) incorporates a pyramid-like structure to expand the receptive field effectively, thereby enabling the accurate capture of multi-layered detailed features of subtle joint movements. Moreover, to further enhance the model’s fine-grained recognition capability for features associated with various joints and regions, a Spatial Topological Focus Module is embedded within the M3D. By comprehensively considering both short-term and long-term temporal dependencies, and leveraging the efficient feature representation provided by multi-scale convolutional blocks, MSTC3D demonstrates superior performance in action recognition tasks. Experiments on the NTU RGB+D and FineGym datasets validate the effectiveness of MSTC3D, showing state-of-the-art performance compared to CNN-based methods and achieving comparable superior performance to leading GCN-based methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

€32.70 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Institutional subscriptions

Multi-level Temporal-Guided Graph Convolutional Networks for Skeleton-Based Action Recognition

Semantic-guided multi-scale human skeleton action recognition

Article 12 August 2022

Dynamic spatial-temporal topology graph network for skeleton-based action recognition

Article 29 October 2024

Data availability

No datasets were generated or analyzed during the current study.

References

Zhang P, Lan C, Zeng W, Xing J, Xue J, Zheng N (2020) Semantics-guided neural networks for efficient skeleton-based human action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1112–1121
Hua Y, Wu W, Zheng C, Lu A, Liu M, Chen C, Wu S (2023) Part aware contrastive learning for self-supervised action recognition. arXiv preprint arXiv:2305.00666
Liu D, Chen P, Yao M, Lu Y, Cai Z, Tian Y (2023) Tsgcnext: Dynamic-static multi-graph convolution for efficient skeleton-based action recognition with long-term learning potential. arXiv preprint arXiv:2304.11631
Xing Y, Zhu J, Li Y, Huang J, Song J (2023) An improved spatial temporal graph convolutional network for robust skeleton-based action recognition. Appl Intell 53(4):4592–4608
Article Google Scholar
Zhou H, Liu Q, Wang Y (2023) Learning discriminative representations for skeleton based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10608–10617
Lee J, Lee M, Cho S, Woo S, Jang S, Lee S (2023) Leveraging spatio-temporal dependency for skeleton-based action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10255–10264
Lin L, Zhang J, Liu J (2023) Actionlet-dependent contrastive learning for unsupervised skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2363–2372
Wu L, Zhang C, Zou Y (2023) Spatiotemporal focus for skeleton-based action recognition. Pattern Recogn 136:109231
Article Google Scholar
Lee J, Lee M, Lee D, Lee S (2023) Hierarchically decomposed graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10444–10453
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN (2017) L. u. Kaiser, and I. Polosukhin, attention is all you need. Adv Neural Inf Process Syst 30:5998–6008
Google Scholar
Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3146–3154
bibitemr12 Caetano C, Sena J, Brémond F, Dos Santos JA, Schwartz WR (2019) Skelemotion: a new representation of skeleton joint sequences based on motion information for 3d action recognition. In: 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), IEEE, pp 1–8
Joze HRV, Shaban A, Iuzzolino ML, Koishida K (2020) Mmtm: multimodal transfer module for CNN fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 13289–13299
Shi L, Zhang Y, Cheng J, Lu H (2020) Decoupled spatial-temporal attention network for skeleton-based action-gesture recognition. In: Proceedings of the Asian Conference on Computer Vision
Luo J, Zhou L, Zhu G, Ge G, Yang B, Wang J (2023) Temporal-channel topology enhanced network for skeleton-based action recognition. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Springer, pp 109–119
Duan H, Xu M, Shuai B, Modolo D, Tu Z, Tighe J, Bergamo A (2023) Skeletr: towards skeleton-based action recognition in the wild. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 13634–13644
Wang L, Koniusz P (2023) 3mformer: multi-order multi-mode transformer for skeletal action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5620–5631
Do J, Kim M (2024) Skateformer: skeletal-temporal transformer for human action recognition. arXiv preprint arXiv:2403.09508
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 32
Li M, Chen S, Chen X, Zhang Y, Wang Y, Tian Q (2019) Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3595–3603
Song Y-F, Zhang Z, Shan C, Wang L (2020) Richly activated graph convolutional network for robust skeleton-based action recognition. IEEE Trans Circuits Syst Video Technol 31(5):1915–1925
Article Google Scholar
Liu Z, Zhang H, Chen Z, Wang Z, Ouyang W (2020) Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 143–152
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6299–6308
Feichtenhofer C (2020) X3d: expanding architectures for efficient video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 203–213
Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 6202–6211
Duan H, Zhao Y, Chen K, Lin D, Dai B (2022) Revisiting skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2969–2978
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1933–1941
Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2117–2125
Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 3–19
Shahroudy A, Liu J, Ng T-T, Wang G (2016) Ntu rgb+ d: a large scale dataset for 3d human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1010–1019
Shao D, Zhao Y, Dai B, Lin D (2020) Finegym: a hierarchical video dataset for fine-grained action understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2616–2625
Zhang P, Lan C, Xing J, Zeng W, Xue J, Zheng N (2019) View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Trans Pattern Anal Mach Intell 41(8):1963–1978
Article Google Scholar
Xu K, Ye F, Zhong Q, Xie D (2022) Topology-aware convolutional neural network for efficient skeleton-based action recognition. Proc AAAI Conf Artif Intell 36:2866–2874
Google Scholar
Cheng Q, Cheng J, Ren Z, Zhang Q, Liu J (2023) Multi-scale spatial-temporal convolutional neural network for skeleton-based action recognition. Pattern Anal Appl 26(3):1303–1315
Article Google Scholar
Cai D, Kang Y, Yao A, Chen Y (2023) Ske2grid: skeleton-to-grid representation learning for action recognition. In: International Conference on Machine Learning, PMLR, pp 3431–3441
Shi L, Zhang Y, Cheng J, Lu H (2019) Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7912–7921
Shi L, Zhang Y, Cheng J, Lu H (2021) Adasgn: adapting joint number and model size for efficient skeleton-based action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 13413–13422
Dai M, Sun Z, Wang T, Feng J, Jia K (2023) Global spatio-temporal synergistic topology learning for skeleton-based action recognition. Pattern Recogn 140:109540
Article Google Scholar
Song Y-F, Zhang Z, Shan C, Wang L (2022) Constructing stronger and faster baselines for skeleton-based action recognition. IEEE Trans Pattern Anal Mach Intell 45(2):1474–1488
Article Google Scholar
Xu Z, Xu J (2024) Gr-former: Graph-reinforcement transformer for skeleton-based driver action recognition. IET Computer Vision
Cui H, Hayama T (2024) STSD: spatial-temporal semantic decomposition transformer for skeleton-based action recognition. Multimedia Syst 30(1):43
Article Google Scholar
Shi L, Zhang Y, Cheng J, Lu H (2020) Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Trans Image Process 29:9532–9545
Article Google Scholar
Zhu Y, Han H, Yu Z, Liu G (2023) Modeling the relative visual tempo for self-supervised skeleton-based action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 13913–13922

Download references

Funding

This study did not receive any funding.

Author information

Authors and Affiliations

School of Computer Science, Hubei University of Technology, Wuhan, 430068, China
Hongwei Chen & Jianpeng Wang
Xiaomi Technology (Wuhan) Co., Ltd, Wuhan, China
Zexi Chen

Authors

Hongwei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jianpeng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zexi Chen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.C. and J.W. wrote the main manuscript text and Z.C. prepared figures. All authors reviewed the manuscript.

Corresponding author

Correspondence to Jianpeng Wang.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Human or animal participation

This study does not involve human or animal subjects.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, H., Wang, J. & Chen, Z. Multi-scale spatiotemporal topology unveiled: enhancing skeleton-based action recognition. J Supercomput 81, 10 (2025). https://doi.org/10.1007/s11227-024-06531-w

Download citation

Accepted: 17 September 2024
Published: 16 October 2024
DOI: https://doi.org/10.1007/s11227-024-06531-w

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

€32.70 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Institutional subscriptions

Multi-scale spatiotemporal topology unveiled: enhancing skeleton-based action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-level Temporal-Guided Graph Convolutional Networks for Skeleton-Based Action Recognition

Semantic-guided multi-scale human skeleton action recognition

Dynamic spatial-temporal topology graph network for skeleton-based action recognition

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Human or animal participation

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Multi-scale spatiotemporal topology unveiled: enhancing skeleton-based action recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-level Temporal-Guided Graph Convolutional Networks for Skeleton-Based Action Recognition

Semantic-guided multi-scale human skeleton action recognition

Dynamic spatial-temporal topology graph network for skeleton-based action recognition

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Human or animal participation

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation