Abstract
The crowd counting problem aims to predict the number of pedestrians in a surveillance video or an image and produce a crowd density map. Achieving accurate crowd counting in different crowded scenes is still challenging due to drastic scale changes and severe occlusions. This paper proposes an efficient multi-scale contextual feature fusion network for counting crowds with varying densities and scales, abbreviated as MSC-FFN. We design a spatial pyramid feature extraction module that enables rich contextual feature extraction to adapt to rapid scale changes. To further enhance the model’s ability to suppress background and focus on main features, we also design a spatial channel attention module to integrate feature map correlations from spatial and channel dimensions, so that the network can focus on main crowd features and filter out irrelevant background information, which outputs a high-quality density map. We conduct extensive experiments on multiple challenging crowds counting datasets, including UCF_CC_50, ShanghaiTech, WorldExpo’10, and Mall dataset, and the results demonstrate that MSC-FFN outperforms many state-of-the-art methods in counting performance and generated density maps.











Similar content being viewed by others
Data Availability
The labeled dataset used to support the findings of this study is available from the corresponding author upon request.
References
Anderson P, He X, Buehler C et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2018:6077–6086
Azizpour H, Laptev I (2012) Object detection using strongly-supervised deformable part models. IEEE Eur Conf Comput Vis (ECCV) 2012:836–849
Bai S, He Z, Xu C, Qiao Y et al (2020) Adaptive dilated network with self-correction supervision for counting. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2020:4594–4603
Cao X, Wang Z, Zhao Y, Su F (2018) Scale aggregation network for accurate and efficient crowd counting. IEEE Eur Conf Comput Vis (ECCV) 2018:757–773
Chan AB, Vasconcelos N (2012) Counting people with low-level features and bayesian regression. IEEE Trans Image Process 21(4):2160–2177
Chen J, Wang Z (2021) Counting with segmentation attention convolutional neural network. IET Image Process 15(6):1221–1231
Chen K, Loy CC, Gong S, Xiang T (2012) Feature mining for localised crowd counting. British Mach Vis Conf (BMVC) 2012:1–11
Ding X, He F, Lin Z, Wang Y, Guo H, Huang Y (2021) Crowd density estimation using fusion of multi-layer features. IEEE Trans Intell Transp Syst 22(8):4776–4787
Fu H, Ma H, Xiao H (2014) Scene-adaptive accurate and fast vertical crowd counting via joint using depth and color information. Multimed Tools Appl 73:273–289
Gao J, Wang Q, Yuan Y (2019) SCAR: spatial−/channel-wise attention regression networks for crowd counting. Neurocomputing 363(21):1–8
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Hu C, Cheng K, Xie Y, Li T (2020) Arbitrary perspective crowd counting via local to global algorithm. Multimed Tools Appl 79:15,059–15,071
Idrees H, Saleemi I, Seibert C, Shah M (2013) Multi-source multi-scale counting in extremely dense crowd images. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2013:2547–2554
Idrees H, Tayyab M, Athrey K et al (2018) Composition loss for counting, density map estimation and localization in dense crowds. Preprint at https://arxiv.org/abs/1808.01050
Jiang X, Zhang L, Xu M et al (2020) Attention scaling for crowd counting. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2020:4705–4714
Khan SD, Basalamah S (2021) Sparse to dense scale prediction for crowd counting in high density crowds. Arab J Sci Eng 46(4):3051–3065
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980
Kumagai S, Hotta K, Kurita T (2017) Mixture of counting CNNs: adaptive integration of cnns specialized to specific appearance for crowd counting. Preprint at https://arxiv.org/abs/1703.09393
Li Y, Zhang X, Chen D (2018) CSRNet: dilated convolutional neural networks for understanding the highly congested scenes. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2018:1091–1100
Li H, Zhang S, Kong W (2019) Crowd counting using a self-attention multi-scale cascaded network. IET Comput Vis 13(6):556–561
Li P, Zhang M, Wan J, Jiang M (2021) Multi-scale guided attention network for crowd counting. Sci. Program 2021: 5596488:1-5596488:13
Liu L, Chen J, Wu H et al (2020) Efficient crowd counting via structured knowledge transfer. Proceedings of the 28th ACM international conference on multimedia 2020: 2645-2654
Liu L, Jiang J, Jia W, Amirgholipour S, Wang Y, Zeibots M, He X (2021) DENet: a universal network for counting crowd with varying densities and scales. IEEE Trans Multimed 23:1060–1068
Ma T, Ji Q, Ning L (2018) Scene invariant crowd counting using multi-scales head detection in video surveillance. IET Image Process 12(12):2258–2263
Miao Y, Lin Z, Ding G, Han J (2020) Shallow feature based dense attention network for crowd counting. In Proceedings of the Thirty-Four AAAI Conference on Artificial Intelligence (AAAI) 2020:11765–11772
Mnih V, Heess N, Graves A, Kavukcuoglu K (2014) Recurrent models of visual attention. Adv Neural Inf Process Syst (NIPS) 2014:2204–2212
Pham VQ, Kozakaya T, Yamaguchi Q, Okada R (2015) Count Forest: co-voting uncertain number of targets using random forest for crowd density estimation. Proceedings of the IEEE/CVF international conference on computer vision (ICCV) 2015:3253–3261
Sam BD, Surya S, Babu RV (2017) Switching convolutional neural network for crowd counting. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2017:4031–4039
Sheng B, Shen C, Lin G, Li J, Yang W, Sun C (2018) Crowd counting via weighted vlad on a dense attribute feature map. IEEE Trans Circ Syst Vid Technol 28(8):1788–1797
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Preprint at https://arxiv.org/abs/1409.1556
Sindagi VA, Pater VM (2017) Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In proceedings of the 14th EEE international conference on advanced video and signal based surveillance (AVSS) 2017: 1-6
Sindagi VA, Pater VM (2017) Generating high-quality crowd density maps using contextual pyramid cnns. Proceedings of the IEEE/CVF international conference on computer vision (ICCV) 2017: 1879–1888
Song Q, Wang C, Jiang Z et al (2021) Rethinking counting and localization in crowds: a purely point-based framework. Proceedings of the IEEE/CVF Int Conf Comput Vis (ICCV) 2021:3365–3374
Tang C, Liu X, An S, Wang P (2021) BR2Net: defocus blur detection via a bidirectional channel attention residual refining network. IEEE Trans Multimed (TMM) 23:624–635
Tang C, Liu X, Zheng X, Li W, Xiong J, Wang L, Zomaya AY, Longo A (2022) DeFusionNET: defocus blur detection via recurrently fusing and refining discriminative multi-scale deep features. IEEE Trans Pattern Anal Mach Intell (PAMI) 44(2):955–968
Thanasutives P, Fukui K, Numao M et al (2021) Encoder-decoder based convolutional neural networks with multi-scale-aware modules for crowd counting. 2020 25th international conference on pattern recognition (ICPR) 2021: 2382-2389
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inf Process Syst (NIPS) 2017:5998–6008
Walach E, Wolf L (2016) Learning to count with cnn boosting. IEEE Eur Conf Comput Vis (ECCV) 2016:660–676
Wang J, Jiang W, Ma L, Liu W, Xu Y (2018) Bidirectional attentive fusion with context gating for dense video captioning. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2018:7190–7198
Wang Q, Gao J, Lin W, Yuan Y (2019) Learning from synthetic data for crowd counting in the wild. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2019:8190–8199
Wang S, Lu Y, Zhou T, di H, Lu L, Zhang L (2020) SCLNet: spatial context learning network for congested crowd counting. Neurocomputing 404:227–239
Wang Y, Hu S, Wang G, Chen C, Pan Z (2020) Multi-scale dilated convolution of convolutional neural network for crowd counting. Multimed Tools Appl 79:1057–1073
Wang C, Song Q, Zhang B et al (2021) Uniformity in heterogeneity: diving deep into count interval partition for crowd counting. Proceedings of the IEEE/CVF international conference on computer vision (ICCV) 2021:3234–3242
Wojek C, Dollar P, Schiele B, Perona P (2012) Pedestrian detection: An evaluation of the state of the art. IEEE Trans Pattern Anal Mach Intell 34(4):743–761
Woo S, Park J, Lee JY, Kweon IS (2018) CBAM: convolutional block attention module. IEEE Eur Conf Comput Vis (ECCV) 2018:3–19
Wu B, Nevatia R (2007) Detection and tracking of multiple, partially occluded humans by bayesian combination of edgelet based part detectors. Int J Comput Vis 75(2):247–266
Zhan B, Monekosso DN, Remagnino P, Velastin SA, Xu L (2008) Crowd analysis: a survey. Mach Vis Appl 19(5–6):345–357
Zhang C, Li H, Wang X, Yang X (2015) Cross-scene crowd counting via deep convolutional neural networks. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2015:833–841
Zhang Y, Zhou D, Chen S, Gao S, Yi M (2016) Single-image crowd counting via multi-column convolutional neural network. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2016:589–597
Zhang A, Shen J, Xiao Z et al (2019) Relational attention network for crowd counting. Proceedings of the IEEE/CVF international conference on computer vision (ICCV) 2019:6788–6797
Zhu L, Zhao Z, Lu C et al (2019) Dual path multi-scale fusion networks with attention for crowd counting. Preprint at https://arxiv.org/abs/1902.01115
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China (Nos.62067002, 61967006, and 62062033), and in part by the Science and Technology Project of the Transportation Department of Jiangxi Province, China (Nos.2021X0011, 2022X0040).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare that there are no competing interests related to the content of this article.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xiong, L., Yi, H., Huang, X. et al. An efficient multi-scale contextual feature fusion network for counting crowds with varying densities and scales. Multimed Tools Appl 82, 13929–13949 (2023). https://doi.org/10.1007/s11042-022-13920-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13920-x