Skip to main content

Advertisement

Log in

An efficient multi-scale contextual feature fusion network for counting crowds with varying densities and scales

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The crowd counting problem aims to predict the number of pedestrians in a surveillance video or an image and produce a crowd density map. Achieving accurate crowd counting in different crowded scenes is still challenging due to drastic scale changes and severe occlusions. This paper proposes an efficient multi-scale contextual feature fusion network for counting crowds with varying densities and scales, abbreviated as MSC-FFN. We design a spatial pyramid feature extraction module that enables rich contextual feature extraction to adapt to rapid scale changes. To further enhance the model’s ability to suppress background and focus on main features, we also design a spatial channel attention module to integrate feature map correlations from spatial and channel dimensions, so that the network can focus on main crowd features and filter out irrelevant background information, which outputs a high-quality density map. We conduct extensive experiments on multiple challenging crowds counting datasets, including UCF_CC_50, ShanghaiTech, WorldExpo’10, and Mall dataset, and the results demonstrate that MSC-FFN outperforms many state-of-the-art methods in counting performance and generated density maps.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
€32.70 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data Availability

The labeled dataset used to support the findings of this study is available from the corresponding author upon request.

References

  1. Anderson P, He X, Buehler C et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2018:6077–6086

    Google Scholar 

  2. Azizpour H, Laptev I (2012) Object detection using strongly-supervised deformable part models. IEEE Eur Conf Comput Vis (ECCV) 2012:836–849

    Google Scholar 

  3. Bai S, He Z, Xu C, Qiao Y et al (2020) Adaptive dilated network with self-correction supervision for counting. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2020:4594–4603

    Google Scholar 

  4. Cao X, Wang Z, Zhao Y, Su F (2018) Scale aggregation network for accurate and efficient crowd counting. IEEE Eur Conf Comput Vis (ECCV) 2018:757–773

    Google Scholar 

  5. Chan AB, Vasconcelos N (2012) Counting people with low-level features and bayesian regression. IEEE Trans Image Process 21(4):2160–2177

    Article  MathSciNet  MATH  Google Scholar 

  6. Chen J, Wang Z (2021) Counting with segmentation attention convolutional neural network. IET Image Process 15(6):1221–1231

    Article  MathSciNet  Google Scholar 

  7. Chen K, Loy CC, Gong S, Xiang T (2012) Feature mining for localised crowd counting. British Mach Vis Conf (BMVC) 2012:1–11

    Google Scholar 

  8. Ding X, He F, Lin Z, Wang Y, Guo H, Huang Y (2021) Crowd density estimation using fusion of multi-layer features. IEEE Trans Intell Transp Syst 22(8):4776–4787

    Article  Google Scholar 

  9. Fu H, Ma H, Xiao H (2014) Scene-adaptive accurate and fast vertical crowd counting via joint using depth and color information. Multimed Tools Appl 73:273–289

    Article  Google Scholar 

  10. Gao J, Wang Q, Yuan Y (2019) SCAR: spatial−/channel-wise attention regression networks for crowd counting. Neurocomputing 363(21):1–8

    Article  Google Scholar 

  11. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

    Article  MathSciNet  MATH  Google Scholar 

  12. Hu C, Cheng K, Xie Y, Li T (2020) Arbitrary perspective crowd counting via local to global algorithm. Multimed Tools Appl 79:15,059–15,071

    Article  Google Scholar 

  13. Idrees H, Saleemi I, Seibert C, Shah M (2013) Multi-source multi-scale counting in extremely dense crowd images. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2013:2547–2554

    Google Scholar 

  14. Idrees H, Tayyab M, Athrey K et al (2018) Composition loss for counting, density map estimation and localization in dense crowds. Preprint at https://arxiv.org/abs/1808.01050

  15. Jiang X, Zhang L, Xu M et al (2020) Attention scaling for crowd counting. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2020:4705–4714

    Google Scholar 

  16. Khan SD, Basalamah S (2021) Sparse to dense scale prediction for crowd counting in high density crowds. Arab J Sci Eng 46(4):3051–3065

    Article  Google Scholar 

  17. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980

  18. Kumagai S, Hotta K, Kurita T (2017) Mixture of counting CNNs: adaptive integration of cnns specialized to specific appearance for crowd counting. Preprint at https://arxiv.org/abs/1703.09393

  19. Li Y, Zhang X, Chen D (2018) CSRNet: dilated convolutional neural networks for understanding the highly congested scenes. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2018:1091–1100

    Google Scholar 

  20. Li H, Zhang S, Kong W (2019) Crowd counting using a self-attention multi-scale cascaded network. IET Comput Vis 13(6):556–561

    Article  Google Scholar 

  21. Li P, Zhang M, Wan J, Jiang M (2021) Multi-scale guided attention network for crowd counting. Sci. Program 2021: 5596488:1-5596488:13

  22. Liu L, Chen J, Wu H et al (2020) Efficient crowd counting via structured knowledge transfer. Proceedings of the 28th ACM international conference on multimedia 2020: 2645-2654

  23. Liu L, Jiang J, Jia W, Amirgholipour S, Wang Y, Zeibots M, He X (2021) DENet: a universal network for counting crowd with varying densities and scales. IEEE Trans Multimed 23:1060–1068

    Article  Google Scholar 

  24. Ma T, Ji Q, Ning L (2018) Scene invariant crowd counting using multi-scales head detection in video surveillance. IET Image Process 12(12):2258–2263

    Article  Google Scholar 

  25. Miao Y, Lin Z, Ding G, Han J (2020) Shallow feature based dense attention network for crowd counting. In Proceedings of the Thirty-Four AAAI Conference on Artificial Intelligence (AAAI) 2020:11765–11772

    Article  Google Scholar 

  26. Mnih V, Heess N, Graves A, Kavukcuoglu K (2014) Recurrent models of visual attention. Adv Neural Inf Process Syst (NIPS) 2014:2204–2212

    Google Scholar 

  27. Pham VQ, Kozakaya T, Yamaguchi Q, Okada R (2015) Count Forest: co-voting uncertain number of targets using random forest for crowd density estimation. Proceedings of the IEEE/CVF international conference on computer vision (ICCV) 2015:3253–3261

    Google Scholar 

  28. Sam BD, Surya S, Babu RV (2017) Switching convolutional neural network for crowd counting. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2017:4031–4039

    Google Scholar 

  29. Sheng B, Shen C, Lin G, Li J, Yang W, Sun C (2018) Crowd counting via weighted vlad on a dense attribute feature map. IEEE Trans Circ Syst Vid Technol 28(8):1788–1797

    Article  Google Scholar 

  30. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Preprint at https://arxiv.org/abs/1409.1556

  31. Sindagi VA, Pater VM (2017) Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In proceedings of the 14th EEE international conference on advanced video and signal based surveillance (AVSS) 2017: 1-6

  32. Sindagi VA, Pater VM (2017) Generating high-quality crowd density maps using contextual pyramid cnns. Proceedings of the IEEE/CVF international conference on computer vision (ICCV) 2017: 1879–1888

  33. Song Q, Wang C, Jiang Z et al (2021) Rethinking counting and localization in crowds: a purely point-based framework. Proceedings of the IEEE/CVF Int Conf Comput Vis (ICCV) 2021:3365–3374

    Google Scholar 

  34. Tang C, Liu X, An S, Wang P (2021) BR2Net: defocus blur detection via a bidirectional channel attention residual refining network. IEEE Trans Multimed (TMM) 23:624–635

    Article  Google Scholar 

  35. Tang C, Liu X, Zheng X, Li W, Xiong J, Wang L, Zomaya AY, Longo A (2022) DeFusionNET: defocus blur detection via recurrently fusing and refining discriminative multi-scale deep features. IEEE Trans Pattern Anal Mach Intell (PAMI) 44(2):955–968

    Article  Google Scholar 

  36. Thanasutives P, Fukui K, Numao M et al (2021) Encoder-decoder based convolutional neural networks with multi-scale-aware modules for crowd counting. 2020 25th international conference on pattern recognition (ICPR) 2021: 2382-2389

  37. Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inf Process Syst (NIPS) 2017:5998–6008

    Google Scholar 

  38. Walach E, Wolf L (2016) Learning to count with cnn boosting. IEEE Eur Conf Comput Vis (ECCV) 2016:660–676

    Google Scholar 

  39. Wang J, Jiang W, Ma L, Liu W, Xu Y (2018) Bidirectional attentive fusion with context gating for dense video captioning. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2018:7190–7198

    Google Scholar 

  40. Wang Q, Gao J, Lin W, Yuan Y (2019) Learning from synthetic data for crowd counting in the wild. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2019:8190–8199

    Google Scholar 

  41. Wang S, Lu Y, Zhou T, di H, Lu L, Zhang L (2020) SCLNet: spatial context learning network for congested crowd counting. Neurocomputing 404:227–239

    Article  Google Scholar 

  42. Wang Y, Hu S, Wang G, Chen C, Pan Z (2020) Multi-scale dilated convolution of convolutional neural network for crowd counting. Multimed Tools Appl 79:1057–1073

    Article  Google Scholar 

  43. Wang C, Song Q, Zhang B et al (2021) Uniformity in heterogeneity: diving deep into count interval partition for crowd counting. Proceedings of the IEEE/CVF international conference on computer vision (ICCV) 2021:3234–3242

    Google Scholar 

  44. Wojek C, Dollar P, Schiele B, Perona P (2012) Pedestrian detection: An evaluation of the state of the art. IEEE Trans Pattern Anal Mach Intell 34(4):743–761

    Article  Google Scholar 

  45. Woo S, Park J, Lee JY, Kweon IS (2018) CBAM: convolutional block attention module. IEEE Eur Conf Comput Vis (ECCV) 2018:3–19

    Google Scholar 

  46. Wu B, Nevatia R (2007) Detection and tracking of multiple, partially occluded humans by bayesian combination of edgelet based part detectors. Int J Comput Vis 75(2):247–266

    Article  Google Scholar 

  47. Zhan B, Monekosso DN, Remagnino P, Velastin SA, Xu L (2008) Crowd analysis: a survey. Mach Vis Appl 19(5–6):345–357

    Article  Google Scholar 

  48. Zhang C, Li H, Wang X, Yang X (2015) Cross-scene crowd counting via deep convolutional neural networks. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2015:833–841

    Google Scholar 

  49. Zhang Y, Zhou D, Chen S, Gao S, Yi M (2016) Single-image crowd counting via multi-column convolutional neural network. IEEE Conf Comput Vis Pattern Recognit (CVPR) 2016:589–597

    Google Scholar 

  50. Zhang A, Shen J, Xiao Z et al (2019) Relational attention network for crowd counting. Proceedings of the IEEE/CVF international conference on computer vision (ICCV) 2019:6788–6797

    Google Scholar 

  51. Zhu L, Zhao Z, Lu C et al (2019) Dual path multi-scale fusion networks with attention for crowd counting. Preprint at https://arxiv.org/abs/1902.01115

Download references

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (Nos.62067002, 61967006, and 62062033), and in part by the Science and Technology Project of the Transportation Department of Jiangxi Province, China (Nos.2021X0011, 2022X0040).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hu Yi.

Ethics declarations

Competing interests

The authors declare that there are no competing interests related to the content of this article.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xiong, L., Yi, H., Huang, X. et al. An efficient multi-scale contextual feature fusion network for counting crowds with varying densities and scales. Multimed Tools Appl 82, 13929–13949 (2023). https://doi.org/10.1007/s11042-022-13920-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-13920-x

Keywords

Navigation