Skip to main content

Advertisement

Log in

CSMB-VSS: video scene segmentation with cosine similarity matrix

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Video scene segmentation is a crucial step in video structural analysis, which divides a long video into discrete scenes, each consisting of a series of semantically coherent shots. The purpose of video scene segmentation is to identify the locations of scene boundaries in a shot sequence. Existing algorithms primarily use token classification methods. However, given the small size of current video scene segmentation datasets and the abundance of redundant, scene-irrelevant information in video embeddings, this approach lacks prior knowledge. This makes the learning process uninterpretable and difficult to control. To address this issue, we propose a cosine similarity matrix-based video scene segmentation (CSMB-VSS) algorithm, which leverages the relationship between video scene segmentation and shot similarity as prior information and shows significant optimization results. First, we use self-supervised learning to map shot features to the scene space for feature adjustment, and propose dynamic programming + nearest neighbor or clustering methods to generate pseudo-scenes for training. Then, we generate a similarity matrix based on the adjusted features and use a convolutional neural network to mine the typical patterns of scene boundaries around the diagonal of the similarity matrix. On the official MovieNet-SSeg video scene segmentation dataset, the CSMB-VSS method achieves an average precision (AP) that is 3.4\(\%\) higher than the state-of-the-art (SOTA). It is worth noting that this paper explored different ways of using the similarity matrix in scene boundary detection, and found that each method was suitable for different feature adjustment methods. The paper provides a detailed analysis of this.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
€32.70 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data Availability Statement

The datasets generated and/or analyzed during the current study are available from the corresponding author upon reasonable request.

References

  1. Guru D, Suhil M (2013) Histogram based split and merge framework for shot boundary detection. In: Mining intelligence and knowledge exploration: first international conference, MIKE 2013, Tamil Nadu, India, December 18-20, 2013. Proceedings, pp 180–191. Springer

  2. Souček T, Lokoč J (2020) Transnet v2: an effective deep network architecture for fast shot transition detection. arXiv:2008.04838

  3. Chen S, Nie X, Fan D, Zhang D, Bhat V, Hamid R (2021) Shot contrastive self-supervised learning for scene boundary detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9796–9805

  4. Rao A, Xu L, Xiong Y, Xu G, Huang Q, Zhou B, Lin D (2020) A local-to-global approach to multi-modal movie scene segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10146–10155

  5. Wu H, Chen K, Luo Y, Qiao R, Ren B, Liu H, Xie W, Shen L (2022) Scene consistency representation learning for video scene segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14021–14030

  6. Mun J, Shin M, Han G, Lee S, Ha S, Lee J, Kim E-S (2022) Boundary-aware self-supervised learning for video scene segmentation. arXiv:2201.05277

  7. Yang Y, Huang Y, Guo W, Xu B, Xia D (2023) Towards global video scene segmentation with context-aware transformer. In: Proceedings of the AAAI conference on artificial intelligence, vol 37, pp 3206–3213

  8. Huang Q, Xiong Y, Rao A, Wang J, Lin D (2020) Movienet: a holistic dataset for movie understanding. In: European conference on computer vision, pp 709–727. Springer

  9. Souček T, Moravec J, Lokoč J (2019) Transnet: a deep network for fast detection of common shot transitions. arXiv:1906.03363

  10. Kelishadrokhi MK, Ghattaei M, Fekri-Ershad S (2023) Innovative local texture descriptor in joint of human-based color features for content-based image retrieval. SIViP 17(8):4009–4017

    Article  Google Scholar 

  11. Protasov S, Khan AM, Sozykin K, Ahmad M (2018) Using deep features for video scene detection and annotation. SIViP 12(5):991–999

    Article  Google Scholar 

  12. Sidiropoulos P, Mezaris V, Kompatsiaris I, Meinedo H, Bugalho M, Trancoso I (2011) Temporal video segmentation to scenes using high-level audiovisual features. IEEE Trans Circuits Syst Video Technol 21(8):1163–1177

    Article  Google Scholar 

  13. Rasheed Z, Shah M (2003) Scene detection in hollywood movies and tv shows. In: 2003 IEEE computer society conference on computer vision and pattern recognition, 2003. Proceedings, vol 2, p 343. IEEE

  14. Chasanis VT, Likas AC, Galatsanos NP (2008) Scene detection in videos using shot clustering and sequence alignment. IEEE Trans Multimedia 11(1):89–100

    Article  Google Scholar 

  15. Rotman D, Porat D, Ashour G (2016) Robust and efficient video scene detection using optimal sequential grouping. In: 2016 IEEE international symposium on multimedia (ISM), pp 275–280. IEEE

  16. Han B, Wu W (2011) Video scene segmentation using a novel boundary evaluation criterion and dynamic programming. In: 2011 IEEE international conference on multimedia and expo, pp 1–6. IEEE

  17. Tapaswi M, Bauml M, Stiefelhagen R (2014) Storygraphs: visualizing character interactions as a timeline. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 827–834

  18. Das A, Das PP (2020) Incorporating domain knowledge to improve topic segmentation of long mooc lecture videos. arXiv:2012.07589

  19. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, pp 8748–8763. PMLR

  20. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  21. Baraldi L, Grana C, Cucchiara R (2015) A deep siamese network for scene detection in broadcast videos. In: Proceedings of the 23rd ACM international conference on multimedia, pp 1199–1202

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zeyu Chen.

Ethics declarations

Conflict of interest/Competing interests

The authors declare no conflict of interest

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Z., Wang, X., Wang, J. et al. CSMB-VSS: video scene segmentation with cosine similarity matrix. Multimed Tools Appl 83, 61451–61467 (2024). https://doi.org/10.1007/s11042-023-17985-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-17985-0

Keywords

Navigation