CSMB-VSS: video scene segmentation with cosine similarity matrix

Chen, Zeyu; Wang, Xinbo; Wang, Ji; Zhang, Yi; Cao, Xiang

doi:10.1007/s11042-023-17985-0

CSMB-VSS: video scene segmentation with cosine similarity matrix

Published: 06 January 2024

Volume 83, pages 61451–61467, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Zeyu Chen ORCID: orcid.org/0000-0003-2766-2031^1,2,
Xinbo Wang²,
Ji Wang²,
Yi Zhang² &
…
Xiang Cao²

232 Accesses
Explore all metrics

Abstract

Video scene segmentation is a crucial step in video structural analysis, which divides a long video into discrete scenes, each consisting of a series of semantically coherent shots. The purpose of video scene segmentation is to identify the locations of scene boundaries in a shot sequence. Existing algorithms primarily use token classification methods. However, given the small size of current video scene segmentation datasets and the abundance of redundant, scene-irrelevant information in video embeddings, this approach lacks prior knowledge. This makes the learning process uninterpretable and difficult to control. To address this issue, we propose a cosine similarity matrix-based video scene segmentation (CSMB-VSS) algorithm, which leverages the relationship between video scene segmentation and shot similarity as prior information and shows significant optimization results. First, we use self-supervised learning to map shot features to the scene space for feature adjustment, and propose dynamic programming + nearest neighbor or clustering methods to generate pseudo-scenes for training. Then, we generate a similarity matrix based on the adjusted features and use a convolutional neural network to mine the typical patterns of scene boundaries around the diagonal of the similarity matrix. On the official MovieNet-SSeg video scene segmentation dataset, the CSMB-VSS method achieves an average precision (AP) that is 3.4\(\%\) higher than the state-of-the-art (SOTA). It is worth noting that this paper explored different ways of using the similarity matrix in scene boundary detection, and found that each method was suitable for different feature adjustment methods. The paper provides a detailed analysis of this.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

€32.70 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

BaSSL: Boundary-aware Self-Supervised Learning for Video Scene Segmentation

Semantic Transition Detection for Self-supervised Video Scene Segmentation

A novel method for video shot boundary detection using CNN-LSTM approach

Article 03 October 2022

Data Availability Statement

The datasets generated and/or analyzed during the current study are available from the corresponding author upon reasonable request.

References

Guru D, Suhil M (2013) Histogram based split and merge framework for shot boundary detection. In: Mining intelligence and knowledge exploration: first international conference, MIKE 2013, Tamil Nadu, India, December 18-20, 2013. Proceedings, pp 180–191. Springer
Souček T, Lokoč J (2020) Transnet v2: an effective deep network architecture for fast shot transition detection. arXiv:2008.04838
Chen S, Nie X, Fan D, Zhang D, Bhat V, Hamid R (2021) Shot contrastive self-supervised learning for scene boundary detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9796–9805
Rao A, Xu L, Xiong Y, Xu G, Huang Q, Zhou B, Lin D (2020) A local-to-global approach to multi-modal movie scene segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10146–10155
Wu H, Chen K, Luo Y, Qiao R, Ren B, Liu H, Xie W, Shen L (2022) Scene consistency representation learning for video scene segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14021–14030
Mun J, Shin M, Han G, Lee S, Ha S, Lee J, Kim E-S (2022) Boundary-aware self-supervised learning for video scene segmentation. arXiv:2201.05277
Yang Y, Huang Y, Guo W, Xu B, Xia D (2023) Towards global video scene segmentation with context-aware transformer. In: Proceedings of the AAAI conference on artificial intelligence, vol 37, pp 3206–3213
Huang Q, Xiong Y, Rao A, Wang J, Lin D (2020) Movienet: a holistic dataset for movie understanding. In: European conference on computer vision, pp 709–727. Springer
Souček T, Moravec J, Lokoč J (2019) Transnet: a deep network for fast detection of common shot transitions. arXiv:1906.03363
Kelishadrokhi MK, Ghattaei M, Fekri-Ershad S (2023) Innovative local texture descriptor in joint of human-based color features for content-based image retrieval. SIViP 17(8):4009–4017
Article Google Scholar
Protasov S, Khan AM, Sozykin K, Ahmad M (2018) Using deep features for video scene detection and annotation. SIViP 12(5):991–999
Article Google Scholar
Sidiropoulos P, Mezaris V, Kompatsiaris I, Meinedo H, Bugalho M, Trancoso I (2011) Temporal video segmentation to scenes using high-level audiovisual features. IEEE Trans Circuits Syst Video Technol 21(8):1163–1177
Article Google Scholar
Rasheed Z, Shah M (2003) Scene detection in hollywood movies and tv shows. In: 2003 IEEE computer society conference on computer vision and pattern recognition, 2003. Proceedings, vol 2, p 343. IEEE
Chasanis VT, Likas AC, Galatsanos NP (2008) Scene detection in videos using shot clustering and sequence alignment. IEEE Trans Multimedia 11(1):89–100
Article Google Scholar
Rotman D, Porat D, Ashour G (2016) Robust and efficient video scene detection using optimal sequential grouping. In: 2016 IEEE international symposium on multimedia (ISM), pp 275–280. IEEE
Han B, Wu W (2011) Video scene segmentation using a novel boundary evaluation criterion and dynamic programming. In: 2011 IEEE international conference on multimedia and expo, pp 1–6. IEEE
Tapaswi M, Bauml M, Stiefelhagen R (2014) Storygraphs: visualizing character interactions as a timeline. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 827–834
Das A, Das PP (2020) Incorporating domain knowledge to improve topic segmentation of long mooc lecture videos. arXiv:2012.07589
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, pp 8748–8763. PMLR
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Baraldi L, Grana C, Cucchiara R (2015) A deep siamese network for scene detection in broadcast videos. In: Proceedings of the 23rd ACM international conference on multimedia, pp 1199–1202

Download references

Author information

Authors and Affiliations

School of Information and Communication Engineering, Communication University of China, Beijing, China
Zeyu Chen
AI Platform Department of Bilibili, Shanghai, China
Zeyu Chen, Xinbo Wang, Ji Wang, Yi Zhang & Xiang Cao

Authors

Zeyu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xinbo Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ji Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Cao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zeyu Chen.

Ethics declarations

Conflict of interest/Competing interests

The authors declare no conflict of interest

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, Z., Wang, X., Wang, J. et al. CSMB-VSS: video scene segmentation with cosine similarity matrix. Multimed Tools Appl 83, 61451–61467 (2024). https://doi.org/10.1007/s11042-023-17985-0

Download citation

Received: 31 July 2023
Revised: 17 November 2023
Accepted: 25 December 2023
Published: 06 January 2024
Issue Date: July 2024
DOI: https://doi.org/10.1007/s11042-023-17985-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

€32.70 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (France)

Instant access to the full article PDF.

Institutional subscriptions

CSMB-VSS: video scene segmentation with cosine similarity matrix

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

BaSSL: Boundary-aware Self-Supervised Learning for Video Scene Segmentation

Semantic Transition Detection for Self-supervised Video Scene Segmentation

A novel method for video shot boundary detection using CNN-LSTM approach

Data Availability Statement

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest/Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

CSMB-VSS: video scene segmentation with cosine similarity matrix

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

BaSSL: Boundary-aware Self-Supervised Learning for Video Scene Segmentation

Semantic Transition Detection for Self-supervised Video Scene Segmentation

A novel method for video shot boundary detection using CNN-LSTM approach

Data Availability Statement

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest/Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation