Abstract
Video scene segmentation is a crucial step in video structural analysis, which divides a long video into discrete scenes, each consisting of a series of semantically coherent shots. The purpose of video scene segmentation is to identify the locations of scene boundaries in a shot sequence. Existing algorithms primarily use token classification methods. However, given the small size of current video scene segmentation datasets and the abundance of redundant, scene-irrelevant information in video embeddings, this approach lacks prior knowledge. This makes the learning process uninterpretable and difficult to control. To address this issue, we propose a cosine similarity matrix-based video scene segmentation (CSMB-VSS) algorithm, which leverages the relationship between video scene segmentation and shot similarity as prior information and shows significant optimization results. First, we use self-supervised learning to map shot features to the scene space for feature adjustment, and propose dynamic programming + nearest neighbor or clustering methods to generate pseudo-scenes for training. Then, we generate a similarity matrix based on the adjusted features and use a convolutional neural network to mine the typical patterns of scene boundaries around the diagonal of the similarity matrix. On the official MovieNet-SSeg video scene segmentation dataset, the CSMB-VSS method achieves an average precision (AP) that is 3.4\(\%\) higher than the state-of-the-art (SOTA). It is worth noting that this paper explored different ways of using the similarity matrix in scene boundary detection, and found that each method was suitable for different feature adjustment methods. The paper provides a detailed analysis of this.









Similar content being viewed by others
Data Availability Statement
The datasets generated and/or analyzed during the current study are available from the corresponding author upon reasonable request.
References
Guru D, Suhil M (2013) Histogram based split and merge framework for shot boundary detection. In: Mining intelligence and knowledge exploration: first international conference, MIKE 2013, Tamil Nadu, India, December 18-20, 2013. Proceedings, pp 180–191. Springer
Souček T, Lokoč J (2020) Transnet v2: an effective deep network architecture for fast shot transition detection. arXiv:2008.04838
Chen S, Nie X, Fan D, Zhang D, Bhat V, Hamid R (2021) Shot contrastive self-supervised learning for scene boundary detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9796–9805
Rao A, Xu L, Xiong Y, Xu G, Huang Q, Zhou B, Lin D (2020) A local-to-global approach to multi-modal movie scene segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10146–10155
Wu H, Chen K, Luo Y, Qiao R, Ren B, Liu H, Xie W, Shen L (2022) Scene consistency representation learning for video scene segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14021–14030
Mun J, Shin M, Han G, Lee S, Ha S, Lee J, Kim E-S (2022) Boundary-aware self-supervised learning for video scene segmentation. arXiv:2201.05277
Yang Y, Huang Y, Guo W, Xu B, Xia D (2023) Towards global video scene segmentation with context-aware transformer. In: Proceedings of the AAAI conference on artificial intelligence, vol 37, pp 3206–3213
Huang Q, Xiong Y, Rao A, Wang J, Lin D (2020) Movienet: a holistic dataset for movie understanding. In: European conference on computer vision, pp 709–727. Springer
Souček T, Moravec J, Lokoč J (2019) Transnet: a deep network for fast detection of common shot transitions. arXiv:1906.03363
Kelishadrokhi MK, Ghattaei M, Fekri-Ershad S (2023) Innovative local texture descriptor in joint of human-based color features for content-based image retrieval. SIViP 17(8):4009–4017
Protasov S, Khan AM, Sozykin K, Ahmad M (2018) Using deep features for video scene detection and annotation. SIViP 12(5):991–999
Sidiropoulos P, Mezaris V, Kompatsiaris I, Meinedo H, Bugalho M, Trancoso I (2011) Temporal video segmentation to scenes using high-level audiovisual features. IEEE Trans Circuits Syst Video Technol 21(8):1163–1177
Rasheed Z, Shah M (2003) Scene detection in hollywood movies and tv shows. In: 2003 IEEE computer society conference on computer vision and pattern recognition, 2003. Proceedings, vol 2, p 343. IEEE
Chasanis VT, Likas AC, Galatsanos NP (2008) Scene detection in videos using shot clustering and sequence alignment. IEEE Trans Multimedia 11(1):89–100
Rotman D, Porat D, Ashour G (2016) Robust and efficient video scene detection using optimal sequential grouping. In: 2016 IEEE international symposium on multimedia (ISM), pp 275–280. IEEE
Han B, Wu W (2011) Video scene segmentation using a novel boundary evaluation criterion and dynamic programming. In: 2011 IEEE international conference on multimedia and expo, pp 1–6. IEEE
Tapaswi M, Bauml M, Stiefelhagen R (2014) Storygraphs: visualizing character interactions as a timeline. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 827–834
Das A, Das PP (2020) Incorporating domain knowledge to improve topic segmentation of long mooc lecture videos. arXiv:2012.07589
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, pp 8748–8763. PMLR
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Baraldi L, Grana C, Cucchiara R (2015) A deep siamese network for scene detection in broadcast videos. In: Proceedings of the 23rd ACM international conference on multimedia, pp 1199–1202
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest/Competing interests
The authors declare no conflict of interest
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chen, Z., Wang, X., Wang, J. et al. CSMB-VSS: video scene segmentation with cosine similarity matrix. Multimed Tools Appl 83, 61451–61467 (2024). https://doi.org/10.1007/s11042-023-17985-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-17985-0